## Machine Learning With Python: Linear Regression With Multiple Variable
### Machine Learning-based Data Analysis: Plans and Procedures
machine learning-based data analysis, the following procedure is generally followed.

1. Understanding the Business and Defining the Problem

2. Collecting Data

3. Data Pre-processing and Searching (missing  , remove )

4. Data for Model Training

5. Model Performance Evaluation

6. Improving Model Performance and Market Application

Sample problem of predicting home price <br>

Given these home prices find out price of a home that has,<br>
<br>
3000 sqr ft area, 3 bedrooms, 40 year old in Bangalore
<br>
2500 sqr ft area, 4 bedrooms, 5 year old in mangalore
<br>
Here area, bedrooms, age are called independant variables or features whereas price is a dependant variable

In [1]:
### Import Support package
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# data collection

df = pd.read_csv("homeprices3.csv")
df

Unnamed: 0,place,area,bedrooms,age,price,doorno,housename
0,Bangalore,2600.0,3.0,20.0,550000,5,Serenity House.
1,Mangalore,3000.0,4.0,15.0,565000,45,Willow Cottage.
2,Mangalore,3200.0,,18.0,610000,78,Sunflower Villa.
3,Bangalore,,3.0,30.0,595000,656,Rosewood Retreat.
4,Udupi,4000.0,5.0,,760000,3123,Oakhurst.
5,Udupi,4100.0,6.0,8.0,810000,623,Willow Cottage.
6,Udupi,3800.0,4.0,5.0,665000,656,Rosewood Retreat.
7,Udupi,3200.0,4.0,18.0,710000,3216,
8,Bangalore,3600.0,4.0,25.0,595000,656,Cedarwood Manor.
9,Mangalore,4000.0,5.0,,760000,33,Green Gables.


In [3]:
df.shape  # rows and columns 

(11, 7)

In [4]:
df.describe()

Unnamed: 0,area,bedrooms,age,price,doorno
count,10.0,10.0,9.0,11.0,11.0
mean,3660.0,4.4,16.333333,684545.454545,886.090909
std,707.420981,1.074968,8.291562,115313.801114,1165.236152
min,2600.0,3.0,5.0,550000.0,5.0
25%,3200.0,4.0,8.0,595000.0,61.5
50%,3700.0,4.0,18.0,665000.0,656.0
75%,4000.0,5.0,20.0,760000.0,656.0
max,5100.0,6.0,30.0,910000.0,3216.0


In [5]:
# this shows how much N/A datas are present

df.isna().sum()

place        0
area         1
bedrooms     1
age          2
price        0
doorno       0
housename    1
dtype: int64

In [6]:
# delete not required column

del df['doorno']
del df['housename']
df

Unnamed: 0,place,area,bedrooms,age,price
0,Bangalore,2600.0,3.0,20.0,550000
1,Mangalore,3000.0,4.0,15.0,565000
2,Mangalore,3200.0,,18.0,610000
3,Bangalore,,3.0,30.0,595000
4,Udupi,4000.0,5.0,,760000
5,Udupi,4100.0,6.0,8.0,810000
6,Udupi,3800.0,4.0,5.0,665000
7,Udupi,3200.0,4.0,18.0,710000
8,Bangalore,3600.0,4.0,25.0,595000
9,Mangalore,4000.0,5.0,,760000


In [7]:
## fill the null values 

a = df.area.median()
a

3700.0

In [8]:
df.area = df.area.fillna(a)
df.area

0     2600.0
1     3000.0
2     3200.0
3     3700.0
4     4000.0
5     4100.0
6     3800.0
7     3200.0
8     3600.0
9     4000.0
10    5100.0
Name: area, dtype: float64

In [9]:
df.bedrooms = df.bedrooms.fillna(df.bedrooms.median())

In [10]:
df.age = df.age.fillna(df.age.median())

#### Using pandas to create dummy variables

In [11]:
dum = pd.get_dummies(df.place)  # convert string into the coding word 
dum

Unnamed: 0,Bangalore,Mangalore,Udupi
0,True,False,False
1,False,True,False
2,False,True,False
3,True,False,False
4,False,False,True
5,False,False,True
6,False,False,True
7,False,False,True
8,True,False,False
9,False,True,False


In [12]:
dum = dum.astype(int)
dum

Unnamed: 0,Bangalore,Mangalore,Udupi
0,1,0,0
1,0,1,0
2,0,1,0
3,1,0,0
4,0,0,1
5,0,0,1
6,0,0,1
7,0,0,1
8,1,0,0
9,0,1,0


####  Steps 3: Analyise Data Set

In [13]:
#taking y value
y = df.price

In [14]:
#taking x value
del df['price']
df

Unnamed: 0,place,area,bedrooms,age
0,Bangalore,2600.0,3.0,20.0
1,Mangalore,3000.0,4.0,15.0
2,Mangalore,3200.0,4.0,18.0
3,Bangalore,3700.0,3.0,30.0
4,Udupi,4000.0,5.0,18.0
5,Udupi,4100.0,6.0,8.0
6,Udupi,3800.0,4.0,5.0
7,Udupi,3200.0,4.0,18.0
8,Bangalore,3600.0,4.0,25.0
9,Mangalore,4000.0,5.0,18.0


In [15]:
# as dum takken the palce endoed so no need of place so remove place 

del df['place']
df

Unnamed: 0,area,bedrooms,age
0,2600.0,3.0,20.0
1,3000.0,4.0,15.0
2,3200.0,4.0,18.0
3,3700.0,3.0,30.0
4,4000.0,5.0,18.0
5,4100.0,6.0,8.0
6,3800.0,4.0,5.0
7,3200.0,4.0,18.0
8,3600.0,4.0,25.0
9,4000.0,5.0,18.0


In [16]:
x = df
x

Unnamed: 0,area,bedrooms,age
0,2600.0,3.0,20.0
1,3000.0,4.0,15.0
2,3200.0,4.0,18.0
3,3700.0,3.0,30.0
4,4000.0,5.0,18.0
5,4100.0,6.0,8.0
6,3800.0,4.0,5.0
7,3200.0,4.0,18.0
8,3600.0,4.0,25.0
9,4000.0,5.0,18.0


In [17]:
# concat the dum and df i.e add place numeriacal way to df 

x = pd.concat([x,dum],axis='columns')
x

Unnamed: 0,area,bedrooms,age,Bangalore,Mangalore,Udupi
0,2600.0,3.0,20.0,1,0,0
1,3000.0,4.0,15.0,0,1,0
2,3200.0,4.0,18.0,0,1,0
3,3700.0,3.0,30.0,1,0,0
4,4000.0,5.0,18.0,0,0,1
5,4100.0,6.0,8.0,0,0,1
6,3800.0,4.0,5.0,0,0,1
7,3200.0,4.0,18.0,0,0,1
8,3600.0,4.0,25.0,1,0,0
9,4000.0,5.0,18.0,0,1,0


##### Analyise Data Set by Traing and Testing(by known values i.e dataset some part. And cross verifing) 
##### We use 80 Training and 20 Testing  - That is randomly taken 

### Traing the Algorithm 

In [18]:
# splitting the data

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [19]:
# Create linear regression object or model

from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(x, y) With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output (x and y) as the arguments. In other words, .fit() fits the model. It returns self, which is the variable model itself. That’s why you can replace the last two statements with this one:

In [20]:
model.fit(x_train,y_train)

#### Step 5: Test Algorithm

In [21]:
x_test

Unnamed: 0,area,bedrooms,age,Bangalore,Mangalore,Udupi
10,5100.0,6.0,8.0,0,1,0
6,3800.0,4.0,5.0,0,0,1
7,3200.0,4.0,18.0,0,0,1


In [22]:
y_pred = model.predict(x_test)

In [23]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, y_pred)

72817.35159820206

- predict price for 3000 sqr ft area, 3 bedrooms, 40 year old in Bangalore

In [24]:
model.predict([[3000,3,40,1,0,0]])



array([453174.65753423])

- 2500 sqr ft area, 4 bedrooms, 5 year old in mangalore 

In [25]:
model.predict([[2500,4,5,0,1,0]])



array([592054.79452051])