<h2 style="color:green" align="center"> Linear Regression with Multiple Variables</h2>

<h3 style="color:purple">Sample problem of predicting home price in monroe, new jersey (USA)</h3>

Below is the table containing home prices in monroe twp, NJ. Here price depends on **area (square feet), bed rooms and age of the home (in years)**. Given these prices we have to predict prices of new homes based on area, bed rooms and age.

<img src="homeprices.jpg" style='height:200px;width:350px'>

Given these home prices find out price of a home that has,

**3000 sqr ft area, 3 bedrooms, 40 year old**

**2500 sqr ft area, 4 bedrooms,  5 year old**

We will use regression with multiple variables here. Price can be calculated using following equation,

<img src="equation.jpg" >

Here area, bedrooms, age are called independant variables or **features** whereas price is a dependant variable

#### Import the necessary Packages

In [2]:
import pandas as pd

import matplotlib.pyplot as plt 

from sklearn import linear_model

#### Read the Dataset

In [3]:
df = pd.read_csv('homeprices.csv')
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


#### Data Preprocessing

In [4]:
df.isnull().sum()

area        0
bedrooms    1
age         0
price       0
dtype: int64

In [18]:
# mean or median or mode ? 

In [5]:
mode = df.bedrooms.mode()[0]
mode

3.0

In [6]:
df.bedrooms = df.bedrooms.fillna(mode)
df

Unnamed: 0,area,bedrooms,age,price
0,2600,3.0,20,550000
1,3000,4.0,15,565000
2,3200,3.0,18,610000
3,3600,3.0,30,595000
4,4000,5.0,8,760000
5,4100,6.0,8,810000


In [7]:
df.isnull().sum()

area        0
bedrooms    0
age         0
price       0
dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   area      6 non-null      int64  
 1   bedrooms  6 non-null      float64
 2   age       6 non-null      int64  
 3   price     6 non-null      int64  
dtypes: float64(1), int64(3)
memory usage: 320.0 bytes


In [13]:
import numpy as np
df.bedrooms = df.bedrooms.astype(np.int64)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   area      6 non-null      int64
 1   bedrooms  6 non-null      int64
 2   age       6 non-null      int64
 3   price     6 non-null      int64
dtypes: int64(4)
memory usage: 320.0 bytes


In [14]:
X = df.drop('price', axis = 'columns')
X

Unnamed: 0,area,bedrooms,age
0,2600,3,20
1,3000,4,15
2,3200,3,18
3,3600,3,30
4,4000,5,8
5,4100,6,8


In [15]:
y = df.price 
y

0    550000
1    565000
2    610000
3    595000
4    760000
5    810000
Name: price, dtype: int64

#### Build the Model

In [16]:
reg = linear_model.LinearRegression()  # Model Creation 

reg.fit(X, y)  # Train the Model

#### Find the Coefficient

In [17]:
reg.coef_

array([  119.67905405, 13097.24903475, -4207.28764479])

#### Find the Intercept

In [18]:
reg.intercept_

256461.14864864905

**Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old**

In [19]:
reg.predict([[3000,3,40]])



array([486498.55212355])

#### do the Manual Calcuation

In [20]:
119.67905405*3000 + 13097.24903475 * 3 -4207.28764479 *40 + 256461.14864864887

486498.5521112989

In [21]:
reg.score(X,y)

0.9534350855214516

numpy.ndarray

In [82]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

In [83]:
X_train

Unnamed: 0,area,bedrooms,age
5,4100,6,8
2,3200,3,18
4,4000,5,8
3,3600,3,30


In [84]:
X_test

Unnamed: 0,area,bedrooms,age
0,2600,3,20
1,3000,4,15


In [85]:
y_train

5    810000
2    610000
4    760000
3    595000
Name: price, dtype: int64

In [86]:
y_test

0    550000
1    565000
Name: price, dtype: int64

In [87]:
reg = linear_model.LinearRegression()  # Model Creation 

reg.fit(X_train, y_train)

In [88]:
reg.score(X_train, y_train)

1.0

In [89]:
reg.score(X_test, y_test)

-80.53287981859008

In [93]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
X

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]])

In [95]:
y = diabetes.target

In [96]:
y

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [126]:
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

In [127]:
X_train.shape

(353, 10)

In [128]:
X.shape

(442, 10)

In [129]:
X_test.shape

(89, 10)

In [148]:
#reg = linear_model.LinearRegression()  # Model Creation 
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(max_depth=3)
reg.fit(X_train, y_train)

In [149]:
reg.score(X_train, y_train)

0.5821235330417691

In [150]:
reg.score(X_test, y_test)

0.4668512591750701