## Linear Regression
For this prediction, we will try out the following regression algorithm

1. Linear Regression
2. Polynomial Regression
3. Random Forest Regression
4. Support Vector Machine

First part of notebook deals with EDA 
<br> Second part of notebook deals with preprocessing/training </br> 

## ****Data****
* CRIM: capita crime rate by town
* ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS: proportion of non-retail business acres per town 
* CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise) 
* NOX: nitric oxides concentration (parts per 10 million) [parts/10M]
* RM: average number of rooms per dwelling 
* AGE: proportion of owner-occupied units built prior to 1940
* DIS: weighted distances to five Boston employment centres
*  RAD: index of accessibility to radial highways 
*  TAX: full-value property-tax rate per $10,000 [$/10k] 
*  PTRATIO: pupil-teacher ratio by town 
*  B: The result of the equation B=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
*  LSTAT: % lower status of the population 
* MEDV: Median value of owner-occupied homes in $1000's [k$]

## Import Library

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
import pandas as pd 

#Read in data
df = pd.read_csv('/kaggle/input/the-boston-houseprice-data/boston.csv')
print(df.shape)
print('m: Number of training examples: ', df.shape[0])
print('n: Number of independent Variables: ', df.shape[1])
print('Target variable: MEDV')


In [None]:
df.head()

### Multi-variate Linear Regression and Matrix dimension
 h(theta) = regression line to predict future values
 <br> theta = parameters </br>
 <br> </br>
<br> $$\mathbf{ h_\theta(x_{i}) = \theta_0+\theta_1 CRIM + \theta_2 ZN + \theta_3 INDUS + \theta_4 CHAS + \theta_5 NOX + \theta_6 RM + ... }$$ </br>

<br> $$\mathbf{X} = \left( \begin{smallmatrix} x_{11} & x_{12} &.&.&.&.& x_{1n}\\
                                x_{21} & x_{22} &.&.&.&.& x_{2n}\\
                                x_{31} & x_{32} &.&.&.&.& x_{3n}\\
                                .&.&.&. &.&.&.& \\
                                .&.&.&. &.&.&.& \\
                                x_{m1} & x_{m2} &.&.&.&.&. x_{mn}\\
                                \end{smallmatrix} \right)_{(m,n)}$$ </br>
                                
$$\theta = \left (\begin{matrix} \theta_0 \\ \theta_1 \\ .\\.\\ \theta_j\\.\\.\\ \theta_n \end {matrix}\right)_{(n+1,1)} 
\mathbf{ y } = \left (\begin{matrix} y_1\\ y_2\\. \\. \\ y_i \\. \\. \\ y_m \end{matrix} \right)_{(m,1)}$$


#### Vectorized Form of hypothesis function
-> Vectorized implementation makes our code run faster 
$$\mathbf{ h_\theta{(x)} = X\theta}$$.

# Explore Data Analysis

In [None]:
sns.distplot(df['MEDV'])

## Question 1: Does price increases as CRIME rate decreases ?

In [None]:

sns.lmplot(x='CRIM',y='MEDV',data=df,aspect=2,height=6)
plt.xlabel('Crime rate')
plt.ylabel('Median value of owner-occupied home')
plt.title('Crime rate vs owner-home value')

It seems like as crime rate goes to nearly 0, price of home increases.
<br> Safety of neighborhood increases price of owned-home. </br>

## Question 2: Does NOX has effect on price of home?

In [None]:
sns.lmplot(x='NOX', y='MEDV', data=df, aspect=2)

As Nitric oxide concentration increases, prices of home deceases.
<br> Since excessive nitric oxide is harmful to human body, it is unlikely to see people to move in a home with nearby high NOX. Therefore, price of home decreases.</br>

# Training Linear Regression Model
First, we will need to split our data into X and Y
<br> X is a matrix containing all columns(features) except 'MEDV' </br>
<br> Y is a column vector containing only 'MEDV' </br>


In [None]:
X = df.drop('MEDV', axis=1)
y = df['MEDV']

## Train test split
We will split our data into test set and training set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score


def cross_val(model):
    pred = cross_val_score(model, X, y, cv=10)
    return pred.mean()

def print_evaluate(true, predicted):  
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    print('MAE:', mae)
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('R2 Square', r2_square)
    print('__________________________________')
    
def evaluate(true, predicted):
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    return mae, mse, rmse, r2_square

# Preparing data for linear regression
* ****Linear Assumption****. Linear regression assumes that the relationship between your input and output is linear. It does not support anything else. This may be obvious, but it is good to remember when you have a lot of attributes. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).
* ****Remove Noise**** Linear regression assumes that your input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible.
* ****Remove Collinearity****. Linear regression will over-fit your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and removing the most correlated.
* ****Gaussian Distributions****. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You may get some benefit using transforms (e.g. log or BoxCox) on you variables to make their distribution more Gaussian looking.
* ****Rescale Inputs:**** Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.

In [None]:
####### Rescaling variables #######
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('std_scalar', StandardScaler())
])

X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)




# 1. Linear Regression Algorithm

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(normalize=True)
lin_reg.fit(X_train,y_train)

#### Model Evaluation

In [None]:
coeff_df = pd.DataFrame(lin_reg.coef_, X.columns, columns=['Coefficient'])
coeff_df

****Note:**** coefficient are the values that multiply predict values

In [None]:
#Plotting our prediction
pred = lin_reg.predict(X_test)
plt.scatter(y_test, pred)
plt.show()


In [None]:
#Prediction on test/train sets
test_pred = lin_reg.predict(X_test)
train_pred = lin_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, train_pred)

****OverFitting****: training MSE lower than testing MSE
<br>****UnderFitting****: very high MSE for testing MSE </br>

In [None]:
results_df = pd.DataFrame(data=[["Linear Regression", *evaluate(y_test, test_pred)]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])
results_df

# 2. Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly_reg = PolynomialFeatures(degree=2)

X_train_2_d = poly_reg.fit_transform(X_train)
X_test_2_d = poly_reg.transform(X_test)

lin_reg = LinearRegression(normalize=True)
lin_reg.fit(X_train_2_d,y_train)

test_pred = lin_reg.predict(X_test_2_d)
train_pred = lin_reg.predict(X_train_2_d)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, test_pred)
print('====================================')
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, train_pred)

****This is example of overfitting. Polynomial regression tries so hard to fit data in training set, it leads to overfitting or not fitting
unseen data well****

In [None]:
results_df_2 = pd.DataFrame(data=[["Polynomial", *evaluate(y_test, test_pred)]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])

results_df = results_df.append(results_df_2, ignore_index=True)
results_df

# 3. Random Forest Algorithm

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(n_estimators=1000)
rf_reg.fit(X_train, y_train)

test_pred = rf_reg.predict(X_test)
train_pred = rf_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, train_pred)

In [None]:
results_df_3 = pd.DataFrame(data=[["Random Forest", *evaluate(y_test, test_pred)]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])

results_df = results_df.append(results_df_3, ignore_index=True)
results_df

# 4. Support Vector Machine Algorithm

In [None]:
from sklearn.svm import SVR

svm_reg = SVR(kernel='rbf', C=1000000, epsilon=0.001)
svm_reg.fit(X_train, y_train)

test_pred = svm_reg.predict(X_test)
train_pred = svm_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, test_pred)

print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, train_pred)

Our R2 Square and MSE are excellent on training set, but it is terrible in test set. This is example of overfitting

In [None]:
results_df_4 = pd.DataFrame(data=[["Support Vector Machine", *evaluate(y_test, test_pred)]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square'])

results_df = results_df.append(results_df_4, ignore_index=True)
results_df

# Comparing Models

In [None]:
results_df.set_index('Model', inplace=True)
results_df['R2 Square'].plot(kind='barh', figsize=(12, 8)) #barh stands for bar histogram 

# Summary

### Random Forest outperforms all other regression.
##### Comparing Linear Regression to Random Forest
<br> Random Forest performs better because it does not make the assumption of linear regression </br>
<br> CHAS features is a categorical variables with 0 and 1. Random Forest performs better than linear regression on such dataset.</br>