## Implement Linear Regression - On USA Housing Data

We will try to apply `linear regression model` on USA Housing Data, to `predict Home Prices`.
#### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
#Importing required numpy, pandas, scikit-learn, matplotlib and seaborn libraries

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import learning_curve
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**Importing `csv` file, for data analysis**

In [None]:
# Importing USA Housing.csv
data = pd.read_csv('data/usa_housing.csv')

In [None]:
# Checking for Null Values
data.info()

 **Using info() we find that there are 5000 rows and 7 columns.**

In [None]:
data.head()

In [None]:
# Getting the summary of Data
data.describe()

**Data Preparation**

1. There are no null values, so there is no need to delete or replace the data.

2. There is no need for Address column, hence we can drop it.


In [None]:
# Dropping Address Column
data.drop(['Address'],axis=1,inplace=True)

In [None]:
data.head()

In [None]:
# shuffle the DataFrame rows 
data = data.sample(frac = 1)

### Using Seaborn Visualization (pairplot)
**`To find CORRELATION among columns with respect to home prices.`**

- Pairs plot builds on two basic figures, histogram and scatter plot. 
- Histogram on diagonal allows us to see distribution of a single variable.
- Scatter plots on upper and lower triangles show relationship (or lack thereof) between two variables.

In [None]:
# Let's plot a pair plot of all variables in our dataframe
sns.pairplot(data);

#### From the pair plots above, it is evident:
- That `data distribution` is `normal`.

- There is clear correlation between `price` and following:
- `average area income`
- `average area house age`
- `average area number of rooms`
- `average area number of bedrooms`
- `average population`

In [None]:
# Visualize relationship between features and response using scatterplots
sns.pairplot(data, x_vars=['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Area Population'], y_vars='Price',height=7, aspect=0.7, kind='scatter');

### Using Seaborn Visualization (heatmap)
`To find correlation among columns with respect to home price.`

**Black colour represents that there is no linear relationship between two variables.**<br>
**A lighter shade shows that the relationship between the variables is more linear.**

In [None]:
sns.heatmap(data.corr(),annot=True);

## `Find correlation among columns with respect to home prices.`

### Correlation
- Correlation coefficient, or simply correlation, is an index that ranges from -1 to 1. 
- When value is near zero, there is no linear relationship. 
- As correlation gets closer to plus or minus one, relationship is stronger. 
- A value of one (or negative one) indicates a perfect linear relationship between two variables.

In [None]:
#Let’s find the correlation between the variables in the dataset.
data.corr().Price.sort_values(ascending=False)

### Histogram
- A great way to get started exploring a single variable is with the histogram. 
- A histogram divides variable into bins, counts the data points in each bin.
- It shows bins on x-axis and counts on the y-axis.

In [None]:
sns.distplot(data.Price);

### Linear Regression Analysis
- Before we start Linear Regression Analysis, we need to `split dataset into training and test data`. 
- We will use `train data to train our model` and `test data to validate our model`. 
- It is a general practice to keep `30% as test data` and `70% as training data`.

In [None]:
# Independent variables(x)
X = data[['Avg. Area Income','Avg. Area House Age','Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Area Population']]

# Dependent variables(y)
Y = data['Price']

In [None]:
X.head()

In [None]:
Y.head()

### Train Test Split
- Our goal is to create a model that generalises well for new data. 
- Our test set serves as proxy for new data.
- Trained data is the data on which we apply linear regression algorithm. 
- And finally we test that algorithm on test data.

**Splitting dataset into train and test , giving 30% as test data and 70% as train data**
**The code for splitting is as follows:**

In [None]:
#random_state is the seed used by the random number generator, it can be any integer.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7 ,test_size = 0.3, random_state=50)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

**Using LinearRegression function from Scikit-Learn Library.**

In [None]:
# Creating a Linear Regression Model
lm = LinearRegression()

# Training the Model created using the 70% data assigned for training
lm.fit(X_train, Y_train)

In [None]:
#Find y intercept(c)and slope(m) values as below:

#To retrieve the y intercept (c):
print("Intercept : ",lm.intercept_)

#To retrieve the slope (m):
print(lm.coef_)

In [None]:
# Let's see the coefficient
coeff_df = pd.DataFrame(lm.coef_,X_test.columns,columns=['Coefficient'])
coeff_df

## Validating the trained model

**Predicting Values of `y` based on 30% `x` test values.**

In [None]:
Y_pred = lm.predict(X_test)

### Comparing Actual Values with Predicted Values by plotting graph

In [None]:
plt.scatter(Y_test, Y_pred)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")

### How accurate is our Model ?

A common method of measuring accuracy of regression models is to use: R2 score, Variance and Mean Square Error.

### Coefficient of determination (R2)
- R2 is fraction (percentage) of variation in dependent variable Y that is explainable by independent variable X. 
- It ranges between 0 (no predictability) to 1 (or 100%) which indicates complete predictability.
- A high R2 indicates being able to predict response variable with less error.

In [None]:
vari = metrics.explained_variance_score(Y_test,Y_pred)
r2score = metrics.r2_score(Y_test, Y_pred)
mse = metrics.mean_squared_error(Y_test,Y_pred)

print('Variance: ',vari)
print('R2 score : ', r2score)
print('MSE: ',mse )

- Predicted values and actual values seem to be agreeing with each other and R2 score is 0.92 (Max possible is 1.0).
- But MSE seems to be higher, which shouldn't be, hence we need to improve our model.

#### To reduce MSE, we can drop a column with least correlation to improve accuracy of Model.
**Find list of columns and their correlation**

In [None]:
data.corr().Price.sort_values(ascending=False)

**Avg. Area Number of Bedrooms - has least correlation, hence can be dropped**

In [None]:
X.drop(['Avg. Area Number of Bedrooms'],axis=1, inplace=True)

In [None]:
X.head()

In [None]:
Y.head()

**Splitting dataset into train and test , giving 30% as test data and 70% as train data**

In [None]:
#random_state is the seed used by the random number generator, it can be any integer.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7 ,test_size = 0.3, random_state=50)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

**Using LinearRegression function from Scikit-Learn Library.**

In [None]:
# Creating a Linear Regression Model
lm = LinearRegression()

# Training the Model created using the 70% data assigned for training
lm.fit(X_train, Y_train)

In [None]:
#Find y intercept(c)and slope(m) values as below:

#To retrieve the y intercept (c):
print("Intercept : ",lm.intercept_)

#To retrieve the slope (m):
print(lm.coef_)

In [None]:
# Let's see the coefficient
coeff_df = pd.DataFrame(lm.coef_,X_test.columns,columns=['Coefficient'])
coeff_df

## Validating the trained model

**Predicting Values of `y` based on 30% `x` test values.**

In [None]:
Y_pred = lm.predict(X_test)

### Comparing Actual Values with Predicted Values by plotting graph

In [None]:
plt.scatter(Y_test, Y_pred)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")

### How accurate is our Model ?

A common method of measuring accuracy of regression models is to use: R2 score, Variance and Mean Square Error.

In [None]:
vari = metrics.explained_variance_score(Y_test,Y_pred)
r2score = metrics.r2_score(Y_test, Y_pred)
mse = metrics.mean_squared_error(Y_test,Y_pred)

print('Variance: ',vari)
print('R2 score : ', r2score)
print('MSE: ',mse )

## R2 Score is 0.92, which represents accuracy of 92%, which is good.