# **Example of supervised learning with linear regression**

<p align='justify'>We can devide supervised learning into two subcategories: Classification and regression.

A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”. Many different models can be used, the simplest is the linear regression. It tries to fit data with the best hyper-plane which goes through the points.

Here we are going to solve a famous regression learning problem by using linear regression model. </p>

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of [Boston MA](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The following describes the dataset columns:

CRIM - per capita crime rate by town

ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS - proportion of non-retail business acres per town.

CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX - nitric oxides concentration (parts per 10 million)

RM - average number of rooms per dwelling

AGE - proportion of owner-occupied units built prior to 1940

DIS - weighted distances to five Boston employment centres

RAD - index of accessibility to radial highways

TAX - full-value property-tax rate per $10,000

PTRATIO - pupil-teacher ratio by town

B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT - % lower status of the population

MEDV - Median value of owner-occupied homes in $1000's

In [None]:
#import some libraries
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.datasets import load_boston # load the dataset
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#know about the dataset completely
#first load ta datset and print its keys
boston_dataset = load_boston()
print("keys:",boston_dataset.keys())

In [None]:
#label values of the dateset
print("target:",boston_dataset['target'][:5])

In [None]:
#print the shape of data
print("shape of data:",boston_dataset['data'].shape) 
#display some data
print("Data:\n",boston_dataset['data'][:5])

In [None]:
#features list, from which we are going to analysis and make model
print("features name:", boston_dataset['feature_names'])
#already described the short terms of the feature_names above.

In [None]:
#above data is looked like messy enough
#convert it to a dataframe for a better look
boston_df = pd.DataFrame(boston_dataset['data'], columns=boston_dataset['feature_names'])
#use display for the clear look
display(boston_df.head()) #head for upper five rows

In [None]:
#add target data into the feature list
boston_df['PRICE'] = boston_dataset.target
display(boston_df.head()) #head for upper five rows

In [None]:
#statistics 
boston_df.describe()


In [None]:
#number of unique values
boston_df.nunique()

In [None]:
#see if there is any null values and sum of these null values
boston_df.isnull().sum()


In [None]:
#see rows with null values
boston_df[boston_df.isnull().any(axis=1)]

In [None]:
#see correlation between the features
crtn = boston_df.corr()
print(crtn.shape)

In [None]:
#now see your data by plotting
plt.figure(figsize=(10,10))
sns.heatmap(crtn, square=True, fmt='.1f', annot=True, cmap='Blues')

In [None]:
#now split the dataset
#import train_test_split to split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston_dataset['data'], boston_dataset['target'], random_state=4) 
print("X train Shape:", X_train.shape) #shape of the training data which is 75%

In [None]:
#display the data by which we are gonna train the model
print("X train data:\n", X_train[:5])

Train the model

In [None]:
#first train the model
from sklearn.linear_model import LinearRegression

#create a Linear Regression
lr = LinearRegression()

#use training set
lr.fit(X_train, y_train)

#find the value of b
lr.intercept_

In [None]:
#see the coefficient values of X_train
codf = pd.DataFrame(X_train)
coeffcients = pd.DataFrame([codf.columns,lr.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
display(coeffcients)

**Evaluate the model**

In [None]:
#predict the model
yprd = lr.predict(X_train)

In [None]:
#print the evaluations
print('R^2:',metrics.r2_score(y_train, yprd))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, yprd))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, yprd))
print('MSE:',metrics.mean_squared_error(y_train, yprd))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, yprd)))

In [None]:
plt.scatter(y_train, yprd)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()

In [None]:
plt.scatter(yprd,y_train-yprd)
plt.title("Predicted vs residuals")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()

In [None]:
sns.distplot(y_train-yprd)
plt.title("Histogram of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

For test data

In [None]:
# Predicting Test data
y_test_pred = lr.predict(X_test)

In [None]:
print('R^2:', metrics.r2_score(y_test, y_test_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))