# Boston Housing Dataset

# Boston house prices dataset

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.



# Importing Libraries as requried here.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style("whitegrid")
pd.set_option('display.max_columns', 8)
pd.set_option('display.width', 10000)

# Loading Data_set

In [None]:
dataset = pd.read_csv('../input/bostoncsv/Boston.csv')

In [None]:
print(dataset.keys())

In [None]:
# Checking the shape of data
dataset = dataset.drop('Unnamed: 0' , axis=1)
print(dataset.shape)

# Analysing the dataset

In [None]:
print(dataset.head())

In [None]:
print(dataset.describe())

In [None]:
print(dataset.info())

# Now seperating the dataset into features i.e X and target variable i.e Y

In [None]:
X = dataset.iloc[:, 1:-1].values
Y = dataset.iloc[:, -1].values

# Checking distribution of target variable

In [None]:
plt.subplots(figsize=(20,15))
sns.distplot(Y)
plt.show()

# Creating the correlation matrix that measures the linear relationship b/w the variables.

In [None]:
correlation_matrix = dataset.corr().round(2)
# annot is True to print the values inside the square
plt.subplots(figsize=(20,15))
sns.heatmap(data=correlation_matrix, annot=True)
plt.show()

Observation -->
* TO fit a linear regression model, we select those features which have a high correlation
  with our target variable.(+ve or -ve)(RM, LSTAT)
* An important point in selecting feature for a LR model is to check for multi-colinearity
  The features RAD, TAX have a correlation of 0.91. These features pairs are strongly correlated
  to each other. We should not select both these features together for training the model.
  And same goes with DIS and AGE.  

Based on the above observations we will choose RM, PTRATIO and LSTAT as our features.
Using a scatter plot let's see how these features vary with MEDV. 

In [None]:
plt.figure(figsize=(20, 5))

features = ['lstat', 'rm']
target = Y
for i, col in enumerate(features):
    plt.subplot(1, len(features), i + 1)
    x = dataset[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('MEDV')
plt.show()

Observation --->
* The price increases as the value of RM increase linearly. There are few outliers and the
  data seems to br capped at 50.
* The prices tends to decrease with an increase in LSTAT. Though it doesn't look to be 
  following exactly a linear line.

In [None]:
X = dataset.drop(['crim', 'zn', 'indus', 'chas', 'nox', 'age', 'dis', 'rad', 'tax', 'black', 'medv'], axis=1)

# Now splitting the data into two parts

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=5)


# Now applying Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)

# Predicting on training data_set

In [None]:
y_train_lr_predict = lr_reg.predict(X_train)

# Predicting on test data_set

In [None]:
y_test_lr_predict = lr_reg.predict(X_test)

# Model Evaluation

# Model evaluation for training set

In [None]:
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_lr_predict))
r2_train = r2_score(y_train, y_train_lr_predict)

In [None]:
print("The LR model performance for the training set")
print("--------------------------------------------")
print("RMSE of training set is {}".format(rmse_train))
print("R2 Score of training set is {}".format(r2_train))


# Evaluating the model on test dataset

In [None]:
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_lr_predict))
r2_test = r2_score(y_test, y_test_lr_predict)

In [None]:
print("The LR model performance for the test set")
print("--------------------------------------------")
print("RMSE of test set is {}".format(rmse_test))
print("R2 Score of test set is {}".format(r2_test))

# Now Comparing the Test and Predicted values of target variable

In [None]:
lr_compare = pd.DataFrame({'Actual': y_test, 'Predicted': y_test_lr_predict})
print(lr_compare)
(lr_compare.head(10)).plot(kind='bar', figsize=(15,5))
plt.show()

# Now Applying Polynomial Regression On the Dataset

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly_features = PolynomialFeatures(degree=2)

# Transforming the existing features to higher degree features.

In [None]:
x_train_poly = poly_features.fit_transform(X_train)

# Fit the transformed features to Linear Regression

In [None]:
poly_model = LinearRegression()

In [None]:
poly_model.fit(x_train_poly, y_train)

# Predicting on training data_set

In [None]:
y_train_poly_predicted = poly_model.predict(x_train_poly)

# Predicting on test data_set

In [None]:
y_test_poly_predicted = poly_model.predict(poly_features.fit_transform(X_test))

# Model Evaluation

# Evaluating the model on training dataset

In [None]:
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_poly_predicted))
r2_train = r2_score(y_train, y_train_poly_predicted)

In [None]:
print("The poly model performance for the training set")
print("--------------------------------------------")
print("RMSE of training set is {}".format(rmse_train))
print("R2 Score of training set is {}".format(r2_train))

# Evaluating the model on test dataset

In [None]:
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_poly_predicted))
r2_test = r2_score(y_test, y_test_poly_predicted)

In [None]:
print("The poly model performance for the test set")
print("--------------------------------------------")
print("RMSE of test set is {}".format(rmse_test))
print("R2 Score of test set is {}".format(r2_test))

# Now Comparing the Test and Predicted values of target variable

In [None]:
poly_compare = pd.DataFrame({'Actual': y_test, 'Predicted': y_test_poly_predicted})
print(poly_compare)
(poly_compare.head(10)).plot(kind='bar', figsize=(15,5))
plt.show()

# Now Applying Support Vector Regression on the dataset

In [None]:
from sklearn.svm import SVR

In [None]:
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train, y_train)

# Predicting On Training Dataset

In [None]:
y_train_svr_predicted = svr_model.predict(X_train)

# Predicting On Test Dataset

In [None]:
y_test_svr_predicted = svr_model.predict(X_test)

# Model Evaluation

# Evaluating the model on training dataset

In [None]:
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_svr_predicted))
r2_train = r2_score(y_train, y_train_svr_predicted)

In [None]:
print("The svr model performance for the training set")
print("--------------------------------------------")
print("RMSE of training set is {}".format(rmse_train))
print("R2 Score of training set is {}".format(r2_train))

# Evaluating the model on test dataset

In [None]:
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_svr_predicted))
r2_test = r2_score(y_test, y_test_svr_predicted)

In [None]:
print("The svr model performance for the test set")
print("--------------------------------------------")
print("RMSE of test set is {}".format(rmse_test))
print("R2 Score of test set is {}".format(r2_test))

# Now Comparing the Test and Predicted values of target variable

In [None]:
svr_compare = pd.DataFrame({'Actual': y_test, 'Predicted': y_test_svr_predicted})
print(svr_compare)
(svr_compare.head(10)).plot(kind='bar', figsize=(15,5))
plt.show()