Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are deﬁned as follows (taken from the UCI Machine Learning Repository1): 

* CRIM: per capita crime rate by town
* ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS: proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: nitric oxides concentration (parts per 10 million)
* RM: average number of rooms per dwelling
* AGE: proportion of owner-occupied units built prior to 1940
* DIS: weighted distances to ﬁve Boston employment centers
* RAD: index of accessibility to radial highways
* TAX: full-value property-tax rate per 10k dollar
* PTRATIO: pupil-teacher ratio by town 
* B: 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town  
* LSTAT: % lower status of the population
* MEDV: Median value of owner-occupied homes in thousands dollars


**Here "MEDV" is the predictor** 

CHAS is a categorical variable and rest are numerical variable

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#  Imporing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

In [None]:
#Importing Dataset

column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df = pd.read_csv('/kaggle/input/boston-house-prices/housing.csv', header=None, delimiter=r"\s+", names= column_names)
df.head()

In [None]:
df.info()    # Checking Data types and missing values

So there is no object and no missing data

In [None]:
# Another way (Most used) of checking missing data in dataset
df.isnull().sum()

Lets see the statistical description of the whole dataset

In [None]:
df.drop("CHAS", axis = 1).describe().transpose()

In [None]:
# Correlation Matrix
plt.figure(figsize= (12, 8))
sns.heatmap(df.drop("CHAS", axis = 1).corr(), annot = True)

1. "DIS" feature is highly correlated with "INDUS", "NOX" and "AGE"

2. "TAX" feature is highly correlated with "RAD" ,"INDUS" and "NOX"

3. "MEDV" has high positive correlation with "RM" which is the no. of rooms 

4. "MEDV" has high negative correlation with "LSTATE" 

5. We must take steps to high correlated features when using Linear Regression not to account multicolinearity 


In [None]:
# Correlation with predictors 
plt.figure(figsize= (10, 6))
correlation = df.drop("CHAS", axis = 1).corr().iloc[0:12,-1]
correlation.plot(kind = "bar")

In [None]:
#Univariate Analysis of MEDV
plt.figure(figsize= (8, 6))
sns.distplot(df["MEDV"])

We can see from the histogram that the predictor is rightly skewed 

In [None]:
plt.figure(figsize= (8, 6))
sns.scatterplot(x = df["LSTAT"], y= df["MEDV"])

In [None]:
plt.figure(figsize= (8, 6))
sns.scatterplot(x = df["RM"], y= df["MEDV"])

In [None]:
plt.figure(figsize= (8, 6))
sns.regplot(x = df["TAX"], y= df["RAD"])

In [None]:
plt.figure(figsize= (8, 6))
sns.regplot(x = df["INDUS"], y= df["NOX"])

In [None]:
df["CHAS"].value_counts().plot(kind = "bar")

This indicates that the feature is imbalanced

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
X = df.drop("MEDV", axis = 1)
y = df["MEDV"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.3, random_state = 42)

In [None]:
y_train.hist()

In [None]:
y_test.hist()

It seems that y_train and y_test has similar distributions

# Scaling The dataset

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X_train, y_train)

In [None]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Multiple Linear Regression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train_scaled, y_train)

In [None]:
y_pred_lr = lr.predict(X_test_scaled)

In [None]:
rmse = mean_squared_error(y_test, y_pred_lr)**(1/2)
rmse

In [None]:
lr.score(X_test_scaled, y_test)

In [None]:
plt.figure(figsize= (10, 6))
sns.regplot(y_test, y_pred_lr)
plt.xlim([0, 60])

Lets try Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(random_state = 42)

In [None]:
rf.fit(X_train_scaled, y_train)

In [None]:
y_pred_rf = rf.predict(X_test_scaled)

In [None]:
rmse = mean_squared_error(y_test, y_pred_rf)**0.5
rmse

In [None]:
r2_score(y_test, y_pred_rf)

In [None]:
plt.figure(figsize= (10, 6))
sns.regplot(y_test, y_pred_rf)
plt.xlim([0, 60])

So, The score has improved a lot. Though we have not done hyperparameter tuning. Lets see the accuracy of the model with hyperparameter tuning using Randomized Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

print(random_grid)

In [None]:
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train_scaled, y_train)

In [None]:
# Extract best hyperparameters from 'rf_random'

best_hyperparams = rf_random.best_params_
print('Best hyerparameters:\n', best_hyperparams)

In [None]:
# Extract best model from 'rf_random'
best_model = rf_random.best_estimator_

# Predict the test set labels
y_pred_rf = best_model.predict(X_test_scaled)

# Evaluate the test set RMSE
rmse_test = mean_squared_error(y_test, y_pred_rf)**(1/2)

# Print the test set RMSE
print('Test set RMSE of gb: {:.2f}'.format(rmse_test))

In [None]:
r2_score(y_test, y_pred_rf)