**Importing standard libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
%matplotlib inline

**Importing the dataset from sklearn module**

In [None]:
from sklearn import datasets

In [None]:
boston = datasets.load_boston()

In [None]:
boston.keys()

In [None]:
boston.feature_names

**Variables in the dataset**
1. CRIM: Per capita crime rate by town
2. ZN: Proportion of residential land zoned for lots over 25,000 sq. ft
3. INDUS: Proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: Nitric oxide concentration (parts per 10 million)
6. RM: Average number of rooms per dwelling
7. AGE: Proportion of owner-occupied units built prior to 1940
8. DIS: Weighted distances to five Boston employment centers
9. RAD: Index of accessibility to radial highways
10. TAX: Full-value property tax rate per USD 10,000 
11. PTRATIO: Pupil-teacher ratio by town
12. B: 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
13. LSTAT: Percentage of lower status of the population
14. MEDV: Median value of owner-occupied homes in USD 1000s

In [None]:
boston.data.shape

**Converting dataset to a dataframe**

In [None]:
df = pd.DataFrame(boston.data, columns = boston.feature_names)

In [None]:
df['MEDV'] = boston.target #Adding the price column also to the original df

**Exploratory Data Analysis**

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
#sns.pairplot(df)

In [None]:
#Looking at the distribution of the target variable
sns.distplot(df['MEDV'],bins=30)

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),cmap='coolwarm',annot=True)
plt.xticks(rotation=0)
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.clustermap(df.corr(),cmap='coolwarm',annot=True)
plt.xticks(rotation=0)
plt.show()

**Key Correlation Takeaways**
1. Negative correlation between LSTAT and MEDV i.e. more the number of people in lower status, lower is the median value of the home
2. Positive correlation between NOX and INDUS i.e. nitric oxide concentration increases with the proportion of non-retail activity
3. Strong positive correlation between TAX and RAD i.e. towns with higher highway connectivity demand higher taxation from the property
4. Positive correlation between RM and MEDV i.e. average number of rooms and median house value

In [None]:
#Distribution of the dependent variable
plt.figure(figsize=(4,6))
sns.boxplot('MEDV',data=df,orient='v')

In [None]:
#Visualising simple regression output of LSTAT v/s MEDV (based on correlation inference)  
sns.lmplot(x='LSTAT',y='MEDV',data=df)

In [None]:
#Visualising simple regression output of RM v/s MEDV (based on correlation inference)  
sns.lmplot(x='RM',y='MEDV',data=df)

**Regression Analysis**

In [None]:
#Separating the feature and predictor df
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

In [None]:
#Splitting X and y in train and test observations 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=100)

In [None]:
#Checking the shape of the new split datasets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
#Creating an instance object of linear regression
from sklearn.linear_model import LinearRegression
ln = LinearRegression()

In [None]:
#Fitting the linear model based on training data
ln.fit(X=X_train,y=y_train)

In [None]:
#Using the linear regression output to predict values of y on the x_test observations
y_pred = ln.predict(X=X_test)

In [None]:
intercept = ln.intercept_
coefficients = ln.coef_

print(intercept)
print(coefficients)

In [None]:
X_train.columns

In [None]:
coeff = pd.DataFrame(data=coefficients,index=('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'),columns=['Coefficients'])

In [None]:
coeff['Description'] = ['Per capita crime rate by town', 
                        'Proportion of residential land zoned for lots over 25,000 sq. ft', 
                        'Proportion of non-retail business acres per town',
                        'Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)',
                        'Nitric oxide concentration (parts per 10 million)',
                        'Average number of rooms per dwelling',
                        'Proportion of owner-occupied units built prior to 1940',
                        'Weighted distances to five Boston employment centers',
                        'Index of accessibility to radial highways',
                        'Full-value property tax rate per USD 10,000',
                        'Pupil-teacher ratio by town',
                        '1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town',
                        'Percentage of lower status of the population']

In [None]:
#Printing the new data frame which as coefficient value for each feature variable along with it's description
pd.options.display.max_colwidth = 100
coeff

In [None]:
#Adding the Pearson's correlation coefficients also for comparison
corr = df.corr()['MEDV'].values
coeff['Correlation'] = corr[:-1]
pd.options.display.max_colwidth = 40
coeff

**Interpretation of major coefficients:**
1. In line with earlier seen positive correlation between RM and MEDV, we can see that the regression model also shows a positive impact of increase in number of rooms (RM) on the median value of the house (MEDV). As per the coefficients, an increase in one room per dwelling increases the median house price by 3672 dollars (since MEDV values are given in 1000s)
2. Whereas, an increase in one ppm of nitric oxide causes the median house price to decline by 16136 dollars
3. If tract bounds river i.e. CHAS = 1 then the median house value increases by 3062 dollars 

**The below regression output shows that the linear regression model is able to predict the values to a moderate extent. 
About 5 to 6 observations clearly fall outside the linear line of the model**

In [None]:
sns.set_style('darkgrid')
plt.scatter(x=y_test,y=y_pred)
plt.xlabel('y_test')
plt.ylabel('y_pred')

In [None]:
#Plotting a comparative plot of predicted y values and actual test y values 
from numpy.polynomial.polynomial import polyfit
b, m = polyfit(y_test, y_pred,1)
plt.scatter(x=y_test,y=y_pred)
plt.plot(y_test, b + m * y_test,color='red')
plt.xlabel('y_test')
plt.ylabel('y_pred')
plt.show()

**Let us now plot the residual values i.e. the difference between the predicted test values and actual test values**

The distribution plot has a left skew indicating that majority residual values are populated around zero or have a negative value i.e. the predicted value is higher than the test value 

In [None]:
#Plotting the residuals 
sns.distplot(y_test - y_pred,label='Residual Plot')

**After basic graphical exploration of predicted values, let us calculate some key parameters to deduce regression model's accuracy**

In [None]:
from sklearn import metrics as mt

In [None]:
print('MAE',mt.mean_absolute_error(y_test,y_pred))
print('MSE',mt.mean_squared_error(y_test,y_pred))
print('RMSE',np.sqrt(mt.mean_squared_error(y_test,y_pred)))
print('R squared',mt.r2_score(y_test,y_pred))
RMSE = np.sqrt(mt.mean_squared_error(y_test,y_pred))

In [None]:
#Revisiting original mean and median values of the predictor to gauge the extent of error
print(df['MEDV'].mean()) 
pricemean = df['MEDV'].mean()
print(RMSE/pricemean * 100)

**Conclusion**
1. We get the final regression model with a R square value of 0.755 which presents a decent fit of the test data
2. RMSE value of 4.859 indicates that the model is able to predict the median house value with an average error of USD 4,859 (since the median values are given in USD 1000s).
3. The error rate is in the range of 21.56% of the mean value of the original predictor variable, hence the model may not be as accurate as desired and thus we need to further explore if the model can be fine tuned by eliminating less important features