# Practice Exercise - Linear Regression

### Problem Statement

The problem at hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us. 

### Data Information

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-

Attribute Information (in order):
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq. ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per 10,000 dollars
- PTRATIO: pupil-teacher ratio by town
- LSTAT: %lower status of the population
- MEDV: Median value of owner-occupied homes in 1000 dollars.

### Import Necessary Libraries

In [3]:
%load_ext nb_black
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

### Load the dataset

In [6]:
df = pd.read_csv("boston.csv")
df.sample(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,7.88,11.9
275,0.09604,40.0,6.41,0,0.447,6.854,42.8,4.2673,4,254,17.6,2.98,32.0
353,0.01709,90.0,2.02,0,0.41,6.728,36.1,12.1265,5,187,17.0,4.5,30.1
127,0.25915,0.0,21.89,0,0.624,5.693,96.0,1.7883,4,437,21.2,17.19,16.2
69,0.12816,12.5,6.07,0,0.409,5.885,33.0,6.498,4,345,18.9,8.79,20.9
94,0.04294,28.0,15.04,0,0.464,6.249,77.3,3.615,4,270,18.2,10.59,20.6
19,0.7258,0.0,8.14,0,0.538,5.727,69.5,3.7965,4,307,21.0,11.28,18.2
214,0.28955,0.0,10.59,0,0.489,5.412,9.8,3.5875,4,277,18.6,29.55,23.7
59,0.10328,25.0,5.13,0,0.453,5.927,47.2,6.932,8,284,19.7,9.22,19.6
259,0.65665,20.0,3.97,0,0.647,6.842,100.0,2.0107,5,264,13.0,6.9,30.1


<IPython.core.display.Javascript object>

### Check the shape of the dataset

In [8]:
df.shape

(506, 13)

<IPython.core.display.Javascript object>

### Get the info regarding column datatypes

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NX       506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB


<IPython.core.display.Javascript object>

### Get summary statistics for the numerical columns

In [10]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,37.97,50.0


<IPython.core.display.Javascript object>

### Exploratory Data Analysis

**Plot the distribution plots for all the numerical features and list your observations.**

**Plot the scatterplots for features and the target variable `MEDV` and list your observations.**

**Plot the correlation heatmap and list your observations.**

### Split the dataset

Split the data into the dependent and independent variables, and further split it in a ratio of 70:30 for train and test sets.

### Model Building

**Fit the model to the training set**

**Get the score on training set**

**Write your own function for the R-squared score.**

**Get the score on test set**

**Get the RMSE on test set**

**Get the model coefficients.**

**Automate the equation of the fit**