# US House Price Prediction  using Linear Regrssion

## Linear regression

In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the "lack of fit" in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares loss function as in ridge regression ($L_2$-norm penalty) and lasso ($L_1$-norm penalty). Conversely, the least squares approach can be used to fit models that are not linear models. Thus, although the terms "least squares" and "linear model" are closely linked, they are not synonymous.

### Import packages and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
df = pd.read_csv("../input/usa-housing/USA_Housing.csv")
df.head()

### Check basic info on the data set

**'info()' method to check the data types and number**

In [None]:
df.info(verbose=True)

**'describe()' method to get the statistical summary of the various features of the data set**

In [None]:
df.describe(percentiles=[0.1,0.25,0.5,0.75,0.9])

**'columns' method to get the names of the columns (features)**

In [None]:
df.columns

### Basic plotting and visualization on the data set

**Pairplots using seaborn**

In [None]:
sns.pairplot(df)

**Distribution of price (the predicted quantity)**

In [None]:
df['Price'].plot.hist(bins=25,figsize=(8,4))

In [None]:
df['Price'].plot.density()

**Correlation matrix and heatmap**

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(10,7))
sns.heatmap(df.corr(),annot=True,linewidths=2)

### Feature and variable sets

**Make a list of data frame column names**

In [None]:
l_column = list(df.columns) # Making a list out of column names
len_feature = len(l_column) # Length of column vector list
l_column

**Put all the numerical features in X and Price in y, ignore Address which is string for linear regression**

In [None]:
X = df[l_column[0:len_feature-2]]
y = df[l_column[len_feature-2]]

In [None]:
print("Feature set size:",X.shape)
print("Variable set size:",y.shape)

In [None]:
X.head()

In [None]:
y.head()

### Test-train split

**Import train_test_split function from scikit-learn**

In [None]:
from sklearn.model_selection import train_test_split

**Create X and y train and test splits in one command using a split ratio and a random seed**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

**Check the size and shape of train/test splits (it should be in the ratio as per test_size parameter above)**

In [None]:
print("Training feature set size:",X_train.shape)
print("Test feature set size:",X_test.shape)
print("Training variable set size:",y_train.shape)
print("Test variable set size:",y_test.shape)

### Model fit and training

**Import linear regression model estimator from scikit-learn and instantiate**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

In [None]:
lm = LinearRegression() # Creating a Linear Regression object 'lm'

**Fit the model on to the instantiated object itself**

In [None]:
lm.fit(X_train,y_train) # Fit the linear model on to the 'lm' object itself i.e. no need to set this to another variable

**Check the intercept and coefficients and put them in a DataFrame**

In [None]:
print("The intercept term of the linear model:", lm.intercept_)

In [None]:
print("The coefficients of the linear model:", lm.coef_)

In [None]:
#idict = {'Coefficients':lm.intercept_}
#idf = pd.DataFrame(data=idict,index=['Intercept'])
cdf = pd.DataFrame(data=lm.coef_, index=X_train.columns, columns=["Coefficients"])
#cdf=pd.concat([idf,cdf], axis=0)
cdf

### Prediction, error estimate, and regression evaluation matrices

**Prediction using the lm model**

In [None]:
predictions = lm.predict(X_test)
print ("Type of the predicted object:", type(predictions))
print ("Size of the predicted object:", predictions.shape)

**Scatter plot of predicted price and y_test set to see if the data fall on a 45 degree straight line**

In [None]:
plt.figure(figsize=(10,7))
plt.title("Actual vs. predicted house prices",fontsize=25)
plt.xlabel("Actual test set house prices",fontsize=18)
plt.ylabel("Predicted house prices", fontsize=18)
plt.scatter(x=y_test,y=predictions)

**Plotting histogram of the residuals i.e. predicted errors (expect a normally distributed pattern)**

In [None]:
plt.figure(figsize=(10,7))
plt.title("Histogram of residuals to check for normality",fontsize=25)
plt.xlabel("Residuals",fontsize=18)
plt.ylabel("Kernel density", fontsize=18)
sns.distplot([y_test-predictions])

**Scatter plot of residuals and predicted values (Homoscedasticity)**

In [None]:
plt.figure(figsize=(10,7))
plt.title("Residuals vs. predicted values plot (Homoscedasticity)\n",fontsize=25)
plt.xlabel("Predicted house prices",fontsize=18)
plt.ylabel("Residuals", fontsize=18)
plt.scatter(x=predictions,y=y_test-predictions)

**Regression evaluation metrices**

In [None]:
print("Mean absolute error (MAE):", metrics.mean_absolute_error(y_test,predictions))
print("Mean square error (MSE):", metrics.mean_squared_error(y_test,predictions))
print("Root mean square error (RMSE):", np.sqrt(metrics.mean_squared_error(y_test,predictions)))

**R-square value**

In [None]:
print("R-squared value of predictions:",round(metrics.r2_score(y_test,predictions),3))