# House Price Prediction for King County(USA) using Linear Regression Techniques

This notebook is described further in three parts:

Part 1: Exploratory Data Analysis

Part 2: This notebook presents a thought process of predicting a prices of houses using Machine Learning model.
Linear Regression algorithm has been used for price prediction.

Part 3: Conclusion

Dataset Source : Kaggle [https://www.kaggle.com/harlfoxem/housesalesprediction]
-Seattle is located in King County


### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
plt.show()
%matplotlib inline

## Part 1 : Exploratory Data Analysis

In [None]:
# Import csv file into dataframe
df = pd.read_csv('../input/kc_house_data.csv')

In [None]:
df.head()


**All columns contain numeric data, so there is no need to further change the data values. 
Let's check if this dataset contains any missing data**


In [None]:
df.info()

** Data is clean, No Missing data. Let's get data summary**

In [None]:
df.describe()

## Correlation

In [None]:
df.corr()[1:2]

## Top five columns correlation with House Price
**sqft_living = 0.702035 , 
grade = 0.667434 , 
sqft_above = 0.605567 , 
sqft_living15 = 0.585379 , 
bathrooms = 0.525138**



In [None]:
sns.heatmap(df[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms', 'price']].corr(), cmap='coolwarm')

In [None]:
df.columns

In [None]:
import warnings
warnings.filterwarnings("ignore")

** Let's use seaborn to create a jointplot to compare the number of sqft_living and House Price columns. Does this correlation make sense?**

In [None]:
sns.jointplot(x='sqft_living',y='price',data=df)

** Let's plot grade and price **

In [None]:
sns.jointplot(x='grade',y='price',data=df)

In [None]:
sns.pairplot(df[['price', 'bedrooms', 'bathrooms', 'floors', 'waterfront', 'view']])

In [None]:
sns.pairplot(df[['price','sqft_above', 'sqft_basement', 'yr_renovated','lat', 'sqft_living15']])

In [None]:
sns.set_style('whitegrid')

**Sqft_Living Column is very strongly correlated with Price Column**

In [None]:
sns.regplot(df.sqft_living, df.price, order=1, ci=None, scatter_kws={'color':'r', 's':9})
plt.xlim(0,13540)
plt.ylim(ymin=0);

## Training a Linear Regression Model

Let's now begin to train out regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. 

### X and y arrays

In [None]:
#Using all features to train model for Linear Regression 
X = df[['bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15']]
y = df['price']

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [None]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

## Creating and Training the Model

In [None]:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)

predictions = lm.predict(X_test)

In [None]:
plt.scatter(y_test,predictions)

**Residual Histogram**

In [None]:
sns.distplot((y_test-predictions),bins=50);

In [None]:
print('Intercept:',lm.intercept_)

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

Interpreting the coefficients:

Examples:
- Holding all other features fixed, a 1 unit increase in **Bathrooms** is associated with an **increase of 36276 dollars in house price**.
- Holding all other features fixed, a 1 unit increase in **Sqft_Living** is associated with an **increase of 109 dollars in house price**.
- Holding all other features fixed, a 1 unit increase in **Grade** is associated with an **increase of 96102 dollars in house price**.
- Holding all other features fixed, a 1 unit increase in **Sqft_Living15** is associated with an **increase of 24 dollars in house price**.
- Holding all other features fixed, a 1 unit increase in **Sqft_Above** is associated with an **increase of 70 dollars in house price**.


## Regression Evaluation Metrics

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

## R2 Score 

In [None]:
from sklearn.metrics import r2_score
print('R2 Score : ',r2_score(y_test, predictions))

## Conclusion

## Results : R2 Score value is 70.9% which is a good indication to predict house prices with Linear Regression model under given features. 

Note : This score can change based on data variability. 