## Introduction

In this notebook I load and preprocess a dataset, containing house sale observations for King County, and Seattle  to train and test a multivariate log linear regregession to predict house prices. The king County dataset,was collected between May 2014 and May 2015 and Contains 21613 rows × 21 columns


## Table of Content 
    
   
   1. [Variables Overview](#cell1)
   2. [Importaing relevant libraries](#cell2)
   3. [importing dataset](#cell4)
   4. [preprocessing dataset](#cell5)
       - Dealing with missing values
       - Check for duplicate values
       - Exploring The descriptive statistics of the variables
       - Exploring PDF(Probility Distribution Functions)
       - Checking Of Least Squared (OLS) Assumptions
       - Relaxing OLS assumptions(log Transformation)
       - Relaxing assumptions
       - checking for multicolinarity
   5. [Linear Regression Model](#cell6)
        - Declaring depedent and indepedent Variable
        - Scaling data
        - Train_Test_Split data
        - fitting model
   6. [Checking Results of Linear Regression model](#cell7)
        - Scatter Plot (y_trained vs Predicted X_train)
        - residual PDf
        - R^2 score
        - features and weights
   7. [Testing](#cell8)
         - Scatter Plot(y_test vs Predicted X_test)
         - Actual Value,Predicted Value and Differences chart
    
    
   


## Variables Overview <a id="cell1"></a>

**id** - Unique ID for each home sold

**date** - Date of the home sale

**price** - Price of each home sold

**bedrooms** - Number of bedrooms

**bathrooms** - Number of bathrooms, where .5 accounts for a room with a toilet but no shower

**sqft_living** - Square footage of the apartments interior living space

**sqft_lot** - Square footage of the land space

**floors** - Number of floors

**waterfront** - A dummy variable for whether the apartment was overlooking the waterfront or not

**view** - An index from 0 to 4 of how good the view of the property was

**condition** - An index from 1 to 5 on the condition of the apartment,

**grade** - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

**sqft_above** - The square footage of the interior housing space that is above ground level

**sqft_basement** - The square footage of the interior housing space that is below ground level

**yr_built** - The year the house was initially built

**yr_renovated** - The year of the house’s last renovation

**zipcode** - What zipcode area the house is in

**lat** - Lattitude

**long** - Longitude

**sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors

**sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors



## Importing Relevant libraries<a id="cell2"></a>
I import pandas for data manipulation and analysis, matplotlib to visualize data in charts and graphs and seaborn for optimal visualization. I also changed the panda's display format so it will not show scientific notation when visualizing data.


In [None]:
import numpy as np
import pandas as pd
#set the pandas display format so it will not use scientific notation
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()



## Importing Dataset<a id="cell4"></a>
I use the panadas method to read the csv and load it into the raw_data and visualize the first five rows and columns of the dataset; noticing 'date' had extra characters, I strip the extra letters.

In [None]:
raw_data = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')
raw_data.head()

In [None]:
raw_data['date'] = raw_data['date'].str.replace('T000000', '')
raw_data['date'] = raw_data['date'].astype(float)


In [None]:
raw_data

## Preprocessing<a id="cell5"></a>
With the dataset successfully loaded, I move on to Preprocessing the dataset. 

### Dealing with missing values
isnull().sum() is used on the dataset to find all the null values and return it.

In [None]:
raw_data.isnull().sum()

### Check for duplicate values 
To check for duplicated values, i use the .duplicated function ad .sum to return the value of duplicated values

In [None]:
raw_data.duplicated().sum()

In [None]:
#Since the dataset contained no missing values or duplicated values, i change the dataset to data with no missing values.
data_no_mv = raw_data

### Exploring The Descriptive statistics of the variables 
Using the pandas .describe() function to pull the statistacial values of the features in the dataset. A couple of things to take note of when obsererving the chart is huge differnces in Max, mean and the percentales. First thing I notice in respect to the statistical data is the Max of price amounting to 7700000.00 while under 75% price payed is 450000.000 with a mean of 540088.142.

In [None]:
data_no_mv.describe()

### Exploring PDF(Probility Distribution Functions) Of features
I take the data with no missing values and plot the PDF for variables with weird-looking descriptive values for further observation. I am looking for outliers in the features and good distribution. Outliers are observations that lay away from the vast majority of observations and can throw off the model's predictive ability. A great way to remove outliers is to eliminate a percentile or filter the data.
        
        

In [None]:
sns.distplot(data_no_mv['price'])

In [None]:
#using the .quantile method I the 1% of data from "price" variable to handle outliers and achieve a normal distrubtion for optimal regression results
z = data_no_mv['price'].quantile(0.99)
data_1 = data_no_mv[data_no_mv['price']<z]


In [None]:
sns.distplot(data_1['price'])

In the distribution, i noticed outliers that range from up to 35. I isolate the bedrooms feature to visualize the column.
After a search of king county and Seattle houses in Zillow, i find they do not pass 20, so i remove all entries with more than 16 since after 16, it seemed not many places were available 

In [None]:
sns.distplot(data_1['bedrooms'])

In [None]:
bedrms = pd.DataFrame(raw_data['bedrooms'])
bedrms = bedrms.dropna(axis=0)

In [None]:
bedrms.sort_values(by='bedrooms')

In [None]:
data_2 = data_1[data_1['bedrooms']<8]

In [None]:
sns.distplot(data_2['bedrooms'])

In [None]:
sns.distplot(data_2['sqft_lot'])

using the .quantile method I use 95% of observations from "sqft_lot" variable to handle outliers and achieve a normal distribution for optimal regression results

In [None]:
z = data_2['sqft_lot'].quantile(0.95)
data_3 = data_2[data_2['sqft_lot']<z]

In [None]:
sns.distplot(data_3['sqft_lot'])

In [None]:
sns.distplot(data_3['sqft_above'])

using the .quantile method I use 99% of observations from "sqft_above" variable to handle outliers and achieve a normal distribution for optimal regression results

In [None]:
z = data_3['sqft_above'].quantile(0.99)
data_4 = data_3[data_3['sqft_above']<z]

In [None]:
sns.distplot(data_4['sqft_above'])

### Index reset 
I reset the index of the data and drop it into data_cleaned variable 

In [None]:
data_cleaned = data_3.reset_index(drop=True)

In [None]:
data_cleaned.describe()

### Checking Of Least Squared (OLS) Assumptions
I use a scatter plot to plot possible predictors against "price" to check for linearity using the of least squared assumptions 

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['bedrooms'],data_cleaned['price'])
ax1.set_title('price and bedrooms')

ax2.scatter(data_cleaned['sqft_living'],data_cleaned['price'])
ax2.set_title('price and sqft_living')

ax3.scatter(data_cleaned['yr_built'],data_cleaned['price'])
ax3.set_title('price and yr_built')


plt.show()

In [None]:
f, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['grade'],data_cleaned['price'])
ax1.set_title('price and grade')

ax2.scatter(data_cleaned['sqft_lot'],data_cleaned['price'])
ax2.set_title('price and sqft_lot')

ax3.scatter(data_cleaned['condition'],data_cleaned['price'])
ax3.set_title('price and condition')


ax4.scatter(data_cleaned['sqft_above'],data_cleaned['price'])
ax4.set_title('price and sqft_above')

plt.show()

### Relaxing  assumptions
Usng .np.log to transfrom 'price' to 'Log_price' to create better linearty against other variables and drop price. Log returns the natural logarithm of a number and relaxs assumptions fro better model fit.

In [None]:
log_price = np.log(data_cleaned['price'])
data_cleaned['Log_price'] = log_price


In [None]:
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['bedrooms'],data_cleaned['Log_price'])
ax1.set_title('price and bedrooms')

ax2.scatter(data_cleaned['sqft_living'],data_cleaned['Log_price'])
ax2.set_title('price and sqft_living')

ax3.scatter(data_cleaned['yr_built'],data_cleaned['Log_price'])
ax3.set_title('price and yr_built')


plt.show()

In [None]:
f, (ax1, ax2, ax3, ax4) = plt.subplots(1, 4, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['grade'],data_cleaned['Log_price'])
ax1.set_title('price and grade')

ax2.scatter(data_cleaned['sqft_lot'],data_cleaned['Log_price'])
ax2.set_title('price and sqft_lot')

ax3.scatter(data_cleaned['condition'],data_cleaned['Log_price'])
ax3.set_title('price and condition')


ax4.scatter(data_cleaned['sqft_above'],data_cleaned['Log_price'])
ax4.set_title('price and sqft_above')

plt.show()

In [None]:
data_cleaned = data_cleaned.drop(['price'], axis=1)
data_cleaned

### Checking for multicolinarity
To check the Multicolinarity assumption i import Variance_ inflation from stats model, None of the features break this assumption

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = data_cleaned
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif["features"] = variables.columns

In [None]:
vif

In [None]:
#Converting data_cleaned variable to data_pre_process since the preprocessing was done
data_pre_proc = data_cleaned

## Linear Regression Model<a id="cell6"></a>

### Declaring depedent and indepedent Variable
Declaring independent and dependent variables, for independent(x) log_price was dropped since it's the dependent variable.

In [None]:
inputs = data_pre_proc.drop(['Log_price'],axis=1)

In [None]:
targets = data_pre_proc['Log_price']

### Scaling data
Importing and using the standard scaler function from sklearn to scale the indepedent variables , so that all the features hold a standard weight towards the depedent variable.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(inputs)

In [None]:
x_scaled = scaler.transform(inputs)

In [None]:
x_scaled

 ### Train_Test_Split data 
Setting a 80/20 split, splitting the training data into 80 and the test data to 20 with a random state of 9

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x_scaled,targets,test_size= 0.20,random_state=9)

### Fitting model
Fitting the Linear regression module with training data and checking results by creating a scatter plot and plotting the predicted values against the observed values. I also create a Residual PDF using the difference between targets and predictions to visualize the error estimate.

In [None]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train,y_train)

In [None]:
reg.get_params()


## Checking Results<a id="cell7"></a>

### Scatter plot 

Plotting predicted values against the observed values to check the results

In [None]:
y_hat = reg.predict(x_train)
y_hat

In [None]:
plt.scatter(y_train ,y_hat, alpha=0.2)
plt.xlabel('Targets (Y_train)', size=15)
plt.ylabel('Predictions  (Y_hat)', size=15)
plt.xlim(11,15)
plt.ylim(11,15)
plt.title('Actual vs Predicted')


### Residual PDf

In [None]:
## Residual shows difference and mean between the targets and predictions 
sns.distplot(y_train - y_hat)
plt.show

### R^2 score
The R2 score being 76% signifies response variable variation that the linear Regression Model explains

In [None]:
reg.score(x_train,y_train)

### features and weights
Checking how much weight each feature has into predicting the price. While positive weight increases, so do price. If it is decreased, so is the price. Values are standardized. Could be used for feature selection

In [None]:
reg_summary = pd.DataFrame(inputs.columns.values, columns=['Features'])
reg_summary['weights'] = reg.coef_
reg_summary

## Testing<a id="cell8"></a>
Plotting the predicted and testing data in a scatter plot to show efficency of model predictions, 

### Scatter Plot(Trained vs Predicted X_test)

In [None]:
y_hat_test = reg.predict(x_test)

In [None]:
plt.scatter(y_test ,y_hat_test, alpha=0.2)
plt.xlabel('Targets (Y_train)', size=15)
plt.ylabel('Predictions  (Y_hat_test)', size=15)
plt.xlim(11.5,15)
plt.ylim(11.5,15)
plt.title('Targets ''Y_train'' vs Predicted')


### Actual Value Predicted Value Differences chart
The Linear Regression model's final test is to test how good the predictions hold up to the actual data. For this, i use the NumPy method to transform the variables back to their original form. Create a prediction column using the x_test predicated data. I then take y_test and target columns by transforming the data using NumPy.exp and reset the index. I finish it off by visualizing a new dataset with new columns containing predictions, target residuals, differences in percentage to show this model's efficiency.

In [None]:
predv =pd.DataFrame(np.exp(y_hat_test), columns=['Predictions'])
predv['Target'] = np.exp(y_test)
y_test = y_test.reset_index(drop=True)

In [None]:
predv['Residual'] = predv['Target'] - predv['Predictions']

In [None]:
predv['Difference%'] = np.absolute(predv['Residual']/predv['Target']*100)
predv