# House Sales in King County, USA

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

This is my first attempt in using simple regression models to predict house prices in King County, USA.

### This is a list of features in out dataset

- <b>id</b> : Unique notation for each house sold (primary key of the dataset)

- <b> date</b>: Date house was sold


- <b>price</b>: Price of the house (our prediction target)


- <b>bedrooms</b>: Number of bedrooms


- <b>bathrooms</b>: Number of bathrooms

- <b>sqft_living</b>: Square footage of the home

- <b>sqft_lot</b>: Square footage of the lot


- <b>floors</b> :Total floors (levels) in house


- <b>waterfront</b> :House which has a view to a waterfront


- <b>view</b>: boolean feature (**True** (1) if the house has been viewed, **False** (0) if the house has not been viewed)


- <b>condition</b> :How good the condition is overall

- <b>grade</b>: overall grade given to the housing unit, based on King County grading system


- <b>sqft_above</b> : Square footage of house apart from basement


- <b>sqft_basement</b>: Square footage of the basement

- <b>yr_built</b> : year the house was built


- <b>yr_renovated</b> : Year when the house was renovated

- <b>zipcode</b>: Zip code


- <b>lat</b>: Latitude coordinate

- <b>long</b>: Longitude coordinate

- <b>sqft_living15</b> : Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area


- <b>sqft_lot15</b> : LotSize area in 2015(implies-- some renovations)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline

# Exploratory Data Analysis

In [None]:
df = pd.read_csv('../input/housesalesprediction/kc_house_data.csv')
df.head()

#### Get an overview of our dataset

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
#Check if there is any null value in our dataset
df.isnull().sum()

In [None]:
df.columns

I will add a new feature 'age' to the dataset and remove feature 'yr_built'. We will also drop 'id' and 'date' since these features won't help us predicting the price of houses (our target).

In [None]:
df['age'] = df['yr_built'].max() - df['yr_built']
df.drop(df[['yr_built', 'id', 'date']], inplace = True, axis = 1)
df.head()

#### We now look at some basic plots about correlations between the dataset's features and the price.

- Look at the numbers from the table. It is a real number ranged between -1 to 1. The magnetude of the number close to 1 indicates there is a strong correlation between two variables and the magnetude of the nunber close to 0 indicates there is a weak to none correlation between two variables. Positive sign of the number indicates there is a positive correlation and negative sign indicates negative correlation.
- We will see it more intuatively with the heatmap below the numeric table.

In [None]:
df.corr()

In [None]:
def corr_heatmap(data):
    ax = plt.subplots(figsize = (15, 10))
    sns.heatmap(df.corr(), cmap="YlGnBu")
    plt.title("Correlation between data's features", fontsize = 20)
corr_heatmap(df.corr())  

# Data Visualization

##### We now look at few plots between 'price' and other features

In [None]:
fig = plt.figure(figsize = (18,14)) # create figure

ax0 = fig.add_subplot(221) # add subplot 1 (1 row, 2 columns, first plot) # add_subplot(1, 2, 1)
ax1 = fig.add_subplot(222) # add subplot 2 (1 row, 2 columns, second plot). See tip below** # add_subplot(1, 2, 2)
ax2 = fig.add_subplot(223)
ax3 = fig.add_subplot(224)

# Subplot 1:
sns.boxplot(x='waterfront',y='price', data=df, ax=ax0)
ax0.set_title('Correlation between waterfront and price')
ax0.set_xlabel('Waterfront')
ax0.set_ylabel('Price')

# Subplot 2:
sns.boxplot(x='bedrooms',y='price',data=df, ax=ax1)
ax1.set_title('Correlation between number of bedrooms and price')
ax1.set_xlabel('Bedrooms')
ax1.set_ylabel('Price')

#Subplot 3:
sns.boxplot(x='grade',y='price',data=df, ax=ax2)
ax2.set_title('Correlation between grade and price')
ax2.set_xlabel('Grade')
ax2.set_ylabel('Price')

#Subplot 4:
sns.boxplot(x='bathrooms',y='price',data=df, ax=ax3)
ax3.set_title('Correlation between bathrooms and price')
ax3.set_xlabel('Bathrooms')
ax3.set_ylabel('Price')


plt.show()

#### I have a few observations based on the boxplots above:
- It's clearly that houses with waterfront view is more valueable than the ones without.
- There is a weakly positive correlation between number of bedrooms and the price. The plot shows there are quite many outliers between these two features.
- There is a positive correlation between grade and price. This is understandable since the grade of the house determines how valueable the house is.
- Interestingly, the correlation between number of bathrooms and price is a bit stronger than the correlation between number of bedrooms and the price. Even though there are still some outliers, we still see the positive relationship between these two features.

##### Let's look at few more plots to see the relationships between 'price' and other features

In [None]:
fig = plt.figure(figsize = (18,14)) # create figure

ax0 = fig.add_subplot(221) # add subplot 1 (1 row, 2 columns, first plot) # add_subplot(1, 2, 1)
ax1 = fig.add_subplot(222) # add subplot 2 (1 row, 2 columns, second plot). See tip below** # add_subplot(1, 2, 2)
ax2 = fig.add_subplot(223)
ax3 = fig.add_subplot(224)

# Subplot 1 (regression plot) : 
sns.regplot(x='sqft_living',y='price', data=df, ax=ax0)
ax0.set_title('Correlation between sqft living and price')
ax0.set_xlabel('sqft living')
ax0.set_ylabel('Price')

# Subplot 2 (Scatter plot) :
sns.scatterplot(x='sqft_lot',y='price',data=df, ax=ax1)
ax1.set_title('Correlation between sqft lot and price')
ax1.set_xlabel('sqft lot')
ax1.set_ylabel('Price')

# Subplot 3 (Scatter plot) :
sns.scatterplot(x='age',y='price',data=df, ax=ax2)
ax2.set_title('Correlation between age and price')
ax2.set_xlabel('Age')
ax2.set_ylabel('Price')

# Subplot 4 (Regression plot) :
sns.regplot(x='sqft_above',y='price',data=df, ax=ax3)
ax3.set_title('Correlation between sqft above and price')
ax3.set_xlabel('Sqft above')
ax3.set_ylabel('Price')


plt.show()

#### I have few more observations based on the plots above:
- We can see there are positive correlations between sqft living, sft above with price. The regression lines show us that.
- We have nothing based on sqft lot and age. The plots scatter all over the place.


> We can use the Pandas method <code>corr()</code>  to find the features other than price that is most correlated with price. Then we plot this table to show it more clearly.

In [None]:
# Create a new dataframe df_temp with no feature 'long', 'zipcode', 'age', 'lat' : 
df_temp = df.drop(df[['long','lat', 'zipcode', 'age']], axis = 1)

# Sorting the data and reset the index
df_corr = df_temp.corr().iloc[0,:].sort_values(ascending=False).reset_index()

df_corr = df_corr[df_corr["price"] < 1]  # Eliminate the 'price' in the x-axis

plt.figure(figsize=(15,9))
sns.barplot(x = "index", y = "price", data = df_corr);
plt.xlabel("Other features")
plt.xticks(rotation = 90)
plt.title("Correlation with price");
plt.grid(True)
sns.set_style("dark")

In [None]:
floor_df = df.floors.value_counts()
floor_df.rename_axis('floor value', inplace = True)
floor_df_t = floor_df.transpose()
floor_df_t

In [None]:


colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink']
explode_list = [0.12, 0, 0, 0, 0.05, 0.15] # ratio for each continent with which to offset each wedge.

floor_df.plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,
                              
                            shadow=True,       
                            labels=None,         # turn off labels on pie chart
                            pctdistance=1.15,    # the ratio between the center of each pie slice and the start of the text generated by autopct 
                            colors=colors_list,  # add custom colors
                            explode=explode_list # 'explode' lowest 3 floor values
                            )

# scale the title up by 12% to match pctdistance
plt.title('The distribution of floor values', y=1.12) 

plt.axis('equal') 

# add legend
plt.legend(labels=floor_df.index, loc='upper left') 

plt.show()

In [None]:
# Now we import folium for the tasks
import folium
from folium.plugins import FastMarkerCluster

In [None]:
df_geo = df[["zipcode" , "lat" , "long"]]

lat = df_geo["lat"]
long = df_geo["long"]

cordinates = list(zip(lat,long))

In [None]:
king_map = folium.Map(location = [47.5480, -122.257], zoom_start = 10)

FastMarkerCluster(data=cordinates).add_to(king_map)

king_map

# Model developments

###  Simple Linear Regression and Multiple Linear Regression

> Let's try simple linear regression model and calculate R^2 between 'price' and other features

In [None]:
Y = df['price']

lm1 = LinearRegression()
X1 = df[['sqft_living']]
lm1.fit(X1,Y)
print('The R^2 score between \'sqft_living\' and \'price\' is ' + str(lm1.score(X1,Y)))

lm2 = LinearRegression()
X2 = df[['bathrooms']]
lm2.fit(X2,Y)
print('The R^2 score between \'bathrooms\' and \'price\' is ' + str(lm2.score(X2,Y)))

lm3 = LinearRegression()
X3 = df[['grade']]
lm3.fit(X3,Y)
print('The R^2 score between \'grade\' and \'price\' is ' + str(lm3.score(X2,Y)))

> - Let's plot the actual values of price againts the values predicted

In [None]:
width = 10
height = 8
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(lm1.predict(X1), hist=False, color="b", label="Fitted Values" , ax=ax1)

#We can adjust histogram True or False if we want to include the histogram along with the distribution lines

plt.title('Actual vs Fitted Values for Price based on sqft_living')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of houses')

plt.show()
plt.close()

In [None]:
width = 10
height = 8
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(lm2.predict(X2), hist=False, color="b", label="Fitted Values" , ax=ax1)

#We can adjust histogram True or False if we want to include the histogram along with the distribution lines

plt.title('Actual vs Fitted Values for Price based on bathrooms')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of houses')

plt.show()
plt.close()

In [None]:
width = 10
height = 8
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(lm3.predict(X3), hist=False, color="b", label="Fitted Values" , ax=ax1)

#We can adjust histogram True or False if we want to include the histogram along with the distribution lines

plt.title('Actual vs Fitted Values for Price based on grade')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of houses')

plt.show()
plt.close()

> We now use multiple linear regression to predict the 'price' using list of features:

In [None]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]

In [None]:
lm4 = LinearRegression()
X4 = df[features]
X4.head()

In [None]:
lm4.fit(X4,Y)
print('The R^2 score between these features and \'price\' is ' + str(lm4.score(X4,Y)))


In [None]:
Yhat_mult = lm4.predict(df[features])
Yhat_mult

# Pipeline

><p>Data Pipelines simplify the steps of processing the data. We use the module <b>Pipeline</b> to create a pipeline. We also use <b>StandardScaler</b> as a step in our pipeline.</p>

In [None]:
Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model', LinearRegression())]

In [None]:
pipe=Pipeline(Input)
pipe

In [None]:
new_features = df[features]
new_features
pipe.fit(new_features,Y)

In [None]:
ypipe=pipe.predict(new_features)
ypipe[0:4]

# Model Evaluation and Refinement

Import the necessary modules:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")

We will split the data into training and testing sets:

In [None]:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]    
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)


print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

In [None]:
from sklearn.linear_model import Ridge

In [None]:
RidgeModel = Ridge(alpha = 0.1)
RidgeModel.fit(x_train,y_train)
RidgeModel.score(x_test,y_test)


> We now perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided.

In [None]:
pr = PolynomialFeatures(degree=2)
x_train_pr = pr.fit_transform(x_train[features])
x_test_pr = pr.fit_transform(x_test[features])

RidgeModel.fit(x_train_pr,y_train)
RidgeModel.score(x_test_pr,y_test)