# King County House Price Regression Project

Coded by Luna McBride

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split #Split the data into train and test
from sklearn.linear_model import LinearRegression #Add in our linear regression
from sklearn.preprocessing import StandardScaler #Test out scaling
from sklearn.neural_network import MLPRegressor #Add a multilayer perceptron to test regression ability
from sklearn import svm #Add a support vector machine to test regression ability
from sklearn.tree import DecisionTreeRegressor #Add a single tree regressor to test regression ability
from sklearn.ensemble import RandomForestRegressor #Add a forest regressor to test regression ability
from sklearn.ensemble import ExtraTreesRegressor #Add even more trees to test regression ability

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
house = pd.read_csv("../input/kc-housesales-data/kc_house_data.csv") #Read in the houses dataset
house.head() #Take a peek at the dataset

---

# Check for nulls

In [None]:
print(house.count())
print(house.isnull().any())

There are no null values. It is time to fix it up some more.

---

# Fix up some variables

## Change dates into an age

In [None]:
house["age"] = pd.DatetimeIndex(house['date']).year - house["yr_built"] #Get the age of the building
print(house['date'][1], " - ", house["yr_built"][1], " = ", house["age"][1]) #Print the equation as a sanity check

## Remove year renovated

In [None]:
house.drop(columns = ["yr_renovated"], inplace = True) #Drop the year renovated field
house.head() #Take a peek at the dataset

## Drop ID and Date, as they are not helpful here

In [None]:
house.drop(columns = ["id", "date"], inplace = True) #Drop the ID and Date Fields
house.head() #Take a peek at the dataset

## Drop the first living and lot columns

More updated versions exist in the sqft_living/lot 15 variables, and thus they are better representation of the current lot.

In [None]:
house.drop(columns = ["sqft_living", "sqft_lot"], inplace = True) #Drop the old sqft lot and living Fields
house.head() #Take a peek at the dataset

---

# Split the data

In [None]:
price = np.array(house["price"].copy().astype(int)) #Set price to be the prices of the houses
price = np.log(price) #Log the price
print(price) #Print the prices

I tried various scaling functions to try to get this to increase the the linear regression score, but I searched the internet for more options and, strangely, just taking the log worked a whole lot better.

In [None]:
characteristics = house.drop("price", axis = 1) #Get every other feature of our dataframe except price
chara = pd.get_dummies(characteristics) #Get the dummies for easier model training
scale = StandardScaler() #Add a standard scaler to scale our data for easier use later
scale.fit(chara) #Fit the scaler with our characteristics
chara = scale.transform(chara) #Transform the data with our scaler

print(chara) #Print the scaled data

In [None]:
charaTrain, charaTest, priceTrain, priceTest = train_test_split(chara, price, test_size = 0.3) #Split the data into train and test
print(priceTest) #Print one of the price splits

---

# Train the linear regression

In [None]:
regression = LinearRegression() #Open a linear regression model
regression.fit(charaTrain, priceTrain) #Fit the regression model

In [None]:
print(regression.score(charaTest, priceTest)) #Print the accuracy of the model
print(regression.coef_) #Print the model coefficients

---

# Test other regressions

In [None]:
neural = MLPRegressor(hidden_layer_sizes = (3,100), random_state=1, max_iter=500) #Build a neural network to test regression
neural.fit(charaTrain, priceTrain) #Fit the network with the train set

In [None]:
svr =  svm.SVR() #Get a support vector regressor to test ability
svr.fit(charaTrain, priceTrain) #Fit the regresso

In [None]:
tree = DecisionTreeRegressor() #Build a tree
tree.fit(charaTrain, priceTrain) #Fit the tree

In [None]:
forest = RandomForestRegressor() #Build a whole forest of trees
forest.fit(charaTrain, priceTrain) #Fit the forest

In [None]:
forestBig = ExtraTreesRegressor() #Build a more random forest
forestBig.fit(charaTrain, priceTrain) #Fit the more random forest

---

# Print Results

In [None]:
#Print the accuracies of all the models
print("Linear Regression Accuracy: ", regression.score(charaTest, priceTest))
print("Neural Network Accuracy: ", neural.score(charaTest, priceTest))
print("Support Vector Accuracy: ", svr.score(charaTest, priceTest))
print("Single Tree Accuracy: ", tree.score(charaTest, priceTest))
print("Random Forest Accuracy: ",forest.score(charaTest, priceTest))
print("Even more Random Forest Accuracy: ", forestBig.score(charaTest, priceTest))

According to the scores, the strongest regressors for this data were the forest, reaching above 88% accuracy. These were followed close behind by the neural network and SVM regressions, which were both above 84% accuracy. Trailing far at the end were the Linear Regression and the single tree, surprisingly. There was a difference of about 7% accuracy between the Linear Regression and its next closest neighbor, the SVM (77 vs 84). This shows that while the linear regression is the typical go-to in settings like statistics, there are better options out there depending on the data.

In [None]:
attributes = characteristics.columns #Get the tested attributes
attributes = list(zip(attributes, regression.coef_)) #Zip the attributes together with their coefficient
sortAtt = sorted(attributes, key = lambda x: x[1], reverse = True) #Sort the zipped attributes by their coefficients

print("According to the Linear Regression, the most important factors for pricing are: ") #Start printing the most important labels
i=0 #Counter variable so only the top five are printed

#For each attribute in the sorted attributes
for label, coef in sortAtt:
    if i<5: #If there has not been five printed yet
        print(label) #Print the label as an important factor
    i += 1 #Increase i by 1

In [None]:
attributes = characteristics.columns #Get the tested attributes
attributes = list(zip(attributes, forest.feature_importances_)) #Zip the attributes together with their coefficient
sortAtt = sorted(attributes, key = lambda x: x[1], reverse = True) #Sort the zipped attributes by their coefficients

print("According to the Random Forest (most accurate), the most important factors for pricing are: ") #Start printing the most important labels
i=0 #Counter variable so only the top five are printed

#For each attribute in the sorted attributes
for label, coef in sortAtt:
    if i<5: #If there has not been five printed yet
        print(label) #Print the label as an important factor
    i += 1 #Increase i by 1

In the listings of most important features contributing to the price, the Linear Regression and far more accurate Random Forest gave widely different results. Both models produced grade as one of the most important. Grade is how King County grades houses, so it makes sense houses in King County will rely on their grade to price houses. They both also have latitude and size (not counting the basement), but the utilization in the list is very different.

The Forest ranked location (latitude and longitude) and above-ground sizing (the size of the house and the size not counting the basement) as the most important metrics. This makes sense, as different areas tend to have higher or lower prices and size does tend to be an important factor in considering a house.

The Linear regression ranked the age a whole lot higher than anything else, which is a bit odd, considering it was not even on the Forest's list. There is also latitude and size above ground, which tells me the Linear regression was trying to come to a similar conclusion about space and location that the forest did, it just put higher emphasis on age. Price emphasis by age can make sense, but I feel like that only matters highly with the very old houses. This intuition is also shown in the forest model, as I had to change the iterator to 9 in order to even show age, thus it is ranking it as the 9th most important factor.