# Housing Price Prediction

The purpose of this demo is to walk through a typical process for building a model that predicts House prices

# Data Ingestion

Data ingestion is usually one of the first steps in a Data Science / Machine Learning project

There are many different data sources that a project may require, some of the common ones are:

- CSV, XLSX, and JSON Files
- Images and Video
- Databases

Depending on the type of data you're using there's a few different methods you can use. For importing from CSV or Excel files `pandas` has a fairly straightforward method for importing into a `dataframe`:

In [None]:
import pandas as pd

df = pd.read_csv("../input/housesalesprediction/kc_house_data.csv")

# Data Exploration

Data exploration is an important task in understanding the data that's available. Usually it's good to take a look at the data imported to understand the structure and data ranges

`df.types` shows us the data types of each column in the dataframe

In [None]:
df.dtypes

`df.head()` prints the first 5 rows of the dataframe

In [None]:
df.head()

`df.describe()` gets us some basic median/range/quartile information for the columns in our data

In [None]:
df.describe()

# Feature Selection

Feature selection is perhaps the most important part of the machine learning process - after data, of course - because this is the phase in which we define which characteristics are most relevant to the problem we're trying to solve

There are many different ways that we can do this, but often the easiest starting point is making use of correlations:

In [None]:
import seaborn as sns

In [None]:
corr = df.corr()

We can create a correlation matrix using the data from `df.corr()` and passing it to a `seaborn` heatmap

In [None]:
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns)

# Training a Model

In order to train a machine learning model we need to do the following:

1. Select the columns we determine are necessary
2. Split our data into training and testing datasets
3. Create a new model instance and fit the model to the data
5. Evaluate the model

Often we may also want to do the above steps with a few different models in order to find the one that represents our data best

## 1. Select data

In [None]:
x_labels = ['sqft_living', 'grade', 'bathrooms']
y_labels = ['price']

df_x = df[x_labels]
df_y = df[y_labels]

np_x = df_x.to_numpy()
np_y = df_y.transpose().to_numpy()[0]

In [None]:
df_x.head()

In [None]:
df_y.head()

In [None]:
np_y

## 2. Split Test and Train Data

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(np_x, np_y)

Once we've got our data split into test and train sets, we can train our model using the test data using the fit function of the LinearRegression model

## 3. Fit the Model

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_test)

In [None]:
y_pred

## 4. Evaluate the Model

A simple method of measuring the performance $R^2$

In [None]:
model.score(x_test, y_test)

In [None]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, y_pred)

Alternatively, it can also be useful to compare predicted data to the actual data visually:

In [None]:
x_test_transp = x_test.transpose()

df_test = pd.DataFrame({
    'sqft_living': x_test_transp[0],
    'grade': x_test_transp[1],
    'bathrooms': x_test_transp[2],
    'actual_price': y_test,
    'predicted_price': y_pred
})

df_test.head()

In [None]:
df_plot_test = df_test[['sqft_living', 'actual_price', 'predicted_price']].melt('sqft_living', var_name='value',  value_name='price')
sns.scatterplot(data=df_plot_test, x='sqft_living', y='price', hue='value')

## Talking about the results