## San Francisco Housing Pricing: 

1. Take big view.
2. Data Cleaning and Exploration
3. Feature engineering for ML algorithms
4. Pick ML model and train it -- today we use simple linear model as example

### 1. Take big view

In [1]:
import numpy as np
import pandas as pd
import os
# to make this notebook's output identical at every run
np.random.seed(42)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')

##### We load open-source data from SF MLS historical database.

In [2]:
df = pd.read_csv('Sales.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Sales.csv'

 Seems like we cannot see all the columns, so we need to do some tricks here:

In [None]:
pd.set_option('display.max_columns', None)
df.head()

In [None]:
## There are total 39 attributes
len(df.columns)

In [None]:
df.columns

All the records are from California

In [None]:
df['state'].unique()

Further, all the records are from San Francisco

In [None]:
df['city'].unique()

In [None]:
## summary of the housing data : columns and basic statistics
## Each row represents one district
df.info()

In [None]:
df.describe()

##### Let's revisit this dataset:

In [None]:
df.head()

***We can have a look at each column:***

+ longitude, latitute and elevation : for precise location of the house
+ full_address: also detailed location, if we use google map or other mapping system, we could map the (longtitude, latitute, elevation) location to street and numbers

+ state and city: all records are from San Francisco, California

+ street no, street name, street suffix: supplemental information for full address

+ zip, area, district_no, district_desc: zip code, location and neighbourhood's name

+ on_market_date, cdom: listing date and cumulative days on market

Others are self-explanatory

##### Let's have a look at the whole data and distribution:

In [None]:
## Heatmap for median_income
df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=df["elevation"]/60, label="elevation", figsize=(10,7),
    c="sale_price", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()

### 2. Get data ready and explore data

We need to drop some data which is redundant and not needed in our analysis

In [None]:
df.columns

In [None]:
df.drop(['full_address', 'city', 'state', 'street_no', 'street_name', 'street_suffix', 'district_no', 'district_desc'], \
        axis=1, inplace = True)

In [None]:
df.head

We also noticed that, 'area' and 'subdist_no' have same value

In [None]:
False in (df['area'] == df['subdist_no']).values

And actually those columns refer to the regional location of the house inside SF city.

In [None]:
df['subdist_desc'].unique()

![avatar](https://m2p7s3n2.rocketcdn.me/wp-content/uploads/2019/12/SanFranciscoNeighborhoods.jpg)

##### Currently we don't need to consider attributes like on_market_date, cdom, sale_date. Those features can represent the market preference and evaluation for certain houses. 

<font color=red> ***After-class excercise: include cdom or other metrics related to sales speed in model***</font>

In [None]:
df.drop(['area', 'subdist_no', 'zip', 'on_market_date', 'cdom', 'sale_date', 'lot_acres', 'orig_list_price'], axis=1, inplace = True)

In [None]:
df

For simplicity, here we only consider houses with positive square feet/acres area.

In [None]:
df = df[df['lot_sf'] > 0]
df

In [None]:
df.describe()

In [None]:
df.hist(figsize=(20,15))
plt.show()

We need to look at some features more granularly:

In [None]:
df['sale_price'].hist(bins=100)

In [None]:
df[df['sale_price'] == df['sale_price'].min()]

In [None]:
df[df['sale_price'] == df['sale_price'].max()]

In [None]:
df[df['year_built'] == df['year_built'].min()]

Seems like some of the houses are missing year_built data. We can either dispose those data points or fill them.
Here I will just drop them, and leave another way as excercise

<font color=red> ***After-class excercise: fill year_built with mean or median data***</font>

In [None]:
df = df[df['year_built'] > 0 ]

In [None]:
df[df['year_built'] == df['year_built'].min()]

In [None]:
df['year_built'].hist()

In [None]:
df['HouseAge'] = 2021 - df['year_built']
df.drop(['year_built'], axis=1, inplace= True)
df

##### We can have a look at internal correlations between different features by scatter plotting them:

In [None]:
from pandas.plotting import scatter_matrix

attributes = ["sale_price", "rooms", "baths",
              "beds", "lot_sf", "num_parking", "HouseAge"]
scatter_matrix(df[attributes], figsize=(20, 16));

Obviously, we can see some positive correlation between price and lot_sf, baths, rooms, etc.

In [None]:
df.plot(kind="scatter", x="sale_price", y="lot_sf",
             alpha=0.1, figsize=(12, 10))

In [None]:
df.plot(kind="scatter", x="sale_price", y="baths",
             alpha=0.1, figsize=(12, 10))

##### We can also have a look at their correlation matrix and then sort the values, need to understand it clearly, not all of them are meaningful:

In [None]:
corr_matrix = df.corr()
corr_matrix

In [None]:
corr_matrix["sale_price"].sort_values(ascending=False)

Can you try to explain all of those relationships?

### 3. Feature engineering for ML algorithms

To be brief, feature engineering is a process which translates some representation that computer or program has difficulty to understand into something easy for them to digest.

Actually the process we translate year_built to HouseAge is already a kind of feature engineering.

We noticed that (longitude, latitude, elevation) can represent location. As common sense, exact location will not matter too much in pricing the houses. What really matters is the relative location or neighbour hood.

In [None]:
df.drop(['longitude', 'latitude',  'elevation'], axis=1, inplace=True)

In [None]:
np.sort(df['subdist_desc'].unique())

So in the dataset, we already have tags for different locations. By conducting research on San Francisco map, we find that each number represents a larger area and the following name is a more details area. For simplicity, we will only consider the larger region.

In [None]:
df['subdist_desc'] = df['subdist_desc'].apply(lambda s: s.split()[0])

In [None]:
region_label = df[['subdist_desc']]
region_label

Since all of them are labeled under numbers, we need to use another way of encoding to eliminate the affect of numerical values.

In [None]:
from sklearn.preprocessing import  OneHotEncoder

In [None]:
cat_encoder = OneHotEncoder(sparse=False)
region_1hot = cat_encoder.fit_transform(region_label)
region_1hot

In [None]:
cat_encoder.categories_

In [None]:
df.drop(['subdist_desc'], axis=1, inplace=True)
df

In [None]:
df.values.shape

In [None]:
fulldata = np.c_[df.values, region_1hot]

In [None]:
np.c_[df.values, region_1hot].shape

### 4. Pick ML model and train it -- today we use simple linear model as example

***Split training and testing data sets:***

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(fulldata, test_size=0.2, random_state=42)

In [None]:
train_target = train_set[:, 0]
train_target = train_target.reshape(len(train_target),-1)
train_target

In [None]:
train_features = train_set[:, 1:]
train_features

In [None]:
test_target = test_set[:, 0]
test_target = test_target.reshape(len(test_target),-1)
test_features = test_set[:, 1:]

In [None]:
train_target.shape, train_features.shape

In [None]:
from sklearn.linear_model import LinearRegression
regression = LinearRegression()
regression.fit(train_features, train_target)

In [None]:
regression.coef_

In [None]:
regression.intercept_

***Then we need to evaluate the performance of the model on training and testing sets:***

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
price_predictions_on_train = regression.predict(train_features)
mse = mean_squared_error(train_target, price_predictions_on_train)
sqrtmse = np.sqrt(mse)
sqrtmse

In [None]:
price_predictions_on_test = regression.predict(test_features)
mse = mean_squared_error(test_target, price_predictions_on_test)
sqrtmse = np.sqrt(mse)
sqrtmse