## California Housing Price
- predict median price per district
- model: regression/labeled supervised learning
- dataset: https://github.com/ageron/handson-ml2/tree/master/datasets/housing

### 1. Read Data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv("../input/california-housing-prices/housing.csv")
df.head()

In [None]:
df.shape

In [None]:
df.isna().sum()
#total_badrooms 207/20640 is missing

In [None]:
df.dtypes

### 2. Data Exploration

#### 1) categorical var

In [None]:
df.ocean_proximity.value_counts()

In [None]:
sns.countplot(x='ocean_proximity',data=df)

#### b) numeric var

In [None]:
df.describe()

In [None]:
df.median_house_value.hist(bins=100)
#abnormal data here, outlier

In [None]:
num_var=list(df.select_dtypes(include=np.number))
num_var

In [None]:
i=0
df[num_var[i]].hist(bins=100)

In [None]:
plt.subplots(figsize=(15,10))
i=0
for r in range(1,len(num_var)+1): 
    plt.subplot(3, 3, r) 
    plt.hist(x=df[num_var[i]],bins=100)
    plt.title(num_var[i])
    i=i+1

In [None]:
df.hist(bins=100,figsize=(15,10))
plt.show()

- median_income: 0-10 represents 0-100,000
- 50+ meidan_age/median_house_value: outlier, check if they need prediction in that range.
    - outlier: 
        - delete, export as a seperate dataset, do seperate prediction.
        - feature scaling?
- tail-heavy for most distributions -> bell shaped distributions.

### Train/test Split

In [None]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df,test_size=0.33, random_state=123)
print("Total sample size = %i\n training sample size = %i \n testing sample size = %i"\
%(df.shape[0],train_set.shape[0],test_set.shape[0]))

### Better split: StratifiedSplit
- stratified the dataset based on the income

In [None]:
df.median_income.hist(bins=100)

In [None]:
# create categorical income
df['income_bins']=pd.cut(df.median_income,
                         bins=[0,1.5,3,4.5,6,7.5,np.inf],
                         labels=[1,2,3,4,5,6])
df.income_bins.hist(bins=20)

In [None]:
df['income_bins']=pd.cut(df.median_income,
                         bins=[0,1.5,3,4.5,6,7.5,np.inf])
df.income_bins.value_counts()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=123)
for train_index,test_index in split.split(df,df.income_bins):
    train_set=df.loc[train_index]
    test_set=df.loc[test_index]

In [None]:
train_set.income_bins.value_counts(normalize=True)

In [None]:
print(train_set.shape[0],test_set.shape[0])

In [None]:
train_set.drop('income_bins',axis=1,inplace=True)
ts=train_set.copy()
ts.shape

#### a) geo data

In [None]:
sns.scatterplot(x=ts['longitude'],y=ts['latitude'],alpha=0.1)

In [None]:
sns.jointplot(x=ts['longitude'],y=ts['latitude'],alpha=0.1)

high-density area:
- Bay area+around Los Angeles, San Diego
- a long line in Central Valley, around Sacramento and Fresno.

In [None]:
ts.plot(kind='scatter',x='longitude',y='latitude',alpha=0.4,
       s=ts.population/100,label='population',figsize=(10,6),
       c=ts.median_house_value,
       cmap=plt.get_cmap('jet'))
plt.title('Califonia Housing Price')
plt.show()

#### b) correlation matrix

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(ts.corr(),annot=True,linewidths=.5)

**median_house_value**:
`median_income`: 0.69
`ttl_rooms`,`housing_age`:0.14,0.11
`latitude`: -0.14

In [None]:
ts.corr().median_house_value.sort_values(ascending=False)

In [None]:
from pandas.plotting import scatter_matrix
var=['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(ts[var],figsize=(10,8))
plt.show()

`median_income` seems highly correlated.

In [None]:
sns.scatterplot(x=ts['median_income'],y=ts['median_house_value'],alpha=0.1)

#### c) new col

In [None]:
ts.head()

In [None]:
ts['room_per_households']=ts.total_rooms/ts.households
ts['bedroom_per_room']=ts.total_bedrooms/ts.total_rooms
ts['population_per_households']=ts.population/ts.households

#### new correlation matrix

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(ts.corr(),annot=True,linewidths=.5)

In [None]:
ts.corr().median_house_value.sort_values(ascending=False)

1. new var `room_per_households` is correlated with `median_house_value` (better than `total_rooms`),
    - the larger the house, the more expensive they are.
2. new var `bedroom_per_room` is negative correlated.
    - less bedroom/room ratio house is more expensive.

## 3. Data Prepration

In [None]:
y=train_set['median_house_value'].copy()
y.head()

In [None]:
train_set.drop('median_house_value',axis=1,inplace=True)

In [None]:
housing=train_set.copy()
housing.head()

#### a) missing data

In [None]:
housing.isna().sum()

In [None]:
from sklearn.impute import SimpleImputer
housing_num=housing.drop('ocean_proximity',axis=1) #drop cat col first.

imputer=SimpleImputer(strategy='median')
X=imputer.fit_transform(housing_num)
housing_tr=pd.DataFrame(X,columns=housing_num.columns)

In [None]:
imputer.statistics_

In [None]:
housing_num.median().values

In [None]:
housing_tr.median().values

#### b) categorical data

In [None]:
#double [[]]
housing_cat=housing[['ocean_proximity']]
housing_cat.head()

In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe=OrdinalEncoder()
housing_cat_encoded=oe.fit_transform(housing_cat)
housing_cat_encoded

In [None]:
oe.categories_

* not in an ordinal sequence

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder()
housing_cat_1hot=ohe.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
housing_cat_1hot.toarray()

In [None]:
ohe.categories_

In [None]:
housing.head()

#### c) add more columns

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class CombinedAttr(BaseEstimator, TransformerMixin):
    def __init__(self,add_bedroom_per_room=True): #__ vs _
        self.add_bedroom_per_room=add_bedroom_per_room
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        room_per_households=X[:,3]/X[:,6]
        population_per_households=X[:,5]/X[:,6]
        if self.add_bedroom_per_room:
            bedroom_per_room=X[:,4]/X[:,3]
            return np.c_[X,room_per_households,bedroom_per_room,population_per_households]
        else:
            return np.c_[X,room_per_households,population_per_households]

In [None]:
attr_adder=CombinedAttr(add_bedroom_per_room=True)
housing_attr=attr_adder.transform(housing.values)

#### d) feature scaling
- min-max scaling
- standardization

#### e) Pipelines

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline=Pipeline([
            ('imputer',SimpleImputer(strategy='median')),
            ('attr_adder',CombinedAttr()),
            ('std_scaler',StandardScaler()) 
])

housing_num_tr=num_pipeline.fit_transform(housing_num)

In [None]:
from sklearn.compose import ColumnTransformer

num_var=list(housing_num)
cat_var=['ocean_proximity']

full_pipeline=ColumnTransformer([
            ('num',num_pipeline,num_var),
            ('cat',OneHotEncoder(),cat_var)
])

X=full_pipeline.fit_transform(housing)

### 4. Modeling

### 4.1 Models

#### a) linear_regression

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X,y)

In [None]:
some_data=housing.iloc[:5]
some_label=y.iloc[:5]
some_prepared=full_pipeline.transform(some_data)

In [None]:
lin_reg.predict(some_prepared)

In [None]:
list(some_label)

In [None]:
from sklearn.metrics import mean_squared_error
housing_pred=lin_reg.predict(X)
lin_mse=mean_squared_error(y,housing_pred)
lin_rmse=np.sqrt(lin_mse)
lin_rmse

rmse is too big for the data, about 1/2 of the housing price. #underfit

#### b) decison tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(X,y)

In [None]:
housing_pred=tree_reg.predict(X)
tree_mse=mean_squared_error(y,housing_pred)
tree_rmse=np.sqrt(tree_mse)
tree_rmse

rmse=0, not realistic. #overfit

### 4.2 Model Evaluation - Cross-validation

In [None]:
from sklearn.model_selection import cross_val_score
tree_rmse=cross_val_score(tree_reg,X,y,
                       scoring='neg_root_mean_squared_error',
                       cv=10)

In [None]:
def display_scores(scores):
    print('rmse Scores:',scores)
    print('Means:',scores.mean())
    print('Std:',scores.std())

In [None]:
display_scores(-tree_rmse)

In [None]:
lin_rmse=cross_val_score(lin_reg,X,y,
                       scoring='neg_root_mean_squared_error',
                       cv=10)

In [None]:
display_scores(-lin_rmse)

Seems decision tree is overfitting here, linear regression is slightly better.

#### c) Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_reg=RandomForestRegressor()
rf_reg.fit(X,y)

In [None]:
rf_rmse=cross_val_score(rf_reg,X,y,
                       scoring='neg_root_mean_squared_error',
                       cv=10)

In [None]:
display_scores(-rf_rmse)

### 4.3 Fine-Tune Model

### Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid=[
    {'n_estimators':[3,10,30],'max_features':[2,4,6,8]}, #12
    {'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]}  #6
]
#12+6=18 total combinations

rf_reg=RandomForestRegressor()

grid_search=GridSearchCV(rf_reg,param_grid,cv=10,
                        scoring='neg_root_mean_squared_error',
                        return_train_score=True)
grid_search.fit(X,y)

In [None]:
grid_search.best_estimator_

In [None]:
cvres=grid_search.cv_results_
for mean_score,params in zip(cvres['mean_test_score'],cvres['params']):
    print(np.sqrt(-mean_score),params)

### Randomized Search

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
        'n_estimators': randint(1, 200),
        'max_features': randint(1, 8)
    }

rf_reg=RandomForestRegressor()

random_search = RandomizedSearchCV(rf_reg, param_distributions=param_dist,
                                   n_iter=10, cv=5, scoring='neg_root_mean_squared_error')
random_search.fit(X, y)

In [None]:
random_search.best_estimator_

In [None]:
cvres=random_search.cv_results_
for mean_score,params in zip(cvres['mean_test_score'],cvres['params']):
    print(np.sqrt(-mean_score),params)

##### Ensemble (to be update)

### Feature Importance

In [None]:
grid_search.best_estimator_.feature_importances_

In [None]:
new_var=['room_per_households','bedroom_per_room','population_per_households']
features=num_var+new_var+list(ohe.categories_[0])
features

In [None]:
feature_importances = pd.Series(grid_search.best_estimator_.feature_importances_, index=features)
feature_importances.nlargest(10).plot(kind='barh')

Top importance features:
`median_income`, `inland`, `population per household`.

In [None]:
### On the Testset
final_model=grid_search.best_estimator_

In [None]:
X_test=test_set.drop('median_house_value',axis=1)
y_test=test_set['median_house_value'].copy()

X_test_prepared=full_pipeline.transform(X_test)
final_pred=final_model.predict(X_test_prepared)

In [None]:
final_mse=mean_squared_error(y_test,final_pred)
final_rmse=np.sqrt(final_mse)
final_rmse

In [None]:
### significant improve the test_set performance?
from scipy import stats
confidence=0.95
squared_err=(final_pred-y_test)**2
np.sqrt(stats.t.interval(confidence,len(squared_err)-1,
                         loc=squared_err.mean(),
                         scale=stats.sem(squared_err)
                        ))

#### next step:
- data prep: heavy tail -> bell shape distribution
- modeling: 
    - Support Vector Machine regressor
    - with Grid Search & Randomized Search
    - Automate the params in GridSearchCV
    - build full pipelines

ref: https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb