# Supervised learning - Classification
Goal of this excercise is to complete the hands-on experience task with similar task description as in the classification project case.

We will use the modified Household Prices Dataset.

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Important attributes description:
* SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* Heating: Type of heating
* CentralAir: Central air conditioning
* GrLivArea: Above grade (ground) living area square feet
* BedroomAbvGr: Number of bedrooms above basement level)

### Complete the following tasks:
1. **Describe what operations you are performing for each of the features**
    - Mainly focus on categorical features
2. Answer the following questions:
    - **How many values are missing?**
    - **How many instances do you have in each of the classes?**
    - **Which metric score do you propose for the classification model performance evaluation?**
        - Hint: This depends on your previous answer
3. Finish your preprocessing pipeline and split the data into the Input and Output part (i.e. X and y variables)
4. Start with the Decision tree model
    - Use 5-fold cross validation
    - **Will you use *standard* cross validation or *stratified* cross validation? Why?**
    - Compute mean of the obtained score values
5. Select one other algorithm from https://scikit-learn.org/stable/supervised_learning.html
    - Repeat the 5-fold CV
6. **Write down which model is better and why**
7. Do **5 experiments** with hyper-parameters
    - Set the parameters
    - Do the 5-fold CV
    - Note the settings and score in the Markdown cell
8. **Write down  which model is the best and why**

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, auc
from sklearn.preprocessing import OrdinalEncoder
from sklearn.neural_network import MLPClassifier

## We will use categorized price as a target variable
- Our goal is to predict if the house will be sold for more than 250k USD or not

In [44]:
df = pd.read_csv('zsu_cv1_data.csv').loc[:, ['SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr']]
df.loc[:, ['Target']] = (df.SalePrice > 250000).astype(int)
df = df.drop(['SalePrice'], axis=1)

In [45]:
df.head()

Unnamed: 0,MSSubClass,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,Heating,CentralAir,GrLivArea,BedroomAbvGr,Target
0,60,1Fam,2Story,7,5,2003,GasA,Y,1710,3,0
1,20,1Fam,1Story,6,8,1976,GasA,Y,1262,3,0
2,60,1Fam,2Story,7,5,2001,GasA,Y,1786,3,0
3,70,1Fam,2Story,7,5,1915,GasA,Y,1717,3,0
4,60,1Fam,2Story,8,5,2000,GasA,Y,2198,4,0


## Take a look at the features

In [46]:
df.describe()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,GrLivArea,BedroomAbvGr,Target
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,6.099315,5.575342,1971.267808,1515.463699,2.866438,0.14863
std,42.300571,1.382997,1.112799,30.202904,525.480383,0.815778,0.355845
min,20.0,1.0,1.0,1872.0,334.0,0.0,0.0
25%,20.0,5.0,5.0,1954.0,1129.5,2.0,0.0
50%,50.0,6.0,5.0,1973.0,1464.0,3.0,0.0
75%,70.0,7.0,6.0,2000.0,1776.75,3.0,0.0
max,190.0,10.0,9.0,2010.0,5642.0,8.0,1.0


In [47]:
df.describe(exclude=np.number)

Unnamed: 0,BldgType,HouseStyle,Heating,CentralAir
count,1460,1460,1460,1460
unique,5,8,6,2
top,1Fam,1Story,GasA,Y
freq,1220,726,1428,1365


In [48]:
df.dtypes

MSSubClass       int64
BldgType        object
HouseStyle      object
OverallQual      int64
OverallCond      int64
YearBuilt        int64
Heating         object
CentralAir      object
GrLivArea        int64
BedroomAbvGr     int64
Target           int64
dtype: object

# Task (2p)
- Finished the proposed tasks

**Write down conclusion to the Markdown cell**

## Convert categorical data

BldgType	HouseStyle	Heating	CentralAir

### We can try order types by mean of house area, but for simplinest we can use it as a dummies 

In [49]:
df.BldgType.unique()

array(['1Fam', '2fmCon', 'Duplex', 'TwnhsE', 'Twnhs'], dtype=object)

In [50]:
df = df.join(pd.get_dummies(df.BldgType, prefix='BldgType')).drop(['BldgType'], axis=1)

In [51]:
df.HouseStyle.unique()

array(['2Story', '1Story', '1.5Fin', '1.5Unf', 'SFoyer', 'SLvl', '2.5Unf',
       '2.5Fin'], dtype=object)

In [52]:
df = df.join(pd.get_dummies(df.HouseStyle, prefix='HouseStyle')).drop(['HouseStyle'], axis=1)

In [53]:
df.Heating.unique()

array(['GasA', 'GasW', 'Grav', 'Wall', 'OthW', 'Floor'], dtype=object)

In [54]:
df = df.join(pd.get_dummies(df.Heating, prefix='Heating')).drop(['Heating'], axis=1)

### Binary type

In [55]:
df.CentralAir.unique()

array(['Y', 'N'], dtype=object)

In [56]:
enc_air = OrdinalEncoder(categories=[['Y', 'N']])
df['CentralAir'] = enc_air.fit_transform(df[['CentralAir']])[:, 0]
df.head()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,CentralAir,GrLivArea,BedroomAbvGr,Target,BldgType_1Fam,BldgType_2fmCon,...,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall
0,60,7,5,2003,0.0,1710,3,0,1,0,...,0,1,0,0,0,1,0,0,0,0
1,20,6,8,1976,0.0,1262,3,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,60,7,5,2001,0.0,1786,3,0,1,0,...,0,1,0,0,0,1,0,0,0,0
3,70,7,5,1915,0.0,1717,3,0,1,0,...,0,1,0,0,0,1,0,0,0,0
4,60,8,5,2000,0.0,2198,4,0,1,0,...,0,1,0,0,0,1,0,0,0,0


## Mising values

In [57]:
df.apply(lambda x: x.isna().sum()).sort_values(ascending=False)

MSSubClass           0
HouseStyle_1.5Unf    0
Heating_OthW         0
Heating_Grav         0
Heating_GasW         0
Heating_GasA         0
Heating_Floor        0
HouseStyle_SLvl      0
HouseStyle_SFoyer    0
HouseStyle_2Story    0
HouseStyle_2.5Unf    0
HouseStyle_2.5Fin    0
HouseStyle_1Story    0
HouseStyle_1.5Fin    0
OverallQual          0
BldgType_TwnhsE      0
BldgType_Twnhs       0
BldgType_Duplex      0
BldgType_2fmCon      0
BldgType_1Fam        0
Target               0
BedroomAbvGr         0
GrLivArea            0
CentralAir           0
YearBuilt            0
OverallCond          0
Heating_Wall         0
dtype: int64

In [58]:
df.Target.value_counts()

0    1243
1     217
Name: Target, dtype: int64

In [59]:
X, y = df.loc[:, df.columns != 'Target'], df.loc[:, 'Target']

In [60]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1168, 26), (292, 26), (1168,), (292,))

In [61]:
skf = KFold(n_splits=5)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
scores

[0.6444444444444444,
 0.7722772277227722,
 0.6818181818181819,
 0.7142857142857142,
 0.6987951807228915]

In [62]:
np.mean(scores)

0.7023241497988009

In [63]:
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    
scores

[0.7472527472527473,
 0.761904761904762,
 0.6666666666666667,
 0.7010309278350515,
 0.6829268292682926]

In [64]:
np.mean(scores)

0.7119563865855041

In [65]:
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = MLPClassifier(hidden_layer_sizes=(
    100, 50, 50, 100), activation='relu', solver='adam', random_state=13, max_iter=10000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    
scores

[0.3548387096774194,
 0.4615384615384615,
 0.6373626373626373,
 0.20408163265306123,
 0.28571428571428575]

In [66]:
np.mean(scores)

0.38870714538917306

In [67]:
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = MLPClassifier(hidden_layer_sizes=(
    20, 200, 200, 20), activation='relu', solver='adam', random_state=13, max_iter=10000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    
scores

[0.6410256410256411,
 0.5287356321839081,
 0.47852760736196315,
 0.4923076923076923,
 0.6329113924050633]

In [72]:
np.mean(scores)

0.4486068091119642

In [68]:
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = MLPClassifier(hidden_layer_sizes=(
    15,), activation='tanh', solver='adam', random_state=13, max_iter=10000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    
scores

[0.0, 0.0851063829787234, 0.6097560975609757, 0.08510638297872342, 0.125]

In [69]:
np.mean(scores)

0.1809937727036845

In [70]:
skf = StratifiedKFold(n_splits=5, shuffle=True)
scores = list()
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = MLPClassifier(hidden_layer_sizes=(
    180,180,280,), activation='relu', solver='adam', random_state=2, max_iter=100000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))
    
scores

[0.5179856115107913,
 0.4137931034482758,
 0.577319587628866,
 0.6506024096385543,
 0.08333333333333334]

In [71]:
np.mean(scores)

0.4486068091119642

| Alghoritm | Layers | Activation  | max_iter  | f1 mean  |
|---|---|---|---|---|
| Decision Tree |  - | - |   | 0.71 |
| MLP | (100, 50, 50, 100) | relu | 10000 | 0.38 |
| MLP | (20, 200, 200, 20) | relu | 10000 | 0.44 |
| MLP | (15,) | tanh | 10000 | 0.18|
| MLP | 180,180,280,) | relu | 100000 | 0.44 |

The best performing was decision tree with default values. 
I tried to toogle few atr in MLP but without any quality result. 