# Supervised learning - Classification
Goal of this excercise is to complete the hands-on experience task with similar task description as in the classification project case.

We will use the modified Household Prices Dataset.

Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Important attributes description:
* SalePrice: The property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* Heating: Type of heating
* CentralAir: Central air conditioning
* GrLivArea: Above grade (ground) living area square feet
* BedroomAbvGr: Number of bedrooms above basement level)

### Complete the following tasks:
1. **Describe what operations you are performing for each of the features**
    - Mainly focus on categorical features
2. Answer the following questions:
    - **How many values are missing?**
    - **How many instances do you have in each of the classes?**
    - **Which metric score do you propose for the classification model performance evaluation?**
        - Hint: This depends on your previous answer
3. Finish your preprocessing pipeline and split the data into the Input and Output part (i.e. X and y variables)
4. Start with the Decision tree model
    - Use 5-fold cross validation
    - **Will you use *standard* cross validation or *stratified* cross validation? Why?**
    - Compute mean of the obtained score values
5. Select one other algorithm from https://scikit-learn.org/stable/supervised_learning.html
    - Repeat the 5-fold CV
6. **Write down which model is better and why**
7. Do **5 experiments** with hyper-parameters
    - Set the parameters
    - Do the 5-fold CV
    - Note the settings and score in the Markdown cell
8. **Write down  which model is the best and why**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix, auc
from sklearn.preprocessing import OrdinalEncoder
from sklearn.neural_network import MLPClassifier

## We will use categorized price as a target variable
- Our goal is to predict if the house will be sold for more than 250k USD or not

In [2]:
df = pd.read_csv('zsu_cv1_data.csv').loc[:, ['SalePrice','MSSubClass','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','Heating','CentralAir','GrLivArea','BedroomAbvGr']]
df.loc[:, ['Target']] = (df.SalePrice > 250000).astype(int)
df = df.drop(['SalePrice'], axis=1)

In [3]:
df.head()

Unnamed: 0,MSSubClass,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,Heating,CentralAir,GrLivArea,BedroomAbvGr,Target
0,60,1Fam,2Story,7,5,2003,GasA,Y,1710,3,0
1,20,1Fam,1Story,6,8,1976,GasA,Y,1262,3,0
2,60,1Fam,2Story,7,5,2001,GasA,Y,1786,3,0
3,70,1Fam,2Story,7,5,1915,GasA,Y,1717,3,0
4,60,1Fam,2Story,8,5,2000,GasA,Y,2198,4,0


## Take a look at the features

In [4]:
df.describe()

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,GrLivArea,BedroomAbvGr,Target
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,6.099315,5.575342,1971.267808,1515.463699,2.866438,0.14863
std,42.300571,1.382997,1.112799,30.202904,525.480383,0.815778,0.355845
min,20.0,1.0,1.0,1872.0,334.0,0.0,0.0
25%,20.0,5.0,5.0,1954.0,1129.5,2.0,0.0
50%,50.0,6.0,5.0,1973.0,1464.0,3.0,0.0
75%,70.0,7.0,6.0,2000.0,1776.75,3.0,0.0
max,190.0,10.0,9.0,2010.0,5642.0,8.0,1.0


In [5]:
df.describe(exclude=np.number)

Unnamed: 0,BldgType,HouseStyle,Heating,CentralAir
count,1460,1460,1460,1460
unique,5,8,6,2
top,1Fam,1Story,GasA,Y
freq,1220,726,1428,1365


# Task (2p)
- Finished the proposed tasks

**Write down conclusion to the Markdown cell**

#### Počet instancí na sloupec 'Target'

In [8]:
df["Target"].value_counts()

0    1243
1     217
Name: Target, dtype: int64

#### Enkodování všech kategorií do int-ových hodnot

In [16]:
myTransformedData = pd.get_dummies(df)
myTransformedData

Unnamed: 0,MSSubClass,OverallQual,OverallCond,YearBuilt,GrLivArea,BedroomAbvGr,Target,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,...,HouseStyle_SFoyer,HouseStyle_SLvl,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,CentralAir_N,CentralAir_Y
0,60,7,5,2003,1710,3,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
1,20,6,8,1976,1262,3,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
2,60,7,5,2001,1786,3,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
3,70,7,5,1915,1717,3,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
4,60,8,5,2000,2198,4,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,6,5,1999,1647,3,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
1456,20,6,6,1978,2073,3,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
1457,70,7,9,1941,2340,4,1,1,0,0,...,0,0,0,1,0,0,0,0,0,1
1458,20,5,6,1950,1078,2,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1


#### Rozdělení input-ů a output-ů

In [18]:
X, y = myTransformedData.loc[:, myTransformedData.columns != 'Target'], myTransformedData.loc[:, 'Target']
# X.head(), y.head()

#### Train / Test rozdělení

In [13]:
# X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=.2, random_state=15)

### Decision Tree

In [20]:
skf = StratifiedKFold(n_splits=5)
scores = list()
for train_index, test_index in skf.split(X,y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = DecisionTreeClassifier()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))

np.mean(scores), scores

(0.6912148421178049,
 [0.6451612903225806,
  0.7906976744186046,
  0.6097560975609757,
  0.7191011235955056,
  0.691358024691358])

### 'Random forest' algoritmus

In [None]:
skf = StratifiedKFold(n_splits=5)
scores = list()
for train_index, test_index in skf.split(X,y):
    X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    clf = RandomForestClassifier(n_estimators=50)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    scores.append(f1_score(y_test, y_pred))

np.mean(scores), scores