# Praca domowa 4
## Wstęp do uczenia maszynowego
### Paweł Morgen

In [3]:
import numpy as np
import pandas as pd

## Wczytanie danych

### Pierwszy zbiór - apartments

In [4]:
apartments = pd.read_csv('apartments.csv')
apartments_test = pd.read_csv('apartments_test.csv')
apartments.head()

Unnamed: 0,m2.price,construction.year,surface,floor,no.rooms,district
0,5897,1953,25,3,1,Srodmiescie
1,1818,1992,143,9,5,Bielany
2,3643,1937,56,1,2,Praga
3,3517,1995,93,7,3,Ochota
4,3013,1992,144,6,5,Mokotow


### Drugi zbiór - drug consumption

Zbiór [drug consumption](https://archive.ics.uci.edu/ml/datasets/Drug+consumption+%28quantified%29) zawiera informacje odnośnie profilu osobowości respondentów (wyniki z testu pięciu wymiarów osobowości), podstawowe informacje (wiek, płeć, rasa itd) oraz dane na temat spożycia narkotyków (zmienne celu). Zmienne informatywne (nie celu) zostały skwantyfikowane i mogą być traktowane jak zmienne numeryczne.

In [7]:
drugs = pd.read_table('drug_consumption.data', 
                      sep = ',',
                     header = None,
                     names = ['ID', 'Age', 'Gender', 'Education', 'Country', 'Ethnicity', 'Nscore', 'Escore', 'Oscore', 'Ascore', 'Cscore', 'Impulsive', 'SS', 'Alcohol', 'Amphet', 'Amyl', 'Benzos', 'Caff', 'Cannabis', 'Choc', 'Coke', 'Crack', 'Ecstasy', 'Heroin', 'Ketamine', 'Legalh', 'LSD', 'Meth', 'Mushrooms', 'Nicotine', 'Semer', 'VSA'])
drugs.head()

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,...,Ecstasy,Heroin,Ketamine,Legalh,LSD,Meth,Mushrooms,Nicotine,Semer,VSA
0,1,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0


#### Inżynieria cech
Zmienne kategoryczne dotyczące spożycia należy rozumieć w następujący sposób:

 * CL0 - nigdy nie używano
 * CL1 - używano ponad 10 lat temu
 * CL2 - używano w ciągu ostatnich 10 lat
 * CL3 - używano w ciągu ostatniego roku
 * CL4 - używano w ciągu ostatniego miesiąca
 * CL5 - używano w ciągu ostatniego tygodnia
 * CL6 - używano w ciągu ostatniego dnia

Autorzy zawarli informacje o każdym typie używek w osobnych kolumnach. W tej pracy domowej skupię się na narkotykach *per se* (a więc nie interesują nas dane nt. spożycia alkoholu, kofeiny czy czekolady). W tym celu powstanie zbiorcza zmienna Drugs, zbierająca dane nt każdego rodzaju narkotyków.

Autorzy pytali również o spożywanie fikcyjnego narkotyku *Semeron*. Respondentów twierdzących, że kiedykolwiek spożywali ten narkotyk, wykluczymy z puli jako niewiarygodnych.

In [8]:
# Usuwamy kłamców
drugs = drugs.loc[drugs.loc[:,'Semer'] == 'CL0'].reset_index()

# Tworzymy zbiorczą kolumnę
def decode_drugs(col):
    col = col.str.replace('CL', '')
    return pd.to_numeric(col)
not_drugs = ['Alcohol','Amyl','Benzos', 'Caff', 'Cannabis', 'Choc', 'Nicotine', 'VSA']
vec = drugs.loc[:,'Alcohol':'VSA'].drop(not_drugs, axis = 1).apply(decode_drugs).apply(axis = 1, func = np.max)

# Pozbywamy się danych nie-numerycznych i dodajemy zbiorczą kolumnę. 
# Osoby, które od co najmniej 10 lat NIE zażywały narkotyków zaliczamy do grupy '0', a resztę - '1'.
conv = np.array([0,0,1,1,1,1])
drugs = drugs.loc[:,"Age":"SS"].assign(Drugs = pd.Series(conv[vec - 1]).astype('category'))


## Trening modelu

In [9]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import matthews_corrcoef, accuracy_score, classification_report

# Podział na zbiór testowy i treningowy

apts_x_train = apartments.drop('district', axis = 1)
apts_y_train = apartments.loc[:,'district']
apts_x_test = apartments_test.drop('district', axis = 1)
apts_y_test = apartments_test.loc[:,'district']

drugs_x_train, drugs_x_test, drugs_y_train, drugs_y_test = train_test_split(drugs.drop('Drugs', axis = 1),
                                                                           drugs.loc[:,'Drugs'],
                                                                           test_size = 0.3)
# Nie ma potrzeby kodowania zmiennych
# Skalujemy ?

steps_scaled = [('scale', StandardScaler()),
               ('SVM', SVC(kernel = 'rbf', class_weight = 'balanced'))]
steps_raw = [('SVM', SVC(kernel = 'rbf', class_weight = 'balanced'))]

pip_scaled = Pipeline(steps_scaled)
pip_raw = Pipeline(steps_raw)

pip_scaled.fit(apts_x_train, apts_y_train)
pip_raw.fit(apts_x_train, apts_y_train)

apts_pred_scaled = pip_scaled.predict(apts_x_test)
apts_pred_raw = pip_raw.predict(apts_x_test)

print("MCC for scaled: {}".format(matthews_corrcoef(apts_y_test, apts_pred_scaled)))
print("MCC for unscaled: {}".format(matthews_corrcoef(apts_y_test, apts_pred_raw)))
print("Accuracy for scaled: {}".format(accuracy_score(apts_y_test, apts_pred_scaled)))
print("Accuracy for unscaled: {}".format(accuracy_score(apts_y_test, apts_pred_raw)))
# print(classification_report(apts_y_test, apts_pred))

MCC for scaled: 0.2258825697442548
MCC for unscaled: 0.14112030799723416
Accuracy for scaled: 0.30177777777777776
Accuracy for unscaled: 0.21955555555555556


In [10]:
pip_scaled = Pipeline(steps_scaled)
pip_raw = Pipeline(steps_raw)

pip_scaled.fit(drugs_x_train, drugs_y_train)
pip_raw.fit(drugs_x_train, drugs_y_train)

drugs_pred_scaled = pip_scaled.predict(drugs_x_test)
drugs_pred_raw = pip_raw.predict(drugs_x_test)

print("MCC for scaled: {}".format(matthews_corrcoef(drugs_y_test, drugs_pred_scaled)))
print("MCC for unscaled: {}".format(matthews_corrcoef(drugs_y_test, drugs_pred_raw)))
print("Accuracy for scaled: {}".format(accuracy_score(drugs_y_test, drugs_pred_scaled)))
print("Accuracy for unscaled: {}".format(accuracy_score(drugs_y_test, drugs_pred_raw)))


MCC for scaled: 0.23663838960962408
MCC for unscaled: 0.2655264573405789
Accuracy for scaled: 0.6134751773049646
Accuracy for unscaled: 0.6365248226950354


Dla zbioru `apartments` obsewrujemy spory spadek w skuteczności. Zbiór `drugs` już jest zbalansowany, zatem brak różnic w skuteczności modelu nas nie zaskakuje.

## Optymalizacja hiperparametrów

In [11]:
from sklearn.model_selection import GridSearchCV

parameters_rbf = {'SVM__C': np.power(10,np.linspace(-2, 2, 10)),
                  'SVM__gamma': np.power(10,np.linspace(-2, 0, 10))}
parameters_poly = {'SVM__C': np.power(10,np.linspace(-2, 2, 5)),
                  'SVM__gamma': np.power(10,np.linspace(-2, 0, 5)),
                  'SVM__degree' : [2,3,4]}
parameters_lin = {'SVM__C': np.power(10,np.linspace(-2, 2, 100))}

steps_rbf = [('scale', StandardScaler()),
            ('SVM', SVC(kernel = 'rbf', class_weight = 'balanced', cache_size = 800))]
steps_poly = [('scale', StandardScaler()),
            ('SVM', SVC(kernel = 'poly', class_weight = 'balanced', cache_size = 800))]
steps_lin = [('scale', StandardScaler()),
            ('SVM', SVC(kernel = 'linear', class_weight = 'balanced', cache_size = 800))]

pip_rbf = Pipeline(steps_rbf)
pip_poly = Pipeline(steps_poly)
pip_lin = Pipeline(steps_lin)

cv_rbf = GridSearchCV(pip_rbf, parameters_rbf, n_jobs = -1, cv = 3)
cv_poly = GridSearchCV(pip_poly, parameters_poly, n_jobs = -1, cv = 3)
cv_lin = GridSearchCV(pip_lin, parameters_lin, n_jobs = -1, cv = 3)

In [19]:
print('Starting fitting')
cv_rbf.fit(apts_x_train, apts_y_train)
print('rbf fitted')
cv_poly.fit(apts_x_train, apts_y_train)
print('poly fitted')
cv_lin.fit(apts_x_train, apts_y_train)
print('linear fitted')

apts_pred_rbf = cv_rbf.predict(apts_x_test)
apts_pred_poly = cv_poly.predict(apts_x_test)
apts_pred_lin = cv_lin.predict(apts_x_test)

print("MCC with gaussian kernel: {}".format(matthews_corrcoef(apts_y_test, apts_pred_rbf)))
print("MCC with polynomial kernel: {}".format(matthews_corrcoef(apts_y_test, apts_pred_poly)))
print("MCC with linear kernel: {}".format(matthews_corrcoef(apts_y_test, apts_pred_lin)))

Starting fitting
rbf fitted
poly fitted
linear fitted
MCC with gaussian kernel: 0.22805194582334865
MCC with polynomial kernel: 0.22868185880021788
MCC with linear kernel: 0.2396963616362826


In [20]:
print(cv_rbf.best_params_)
print(cv_poly.best_params_)
print(cv_lin.best_params_)

{'SVM__C': 100.0, 'SVM__gamma': 0.0774263682681127}
{'SVM__C': 1.0, 'SVM__degree': 3, 'SVM__gamma': 1.0}
{'SVM__C': 0.5462277217684343}


In [12]:
print('Starting fitting')
cv_rbf.fit(drugs_x_train, drugs_y_train)
print('rbf fitted')
cv_poly.fit(drugs_x_train, drugs_y_train)
print('poly fitted')
cv_lin.fit(drugs_x_train, drugs_y_train)
print('linear fitted')

drugs_pred_rbf = cv_rbf.predict(drugs_x_test)
drugs_pred_poly = cv_poly.predict(drugs_x_test)
drugs_pred_lin = cv_lin.predict(drugs_x_test)

print("MCC with gaussian kernel: {}".format(matthews_corrcoef(drugs_y_test, drugs_pred_rbf)))
print("MCC with polynomial kernel: {}".format(matthews_corrcoef(drugs_y_test, drugs_pred_poly)))
print("MCC with linear kernel: {}".format(matthews_corrcoef(drugs_y_test, drugs_pred_lin)))

Starting fitting
rbf fitted
poly fitted
linear fitted
MCC with gaussian kernel: 0.07205760409319376
MCC with polynomial kernel: 0.11315233231990394
MCC with linear kernel: 0.3048142350121146


## Podsumowanie
W obu przypadkach najlepsze okazało się użycie jądra liniowego.

In [21]:
from sklearn.metrics import classification_report, confusion_matrix

print('Report for apartments:')
print(classification_report(apts_y_test, apts_pred_lin))


print('Report for drugs:')
print(classification_report(drugs_y_test, drugs_pred_lin))
print('Confusion matrix for drugs:')
print(confusion_matrix(drugs_y_test, drugs_pred_lin))

Report for apartments:
              precision    recall  f1-score   support

      Bemowo       0.17      0.36      0.23       896
     Bielany       0.17      0.01      0.02       894
     Mokotow       0.35      0.09      0.14       868
      Ochota       0.36      0.49      0.42       909
       Praga       0.19      0.26      0.22       971
 Srodmiescie       1.00      1.00      1.00       924
       Ursus       0.18      0.20      0.19       920
     Ursynow       0.15      0.01      0.02       864
        Wola       0.16      0.19      0.17       892
    Zoliborz       0.34      0.47      0.39       862

    accuracy                           0.31      9000
   macro avg       0.31      0.31      0.28      9000
weighted avg       0.31      0.31      0.28      9000

Report for drugs:
              precision    recall  f1-score   support

           0       0.30      0.84      0.45       107
           1       0.94      0.55      0.69       457

    accuracy                        

Nie zachwyca. Porównajmy z regresją logistyczną:

In [25]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(class_weight = 'balanced')
logreg.fit(drugs_x_train, drugs_y_train)
logreg_pred = logreg.predict(drugs_x_test)

print(classification_report(drugs_y_test, logreg_pred))
print('Confusion matrix:')
print(confusion_matrix(drugs_y_test, logreg_pred))

              precision    recall  f1-score   support

           0       0.30      0.71      0.42       107
           1       0.90      0.61      0.72       457

    accuracy                           0.63       564
   macro avg       0.60      0.66      0.57       564
weighted avg       0.79      0.63      0.67       564

Confusion matrix:
[[ 76  31]
 [180 277]]


Wyniki są porównywalne. Oba modele mają porównywalny f1-score. `SVM` wykrył więcej *true positives*, ale mniej *true negatives*. 
Stąd wniosek, że dane najzwyczajniej w świecie stanowią wyzwanie dla modeli ML.