# Apartments

Dane pobrałam w repo pakietu DALEX.

In [58]:
import rdata

parsed = rdata.parser.parse_file("apartments.rda")
converted = rdata.conversion.convert(parsed)
apartments = converted["apartments"]

In [59]:
apartments.head()

Unnamed: 0,m2.price,construction.year,surface,floor,no.rooms,district
0,5897.0,1953.0,25.0,3,1.0,Srodmiescie
1,1818.0,1992.0,143.0,9,5.0,Bielany
2,3643.0,1937.0,56.0,1,2.0,Praga
3,3517.0,1995.0,93.0,7,3.0,Ochota
4,3013.0,1992.0,144.0,6,5.0,Mokotow


Mamy 5 kolumn ilościowych i kolumnę celu - district.

In [60]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# funkcja wyodrebniajaca kolumne o danej nazwie
def extract_y(data, column):
    y = data[[column]]
    le.fit(y.values.ravel())
    return data.drop([column], axis=1), le.transform(y.values.ravel())

X, y = extract_y(apartments, "district")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2137)

## Przed skalowaniem

In [61]:
from sklearn import svm
model = svm.SVC()
model.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [63]:
from sklearn.metrics import accuracy_score

y_predicted = model.predict(X_test)
accuracy_score(y_test, y_predicted)

0.185

Wynik wygląda bardzo słabo, spróbujmy trochę zoptymalizować hiperparametry.

In [66]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

params = {
    'C': np.arange(0.1, 10, 0.1),
    'degree': np.arange(1, 5, 1),
    'gamma': ["scale", "auto"] + np.arange(0.01, 0.5, 0.01).tolist()
}

# ustawilam cv=10, bo nizsze wartosci troche mijaja sie z celem przy tylu klasach i takiej defaultowej skutecznosci
grid = RandomizedSearchCV(model, params, error_score='raise', cv=10)
grid.fit(X_train, y_train)

print(f"Score: {grid.best_score_}")
grid.best_params_

Score: 0.22125000000000003


{'gamma': 'scale', 'degree': 2, 'C': 2.7}

In [67]:
newmodel = svm.SVC(gamma='scale', degree=2, C=2.7)
newmodel.fit(X_train, y_train)
y_predicted=newmodel.predict(X_test)
accuracy_score(y_test, y_predicted)

0.185

Strojenie hiperparametrów nie dało nam nic. (Przy innych moich próbach wynik tylko spadł :( )

## Po skalowaniu

In [49]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
numerical = ['m2.price', 'construction.year', 'surface', 'floor', 'no.rooms']
scaler = StandardScaler() 

X_scaled = scaler.fit_transform(X[numerical])
X_scaled = pd.DataFrame(X_scaled, columns = numerical)
X_scaled

Unnamed: 0,m2.price,construction.year,surface,floor,no.rooms
0,2.659324,-0.457926,-1.600545,-0.904974,-1.709248
1,-1.841700,1.052614,1.516542,1.165115,1.187782
2,0.172119,-1.077634,-0.781649,-1.595004,-0.984990
3,0.033083,1.168809,0.195742,0.475086,-0.260733
4,-0.523062,1.052614,1.542958,0.130071,1.187782
...,...,...,...,...,...
995,3.164710,-1.697343,-1.098641,-1.249989,-0.984990
996,-0.071746,-1.697343,-0.992977,1.510130,-0.984990
997,-0.429268,0.587832,-0.015585,-0.904974,-0.260733
998,0.777920,-0.883975,-1.309969,0.475086,-1.709248


In [69]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=2137)

model = svm.SVC()
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
accuracy_score(y_test, y_predicted)

0.32

Wynik bez strojenia hiperparametrów już jest znacznie wyższy. Wygląda obiecująco.

In [70]:
params = {
    'C': np.arange(0.1, 10, 0.1),
    'degree': np.arange(1, 5, 1),
    'gamma': ["scale", "auto"] + np.arange(0.01, 0.5, 0.01).tolist()
}

grid = RandomizedSearchCV(model, params, error_score='raise', cv=10)
grid.fit(X_train, y_train)

print(f"Score: {grid.best_score_}")
grid.best_params_

Score: 0.3225


{'gamma': 0.19, 'degree': 2, 'C': 4.9}

Warto zauważyć, że gamma znaleziona przez RandomizedSearch nie różni się szczególnie od wartości ustawionej przez *auto*.

In [72]:
newmodel = svm.SVC(gamma=0.19, degree=2, C=4.9)
newmodel.fit(X_train, y_train)
y_predicted=newmodel.predict(X_test)
accuracy_score(y_test, y_predicted)

0.33

Nie jest może znacznie lepiej niż przed strojeniem, ale lepsza skuteczność 1/3 niż 1/5 jak przed skalowaniem :)

## Szybki test innego jądra
### Liniowe

In [90]:
model = svm.SVC(kernel='linear')
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

grid = RandomizedSearchCV(model, params, error_score='raise', cv=10)
grid.fit(X_train, y_train)

print(f"Score: {grid.best_score_}")
grid.best_params_

Default:  0.285
Score: 0.31375


{'gamma': 0.47000000000000003, 'degree': 4, 'C': 9.0}

In [91]:
# linear ignoruje degree, wiec mozna wyrzucic
newmodel = svm.SVC(kernel='linear', gamma=0.47, C=9)
newmodel.fit(X_train, y_train)
y_predicted=newmodel.predict(X_test)
accuracy_score(y_test, y_predicted)

0.28

Dostaliśmy gorszy wynik, choć zapowiadało się obiecująco.
### Wielomianowe

In [93]:
model = svm.SVC(kernel='poly')
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

grid = RandomizedSearchCV(model, params, error_score='raise', cv=10)
grid.fit(X_train, y_train)

print(f"Score: {grid.best_score_}")
grid.best_params_

Default:  0.29
Score: 0.31375


{'gamma': 0.37, 'degree': 1, 'C': 1.4000000000000001}

In [94]:
# chyba jednak model liniowy nie jest taki zly :P
newmodel = svm.SVC(kernel='poly', gamma='scale', degree=1, C=1.4)
newmodel.fit(X_train, y_train)
y_predicted=newmodel.predict(X_test)
accuracy_score(y_test, y_predicted)

0.285

Dalej gorzej niż *rbf*.
### Sigmoid

In [95]:
model = svm.SVC(kernel='sigmoid')
model.fit(X_train, y_train)
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

grid = RandomizedSearchCV(model, params, error_score='raise', cv=10)
grid.fit(X_train, y_train)

print(f"Score: {grid.best_score_}")
grid.best_params_

Default:  0.205
Score: 0.28625


{'gamma': 0.060000000000000005, 'degree': 1, 'C': 7.8}

In [96]:
# tez mozna wywalic degree
newmodel = svm.SVC(kernel='sigmoid', gamma=0.06, C=7.8)
newmodel.fit(X_train, y_train)
y_predicted=newmodel.predict(X_test)
accuracy_score(y_test, y_predicted)

0.27

Dalej najlepiej się sprawdza *rbf*. Być może przed skalowaniem te kernele poradziłyby sobie lepiej, ale nie sądzę.
# League of Legends Diamond Ranked Games
Link: https://www.kaggle.com/bobbyscience/league-of-legends-diamond-ranked-games-10-min

Jako zmienną celu przyjmijmy *blueWins*. Co prawda daje nam to tylko 2 kategorie, ale spodobał mi się zbiór danych, gdyż ostatnio po długim czasie wróciłam do gry w LoLa :)

In [104]:
data = pd.read_csv('high_diamond_ranked_10min.csv')
data.head()

Unnamed: 0,gameId,blueWins,blueWardsPlaced,blueWardsDestroyed,blueFirstBlood,blueKills,blueDeaths,blueAssists,blueEliteMonsters,blueDragons,...,redTowersDestroyed,redTotalGold,redAvgLevel,redTotalExperience,redTotalMinionsKilled,redTotalJungleMinionsKilled,redGoldDiff,redExperienceDiff,redCSPerMin,redGoldPerMin
0,4519157822,0,28,2,1,9,6,11,0,0,...,0,16567,6.8,17047,197,55,-643,8,19.7,1656.7
1,4523371949,0,12,1,0,5,5,5,0,0,...,1,17620,6.8,17438,240,52,2908,1173,24.0,1762.0
2,4521474530,0,15,0,0,7,11,4,1,1,...,0,17285,6.8,17254,203,28,1172,1033,20.3,1728.5
3,4524384067,0,43,1,0,4,5,5,1,0,...,0,16478,7.0,17961,235,47,1321,7,23.5,1647.8
4,4436033771,0,75,4,0,6,6,6,0,0,...,0,17404,7.0,18313,225,67,1004,-230,22.5,1740.4


In [105]:
data = data.drop(['gameId'], axis=1)

y = data[['blueWins']]
X = data.drop(['blueWins'], axis=1)
print(y.head())
print(X.head())

   blueWins
0         0
1         0
2         0
3         0
4         0
   blueWardsPlaced  blueWardsDestroyed  blueFirstBlood  blueKills  blueDeaths  \
0               28                   2               1          9           6   
1               12                   1               0          5           5   
2               15                   0               0          7          11   
3               43                   1               0          4           5   
4               75                   4               0          6           6   

   blueAssists  blueEliteMonsters  blueDragons  blueHeralds  \
0           11                  0            0            0   
1            5                  0            0            0   
2            4                  1            1            0   
3            5                  1            0            1   
4            6                  0            0            0   

   blueTowersDestroyed  ...  redTowersDestroyed  redTotalGold 

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2137)

## Przed skalowaniem

In [107]:
model = svm.SVC()
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

grid = RandomizedSearchCV(model, params, error_score='raise', cv=5)
grid.fit(X_train, y_train.values.ravel())

print(f"Score: {grid.best_score_}")
grid.best_params_

Default:  0.7383603238866396
Score: 0.5014550957173396


{'gamma': 0.31, 'degree': 4, 'C': 5.8}

In [109]:
model = svm.SVC(gamma=0.31, degree=4, C=5.8)
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
accuracy_score(y_test, y_predicted)

0.4893724696356275

Defaultowe parametry sprawdziły się znacznie lepiej niż te po strojeniu. Spróbowałabym może jeszcze raz, ale niestety mój  komputer już nie chce.
## Po skalowaniu

In [111]:
numerical = X.columns
scaler = StandardScaler() 

X_scaled = scaler.fit_transform(X[numerical])
X_scaled = pd.DataFrame(X_scaled, columns = numerical)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=2137)

In [112]:
model = svm.SVC()
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

grid = RandomizedSearchCV(model, params, error_score='raise', cv=5)
grid.fit(X_train, y_train.values.ravel())

print(f"Score: {grid.best_score_}")
grid.best_params_

Default:  0.7398785425101214
Score: 0.7226368505752647


{'gamma': 0.03, 'degree': 1, 'C': 0.9}

In [115]:
# w sumie to tez ignoruje degree
model = svm.SVC(gamma=0.03, C=0.9)
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
accuracy_score(y_test, y_predicted)

0.7373481781376519

# Wnioski ze skalowania
W pierwszym przypadku skalowanie dało znaczącą poprawę wyniku. W drugim co prawda nie ma takiej różnicy przy defaultowych parametrach, ale widać, że znacznie lepiej działało chociażby strojenie hiperparametrów. 

## Szybki test innych kerneli
ale bez strojenia, bo ten zbiór jest za duży dla mojego komputera

### Liniowe

In [117]:
model = svm.SVC(kernel='linear')
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

Default:  0.736336032388664


### Wielomianowe

In [119]:
model = svm.SVC(kernel='poly')
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

Default:  0.7348178137651822


### Sigmoid

In [120]:
model = svm.SVC(kernel='sigmoid')
model.fit(X_train, y_train.values.ravel())
y_predicted = model.predict(X_test)
print("Default: ", accuracy_score(y_test, y_predicted))

Default:  0.6842105263157895


W tym przypadku widać, że modele dają bardzo podobne wyniki, tylko sigmoid jest wyraźnie gorszy.