# 3. hands-on session: **Classification problem: from *Data* to *Inference***

## **Contents**

1. Preprocess the data
1. Select features & reduce dimensions
1. Closs-validate
1. Find best hyperparameters
1. Compare classifiers
1. Combine classifiers
1. Evaluate performance
1. Predict

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

KeyboardInterrupt: ignored

In [None]:
!pip install corner
import corner

## **Our dataset**

SDSS data of point sources: stars vs Quasi-stellar objects

<img src="https://cdn.mos.cms.futurecdn.net/HgaCHZDNppE6e52yeDACo6-970-80.jpg.webp" height=200>

<img src="https://earthsky.org/upl/2021/01/supermassive-black-hole-artist-e1610556964639.jpg" height=200 align=right>



In [None]:
!wget -c "https://drive.google.com/uc?id=1IoQfGFo13ZP2wTyp-xvzQvguPYhE8TWB" -O "sdss_photo.csv"

In [None]:
data = pd.read_csv("sdss_photo.csv")

## **Data preprocessing**

In [None]:
data

In [None]:
data.describe().round(2)

In [None]:
sum(data.target== "star"), sum(data.target == "QSO")

In [None]:
cols = data.columns
fig = corner.corner(data[cols[0:5]][data.target == "star"], color="C0")
corner.corner(data[cols[0:5]][data.target == "QSO"], fig=fig, color="C1");

### task 1: **create `X` and `y`**

```python
data[["u","g","r","i","z"]] -> X
data.target -> y
"QSO" -> 0
"star" -> 1
```

hint: you can use [LabelEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


In [None]:
X = data[["u","g","r","i","z"]]
X

In [None]:
y = np.array(data.target == "star").astype(int)
y

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
# le.fit(data.target)
# y = le.transform(data.target)
y = le.fit_transform(data.target)
y

In [None]:
le.inverse_transform([0,1])

### task 2: **classify with Decision tree & test accuracy**

In [None]:
# from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

In [None]:
# model = DecisionTreeClassifier(random_state=420)
model = SVC(kernel="linear")

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=420)

In [None]:
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
sum(y_pred == y_test) / len(y_pred)

In [None]:
model.score(X_test, y_test)

### task 3: **rescale the data -> `X_scaled` & test score**

hint:\
you can use [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)\
`X_scaled = X.copy()`

In [None]:
X_scaled = X.copy()
for col in X.columns:
    X_scaled[col] = (X[col] - np.mean(X[col])) / np.std(X[col])

In [None]:
X_scaled.describe().round(2)

In [None]:
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)
X_scaled = pd.DataFrame(data=X_scaled, columns=X.columns)
X_scaled.describe().round(2)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, stratify=y, random_state=420)

model.fit(X_train, y_train)

model.score(X_test, y_test)

In [None]:
from sklearn.pipeline import make_pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=420)

model = make_pipeline(StandardScaler(),
                      SVC(kernel="linear"))
                      #DecisionTreeClassifier(random_state=420))

model.fit(X_train, y_train)

model.score(X_test, y_test)

## **Feature selection & dimensionality reduction**

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
clf = ExtraTreesClassifier(random_state=420).fit(X,y)
clf.feature_importances_

In [None]:
plt.bar(np.arange(5), clf.feature_importances_, 0.5)
plt.xticks(np.arange(5), X.columns);

#### task 4: **calculate spectral indices & test importance**

hint:
`X_new = X.copy()`



In [None]:
X_new = X.copy()
X_new["u-g"] = X.u - X.g
X_new["u-r"] = X.u - X.r
X_new["u-z"] = X.u - X.z
X_new["i-z"] = X.i - X.z

X_new

In [None]:
clf = ExtraTreesClassifier(random_state=42).fit(X_new,y)
plt.bar(np.arange(9), clf.feature_importances_, 0.5)
plt.xticks(np.arange(9), X_new.columns);

#### task 5: **test score if only *u-r* or *i-z* spectral indices are used**

hint: for single columns use `X_new[["u-r"]]`

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new[["u-r"]], y, stratify=y, random_state=420)

model.fit(X_train, y_train)

model.score(X_test, y_test)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new[["i-z"]], y, stratify=y, random_state=420)

model.fit(X_train, y_train)

model.score(X_test, y_test)

#### task 6: **create dummy column & test importance**

hint:
```
X_new2 = X.copy()
X_new2["dummy"] = np.random.randint(10, size=X.r.size)
```

In [None]:
X_new2 = X.copy()

X_new2["dummy"] = np.random.normal(0, 1, size=X.r.size)
X_new2["dummy2"] = np.ones_like(X.r)

In [None]:
clf = ExtraTreesClassifier().fit(X_new2,y)
plt.bar(np.arange(7), clf.feature_importances_, 0.5)
plt.xticks(np.arange(7), X_new2.columns);

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new2[["dummy"]], y, stratify=y, random_state=420)

model.fit(X_train, y_train)

model.score(X_test, y_test)

### [**Principal component analysis**](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

<img src="https://programmathically.com/wp-content/uploads/2021/08/pca-2-dimensions-1024x644.png" width=600pt></img>

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_PCA = pca.fit_transform(X)

In [None]:
plt.scatter(X_PCA[:,0], X_PCA[:,1], c=y)

In [None]:
plt.scatter(X.g, X.u-X.g, c=y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_PCA, y, stratify=y, random_state=420)

model.fit(X_train, y_train)

model.score(X_test, y_test)

### task 7: **integrate `PCA(n_components=2)` into our pipeline**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=420)

model = make_pipeline(PCA(n_components=3),
                      StandardScaler(),
                      SVC(kernel="linear"))
                      #DecisionTreeClassifier(random_state=420))

model.fit(X_train, y_train)

model.score(X_test, y_test)

## [**Cross-validation**](https://scikit-learn.org/stable/modules/cross_validation.html)

<img src="https://miro.medium.com/max/1400/1*AAwIlHM8TpAVe4l2FihNUQ.png" width=800pt></img>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_PCA, y, stratify=y, random_state=420)

model.fit(X_train, y_train)

model.score(X_test, y_test)

### task 8: **use several different random states when splitting data & get average score**

In [None]:
scores = []
for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X_PCA, y, stratify=y, random_state=i)

    model.fit(X_train, y_train)

    scores.append(model.score(X_test, y_test))

scores

In [None]:
np.mean(scores), np.std(scores)

In [None]:
from sklearn.model_selection import cross_validate

res = cross_validate(model, X, y, cv=10)

print(res)

In [None]:
np.mean(res["test_score"]), np.std(res["test_score"])

In [None]:
def score(model, X, y, cv=10):
    res = cross_validate(model, X, y, cv=cv)
    return np.mean(res["test_score"])

In [None]:
score(model, X, y)

## **Tuning hyperparameters**

In [None]:
SVC?

#### task 8: **find SVC hyperparameters with best test score**

In [None]:
def classify(X, y, classifier):
    model = make_pipeline(PCA(n_components=3),
                          StandardScaler(),
                          classifier)

    res = cross_validate(model, X, y, cv=10)
    print(np.mean(res["test_score"]))

In [None]:
classify(X, y, SVC(kernel="linear"))

In [None]:
classify(X, y, SVC(kernel="linear", C=10))

In [None]:
classify(X, y, SVC(kernel="linear", C=0.1))

In [None]:
classify(X, y, SVC(kernel="poly", degree=1))

In [None]:
classify(X, y, SVC(kernel="rbf", C=1))

### **Grid-search + crossvalidation**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
C = [0.01, 0.1, 1, 10]
kernel = ["linear", "poly", "rbf"]

params = {"C" : C,
          "kernel" : kernel}

model = GridSearchCV(SVC(), params, cv=3, n_jobs=8)
model.fit(X, y)

In [None]:
model.cv_results_

In [None]:
params, score = model.cv_results_["params"], model.cv_results_["mean_test_score"]

indices = np.argsort(score)

for i in indices:
    print(params[i], score[i].round(3))

In [None]:
res = model.cv_results_
plt.imshow(res["mean_test_score"].reshape(4,3))
n = 0
for i,k in enumerate(kernel):
    for j,c in enumerate(C):
        plt.text(i,j,"{0:.2f}".format(res["mean_test_score"][n]), ha="center")
        n += 1

plt.xticks(np.arange(len(params["kernel"])), params["kernel"]);
plt.yticks(np.arange(len(params["C"])), params["C"]);

In [None]:
model.best_estimator_

## **Classifier comparison**

In [None]:
from sklearn.neural_network import MLPClassifier # multi-layer perceptron classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
classifiers = [MLPClassifier(max_iter=1000),
               GaussianNB(),
               DecisionTreeClassifier(),
               KNeighborsClassifier(),
               SVC(kernel="rbf", C=10)]

for classifier in classifiers:
    classify(X, y, classifier)

In [None]:
clf = MLPClassifier(max_iter=1000)
params = {"hidden_layer_sizes" : [5, 10, 50, 100],
          "activation" : ["identity", "logistic", "tanh", "relu"],
          "solver" : ["sgd", "adam"]}
model = GridSearchCV(clf, params, cv=5, n_jobs=8)
model.fit(X, y)

In [None]:
model.best_estimator_, model.best_score_

In [None]:
%time MLPClassifier(activation="tanh", hidden_layer_sizes=10, max_iter=1000).fit(X_train, y_train)

In [None]:
%time SVC(C=10).fit(X_train, y_train)

## **Ensemble methods**

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier()

res = cross_validate(model, X, y, cv=10)
np.mean(res["test_score"]), np.std(res["test_score"])

In [None]:
from sklearn.ensemble import StackingClassifier

In [None]:
classifiers = [("MLP", MLPClassifier(max_iter=1000, random_state=42)),
               ("Bayes", GaussianNB()),
               ("RFC", RandomForestClassifier()),
               ("KNN", KNeighborsClassifier()),
               ("SVC", SVC(C=10))]

model = StackingClassifier(classifiers)

res = cross_validate(model, X, y, cv=10)
np.mean(res["test_score"]), np.std(res["test_score"])

In [None]:
classifiers = [("MLP", MLPClassifier(max_iter=1000, random_state=42)),
               ("Bayes", GaussianNB()),
               ("DTC", DecisionTreeClassifier()),
               ("KNN", KNeighborsClassifier()),
               ("SVC", SVC(C=10))]

# train the model
clf = StackingClassifier(classifiers)

model = make_pipeline(StandardScaler(),
                      clf)

res = cross_validate(model, X, y, cv=10)
np.mean(res["test_score"]), np.std(res["test_score"])

### task 3: **try [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) and [`ConfusionMatrixDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html)**

In [None]:
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

In [None]:
print(classification_report(y_test, y_pred, digits=3))

In [None]:
ConfusionMatrixDisplay.from_estimator(model, X_test, y_test, display_labels=["QSO", "star"]);

## **Conclusion**

In [None]:
classifiers = [("MLP", MLPClassifier(max_iter=1000, random_state=42)),
               ("Bayes", GaussianNB()),
               ("DTC", DecisionTreeClassifier()),
               ("KNN", KNeighborsClassifier()),
               ("SVC", SVC(C=10))]

# train the model
clf = StackingClassifier(classifiers)

model = make_pipeline(StandardScaler(),
                      clf)

model.fit(X_new[["u-g"]], y)

## **Model inference**

### task 13: **pick an object from SDSS and classify it**

In [None]:
u = 15.914
g = 15.500
r = 16.2
i = 16.5
z = 17.1

X_real = pd.DataFrame(np.array([[u,g,r,i,z]]), columns=["u","g","r","i","z"])

X_real

In [None]:
pred = model.predict(X_real)

le.inverse_transform(pred)