# Lab assignment №1, part 2

This lab assignment consists of several parts. You are supposed to make some transformations, train some models, estimate the quality of the models and explain your results.

Several comments:
* Don't hesitate to ask questions, it's a good practice.
* No private/public sharing, please. The copied assignments will be graded with 0 points.
* Blocks of this lab will be graded separately.

__*This is the second part of the assignment. First and third parts are waiting for you in the same directory.*__

## Part 2. Data preprocessing, model training and evaluation.

### 1. Reading the data
Today we work with the [dataset](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29), describing different cars for multiclass ($k=4$) classification problem. The data is available below.

In [None]:
# If on colab, uncomment the following lines

# ! wget https://raw.githubusercontent.com/girafe-ai/ml-course/22f_made/homeworks/lab01_ml_pipeline/car_data.csv

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

dataset = pd.read_csv('car_data.csv', delimiter=',', header=None).values
data = dataset[:, :-1].astype(int)
target = dataset[:, -1]

print(data.shape, target.shape)

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.35)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

To get some insights about the dataset, `pandas` might be used. The `train` part is transformed to `pd.DataFrame` below.

In [None]:
X_train_pd = pd.DataFrame(X_train)

# First 15 rows of our dataset.
X_train_pd.head(15)

Methods `describe` and `info` deliver some useful information.

In [None]:
X_train_pd.describe()

In [None]:
X_train_pd.info()

### 2. Machine Learning pipeline
Here you are supposed to perform the desired transformations. Please, explain your results briefly after each task.

#### 2.0. Data preprocessing
* Make some transformations of the dataset (if necessary). Briefly explain the transformations

In [None]:
### YOUR CODE HERE
# Here we perform some normalizing and centring of our our data. We normalize both test and train data
# Actually, we norm test data by train one, because we suppose that test data is unreachable for us at this moment, but not to forget about normolizing later

X_train_norm = (X_train - np.mean(X_train,axis = 0))/np.std(X_train,axis = 0) #centering and normilize train data
X_test_norm = (X_test - np.mean(X_train,axis = 0))/np.std(X_train,axis = 0) #centering and normilize test data


#### 2.1. Basic logistic regression
* Find optimal hyperparameters for logistic regression with cross-validation on the `train` data (small grid/random search is enough, no need to find the *best* parameters).

* Estimate the model quality with `f1` and `accuracy` scores.
* Plot a ROC-curve for the trained model. For the multiclass case you might use `scikitplot` library (e.g. `scikitplot.metrics.plot_roc(test_labels, predicted_proba)`).

*Note: please, use the following hyperparameters for logistic regression: `multi_class='multinomial'`, `solver='saga'` `tol=1e-3` and ` max_iter=500`.*

In [None]:
### YOUR CODE HERE
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
model = LogisticRegressionCV(cv = 9,multi_class='multinomial', solver='saga', tol=1e-3, max_iter=500)
model.fit(X_train_norm, y_train)
y_pred = model.predict(X_test_norm)
# print(y_pred)
print(accuracy_score(y_test, y_pred))
print(f1_score(y_test, y_pred,average='micro'))

In [None]:
# You might use this command to install scikit-plot. 
# Warning, if you a running locally, don't call pip from within jupyter, call it from terminal in the corresponding 
# virtual environment instead

# ! pip install scikit-plot

#### 2.2. PCA: explained variance plot
* Apply the PCA to the train part of the data. Build the explaided variance plot. 

In [None]:
### YOUR CODE HERE
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

pipe_pca = Pipeline([('standart', StandardScaler()), ('pca', PCA())])

pipe_pca.fit(X_train)
print(pipe_pca['pca'].explained_variance_ratio_)
x = np.linspace(1,19,19)
y =pipe_pca['pca'].explained_variance_ratio_
plt.plot(x,y)
plt.scatter(x, y)
plt.ylabel("variance_ratio")
plt.xlabel("number of feature")

#### 2.3. PCA trasformation
* Select the appropriate number of components. Briefly explain your choice. Should you normalize the data?

*Use `fit` and `transform` methods to transform the `train` and `test` parts.*

In [None]:
### YOUR CODE HERE
### Here we manage some PCA transformation. Our data is already normalized
from sklearn.decomposition import PCA
import numpy as np
pca = PCA(n_components = 15)
pca.fit(X_train_norm)
X_train_new = pca.transform(X_train_norm)
X_test_new = pca.transform(X_test_norm)
# pca.transform(X_train_norm)
# pca.transform(X_test_norm).shape

**Note: From this point `sklearn` [Pipeline](https://scikit-learn.org/stable/modules/compose.html) might be useful to perform transformations on the data. Refer to the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for more information.**

#### 2.4. Logistic regression on PCA-preprocessed data.
* Find optimal hyperparameters for logistic regression with cross-validation on the transformed by PCA `train` data.

* Estimate the model quality with `f1` and `accuracy` scores.
* Plot a ROC-curve for the trained model. For the multiclass case you might use `scikitplot` library (e.g. `scikitplot.metrics.plot_roc(test_labels, predicted_proba)`).

*Note: please, use the following hyperparameters for logistic regression: `multi_class='multinomial'`, `solver='saga'` and `tol=1e-3`*

In [None]:
### YOUR CODE HERE
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([('standart', StandardScaler()), ('pca', PCA(n_components=13)), ('logreg', LogisticRegression(multi_class='multinomial',solver='saga',tol=1e-3))])
param_grid = dict(logreg__C=[0.01, 0.1, 1, 5, 10, 100])
grid_search = GridSearchCV(pipe, param_grid, refit=True)
grid_search.fit(X_train, y_train)

#### 2.5. Decision tree
* Now train a desicion tree on the same data. Find optimal tree depth (`max_depth`) using cross-validation.

* Measure the model quality using the same metrics you used above.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# YOUR CODE HERE
tree = DecisionTreeClassifier(max_depth = 15)
tree.fit(X_train_norm, y_train)
y_pred = tree.predict(X_test_norm)
print(accuracy_score(y_test, y_pred))
print(f1_score(y_test, y_pred,average='micro'))

#### 2.6. Bagging.
Here starts the ensembling part.

First we will use the __Bagging__ approach. Build an ensemble of $N$ algorithms varying N from $N_{min}=2$ to $N_{max}=100$ (with step 5).

We will build two ensembles: of logistic regressions and of decision trees.

*Comment: each ensemble should be constructed from models of the same family, so logistic regressions should not be mixed up with decision trees.*


*Hint 1: To build a __Bagging__ ensebmle varying the ensemble size efficiently you might generate $N_{max}$ subsets of `train` data (of the same size as the original dataset) using bootstrap procedure once. Then you train a new instance of logistic regression/decision tree with optimal hyperparameters you estimated before on each subset (so you train it from scratch). Finally, to get an ensemble of $N$ models you average the $N$ out of $N_{max}$ models predictions.*

*Hint 2: sklearn might help you with this taks. Some appropriate function/class might be out there.*

* Plot `f1` and `accuracy` scores plots w.r.t. the size of the ensemble.

* Briefly analyse the plot. What is the optimal number of algorithms? Explain your answer.

* How do you think, are the hyperparameters for the decision trees you found in 2.5 optimal for trees used in ensemble? 

In [None]:
# YOUR CODE HERE
from sklearn.ensemble import BaggingClassifier

n_s = [i for i in range(2,100,5)]
log_bags = []
tree_bags = []
for n in n_s:
  log_bag = BaggingClassifier(base_estimator=model, n_estimators=2).fit(X_train_new, y_train)
  tree_bag = BaggingClassifier(base_estimator=tree, n_estimators=10).fit(X_train_new, y_train)
  # y_pred_log= log_bag.predict(X_test_new)
  # y_pred_tree = tree_bag.predict(X_test_new)
  # print(accuracy_score(y_test, y_pred_log), accuracy_score(y_test, y_pred_tree))
  # print(f1_score(y_test, y_pred_log,average='micro'), f1_score(y_test, y_pred_tree,average='micro'))
  log_bags.append(log_bag)
  tree_bags.append(tree_bag)

In [None]:
import matplotlib.pyplot as plt
accuracies_log = [accuracy_score(log_bags[i].predict(X_test_new), y_test) for i in range(20)]
accuracies_tree = [accuracy_score(tree_bags[i].predict(X_test_new), y_test) for i in range(20)]
f1_log = [f1_score(log_bags[i].predict(X_test_new), y_test,average='micro') for i in range(20)]
f1_tree = [f1_score(tree_bags[i].predict(X_test_new), y_test,average='micro') for i in range(20)]
plt.plot(n_s,accuracies_log, label="accuracies_log")
plt.plot(n_s, accuracies_tree, label="accuracies_tree")
# plt.plot(range(20),f1_log,label = "f1_log")
# plt.plot(range(20),f1_tree, label ="f1_tree")
plt.legend()

#### 2.7. Random Forest
Now we will work with the Random Forest (its `sklearn` implementation).

* * Plot `f1` and `accuracy` scores plots w.r.t. the number of trees in Random Forest.

* What is the optimal number of trees you've got? Is it different from the optimal number of logistic regressions/decision trees in 2.6? Explain the results briefly.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# YOUR CODE HERE
n_s = [i for i in range(1,550,5)]
rand_forests = [RandomForestClassifier(n).fit(X_train_new, y_train) for n in n_s]
accuracies = [accuracy_score(rand_forests[i].predict(X_test_new), y_test) for i in range(len(n_s))]
f1 = [f1_score(rand_forests[i].predict(X_test_new), y_test,average='micro') for i in range(len(n_s))]
plt.plot(n_s,accuracies, label="accuracies")
plt.plot(n_s, f1, label="f1")
plt.legend()

#### 2.8. Learning curve
Your goal is to estimate, how does the model behaviour change with the increase of the `train` dataset size.

* Split the training data into 10 equal (almost) parts. Then train the models from above (Logistic regression, Desicion Tree, Random Forest) with optimal hyperparameters you have selected on 1 part, 2 parts (combined, so the train size in increased by 2 times), 3 parts and so on.

* Build a plot of `accuracy` and `f1` scores on `test` part, varying the `train` dataset size (so the axes will be score - dataset size.

* Analyse the final plot. Can you make any conlusions using it? 

In [None]:
# YOUR CODE HERE
x = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
plt.plot(x, accuracy_tree)
plt.scatter(x, accuracy_tree)
plt.title("accuracy_score_tree")
plt.xlabel("X_part")
plt.grid(True)
plt.show()

plt.plot(x, f1_tree)
plt.scatter(x, f1_tree)
plt.title("f1_score_tree")
plt.xlabel("X_part")
plt.grid(True)
plt.show()

plt.plot(x, accuracy_logreg)
plt.scatter(x, accuracy_logreg)
plt.title("accuracy_score_logreg")
plt.xlabel("X_part")
plt.grid(True)
plt.show()

plt.plot(x, f1_logreg)
plt.scatter(x, f1_logreg)
plt.title("f1_score_logreg")
plt.xlabel("X_part")
plt.grid(True)
plt.show()

plt.plot(x, accuracy_forest)
plt.scatter(x, accuracy_forest)
plt.title("accuracy_score_forest")
plt.xlabel("X_part")
plt.grid(True)
plt.show()

plt.plot(x, f1_forest)
plt.scatter(x, f1_forest)
plt.title("f1_score_forest")
plt.xlabel("X_part")
plt.grid(True)
plt.show()