# **Exploratory Analysis**
First, I looked at all the features on kaggle.com. It has a very useful and convenient data representation
as diagrams and overall statistics on each feature.

I found out that there are 3 useless features that have the
exact same value for all of the dots: ['Over18', 'EmployeeCount', 'StandardHours']. So I deleted them.

Also, some of the features are categorical (not something I can efficiently transform into numbers and order them)
: ['BusinessTravel', 'Department', 'EducationField', 'Gender',
'JobRole', 'MaritalStatus', 'OverTime']. I relocated them to end of the data array (last columns) in order to easier
later analysis.

A lot of feature have text values so I changed them all into numbers that make sense. For example:
'Non-Travel': 0, 'Travel_Rarely': 0.5, 'Travel_Frequently': 1.

After that I rescaled all the features into [0, 1] range (for the validation data I used the coefficients
obtained from the training data to make sure I am not "leaking" any information to the test array.) I thought
it would make the distances on the same scale with each other. I also tried to normalize all the features on each
other norms (so their squared sum gives 1) but it ended up working worse.

Visually I did not notice anything important from just looking at the data representation on kaggle. So I decided
to look at the correlation between all the features left and check if there are more not important ones not carrying
any additional information. However, it seemed like all the features have at least some new information (no parameters
with very high correlation).

One of the most important things I noticed just by looking at the data is that it is very imbalanced. Thus, at some
point during my work, I was trying to use oversampling to make the data balance. The initial idea was just to copy
some less common examples but then I found some algorithms which can make it more efficient. In the end, I
tried the imblearn.over_sampling package to make the data more balance. I used SMOTE and ADASYN algorithms from it.
ADASYN worked better for the algorithms I used. I haven't tried the undersampling because I decided that our data
is not big enough for that.

Moreover, to find which features are 'good' I plotted all the ROC curves for all the individual features
using logistic regressions. To achieve this goal, I was also trying to use a new tool for myself -
SelectKBest from sklearn.feature_selection with 3 different score functions.
Later I was trying to throw away a different number of the least well-performing features based on all of these
approaches and their combination.

For the same purpose, I also tried SelectFromModel from the same package on some of my best models. This package
shows what are the most important features of this particular model. So It helped me to retrain my models throwing
away some of the least important features.

# **Models**

The first algorithm I chose was the logistic regression from sklearn.linear_model.LogisticRegression. Since this
class is my first experience in machine learning usage I decided to select the algorithm I have the most experience
with through all our lectures and homework. Moreover, this algorithm is very well-known and has a lot of
well-developed instruments in the same library (sklearn). On top of that, it is one of the fastest algorithms
and it has a wide variety of parameters I can work with. Logistic regression is much easier to use than a lot of
other algorithms. All of that together allowed me not only to study some of the new methods I have never used before
(such as SelectFromModel and GridSearchCV) but also I could try a lot of different approaches way easier and faster
(for example it was a good way to learn how oversampling works, which parameters should I choose). Moreover,
because of how fast it is I practiced and developed some functions and algorithms for the next models I used.
Also, since I used logistic regression to study the "quality" of the features it was easier to operate with these
good features via logistic regression algorithm.

The second algorithm I used was a random forest classifier. Besides the similar to the logistic regression benefits, such as
well-studied, good implementation in sklearn, ease of use, and simplicity and parametrization, (also, I could
a lot of scripts I used for logistic regression) there was one more
thing I was guided by. In the logistic regression, I could not but the problem of overfitting that easily and I thought
that random forest classifier should be better in terms of overfitting (In the end it did not help that much but that
is what I thought because it was easy to understand what some parameters such as tree max depth were actually doing).


The last algorithm I used was a Adaboost. Besides all the advantages I mentioned earlier (but both
the Adaboost and random forest are not even close to the logistic regression in terms of speed) this method
was a good idea to use because I have practiced a lot with decision tree classifiers previously in the
random forest algorithm parametrization, so I had some idea of how deep my trees should be,
how many nodes and leaves it should have and so on. I decided
that I can quickly try this algorithm and see if it works well (it was also 3 times faster
than the random forest classifier). In the end, this method was the one
that performed the best.

# **Training**

For the logistic regression problem I was mainly working with l2 regularization term and using mostly LIBLINEAR solver,
and thus, it minimizes for $\mathbf{\beta}$ the following cost function:
$\dfrac{1}{n}\sum_{i=1}^n\log(1+exp(-y_i\mathbf{\beta}^T\mathbf{x_i}) + \lambda ||\mathbf{\beta}||_2^2$ where
$\lambda$ is the penalty parameter (it's $C$ in sklearn.linear_model.LogisticRegression).
In some cases, the discriminant function of the classifier includes a bias term. LIBLINEAR handles this term by
augmenting the vector w and each instance xi with an additional dimension:
$\mathbf{\beta}^T\leftarrow \mathbf{\beta}^T, b, \mathbf{x}_i^T \leftarrow \mathbf{x}_i^T, B$
where $B$ is a constant specified by the user. LIBLINEAR uses Automatic parameter selection and it applies
the coordinate descent algorithm.
For multi-class classification the problem is decomposed in 2 possible way: 1) one-vs-the rest, 2) Crammer & Singer.
LIBLINEAR actually can also support the SVM algorithm.
(Source (LIBLINEAR site): https://www.csie.ntu.edu.tw/~cjlin/liblinear/)

The core principle of adaboost is to fit a group of weak learners (at least slightly above the random guessing) on the
data, which being constantly modified with each iteration. After several iterations we combine all the weak learners
with obtained weights.
Adaboost from sklearn.ensemble implements the algorithm known as AdaBoost-SAMME. This training algorithm can be
represented in following steps: 1) initialization of the observation weights $w_i=1/n$; 2) for $m=1$ to $m=M$ do:
a) fitting each classifier $T^m$ with the weight $w_m$;
b) compute error $\sum_{i=1}^nw_iI(y_i\ne T^m(x_i))/\sum_{i=1}^nw_i$;
c) compute coefficients $\alpha^m=\log\dfrac{1-err^m}{err^m}$;
d) set $w_i\leftarrow w_i \exp(\alpha^m I(y_i\ne T^m(x_i)))$;
e) Re-normalization of $w_i$.
So the final output is the combination of $\alpha^m T^m$.
(Source: Zhu, H. Zou, S. Rosset, T. Hastie, “Multi-class AdaBoost”, 2009.)

For the random forest classifier is a black box algorithm which averaging a bunch of decision trees on various
sub samples of features. Moreover, the 2nd level of randomness is achieved by training each tree from a sample
of the training set selected with replacement. These high level of randomness should help with the decreasing
of the overall variance of the final classifier.  The scikit-learn implementation combines classifiers by
averaging their probabilistic prediction.
(Source: https://scikit-learn.org/ documentation and guides)

Logistic regression was by far the fastest method to use. Average time required for 1 training with the
whole training data was ~9ms, while for the random forest this time was more than 100 times longer and was ~1s.
AdaBoost was slightly faster than random forest - around 0.4s on average. (I haven't used any extra optimization
or parallelism features for these measurements)


# **Hyperparameter Selection**

For all the algorithms I was using GridSearchCV and then more precise tuning by hand. The logic
behind the idea, I was using to tune the parameters is the following:

**Common ideas for all the algorithms:**

For the class weights, I tried all the values from 1:1 up to 1:5 and also the "balanced" case. In most cases
ratio around 1:2.5 - 1:30 worked better than the balanced case. And then I was using
oversampling data (with 1:1 ratio between labels).
GridSearchCV allows me to see the F1 score for each combination of parameters (by using verbose=3) but for that
I had to set n_jobs=1 (otherwise it doesn't work in parallel).

First, I set some diapason for all of the possible parameters for each algorithm
and looked at the performance for all of them to get (it may take some time but it definitely worth it
to get some ideas of how each parameter affects the algorithm). After that, I could cast aside a lot of
values for most of the features (if f1 is too low (underfitting) or very close to 1 (overfitting)).

Then I was tuning every feature 1 by 1 and looked at the performance on both training and validation data sets
trying to make them as high as possible (mostly validation set). As soon as I got some good combinations
of parameters I was looking precisely at each parameter with finner steps in a small range around these
combinations.

Also, for all of the algorithms I tuned the number of folder for the cross-validation and the best value 5-7 folders.

**Some specific for each algorithm notes:**

Random Forest: For this algorithm I was tuning the following parameters: ['n_estimators', 'criterion',
'max_depth', 'min_samples_split', 'min_samples_leaf', 'class_weight', 'max_leaf_nodes', 'max_features'
]. The most important parameters were 'max_depth' and 'max_features'.

AdaBoost: For the adaboost with the decision tree estimator I already knew where I should look at for
'max_depth' and 'max_features' values, since I tuned them in the random forest case. And when I tried some
different combinations it turned out that these values should be in the same range. However,
tuning adaboost was more complicated because 2 parameters 'learning_rate' and 'n_estimators' affect
a lot the accuracy of the algorithm. Even a slight change in any of them may completely change the performances.

Logistic regression: Some of the parameters I was tuning were: ['C', 'penalty', 'intercept_scaling', 'solver'].
First, I started with solvers and found out that they do not affect f1 score that much so I just stopped on the
one which works for both l1 and l2 regularization. Also, l2 shows itself better so I was working mostly with it.

Below you can see 2 plots with f1 scores of my final AdaBoost model vs 2 different hyperparameters (learning rate
and number of estimators). We can see, that the best values are 0.97 and 104 for the learning rate and
the number of estimators, accordingly. Validation data and training data were used.
(Black line is for the validation data, red line is for the training data)



# **Data Splits**

My first step was to split the training data into 2 parts - the training part and the validation data. Validation data
was used to estimate the performance of my final algorithms and it was untouched until the very end. For this
splitting, I tried several values: 10%, 15%, 20%, 25%, and 30% for the validation data. In the end, I stopped
on the 20% because values lower were not accurate enough to judge my models, and values higher took too much
from the training data so it worked worse. For the final model, I tried to use both just training data and
full training + validation data.

For the cross-validation process, I tried several numbers of folders: from 4 to 10. 10 was definitely too many
(probably because the data is not large enough and it is also unbalanced). The best performing value I stopped on
was 6-7 folders. It showed the best results in terms of not overfitting the data too much. To get this result
I was training my models with different parameters on the training data (using different numbers of folders),
checked the average f1 and accuracy values, and then using these models I
looked the at f1 and accuracy values for the validation data. If the values for the training data were too high
when the same values for the validation data were too low (like 99% vs 50%) I was definitely overfitting the
data oo much.

# **Errors and Mistakes**

While working on that project I have several mistakes since I am new to ML:

1) I did not preprocess the data the same way for training data and test data. I was not noticing it for quite a
while and could not understand why my scores on kaggle were all below 50%.

2) For the first couple of days I did not split my training data into the training and validation (test) parts.
So I could not check the performance of my models besides looking at the scores on the training data (which was
always 1) and checking the results on Kaggle (which was limited to 10 per day). And it slowed me down a lot.

3) I did not know that in the GridSearchCV you can increase the number of parallel jobs so all my trainings were
very slow (when I set the number of jobs to 16 (I have 16 threads), and the speed increased about 7-8 times).

4) I was working with a very low number of features (only used 5-15 of the best features) because I thought it could
make my algorithms perform better, but in reality, almost all the best results I got used 31 features (all of the
usable ones)

5) In the beginning I was not looking at the results of each combination of the parameters and only worked
with the best models. Thus, I could not control well what exactly each parameter was doing.

6) I was changing too many hyperparameters at the same time trying to catch the best result. In the end, I stopped
doing it and started to look at them 1 by 1 and at some small combinations of changes.

7) I did not know that Kaggle has an amazing data representation by default so I was trying to plot all the features
to see what they are.

8) Also, one huge mistake I did - I forgot to add random_state into the oversampling and I was always getting
different values when I was learning how to work with imblearn.over_sampling.

# **Predictive Accuracy**

Kaggle nickname: Dakine100500 (Dima Tsvetkov)

If we compare results using f1 score on kaggle.com, my best result was achieved with the Adaboost algorithm $F_1=0.7$,
my second best was achieved with the logistic regression $F_1=0.66$, and the least success I had with the random forest
$F_1=0.59$. However, these numbers correlate a lot with the time I have spent on each algorithm and the order I was
working on it. The last algorithm I was trying to optimize was Adaboost and I also spent the most pure time on it
compared to 2 others. So I believe, it's definitely possible to achieve better results with other 2. Especially
with the random forest algorithm. I was struggling a lot with parameter optimization.

For my final model we can look at ROC curve for it (with AUC=).

# **Code**
