## Q3. Machine Learning Challenge

In [1]:
# Imports
import pandas as pd

# Custom implementation using the most popular sklearn tools
from utils import ClassifierGSCV


### Goal and Approach

I will implement a custom classifier using **random forests**, **grid search** and **cross validation** for predicting the degree of toxicity. 

I will **not** try/implement other classifiers as I do not think that proves anything for the sake of the test, 

I will **not** focus on model performance, I will showcase how I would approach a problem like this in a simple way with a limited amount of time and resources. Why random forests? It is one of the most known classifiers right now, consensually one of the best performers being able to deal with multi-classification.

In [3]:
# Reading the data and taking a first look
data = pd.read_excel("toxicity_xls.xlsx", engine = "openpyxl", index_col=0)
data.head()

Unnamed: 0,flirtation,identity_attack,insult,severe_toxicity,sexually_explicit,threat,label
0,0.593828,0.563516,0.84909,0.864632,0.777347,0.602494,Offensive
1,0.213193,0.407253,0.92501,0.856451,0.456983,0.592931,Offensive
2,0.474532,0.323574,0.710831,0.747318,0.933715,0.208848,Very offensive
3,0.503426,0.407557,0.796685,0.854638,0.955973,0.343336,Neutral
4,0.394807,0.170078,0.561849,0.766563,0.4593,0.223698,Profanity


In [11]:
# Let's describe our data to see if there are any issues that pop
data.describe()

Unnamed: 0,flirtation,identity_attack,insult,severe_toxicity,sexually_explicit,threat
count,12000.0,12000.0,12000.0,12000.0,12000.0,12000.0
mean,0.416813,0.438216,0.80158,0.821883,0.556574,0.405262
std,0.185237,0.266264,0.162062,0.095602,0.286541,0.256686
min,0.029391,0.03771,0.02443,0.024729,0.017585,0.026123
25%,0.285029,0.232623,0.699709,0.747318,0.305679,0.224373
50%,0.403791,0.353419,0.843521,0.821408,0.548136,0.307148
75%,0.501138,0.606927,0.936827,0.894143,0.820784,0.497426
max,0.949213,0.993878,0.994336,0.984462,1.0,1.0


All features range from 0 to 1 which is a first good indication.
Intersting to see that the average for severe_toxicity and insult is quite high, wonder if that will affect the feature relevance.

In [12]:
#Computing the correlation between all features
data.corr(method='pearson')

Unnamed: 0,flirtation,identity_attack,insult,severe_toxicity,sexually_explicit,threat
flirtation,1.0,-0.077118,-0.20865,0.094356,0.795262,0.081217
identity_attack,-0.077118,1.0,0.563307,0.445643,0.025,0.178985
insult,-0.20865,0.563307,1.0,0.659208,0.036811,-0.068323
severe_toxicity,0.094356,0.445643,0.659208,1.0,0.352226,0.256016
sexually_explicit,0.795262,0.025,0.036811,0.352226,1.0,-0.072997
threat,0.081217,0.178985,-0.068323,0.256016,-0.072997,1.0


Makes sense that **flirtation** is very positively linearly correlated with **sexually_explicit**.
**Identity_attack, insult and severe_toxicity** also have quite a noticeable high correlation

Depending on the algorithm chosen we might need to get rid of the highly correlation variables as we might introduce a biase or give more weight to that specific behaviour to the model

Our variables are pretty clean already, no need for scaling, outlier analysis, encoding, feature engineering is also not something I see worthy with this dataset

Next I will be using our custom class for training model, please see **utils.py**

First I will check the target distribution

In [13]:
data.groupby(['label']).agg({'label': 'count'})

Unnamed: 0_level_0,label
label,Unnamed: 1_level_1
Extremely offensive,426
Hate speech,86
Neutral,814
Offensive,5966
Profanity,3430
Unknown,140
Very offensive,1138


Clear differences in the distribution, we would probably need to **oversample** some of these labels to make them more representative and "force" our model to predict them, I will not cover that here since this is also a question for the "business" how important is it to predict the low represented classes.

Now we fit the classifier using my grid search and cross validation implementation:
The decision on the best classifier will be **based on the default scoring function from the sklearn implementation of the RandomForestClassifier**. 

I will **not customize the scoring function**, so the classifier with the **minimal average accuracy for the 5 k-folds will be selected**. 

In summary **Accuracy** will be our decision metric - there are many more we could use such as **precision** and/or **recall** and would make sense due to the biase we observe in the labels distribution!

In [14]:
# Initialise our class from an existing dataframe and the hyperparameters we want to apply grid search for.
# The more parameters the more time it will take for the model to be created.
# It is important to note that the current implementation will do all the combinations between
# all the parameters selected, so the computation time scales exponentially and not linearly as the
# number of hyperparameters increase

clsf = ClassifierGSCV.from_data(data, n_estimators = [200], max_depth = [None, 50], criterion = ['gini', 'entropy'])
clsf.fit_classifier()

In [15]:
# Check the results for all the hyperparameters combinations
clsf.check_results()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,8.242761,0.195613,0.264093,0.032211,gini,,200,"{'criterion': 'gini', 'max_depth': None, 'n_es...",0.600417,0.615,0.640417,0.63375,0.441667,0.58625,0.07365,4
1,8.081723,0.417538,0.253872,0.022,gini,50.0,200,"{'criterion': 'gini', 'max_depth': 50, 'n_esti...",0.5975,0.620417,0.6425,0.6325,0.4525,0.589083,0.069923,3
2,16.152425,0.424774,0.25337,0.010818,entropy,,200,"{'criterion': 'entropy', 'max_depth': None, 'n...",0.604583,0.61625,0.6425,0.628333,0.457083,0.58975,0.067519,2
3,17.146311,1.074815,0.26973,0.05084,entropy,50.0,200,"{'criterion': 'entropy', 'max_depth': 50, 'n_e...",0.605,0.6175,0.647083,0.64375,0.458333,0.594333,0.069816,1


This approach is very resilient to **over fitting**, specially because we are tunning **max depth** and we are using **cross validation**, this might affect the score metrics, in this case the accuracy, other approaches will very likely have higher accuracies but will be worse in a production environment. 

Note that we got a ~65% accuracy on the second split and a ~44% on split 4, so this performance metric **is very susceptible to random splits of training and testing data**. Not using cross validation will cause your metric to be very **volatile and if you are lucky with your split you get a good accuracy, if you are unlucky you get a low accuracy**.

In [4]:
# I am initializing the classifier again just to show that we can predict without having to 
# run everything again from scratch, we can start from this point and predict for new data
clsf = ClassifierGSCV(simple = True)
# Running the prediction for our data top 10 values
# I will not run and evaluate for all the data because we already decided what the best model is
# I did not save a chunk of the dataset for test because cross validation already does several splits internally
# This is just an idea of what the outcome would be, but this would only make sense to evaluate
# Either on brand new data or reserved data that the model did not see
clsf.predict_classifier(data.iloc[0:10,0:5].values)

Unnamed: 0,Extremely offensive,Hate speech,Neutral,Offensive,Profanity,Unknown,Very offensive
0,0.01,0.0,0.0,0.794639,0.190361,0.0,0.005
1,0.0,0.0,0.0,0.88,0.095,0.0,0.025
2,0.0,0.0,0.0,0.161667,0.123333,0.0,0.715
3,0.0,0.0,0.655,0.21,0.105,0.0,0.03
4,0.0,0.0,0.025,0.07,0.905,0.0,0.0
5,0.0,0.0,0.0,0.19,0.805,0.0,0.005
6,0.005,0.0,0.03,0.2,0.75,0.0,0.015
7,0.95,0.04,0.0,0.01,0.0,0.0,0.0
8,0.0,0.0,0.01,0.03,0.96,0.0,0.0
9,0.0,0.0,0.01,0.085,0.9,0.005,0.0


As expected due to the nature of the data set our model will probably **predict very well the most represented labels** - if I had a test data set would be nice to see **how would the model perform against the least represented labels vs the most represented ones**, measuring this on the training/validation might lead to wrong conclusions, I would need a brand new data set to do so, or I would need to save a percentage of the main dataset just for this purpose - I did not do it because I would not change my implementation based on that conclusion, so might as well just perform the training on the entire dataset.

Please keep in mind I am speaking about the **testing set**, not the validation set - we had **multiple train and validation splits during the cross validation process**, so there is no issue there!

### Next Steps

1. Try different algorithms.
2. More Hyperparameter Tunning.
3. Create more evaluating metrics specially to evaluate the low represented labels.
4. Measure and tackle the impact of having a very biased label distribution (oversample, downsample etc..).
5. Deploy the model Implement monitoring
6. Run and evaluate the model in new test data