# Train a random-forest classifier

Let's see how a random forest can learn to distinguish misogynistic
speech from generic speech.

In [1]:
import numpy as np
import pandas as pd
import spacy
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

In [2]:
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, f1_score, confusion_matrix, roc_curve
from sklearn.model_selection import GridSearchCV

### Load the data

In [3]:
df = pd.read_csv("../../data/processed/stanford-all.csv", index_col=0)

In [4]:
df.shape

(31014, 2)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31014 entries, 0 to 31013
Data columns (total 2 columns):
content    31014 non-null object
label      31014 non-null int64
dtypes: int64(1), object(1)
memory usage: 726.9+ KB


### Balance classes

To avoid bias (i.e. when an algorithm prefers one class to the other) we need to have an equal number of elements for each class.

In [6]:
df.groupby('label').count()

Unnamed: 0_level_0,content
label,Unnamed: 1_level_1
0,11273
1,19741


One can see here that we have a different number of elements for the two
classes.
Let's change that.
First let's count the elements of the minority class.

In [7]:
n_elements = (df['label']==0).sum()
n_elements

11273

Then sample an equal number from the majority class.

In [8]:
df_positive = df[df['label']==1].sample(n_elements, random_state=42)
df_positive.head()

Unnamed: 0,content,label
10724,@DYKE_37 O que DST?,1
15443,A buzz cut does not a dyke make,1
7968,@my_suhr_guitar @KondratieffWave @NicolaSturge...,1
30805,@lovedayas THIS WASNT EVEN FUNNY BITCH OAK,1
30564,@DreadLegend_ Its gone be all worth it when im...,1


In [9]:
df_balanced = pd.concat((df[df['label']==0], df_positive)).reset_index(drop=True)
df_balanced

Unnamed: 0,content,label
0,The new Doras cute af,0
1,@minniemonikive well,0
2,@tangletorn We will be killed by a snake 3,0
3,@ATX_fight_club @AOC JFK was a clandestine aus...,0
4,@ocorreia_ @skank_ @duudamarquess_ eu tambm kk...,0
...,...,...
22541,@blackearnside @richardqspencer @BBCsarahsmith...,1
22542,Hope shes having a wonderful time with her won...,1
22543,hnn i want to be humiliated and degraded for b...,1
22544,Message to.. 1. Be good 2. Thanks for making m...,1


We store the balanced data in the original dataframe.

In [10]:
df = df_balanced

Now let's create the Spacy NLP documents from our text data.

First we need to load the language model.

In [11]:
nlp = spacy.load("en_core_web_md")

In [None]:
# The .pipe() method batch processes all the text
docs = list(nlp.pipe(df['content']))

### Vectors visualization

Here we take the document vectors from Spacy.

In [None]:
vector_matrix = np.array(list(map(lambda x: x.vector, docs)))

In [None]:
vector_matrix.shape

We use PCA to project the vectors to their principal components.

In [None]:
pca = PCA(2)

In [None]:
x_pca = pca.fit_transform(vector_matrix)

In [None]:
x_pca.shape

In [None]:
plt.scatter(x_pca[:, 0], x_pca[:, 1], c=df['label'].tolist(), alpha=.05, cmap='rainbow');
plt.colorbar();

## Random forest classifier

The random forest algorithm is a pretty versatile one and generally
performs well.
Let's use the default parameters with just one change:
we'll use 100 trees.

In [None]:
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

### Train-test split

It's important to split the data consistently
so that we keep a 50/50 ratio both in training and test.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    vector_matrix,
    df['label'].to_numpy(),
    train_size=.8,
    shuffle=True,
    stratify=df['label'],
    random_state=42,
)

In [None]:
x_train.shape

Let's check the class balance:

In [None]:
y_train.mean(), y_test.mean()

And now we fit the model:

In [None]:
rf.fit(x_train, y_train)

Let's try the fitted model on the test set:

In [None]:
y_pred = rf.predict(x_test)

We can also extract probabilities.
The [1,0] prediction can always be recomputed
from the probabilities, knowing that the prediction is 1 when
the probability > 0.5 and 0 otherwise.

In [None]:
y_proba = rf.predict_proba(x_test)

There are several metrics to evaluate the performance of the model.
Common ones include the F1-score, the confusion matrix,
the ROC curve, and the area under the ROC curve (AUC).

In [None]:
f1_score(y_test, y_pred)

Let's check what we get from the training data, just for fun:

In [None]:
f1_score(y_train, rf.predict(x_train))

This is a very high value, which is expected since the training data has been used to train the model.

Let's look at what the prediction arrays look like.

In [None]:
y_pred

In [None]:
y_proba

`y_proba` contains two columns, the first is the probability for class 0, the second is the probability
for class 1.

In [None]:
# This is a convenience function that takes care of boring stuff
def plot_roc_auc_f1(y_test, y_proba, title=None):
    f1 = f1_score(y_test, y_proba[:, 1]>.5)
    fpr, tpr, _ = roc_curve(y_test, y_proba[:, 1])
    auc_score = auc(fpr, tpr)
    fig, ax = plt.subplots(figsize=(6,6))
    if title is not None:
        ax.set_title(title)
    ax.plot([0, 1], [0, 1], '--', label="Random")
    ax.plot(fpr, tpr, label="Your model")
    ax.set_xlabel("False positive rate")
    ax.set_ylabel("True positive rate")
    ax.annotate(f"AUC: {auc_score:.4}", (.8, 0.05))
    ax.annotate(f"F1: {f1:.4}", (.8, 0.0))
    ax.legend()
    return f1, auc_score

### ROC curve

In [None]:
f1, auc_score = plot_roc_auc_f1(y_test, y_proba);

The straight line is equivalent to the performance of a random classifier.
The further away from it, the better.
The curve looks quite good and AUC=0.85 is a decent value.

### F1-score, AUC

In [None]:
f1, auc_score

In [None]:
rf.n_features_

### A closer look at the predictions

In [None]:
df_train, df_test = train_test_split(
    df,
    train_size=.8,
    shuffle=True,
    stratify=df['label'],
    random_state=42,
)

In [None]:
df_train.shape

In [None]:
df_test.shape

In [None]:
df_test['prediction'] = y_proba[:, 1]

Let's take a look at what the model predicts.

In [None]:
df_test[df_test['prediction']>.5].sort_values('prediction', ascending=False)['content'].tolist()

In [None]:
df_test[df_test['prediction']<=.5].sort_values('prediction', ascending=False)['content'].tolist()

### Visualization of probability distribution

In [None]:
df_test['prediction'][df_test['label']==1].plot.hist(bins=99, alpha=.5, label="Misogyny")
df_test['prediction'][df_test['label']!=1].plot.hist(bins=99, alpha=.5, label="No Misogyny");
plt.legend();

## Parameter Optimization

One can fine-tune the model's hyperparameters in order to
find the best possible model.

In [None]:
estimator = RandomForestClassifier(n_jobs=-1, random_state=42)

In [None]:
# BEWARE: Searching for more than just a couple parameters will increase the calculation time exponentially!
param_grid = {
    "n_estimators": [50, 200],
#     "max_depth": [3, None],
#     "max_features": [1, 10, 100],
#     "min_samples_split": [10, 100, 1000],
#     "bootstrap": [True, False],
#     "criterion": ["gini", "entropy"],
}

In [None]:
grid_search = GridSearchCV(estimator, param_grid=param_grid, cv=5, iid=False, verbose=2, n_jobs=-1)

In [None]:
%%time
grid_search.fit(x_train, y_train)

In [None]:
grid_search.cv_results_

In [None]:
grid_search.best_estimator_

In [None]:
f1_score(y_test, grid_search.predict(x_test))

In [None]:
y_proba = grid_search.predict_proba(x_test)

In [None]:
plot_roc_auc_f1(y_test, y_proba, title="Random forest - Grid search best result");

## Predict user input

Remember: 0 is not misogynistic, 1 is misogynistic.

In [None]:
best_rf = grid_search.best_estimator_

In [None]:
biatch = nlp("bitch").vector

In [None]:
best_rf.predict(biatch.reshape(1, -1))

In [None]:
best_rf.predict(nlp("Have a nice day").vector.reshape(1, -1))

Seems to be working!