In [None]:
from libs import *

# Model Building
<a id='contents'></a>
Here we try to build the machine learning model that can classify the Reddit posts into categories. This takes the following steps:

1. [Load labelled data](#section1)
2. [Train / Test split](#section2)
3. [Apply cleaning / transformation](#section3)
4. [Train models](#section4)
5. [Tune model hyperparameters](#section5)

<a id='section1'></a>
## 1. Load labelled data
[back](#contents)

In [None]:
df = pd.read_csv('datasets/all_reddit_labelled.csv')

Create dataset containing the following labels for training / prediction:

In [None]:
TARGET = 'label'
LABELS = ['screeners', 
          'bad test', 
          'ratings', 
          'recorder', 
          'live convo', 
          'no test', 
          'mobile', 
          'bug', 
          'payment']

In [None]:
from helpers import DatasetCreator

In [None]:
creator = DatasetCreator(cols_to_drop_na=TARGET, train=True, labels=LABELS)
data = creator.transform(df)

In [None]:
data.head()

### Explore the labels
Are the classes balanced (i.e. do we have roughly same number of items in each category?)

In [None]:
data[TARGET].value_counts()

## 2. Train-Test Split
<a id="section2"></a>[back](#contents)

Use to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

<img src="figures/train_test_split.png" width=500>

* **Train Dataset**: Used to fit the machine learning model.
* **Test Dataset**: Used to evaluate the fit machine learning model.

#### Cross-validation
When our data is small, we can use *k*-fold cross-validation to evaluate performance: we divide the training data into *k* parts, train on *k-1* parts and evaluate on the remaining part. ([See later](#hyper)) <a id="cv"></a>

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42, stratify=data[TARGET])

In [None]:
train_df[TARGET].value_counts()

In [None]:
test_df[TARGET].value_counts()

In [None]:
y_train = train_df[TARGET]
X_train = train_df.drop(columns=TARGET, axis=1)

y_test = test_df[TARGET]
X_test = test_df.drop(columns=TARGET, axis=1)

## 3. Apply cleaning / vectorization 
<a id='section3'></a>[back](#contents)

We make use of Simon's text cleaning / vectorizer code and create a scikit-learn [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). This sequentially applies the text cleaning and vectorize to create a sparse matrix. In this case we use a vocabulary of the top 1500 tokens, so we have a matrix of dimension `(1500, n_train)`, because most of the text will only contain a few tokens.

In [None]:
from pipeline import vectorizer_pipe

In [None]:
vectorizer_pipe.fit(X_train, y_train)
X_train = vectorizer_pipe.transform(X_train)

In [None]:
feature_names = vectorizer_pipe.named_steps['vectorizer'].get_feature_names()
feature_names[:10]

In [None]:
X_train.toarray()

In [None]:
X_test = vectorizer_pipe.transform(X_test)

## 4. Train Models
<a id="section4"></a>[back](#contents)

We look at some of the more commonly used machine learning algorithms. In particularly, we will be making extensive use of the [scikit-learn](https://scikit-learn.org/stable/index.html) library, one of the most popular machine learning libraries for Python.

Before we get started, we need to define some success criteria: here we have a multi-class classification problem, so one obvious metrics is accuracy. Another metric is called the confusion matrix, which provides a good way of inspecting prediction errors.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

def evaluate(clf, plot=True):
    """Evaluate test set performance"""
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Model accuracy on test set = {accuracy}')
    
    if plot:
        # Compute ane plot confusion matrix as heatmap
        cf = confusion_matrix(y_test, y_pred)
        df_cf = pd.DataFrame(cf, columns=clf.classes_, index=clf.classes_)
        fig, ax = plt.subplots(figsize=(8, 6))
        sns.heatmap(df_cf, ax=ax, annot=True, cmap='Blues')
        ax.set_xlabel('Predicted Label')
        ax.set_ylabel('True Label')

    
def evaluate_train(clf):
    """Compute training set accuracy score"""
    y_pred = clf.predict(X_train)
    accuracy = accuracy_score(y_train, y_pred)
    print(f'Model accuracy on training set = {accuracy}')

### Decision Tree Classifier
* Pros: Easy to train, easy to interpret
* Cons: Easy to overfit

<img src="figures/decision_tree.png">

In [None]:
from sklearn import tree
params = {'max_depth': 15, 'class_weight': 'balanced'}
clf = tree.DecisionTreeClassifier(**params)
clf.fit(X_train, y_train)

In [None]:
evaluate_train(clf)

In [None]:
evaluate(clf, plot=True)

Model performs much better on training set than test set. This is called *overfitting*. Essentially the model has 'memorized' the training data and is not learning any more. 

### Random Forest Classifiers

These are an ensemble learning method for classification 
* Operate by constructing a multitude of decision trees at training time 
* Output the class that is the mode of the classes (classification) of the individual trees
* Correct the tendency of decision trees to overfit

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
params = {'n_estimators':100, 'random_state':42, 'max_depth':10, 'class_weight':'balanced'}
rf_clf = RandomForestClassifier(**params)
rf_clf.fit(X_train, y_train)

In [None]:
evaluate_train(rf_clf)

Evaluate [Random Forest model](#eval2) performance: <a id='eval1'></a>

In [None]:
evaluate(rf_clf)

## 5. Hyper-parameter Tuning
<a id="section5"></a>[back](#contents)

Most of the models come with a set of adjustable parameters (or hyper-parameters) that can significantly modify the performance of the model. Some of the important parameters for the models above are: 

**Decision Trees**:
- The depth of the tree: the deeper the tree, more likely to overfit

**Random Forest**:
- Number of trees in the ensemble (`n_estimators`) - too many trees can lead to overfitting as well
- Number of features considered by each tree when splitting a node
- Depth of the trees (`n_depth`)

<img src="figures/hyperparameter_tuning.png">

It is in our interest to identify the best set of hyperparameters that will yield the highest performing model.

### Random Search Cross Validation 
<a id="hyperparameter"></a>
Remember the [figure](#cv) from section 2. We use the `RandomizedSearchCV` method in scikit-learn to sample from a grid of hyperparameter ranges, and performing *k*-fold cross-validation with each combination of values.

In [None]:
# Look at the parameters that are currently used:
rf_clf.get_params()

Create the parameter grid to sample from during fitting:

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 200, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 25, num = 5)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
              'class_weight': ['balanced']}
pprint(random_grid)

### Random Search Training

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

In [None]:
rf_random.best_params_

In [None]:
best_random = rf_random.best_estimator_
random_performance = evaluate(best_random)

<a id='eval2'></a>
Compare with the [original model](#eval1), the performance are nearly identical, indicating we haven't really improved the model through hyperparameter tuning. This is perhaps not so surprising, given that the model was already overfitting.

### Save models and data

In [None]:
joblib.dump(vectorizer_pipe, 'trained_models/vectorizer_pipe.pkl')
joblib.dump(rf_clf, 'trained_models/random_forest_classifier.pkl')

In [None]:
joblib.dump(X_test, 'datasets/X_test.pkl')
joblib.dump(y_test, 'datasets/y_test.pkl')