# Machine Learning Project - Inappropriate Language Classification

Classification using 3 different models:
- Decision tree model -> sourced from the sklearn.tree module. (This notebook)
- Random forest model -> sourced from the sklearn.ensemble module. (This notebook)
- LSTM model -> sourced from tanserflow.keras module. (LSTM notebook)

There are two options for tockenisation and embedding:
- **CountVectorizer** (tockenisation)
- **GloVe model** (tockenisation + embedding)

The **CountVectorizer** has it's dictionary created on the entire dataset, as such, it is limited to the words that are present in the dataset. This might contain words that haven't been seen by pretrained models. It is however entirely sure that pretrained models have seen certain words that aren't present in the dataset. Furethermore it doesn't embed the words, as such the models will need to infer their own relations between then words.

The **GloVe model** is a pretrained model trained on a large corpus of words. It allows for both vectorization and embedding of the words. This allows for a first relation between different words to be created. For example, the embeddings of King - Man + Woman would equal Queen. Furthermore, it's dicitonary contains words that are not present in the dataset, this allows the entire trained model to be applied to be applied to new elements containing unseen words more effectively. For example, the word *biatch* not contained in the corpus would have it's embedding close to that of *bitch* which would allow the model to infer it's meaning.

This Jupyter Notebook contains the following features:

## 1. Data Preparation

Here we will load the data, choose a tockeniser and preprocess the data by tockenising it. **Choose only one of the tockenisers**, rerunning a block will overwrite the other.

### Load Data - Count Vectorizer

In [None]:
from experiment_baseplate import get_split_count_vectorizer

X_train, y_train, X_validate, y_validate, X_test, y_test = get_split_count_vectorizer()

### Load Data - GloVe

There are three ways of using the glove model:
1. Get the average of the vectors - works best
2. Get the sum of the vectors - works ok
3. Flatten all the vectors - requires pading, training with batches and doesn't work at all

In [None]:
'''
If needed download weights
'''
from experiment_baseplate import get_glove_model

get_glove_model()

In [None]:
from experiment_baseplate import get_split_glove_embedding

X_train, y_train, X_validate, y_validate, X_test, y_test = get_split_glove_embedding()

y_processed=False

#### 1. Get the average of the vectors

In [None]:
import numpy as np

def post_process_glove(X_values):
    return np.array([np.mean(np.array(v), axis=0) for v in X_values])

#### 2. Get the sum of the vectors

In [None]:
import numpy as np

def post_process_glove(X_values):
    return np.array([np.sum(np.array(v), axis=0) for v in X_values])

#### 3. Flatten all the vectors

If you use this we recommand you use batches to train the model

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

max_input_length = 100

def post_process_glove(X_values):
    X_values = np.array( pad_sequences( X_values , maxlen=max_input_length) , np.uint8)
    return X_values.reshape(-1, X_values.shape[1] * X_values.shape[2])

#### Compute the data

In [None]:
X_train = post_process_glove(X_train)
X_validate = post_process_glove(X_validate)
X_test = post_process_glove(X_test)

## 2. Model selection

### 1. Decision Tree

The decision tree model is built and trained by sklearn. Parameters are specified by the library.

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
model = "decision tree"

### 2. Random Forest

The random forest model is built and trained by sklearn. Parameters are specified by the library.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
model = "random forest"

### 3. Support Vector Machine

The random forest model is built and trained by sklearn. Parameters are specified by the library.

In [None]:
from sklearn.svm import SVC

y_train = y_train[:, 1]
y_validate = y_validate[:, 1]
y_test = y_test[:, 1]
y_processed=True

clf = SVC()
model = "svc"

## 3. Training

Make sure you have imported the data, click on run, sit back and relax while the model trains.

#### Without Batches

In [None]:
print("Training...")
clf.fit(X_train, y_train) 
print("Training finished...")

#### With Batches

If the data is too big to train in one go, do batches

In [None]:
import math

print("Training...")

batch_size = 200
iterations = math.floor(X_train.shape[0] / batch_size)
for i in range(iterations):
    clf.fit(X_train[(batch_size * i):(batch_size * (i+1))], y_train[(batch_size * i):(batch_size * (i+1))])
if (iterations * batch_size < X_train.shape[0]):
    clf.fit(X_train[(batch_size * iterations):], y_train[(batch_size * iterations):])
    
print("Training finished...")

## 4. Testing

Once you have trained, click on run and get the results on unseen data. You will have both test results on the validate and test. That is because the validation dataset wasn't used during training and it is bigger. The testing dataset is good to compare with the other models. 

In [None]:
from experiment_baseplate import score
import time

start = time.time_ns()
X_val_predict = clf.predict(X_validate)
val_time = time.time_ns() - start

start = time.time_ns()
X_test_predict = clf.predict(X_test)
test_time = time.time_ns() - start

print(model + " Model")
print(f"Validate values\n\t{score( X_val_predict , y_validate, y_processed=y_processed)} | inf_time : {val_time / X_validate.shape[0], y_processed=y_processed} ns")
print(f"Test values\n\t{score( X_test_predict , y_test, y_processed=y_processed)} | inf_time : {test_time / X_test.shape[0], y_processed=y_processed} ns")

## 5. Saving

Remeber to choose the proper save location!!!

In [None]:
file_loc = 'data/your_model.pkl'

In [None]:
import pickle

with open(file_loc, 'wb') as f:
    pickle.dump(clf, f)

## 6. Loading

Remeber to choose the proper load location!!!

In [None]:
file_loc = 'data/your_model.pkl'
model = "your model name (for display purposes)"

In [None]:
import pickle

with open(file_loc, "rb") as f:
    clf = pickle.load(f)