# Machine Learning Project - Inappropriate Language Classification

Classification using 3 different models:
- Decision tree model -> sourced from the sklearn.tree module. (This notebook)
- Random forest model -> sourced from the sklearn.ensemble module. (This notebook)
- LSTM model -> sourced from tanserflow.keras module. (LSTM notebook)

There are two options for tockenisation and embedding:
- **CountVectorizer** (tockenisation)
- **GloVe model** (tockenisation + embedding)

The **CountVectorizer** has it's dictionary created on the entire dataset, as such, it is limited to the words that are present in the dataset. This might contain words that haven't been seen by pretrained models. It is however entirely sure that pretrained models have seen certain words that aren't present in the dataset. Furethermore it doesn't embed the words, as such the models will need to infer their own relations between then words.

The **GloVe model** is a pretrained model trained on a large corpus of words. It allows for both vectorization and embedding of the words. This allows for a first relation between different words to be created. For example, the embeddings of King - Man + Woman would equal Queen. Furthermore, it's dicitonary contains words that are not present in the dataset, this allows the entire trained model to be applied to be applied to new elements containing unseen words more effectively. For example, the word *biatch* not contained in the corpus would have it's embedding close to that of *bitch* which would allow the model to infer it's meaning.

This Jupyter Notebook contains the following features:
1. Data Preparation (Loading + Tockenisation + Embedding)
- Using the CountVectorizer
- Using the GloVe model
2. Model choice
    1. Using a Decision Tree model
        - Training
        - Testing
    2. Using a Random Forest model
        - Training
        - Testing


## 1. Data Preparation

Here we will load the data, choose a tockeniser and preprocess the data by tockenising it. **Choose only one of the tockenisers**, rerunning a block will overwrite the other.

### Load Data - Count Vectorizer

In [1]:
from experiment_baseplate import get_split_count_vectorizer

X_train, y_train, X_validate, y_validate, X_test, y_test = get_split_count_vectorizer()

### Load Data - GloVe

In [None]:
'''
If needed download weights
'''
from experiment_baseplate import get_glove_model

get_glove_model()

In [10]:
from experiment_baseplate import get_split_glove_embedding

X_train, y_train, X_validate, y_validate, X_test, y_test = get_split_glove_embedding()

Loading GloVe model
Done loading GloVe model

Embedding data
Done Embedding data


## 2. Model selection

### 1. Decision Tree

The decision tree model is built and trained by sklearn. Parameters are specified by the library.

#### Training

Make sure you have imported the data, click on run, sit back and relax while the model trains.

In [14]:
from sklearn.tree import DecisionTreeClassifier

clf_t = DecisionTreeClassifier()

print("Training...")
clf_t.fit(X_train, y_train)
print("Training finished...")

Training...
Training finished...


#### Testing

Once you have trained, click on run and get the results on unseen data. You will have both test results on the validate and test. That is because the validation dataset wasn't used during training and it is bigger. The testing dataset is good to compare with the other models. 

In [15]:
from experiment_baseplate import score

print("Decision Tree Model")
print("Validate values -> " + score( clf_t.predict(X_validate) , y_validate))
print("Test values -> " + score( clf_t.predict(X_test) , y_test))

Validate values -> accuracy : 0.9463639765498316 | precision : 0.9473760721035034 | recall : 0.9290092658588739
Test values -> accuracy : 0.9433703380316827 | precision : 0.9443392102579047 | recall : 0.9249322106464963


### 2. Random Forest

The decision tree model is built and trained by sklearn. Parameters are specified by the library.

#### Training

Make sure you have imported the data, click on run, sit back and relax while the model trains.

In [2]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier()

print("Training...")
clf_rf.fit(X_train, y_train)
print("Training finished...")

Training...
Training finished...


#### Testing

Once you have trained, click on run and get the results on unseen data. You will have both test results on the validate and test. That is because the validation dataset wasn't used during training and it is bigger. The testing dataset is good to compare with the other models. 

In [4]:
from experiment_baseplate import score

print("Random Forest Model")
print("Validate values -> " + score( clf_rf.predict(X_validate) , y_validate))
print("Test values -> " + score( clf_rf.predict(X_test) , y_test))

Random Forest Model
Validate values -> accuracy : 0.9476736934015217 | precision : 0.9643609022556391 | recall : 0.914183891660727
Test values -> accuracy : 0.9487963078458276 | precision : 0.9660940325497287 | recall : 0.9149422006564863
