## Multi Class Classification Problem

The task of predicting flairs is a multi class classification task. The flairs are **mutually exclusive**. This classification is based on the assumption that each submission is assigned to a **single flair** - which is how Reddit flairs are assigned. Each post has a single flair attached to it. 

#### Aim

Build a flair detector - a supervised classifier using dataset scraped from Reddit */r/india* subreddit.  

#### What does this code block do?

Prepares the dataset to be fed to models:

- Get cleaned data (from previous section)
- Combine text data (title, selftext and comments) - Give higher weight to title and selftext over comments (Discussed in EDA)
- Transform into input (text) and output (flair - one hot encoded)   

In [1]:
import pandas as pd

dataset = pd.read_pickle('submissions_df_clean.pkl')

dataset = dataset[['flair','title_processed','comments_processed','selftext_processed']]
dataset['text'] = 3*dataset['title_processed']+2*dataset['selftext_processed']+dataset['comments_processed']
dataset['text'] = dataset['text'].apply(lambda x: ' '.join([str(elem) for elem in x]))
dataset = dataset[['flair','text']]

dataset = dataset.assign(**pd.get_dummies(dataset['flair']))

dataset.head()

Unnamed: 0,flair,text,/r/all,40 Martyrs,AMA,AskIndia,Business/Finance,CAA-NRC,CAA-NRC-NPR,Coronavirus,...,Politics -- Source in comments,Politics [Megathread],Scheduled,Science/Technology,Sports,Totally real,Unverified,Zoke Tyme,[R]eddiquette,r/all
0,Coronavirus,coronavirus covid-19 megathread news update 4 ...,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Scheduled,monthly happiness thread randians share good/p...,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,Photography,aerial view gangaikonda cholapuram temple aeri...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Non-Political,fir arnab goswami chhattisgarh create animosit...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Coronavirus,lockdown scene kurnool andhra pradesh 203 case...,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


#### Classes

The submissions will be classified in the top 10 flairs (Discussed in EDA). All the other submissions - with flairs other than the `top_flairs` are classified as *Others*. 

#### What does this code block do?

Get the list of top flairs from previous section as a list. These will serve as our distinct classes of the multi-class classification problem.

In [2]:
top_flairs = pd.read_pickle('top_flairs.pkl')

flairs = top_flairs.index.to_list()
flairs

['Non-Political',
 'Politics',
 'Coronavirus',
 'AskIndia',
 'Policy/Economy',
 'Business/Finance',
 'Photography',
 '[R]eddiquette',
 'Sports',
 'Science/Technology',
 'Others']

#### What does this code block do?

- Transform dataset to allowed classes
- Remove flairs except the ones in `top_flairs`, assign class *Others* to such records.

In [3]:
dataset = dataset[['flair','text']+flairs[:-1]].assign(Others=dataset[dataset.columns.difference(flairs[:-1])].max(1))
dataset.head()

Unnamed: 0,flair,text,Non-Political,Politics,Coronavirus,AskIndia,Policy/Economy,Business/Finance,Photography,[R]eddiquette,Sports,Science/Technology,Others
0,Coronavirus,coronavirus covid-19 megathread news update 4 ...,0,0,1,0,0,0,0,0,0,0,0
1,Scheduled,monthly happiness thread randians share good/p...,0,0,0,0,0,0,0,0,0,0,1
2,Photography,aerial view gangaikonda cholapuram temple aeri...,0,0,0,0,0,0,1,0,0,0,0
3,Non-Political,fir arnab goswami chhattisgarh create animosit...,1,0,0,0,0,0,0,0,0,0,0
4,Coronavirus,lockdown scene kurnool andhra pradesh 203 case...,0,0,1,0,0,0,0,0,0,0,0


#### What does this code block do?

- Split into testing and training data.
- Given the small volume of data, to maximize learning I have gone ahead with a 80:20 train:test split.
- For the more advanced models, this 20% will serve as validation data.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset['text'], dataset.loc[:, ~dataset.columns.isin(['flair', 'text'])], test_size=0.20, random_state=42)
print("Training data size: Input: "+str(X_train.shape)+" Output: "+str(y_train.shape))
print("Training data size: Input: "+str(X_test.shape)+" Output: "+str(y_test.shape))

Training data size: Input: (1389,) Output: (1389, 11)
Training data size: Input: (348,) Output: (348, 11)


#### How I chose to proceed?

To begin with, I started with some basic binary classification models trained independently for the flairs to get a sense of complexity of the data and a motivation to proceed with some more advanced models. 

#### Understanding the Classifier code for the next few blocks

- **Pipeline** - to automate the workflow (manipulations and transformations)

- The multi class algorithm accepts a **binary mask** over multiple flairs. The result for each prediction will be an array of 0s and 1s marking which flair apply to each row input sample.

- Vectorizer - The next few blocks of code use the popular TF IDF vectorizer (which is independent of our corpus, hence picked for the simpler models) - to systematically compute word counts using **CountVectorizer **and then compute the **Inverse Document Frequency** (IDF) values and only then compute the Tf-idf scores. 

 

#### What does this code block do?

- Classifier - **Binary Naive Bayes Classifier** - MultinomialNB 

- Vectorizer - TF IDF vectorizer 

- OneVsRestClassifier - to wrap for multi class classification

  

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [6]:
NB_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', OneVsRestClassifier(MultinomialNB(fit_prior=True, class_prior=None))),])
for flair in flairs:
    print('... Processing '+str(flair))
    NB_pipeline.fit(X_train, y_train[flair])
    prediction = NB_pipeline.predict(X_test)
    print('Test accuracy is '+str(accuracy_score(y_test[flair], prediction)))

... Processing Non-Political
Test accuracy is 0.6609195402298851
... Processing Politics
Test accuracy is 0.8103448275862069
... Processing Coronavirus
Test accuracy is 0.8017241379310345
... Processing AskIndia
Test accuracy is 0.9339080459770115
... Processing Policy/Economy
Test accuracy is 0.9511494252873564
... Processing Business/Finance
Test accuracy is 0.9770114942528736
... Processing Photography
Test accuracy is 0.9626436781609196
... Processing [R]eddiquette
Test accuracy is 0.9885057471264368
... Processing Sports
Test accuracy is 0.9885057471264368
... Processing Science/Technology
Test accuracy is 0.9885057471264368
... Processing Others
Test accuracy is 0.9396551724137931


#### What does this code block do?

- Classifier - **Binary Linear SVC Classifier** - LinearSVC 
- Vectorizer - TF IDF vectorizer 
- OneVsRestClassifier - to wrap for multi class classification

In [7]:
from sklearn.svm import LinearSVC

In [8]:
SVC_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),])
for flair in flairs:
    print('... Processing '+str(flair))
    SVC_pipeline.fit(X_train, y_train[flair])
    prediction = SVC_pipeline.predict(X_test)
    print('Test accuracy is '+str(accuracy_score(y_test[flair], prediction)))

... Processing Non-Political
Test accuracy is 0.7931034482758621
... Processing Politics
Test accuracy is 0.8994252873563219
... Processing Coronavirus
Test accuracy is 0.8706896551724138
... Processing AskIndia
Test accuracy is 0.9425287356321839
... Processing Policy/Economy
Test accuracy is 0.9482758620689655
... Processing Business/Finance
Test accuracy is 0.9770114942528736
... Processing Photography
Test accuracy is 0.9655172413793104
... Processing [R]eddiquette
Test accuracy is 0.9885057471264368
... Processing Sports
Test accuracy is 0.9971264367816092
... Processing Science/Technology
Test accuracy is 0.9885057471264368
... Processing Others
Test accuracy is 0.9511494252873564


#### What does this code block do?

- Classifier - **Binary Logistic Regression Classifier** - LogisticRegression 
- Vectorizer - TF IDF vectorizer 
- OneVsRestClassifier - to wrap for multi class classification

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
LogReg_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)),('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=1)),])
for flair in flairs:
    print('... Processing '+str(flair))
    LogReg_pipeline.fit(X_train, y_train[flair])
    prediction = LogReg_pipeline.predict(X_test)
    print('Test accuracy is '+str(accuracy_score(y_test[flair], prediction)))

... Processing Non-Political
Test accuracy is 0.7442528735632183
... Processing Politics
Test accuracy is 0.8706896551724138
... Processing Coronavirus
Test accuracy is 0.8419540229885057
... Processing AskIndia
Test accuracy is 0.9339080459770115
... Processing Policy/Economy
Test accuracy is 0.9511494252873564
... Processing Business/Finance
Test accuracy is 0.9770114942528736
... Processing Photography
Test accuracy is 0.9626436781609196
... Processing [R]eddiquette
Test accuracy is 0.9885057471264368
... Processing Sports
Test accuracy is 0.9885057471264368
... Processing Science/Technology
Test accuracy is 0.9885057471264368
... Processing Others
Test accuracy is 0.9396551724137931


#### What did I observe?

At first the excellent accuracy values might deceive one that the models are performing great, however, it's worthy to note that the above models are binary classifiers trained for each individual flair. As a result while the accuracy for *Science/Technology* is 98% it is not a reasonable representation of accuracy. In other words, the classifier tests whether a submission is *Science/Technology* or not and not whether is *Science/Technology* or *Political* or any other flair.  

> I tried this by testing the model with some randomly picked text from /r/india. These models performs identically and poorly. The text was classified as Science/Technology and Political with similar probabilities. 

#### What next?

I decided to move on to basic but multi class models that were not binary in nature.

##### Random Forest Classifier

- vectorizer - TF IDF vectorizer
- RandomForestClassifier - Random Forest Classifier

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

In [12]:
RF_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', RandomForestClassifier())])
RF_pipeline.fit(X_train, y_train)
print("Accuracy Score "+str(accuracy_score(y_test, RF_pipeline.predict(X_test))))
print("F1 Score (Micro) "+str(f1_score(y_test, RF_pipeline.predict(X_test), average='micro')))
print("F1 Score (Weighted) "+str(f1_score(y_test, RF_pipeline.predict(X_test), average='weighted')))



Accuracy Score 0.27011494252873564
F1 Score (Micro) 0.3900414937759336
F1 Score (Weighted) 0.34871874756600985


  'precision', 'predicted', average, warn_for)


##### Linear SVC Classifier

- vectorizer - TF IDF vectorizer
- LinearSVC - Linear SVC Classifier

In [13]:
import numpy as np

In [14]:
LSVC_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', LinearSVC())])
LSVC_pipeline.fit(X_train, np.argmax(np.array(y_train), axis=1))
print("Accuracy Score "+str(accuracy_score(np.argmax(np.array(y_test), axis=1), LSVC_pipeline.predict(X_test))))
print("F1 Score (Micro) "+str(f1_score(np.argmax(np.array(y_test), axis=1), LSVC_pipeline.predict(X_test), average='micro')))
print("F1 Score (Weighted) "+str(f1_score(np.argmax(np.array(y_test), axis=1), LSVC_pipeline.predict(X_test), average='weighted')))

Accuracy Score 0.6551724137931034
F1 Score (Micro) 0.6551724137931034
F1 Score (Weighted) 0.6098445344943171


  'precision', 'predicted', average, warn_for)


#### What did I observe?

The poor accuracy results from these models suggest the complexity of data and how more complex neural network models are required to work with such data. This is backed not only by the results of these models, but also by popular research in the field of RNNs for text based data.

##### LSTM Model

Motivation behind choosing the model
- LSTM outperforms other models on text data when we want our model to learn from long term dependencies. LSTM’s ability to forget, remember and update the information pushes it one step ahead of RNNs.
- LSTM is a popular step in to begin with advanced models

Steps
- Tokenize text to embedded vectors
- Build LSTM
    - `embed_dim` : The embedding layer encodes the input sequence into a sequence of dense vectors of dimension embed_dim.
    - `lstm_out` : The LSTM transforms the vector sequence into a single vector of size lstm_out, containing information about the entire sequence.
    - 'softmax'activation function
- Fit on training data, check accuracy on validation set

Resource for [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

In [15]:
from keras.preprocessing import sequence, text
from keras.models import Sequential
from keras.layers import Dense, Embedding, Input, LSTM
from keras.optimizers import Adam, SGD
from keras.callbacks import EarlyStopping
import pickle

Using TensorFlow backend.


##### Converting text to vector

In the previous models, TF ID vectorization encoding has been used to convert text to vector. This section explores a tokenizer that is fit on the corpus of the submissions (specific to our dataset).  

The test and train dataset are then converted to vector using this tokenizer.

`pad_sequences` transforms the vector into a 2D Numpy array of shape (*length_of_vector* x 300) to be used in the RNN. The motivation to select 300 is that it is close to the average length of text in words.  

In [16]:
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(dataset['text'])
with open('tokenizer.pkl', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

X_train_2 = tokenizer.texts_to_sequences(X_train)
X_test_2 = tokenizer.texts_to_sequences(X_test)

X_train_sequence = sequence.pad_sequences(X_train_2, maxlen=300)
X_test_sequence = sequence.pad_sequences(X_test_2, maxlen=300)

Find average count of words in the text

In [17]:
dataset['word_count'] = dataset['text'].apply(lambda x: len(x.split()))
dataset['word_count'].mean()

306.3615428900403

In [18]:
print(X_test.shape)
print(X_test_sequence.shape)
X_test_sequence

(348,)
(348, 300)


array([[   0,    0,    0, ..., 1336, 4728,  227],
       [ 148, 2223,   41, ...,  540,  429,   86],
       [   0,    0,    0, ...,   18, 4134,  226],
       ...,
       [   0,    0,    0, ...,  298,  125, 7303],
       [   0,    0,    0, ...,   18, 2052,  715],
       [   0,    0,    0, ...,   85,  522, 1436]])

#### Model Architecture 
one input layer, one embedding layer, one LSTM layer with 200 neurons and one output layer with 11 neurons since we have 11 flairs in the output.

In [19]:
embedding_length = 200
model = Sequential()
model.add(Embedding( len(tokenizer.word_index)+1, embedding_length ,input_length = X_train_sequence.shape[1]))
model.add(LSTM(embedding_length, dropout=0.2))
model.add(Dense(11,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

model.fit(X_train_sequence, np.array(y_train), batch_size=64,epochs=25,
          validation_data=(X_test_sequence, np.array(y_test)))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.

Train on 1389 samples, validate on 348 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.callbacks.History at 0x201f7a91e88>

#### Modifying sequence length to 80

Based on observations, the accuracy improved with smaller values of `maxlen`

In [23]:
X_train_sequence = sequence.pad_sequences(X_train_2, maxlen=80)
X_test_sequence = sequence.pad_sequences(X_test_2, maxlen=80)
embedding_length = 200
model = Sequential()
model.add(Embedding( len(tokenizer.word_index)+1, embedding_length ,input_length = X_train_sequence.shape[1]))
model.add(LSTM(embedding_length, dropout=0.2))
model.add(Dense(11,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

model.fit(X_train_sequence, np.array(y_train), batch_size=64,epochs=25,
          validation_data=(X_test_sequence, np.array(y_test)))

Train on 1389 samples, validate on 348 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.callbacks.History at 0x201f9efc408>

In [21]:
model.save('model_lstm.h5')

#### What did I observe?

A common observation is the falling of validation-accuracy with the increase in epochs. Usually, the validation metric stops improving after a certain number of epochs and begins to decrease afterward, indicating **overfitting** ie  the model learned patterns specific to the training data, which are irrelevant in other data. Some of the solutions I thought of:

1. **Reduce learning rate** to a very small number like 0.001 or even 0.0001.
2. Provide **more data**. (This is restricted with the current volume of data)
3. Set **Dropout rates** to a number like 0.2. **Keep them uniform across the network**.
4. Try **decreasing the batch size**.

#### What next?

I ran the model with early stopping to get the best accuracy possible

In [22]:
es = EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=10)

model.fit(X_train_sequence, np.array(y_train), batch_size=64,epochs=50,
          validation_data=(X_test_sequence, np.array(y_test)), callbacks=[es])

Train on 1389 samples, validate on 348 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 00023: early stopping


<keras.callbacks.callbacks.History at 0x201edac2b88>

The LSTM model is used to make the predictions in the web service available [here](http://http://cryptic-earth-17134.herokuapp.com/).

#### **What can be improved?**

- Larger dataset - currently Reddit only allows getting the top() and hot() posts which is limited to the number of records extracted, a larger data would increase the amount of data that passes through the model
- More advanced models, possibly explore other RNN based and improvement models
- Modifications to layers in the LSTM model, activation function 