#Predictive Model Creation

In this notebook, we create predictive models for the topic distribution we figured out in the previous notebook. We establish one classical ML baseline as well as a neural baseline using DistilBERT.

However, before proceeding, we must discuss the validity of this approach. We have labeled the entire dataset automatically using LDA and NMF methods. Now, we are training models to learn the function representation that NMF has provided. For this to work, the underlying assumption is that NMF provided the absolute perfect topic distribution and classification. However, that is dangerous to assume without doing a random sanity check of the dataset. 

In [None]:
df_path = '/content/drive/MyDrive/Wysa/final.pkl'

In [None]:
import pandas as pd

In [None]:
df = pd.read_pickle(df_path)

In [None]:
df.head()

Unnamed: 0,author,created_utc,full_link,is_original_content,is_self,is_video,link_flair_text,num_comments,over_18,score,...,date,word_count,ctitle,lemma_title,tword_count,nmf_topics,lda_topics,nmf_topics_text,lda_topics_text,polarity
0,InternetFreedomIn,1628168541,https://www.reddit.com/r/india/comments/oyh2uq...,False,True,False,Policy/Economy,0,False,1,...,2021-08-05 13:02:21,1180,cybersec charcha a global overview of the sta...,cybersec charcha global overview state surveil...,14,1,0,"government, information bill, privacy, protection","school, problem, family, support",0.0
1,adam0010101,1628452706,https://www.reddit.com/r/india/comments/p0lpok...,False,True,False,Politics,0,False,1,...,2021-08-08 19:58:26,33,indian social divisions and political redresses,indian social division political redress,6,4,3,"population, world, culture","karnataka, culture, population, drugs",0.016667
2,MaharajadhirajaSawai,1629126680,https://www.reddit.com/r/india/comments/p5i0b6...,False,True,False,History,9,False,1,...,2021-08-16 15:11:20,3812,my critique of the carvaka podcasts warhorse e...,critique carvaka podcasts warhorse evolution e...,20,0,3,"life, friend, social problem","karnataka, culture, population, drugs",-0.1
3,InternetFreedomIn,1627026319,https://www.reddit.com/r/india/comments/opxkgc...,False,True,False,Policy/Economy,11,False,1,...,2021-07-23 07:45:19,1657,dear standing committee we have some questions...,dear standing committee question pegasus,9,1,4,"government, information bill, privacy, protection","government, survey, kidnapped, suicide",0.0
4,wanderingmind,1628683922,https://www.reddit.com/r/india/comments/p2btn6...,False,True,False,Non-Political,17,False,1,...,2021-08-11 12:12:02,843,barking biting stray dogs hypocritical doglove...,barking biting stray dog hypocritical doglover...,10,0,4,"life, friend, social problem","government, survey, kidnapped, suicide",0.0


First, we create our train, dev, and test sets with the labels as the NMF topics.

In [None]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(df['lemma_text'], df['nmf_topics'], test_size=0.1, stratify = df['nmf_topics'])

In [None]:
xtrain, xdev, ytrain, ydev = train_test_split(xtrain, ytrain, test_size=0.1, stratify=ytrain)

In [None]:
print(xtrain.shape)
print(xdev.shape)
print(xtest.shape)

(1678,)
(187,)
(208,)


We create a basic pipeline for Multinomial Naive Bayes followed by a grid search over three parameters.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [None]:
naive_bayes = Pipeline([('vect', CountVectorizer(stop_words='english')),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB(fit_prior=False)),
              ], verbose=True)

parameters = {'vect__ngram_range': [(1, 2), (1, 3)],
               'tfidf__use_idf': [True],
               'clf__alpha': (1, 1e-2)}

gs_naive_bayes = GridSearchCV(naive_bayes, parameters, verbose=3)

gs_naive_bayes = gs_naive_bayes.fit(xtrain, ytrain)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.1s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.0s
[CV 1/5] END clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2);, score=0.676 total time=   1.2s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.0s
[CV 2/5] END clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2);, score=0.670 total time=   1.2s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   1.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.0s
[CV 3/5] END clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2);, score=0.696 to

In [None]:
gs_naive_bayes.best_params_

{'clf__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

In [None]:
ydev_pred = gs_naive_bayes.predict(xdev)
print(classification_report(ydev, ydev_pred))

              precision    recall  f1-score   support

           0       0.71      0.74      0.72        62
           1       0.60      0.70      0.65        30
           2       0.77      0.61      0.68        38
           3       0.67      0.29      0.40         7
           4       0.59      0.64      0.62        50

    accuracy                           0.66       187
   macro avg       0.67      0.59      0.61       187
weighted avg       0.67      0.66      0.66       187



In [None]:
ytest_pred = gs_naive_bayes.predict(xtest)
print(classification_report(ytest, ytest_pred))

              precision    recall  f1-score   support

           0       0.61      0.78      0.68        69
           1       0.64      0.55      0.59        33
           2       0.74      0.48      0.58        42
           3       0.50      0.12      0.20         8
           4       0.61      0.68      0.64        56

    accuracy                           0.63       208
   macro avg       0.62      0.52      0.54       208
weighted avg       0.64      0.63      0.62       208



63% accuracy is a decent baseline. Since there is no severe class imbalance either, the FScores are also quite decent, barring class 3.

We do the same for LDA topics.

In [None]:
xtrainlda, xtestlda, ytrainlda, ytestlda = train_test_split(df['lemma_title'], df['lda_topics'], test_size=0.1, stratify = df['lda_topics'])
xtrainlda, xdevlda, ytrainlda, ydevlda = train_test_split(xtrainlda, ytrainlda, test_size=0.1, stratify=ytrainlda)

In [None]:
naive_bayes_lda = Pipeline([('vect', CountVectorizer(stop_words='english')),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB(fit_prior=False)),
              ], verbose=True)

parameters_lda = {'vect__ngram_range': [(1, 2), (1, 3)],
               'tfidf__use_idf': [True],
               'clf__alpha': (1, 1e-2)}

gs_naive_bayes_lda = GridSearchCV(naive_bayes_lda, parameters_lda, verbose=3)

gs_naive_bayes_lda = gs_naive_bayes_lda.fit(xtrainlda, ytrainlda)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.1s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.0s
[CV 1/5] END clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2);, score=0.384 total time=   0.1s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.1s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.0s
[CV 2/5] END clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2);, score=0.446 total time=   0.2s
[Pipeline] .............. (step 1 of 3) Processing vect, total=   0.1s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.0s
[Pipeline] ............... (step 3 of 3) Processing clf, total=   0.0s
[CV 3/5] END clf__alpha=1, tfidf__use_idf=True, vect__ngram_range=(1, 2);, score=0.440 to

In [None]:
gs_naive_bayes_lda.best_params_

{'clf__alpha': 1, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

In [None]:
ydev_pred_lda = gs_naive_bayes_lda.predict(xdevlda)
print(classification_report(ydevlda, ydev_pred_lda))

              precision    recall  f1-score   support

           0       0.34      0.57      0.42        46
           1       0.45      0.45      0.45        44
           2       0.58      0.47      0.52        38
           3       0.30      0.12      0.17        25
           4       0.44      0.32      0.37        34

    accuracy                           0.42       187
   macro avg       0.42      0.39      0.39       187
weighted avg       0.43      0.42      0.41       187



In [None]:
ytest_pred_lda = gs_naive_bayes_lda.predict(xtestlda)
print(classification_report(ytestlda, ytest_pred_lda))

              precision    recall  f1-score   support

           0       0.38      0.54      0.45        52
           1       0.47      0.55      0.50        49
           2       0.42      0.38      0.40        42
           3       0.29      0.07      0.12        27
           4       0.41      0.34      0.37        38

    accuracy                           0.41       208
   macro avg       0.39      0.38      0.37       208
weighted avg       0.40      0.41      0.39       208



LDA topic classification seems to be a bit of a bust.


Now that we have established a classical ML baseline, lets also get a neural baseline model for classifying the NMF topics.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 8.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 3.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 38.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 32.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 27.9 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses:

In [None]:
import torch
from tqdm import tqdm
from transformers import DistilBertTokenizerFast

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(list(xtrain), truncation=True, padding=True)
val_encodings = tokenizer(list(xdev), truncation=True, padding=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
class IndiaDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IndiaDataset(train_encodings, list(ytrain))
val_dataset = IndiaDataset(val_encodings, list(ydev))

In [None]:
from transformers import DistilBertForSequenceClassification, BertConfig

config = BertConfig.from_pretrained('distilbert-base-uncased')
config.num_labels = 5
model = DistilBertForSequenceClassification(config)
model.parameters

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


<bound method Module.parameters of DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dr

In [None]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# device = torch.device('cpu')
print(device)

# model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)


train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)
model.train()
optim = AdamW(model.parameters(), lr=5e-5)

for epoch in range(5):
    all_train_preds = None
    all_train_labels = None
    all_val_preds = None
    all_val_labels = None
    model.train()
    for batch in tqdm(train_loader):
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        logits = outputs[1]
        preds = torch.argmax(logits,dim=1)

        if all_train_preds is None:
            all_train_preds = preds
            all_train_labels = labels
        else:
            all_train_preds = torch.cat((all_train_preds,preds))
            all_train_labels = torch.cat((all_train_labels,labels))

        loss.backward()
        optim.step()
    
    print("train accuracy:", torch.sum(all_train_preds==all_train_labels)/all_train_labels.shape[0])
    
model.eval()
for batch in tqdm(val_loader):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)
    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
    val_loss = outputs[0]
    logits = outputs[1]
    preds = torch.argmax(logits,dim=1)

    if all_val_preds is None:
        all_val_preds = preds
        all_val_labels = labels
    else:
        all_val_preds = torch.cat((all_val_preds,preds))
        all_val_labels = torch.cat((all_val_labels,labels))

print("val accuracy:", torch.sum(all_val_preds==all_val_labels)/all_val_labels.shape[0])

cuda


100%|██████████| 105/105 [02:28<00:00,  1.41s/it]


train accuracy: tensor(0.4350, device='cuda:0')


100%|██████████| 105/105 [02:27<00:00,  1.41s/it]


train accuracy: tensor(0.7193, device='cuda:0')


100%|██████████| 105/105 [02:27<00:00,  1.41s/it]


train accuracy: tensor(0.8969, device='cuda:0')


100%|██████████| 105/105 [02:27<00:00,  1.40s/it]


train accuracy: tensor(0.9583, device='cuda:0')


100%|██████████| 105/105 [02:27<00:00,  1.40s/it]


train accuracy: tensor(0.9809, device='cuda:0')


100%|██████████| 12/12 [00:05<00:00,  2.12it/s]


val accuracy: tensor(0.7112, device='cuda:0')


In [None]:
print(classification_report(all_val_labels.cpu(), all_val_preds.cpu()))

              precision    recall  f1-score   support

           0       0.76      0.84      0.80        62
           1       0.45      0.83      0.59        30
           2       0.83      0.76      0.79        38
           3       1.00      0.57      0.73         7
           4       0.92      0.46      0.61        50

    accuracy                           0.71       187
   macro avg       0.79      0.69      0.70       187
weighted avg       0.78      0.71      0.71       187



Indeed, we improve quite a bit on our Naive Bayes outcome. Better preprocessing, trying out different pre-trained models more suited to reddit, and hyperparameter tuning should be able to increase our FScores further. However, before doing that, we need to establish the validity of our labels in a better way. In the interest of time, I am concluding this task at this stage. Thank you :)

In [None]:
torch.save(model.state_dict(), '/content/drive/MyDrive/Wysa/distilbert_nmf_topics.pt')