<a href="https://colab.research.google.com/github/sotoodaa-ucb/ucb_mids_w251_homework_4/blob/main/w251_v3_hw04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Homework 04
  
In this homework we shall cover some one the concepts you have seen in this weeks async material. 

We will use a dataset [Conversation AI](https://conversationai.github.io/) team, a research initiative founded by [Jigsaw](https://jigsaw.google.com/) and Google. The area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion).   
  
Kaggle hosted their [second competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge#description) on this research. The challenge was to create a model that is capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. The competitions used a dataset of comments from Wikipedia’s talk page edits.

We shall be using this dataset to benchmark a number of ML models. We will measure our models using a binary classification metric [Area under the Curve (or AUC)](https://www.kaggle.com/learn-forum/53782) which is suitable to unbalanced binary datasets. This binary classes will be if the comment is toxic or not. 

We have created a jupyter notbook with some of the tools to model this problem in Deep Learning, using Logistic regression and MLP. Your challenge will be to fill in the models and benchmark the AUC you achieve on these models.

We shall be using the pytorch, if you get stuck do use online [docs](https://pytorch.org/docs/stable/index.html) as well as the [forums](https://discuss.pytorch.org/) to look for solutions. 

**Have fun!!**


*Disclaimer: the dataset used contains text that may be considered profane, vulgar, or offensive.*

### Set up packages

In [4]:
import os
import sys
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from tqdm.autonotebook import tqdm
from torch.utils.data import Dataset, DataLoader
from random import shuffle
import logging
ch = logging.StreamHandler(sys.stdout)

### Load data
Let's load our data and take a peak.   
The competition metrics and fields have too much detail too cover here.   
We will just focus on the comment of the users and whether it was deemed toxic (target>0.5) or not.   

In [5]:
%%capture
# File with the training samples
!wget -O train.csv.zip "https://www.dropbox.com/s/xei6z41mfrcnxcd/train.csv.zip?dl=1"

In [6]:
# Lets look at what files we have available. 
os.listdir()

['.config', 'train.csv.zip', 'sample_data']

In [7]:
train_df = pd.read_csv('train.csv.zip', 
                       dtype={'comment_text':str},
                       usecols=['comment_text', 'target'],
                       compression = 'zip', 
                       nrows = 200000)
train_df['target'] = (train_df['target']>0.5).astype(int)
print("Dataframe Number of (rows, columns) : {}".format(train_df.shape))

Dataframe Number of (rows, columns) : (200000, 2)


In [8]:
pd.options.display.max_colwidth = 200
train_df.head(10)

Unnamed: 0,target,comment_text
0,0,"This is so cool. It's like, 'would you want your mother to read this??' Really great idea, well done!"
1,0,"Thank you!! This would make my life a lot less anxiety-inducing. Keep it up, and don't let anyone get in your way!"
2,0,This is such an urgent design problem; kudos to you for taking it on. Very impressive!
3,0,Is this something I'll be able to install on my site? When will you be releasing it?
4,1,haha you guys are a bunch of losers.
5,1,ur a sh*tty comment.
6,0,hahahahahahahahhha suck it.
7,0,FFFFUUUUUUUUUUUUUUU
8,0,The ranchers seem motivated by mostly by greed; no one should have the right to allow their animals destroy public land.
9,0,It was a great show. Not a combo I'd of expected to be good together but it was.


### Create validation data set
Lets randomly 66/33 split the data into a training and validation set.   
**No change needed here - note, please do not change the KFold split parameters, keeping it consistent will help us debug.**

In [9]:
cv = KFold(n_splits=3, shuffle=True, random_state=42)
trn_ids, val_ids = next(cv.split(train_df))
x_train, x_valid = train_df['comment_text'][trn_ids], train_df['comment_text'][val_ids]
y_train, y_valid = train_df['target'].values[trn_ids], train_df['target'].values[val_ids]

### Vectorize Count of tokens

![Count Vectorizer](https://cdn-images-1.medium.com/max/1600/1*LD5XaHzOnoniU4p137FL5g.jpeg)  
We shall start off performing some CPU based Deep Learning operations. Sparse matrices are better run on CPU.    
Do not underestimate CPU based Deep Learning such as MLP; these models can be very powerful and outperform complex much more complex DL models.   
Here we create a sparse matrix from the text with 200K of the most common unigram and bigrams.  
Below we convert the collection of text documents (found in the `comment_text` field) to a matrix of token counts.  
This is done using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in scikit_learn.  
After creating the vectorizer, we fit it based on the train matrix `x_train` and use this vectorizer to transform both the `x_train` and `x_valid` sets.   
This sreate sparse matrices called `X_trn_mat` and `X_val_mat`.


In [10]:
vectorizer = CountVectorizer(max_features=100000, token_pattern='\w+', ngram_range=(1, 2))
X_trn_mat = vectorizer.fit_transform(x_train)
X_val_mat = vectorizer.transform(x_valid)

In [11]:
print(vectorizer.get_feature_names()[:10])
print(vectorizer.get_feature_names()[100000:100000+10])

['0', '0 0', '0 1', '0 5', '0 c', '00', '00 a', '00 am', '00 and', '00 for']
[]




In [12]:
X_trn_mat

<133333x100000 sparse matrix of type '<class 'numpy.int64'>'
	with 9852197 stored elements in Compressed Sparse Row format>

Below is our dataloader. This collect a single batch at a time (shufflling the data, if specified) and moves the batch to GPU, before making it available to the model.   
Please take a look through the data loader and make sure you understand the steps. 

In [13]:
# Write a dataloader
class JigsawDataset(Dataset):
    def __init__(self, X, y, batchsize = 1024, shuffleidx = False):
        self.X, self.y = X, y
        if shuffleidx:
          self.X, self.y = self.shuffle(self.X, self.y)
        self.batch_size = batchsize

    def __len__(self):
        return (self.X.shape[0] // self.batch_size) + 1

    def __getitem__(self, idx):

        # Create the batch input and target
        batchidx = slice(idx*self.batch_size, (idx+1)*self.batch_size)
        x = self.X[batchidx].todense()
        y = self.y[batchidx]

        # Move the data to dtype float and onto the device
        x = torch.tensor(x).float().to(device)
        y = torch.tensor(y).float().to(device)

        return x, y
      
    def shuffle(self, X, y):
        indices = np.arange(X_trn_mat.shape[0]) #gets the number of rows 
        shuffle(indices)
        Xout, yout = X[list(indices)], y[list(indices)]
        return Xout, yout


First we create a simple logistic regression in torch. Note, this is a single layer with a sigmoid activation. Check the `Deep Learning 101` async material to get the understanding of the sigmoid activation. 

In [None]:
class LogisticRegresion(nn.Module):
    def __init__(self, input_dim = 200000, output_dim = 1):
        super(LogisticRegresion, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)
        self.m = nn.Sigmoid()

    def forward(self, x):
        x = self.linear(x)
        x = self.m(x)
        x = x.flatten()
        return x

In [None]:
model = LogisticRegresion(input_dim = X_trn_mat.shape[1])

Now we choose our loss funtion, batch size and optimizer. See `Deep Learning 101` async material for details on these. 
Note we also, move our model from CPU to GPU. 

In [None]:
batch_size = 1024
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
criterion = nn.BCELoss()
device = 'cuda'
model = model.to(device)

In [None]:
trndataset = JigsawDataset(X_trn_mat, y_train, batch_size, shuffleidx = True)
valdataset = JigsawDataset(X_val_mat, y_valid, batch_size, shuffleidx = False)

Below is the training code, step through it to understand how the batches are released and each step  is made in gradient descent. You will need to run this your self after with your own model. 
Note how the gradients of the optimiser are reset every step. 
Also understand how we log our results - it is important to monitor how training progresses to debug issues. 

In [None]:
for epoch in range(5):
  # Training step
  trn_losses = []
  pbartrn = tqdm(range(len(trndataset)), total = len(trndataset))
  model.train()
  for step in pbartrn: 
    optimizer.zero_grad() # Zero gradients each step 
    # Run through the model, get the loss and make a gradient descent step
    (x,y) = trndataset.__getitem__(step)
    out = model(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
    trn_losses.append(loss.item())
    pbartrn.set_description(f'Epoch {epoch} train loss: {sum(trn_losses)/(1+step) :.4f}')

  # Validation step
  pbarval = tqdm(range(len(valdataset)), total = len(valdataset))
  model.eval()
  val_losses = []
  preds = []
  for step in pbarval: 
    # Run through the model, get the loss; note, we do not need gradients during inference
    (x,y) = valdataset.__getitem__(step)
    with torch.no_grad():
      out = model(x)
    preds.append(out.detach().cpu())
    loss = criterion(out, y)
    val_losses.append(loss.item())
    pbarval.set_description(f'Epoch {epoch} valid loss: {sum(val_losses)/(1+step) :.4f}')

  y_pred = np.concatenate(preds)
  print(f'Epoch {epoch} valid auc : {roc_auc_score(y_valid, y_pred):.4f}')

HBox(children=(FloatProgress(value=0.0, max=131.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=66.0), HTML(value='')))


Epoch 0 valid auc : 0.7982


HBox(children=(FloatProgress(value=0.0, max=131.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=66.0), HTML(value='')))


Epoch 1 valid auc : 0.8256


HBox(children=(FloatProgress(value=0.0, max=131.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=66.0), HTML(value='')))


Epoch 2 valid auc : 0.8324


HBox(children=(FloatProgress(value=0.0, max=131.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=66.0), HTML(value='')))


Epoch 3 valid auc : 0.8394


HBox(children=(FloatProgress(value=0.0, max=131.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=66.0), HTML(value='')))


Epoch 4 valid auc : 0.8422


Your task is to improve the results by changing the architecture. You can try things like, 


1.   Adding multiple hidden layers (you can reference [this](https://www.kaggle.com/lopuhin/mercari-golf-0-3875-cv-in-75-loc-1900-s) prize winning architecture for an initial set of layer dimensions)
2.   As you introduce more parameters, you will probaby need to drop the learning rate to avoid overfitting. 
3.   Does [dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) help to avoid overfitting. 
4.   Add a [relu](https://pytorch.org/docs/master/generated/torch.nn.functional.relu.html#torch.nn.functional.relu) activation between hidden layers. 
5.   Experiment with increasing or decreasing batch size. Or a good way to regularise is starting with small batchsizes and increasing batchsize each epoch.  
6.   Add a small weight decay to your [Adam optimiser](https://pytorch.org/docs/stable/optim.html). 

**After you are happy with the results, download this notebook as a html and submit it to ISVC, together with the highest AUC score you achieved.** 



In [17]:
!pip install wandb
!wandb login

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [40]:
# You can start adding your layers in here....
class MyBombasticMLP(nn.Module):
    def __init__(self, input_dim = 200000, output_dim = 1):
        super(MyBombasticMLP, self).__init__()
        self.input = nn.Linear(input_dim, 32)
        self.linear_stack = nn.Sequential(
            nn.Linear(32, 16),
            # nn.BatchNorm1d(16),
            nn.ReLU(),
            nn.Dropout(0.6),
            nn.Linear(16, 1)
        )
        self.m = nn.Sigmoid()

    def forward(self, x):
        x = self.input(x)
        x = self.linear_stack(x)
        x = self.m(x)
        x = x.flatten()
        return x

In [41]:
import wandb

wandb.config = {
  'learning_rate': 0.01,
  'epochs' : 10,
  'batch_size': 16,
  'log_interval': 1000
}

wandb.init(project='w251-hw4', config=wandb.config)

trndataset = JigsawDataset(X_trn_mat, y_train, wandb.config['batch_size'], shuffleidx = True)
valdataset = JigsawDataset(X_val_mat, y_valid, wandb.config['batch_size'], shuffleidx = False)
model = MyBombasticMLP(input_dim = X_trn_mat.shape[1])
optimizer = torch.optim.SGD(model.parameters(), lr=wandb.config['learning_rate'], momentum=0.9, patience=1)
criterion = nn.BCELoss()
device = 'cuda'
model = model.to(device)

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
auc,▂▇███▇▁▄
loss,█▃▂▄▁▄▂▃▃▂▂▂▂▁▃▁▂▂▄▂▂▂▄▂▂▁▂▁▂▂▂▂▂▂▂▁▂▁▁▂
val_loss,▄▁▃▅▄▄▁▂▄▄▃▁▂▄▄▃▁▂▄▃▃▁▂▃▆▃▁▂▅▇▃▁▃▃█▂▂▁▅▂

0,1
auc,0.87865
loss,0.06182
val_loss,0.16472


In [42]:
best_auc = 0

for epoch in range(wandb.config['epochs']):
  # Training step
  trn_losses = []
  pbartrn = tqdm(range(len(trndataset)), total = len(trndataset))
  model.train()
  for i, step in enumerate(pbartrn): 
    optimizer.zero_grad() # Zero gradients each step 
    # Run through the model, get the loss and make a gradient descent step
    (x,y) = trndataset.__getitem__(step)
    out = model(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
    trn_losses.append(loss.item())
    pbartrn.set_description(f'Epoch {epoch} train loss: {sum(trn_losses)/(1+step) :.4f}')

    if i % wandb.config['log_interval'] == 0:
      wandb.log({
          'loss': loss.item()
      })

  # Validation step
  pbarval = tqdm(range(len(valdataset)), total = len(valdataset))
  model.eval()
  val_losses = []
  preds = []
  for i, step in enumerate(pbarval): 
    # Run through the model, get the loss; note, we do not need gradients during inference
    (x,y) = valdataset.__getitem__(step)
    with torch.no_grad():
      out = model(x)
    preds.append(out.detach().cpu())
    loss = criterion(out, y)
    val_losses.append(loss.item())
    pbarval.set_description(f'Epoch {epoch} valid loss: {sum(val_losses)/(1+step) :.4f}')

    if i % wandb.config['log_interval'] == 0:
      wandb.log({
          'val_loss': loss.item()
      })

  y_pred = np.concatenate(preds)
  auc = roc_auc_score(y_valid, y_pred)

  if auc > best_auc:
    torch.save(model, 'model.pth')

  print(f'Epoch {epoch} valid auc : {auc:.4f}')

  wandb.log({
      'auc': auc
  })

  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 0 valid auc : 0.8769


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 1 valid auc : 0.9039


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 2 valid auc : 0.8985


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 3 valid auc : 0.9020


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 4 valid auc : 0.8993


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 5 valid auc : 0.8914


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 6 valid auc : 0.8906


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 7 valid auc : 0.9058


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 8 valid auc : 0.8945


  0%|          | 0/8334 [00:00<?, ?it/s]

  0%|          | 0/4167 [00:00<?, ?it/s]

Epoch 9 valid auc : 0.8889
