# Python project for DSTI class 

1. [Data Ingestion](#section1)


2. [Data Preprocessing](#section2)


3. [EDA](#section3)


4. [NLP](#section4)

    4.1 [Tokenizing the text](#section41)
    
    4.2 [Extracting keywords from text : NER](#section42)
    

5. [Deep Learning](#section5)

    5.1. [Encoding the Labels](#section51)

    5.2. [Train and Validation Split](#section52)

    5.3. [BertTokenizer and Encoding the Data](#section53)
    
    5.4. [BERT Pre-trained Model](#section54)
    
    5.5. [Data Loaders](#section55)
    
    5.6. [Optimizer & Scheduler](#section56)
    
    5.7. [Performance Metrics](#section57)
    
    5.8. [Training Loop](#section58)
    
    5.9. [Loading and Evaluating the Model](#section59)

# 1. Data Ingestion
<a id="section1"> </a>

In [2]:
# Importing needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter,defaultdict


import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', 200)

In [33]:
df = pd.read_json('News_Category_Dataset_v2.json', lines=True)

In [34]:
df.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,"There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV",Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89,She left her husband. He killed their children. Just another day in America.,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song,Andy McDonald,https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c,The actor and his longtime girlfriend Anna Eberstein tied the knot in a civil ceremony.,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And Democrats In New Artwork,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carrey-adam-schiff-democrats_us_5b0950e8e4b0fdb2aa53e675,The actor gives Dems an ass-kicking for not fighting hard enough against Donald Trump.,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags To Pick Up After Her Dog,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-margulies-trump-poop-bag_us_5b093ec2e4b0fdb2aa53df70,"The ""Dietland"" actress said using the bags is a ""really cathartic, therapeutic moment.""",2018-05-26


In [35]:
df.shape

(200853, 6)

# 2. Data Preprocessing
<a id="section2"> </a>

In [36]:
df.date.max()

Timestamp('2018-05-26 00:00:00')

In [37]:
df.date.min()

Timestamp('2012-01-28 00:00:00')

In [38]:
# since data is too large, I am just getting last year
df = df[df['date'] >= pd.Timestamp(2018,1,1)]

In [39]:
df.shape

(8583, 6)

In [40]:
# removing duplicated rows in short description
df.sort_values('short_description',inplace=True, ascending=False)
duplicated_df = df.duplicated('short_description', keep = False)
df = df[~duplicated_df]

In [41]:
df.shape

(8521, 6)

In [42]:
df = df[df['short_description'].apply(lambda x: len(x.split())>5)]

In [43]:
df.shape

(7828, 6)

In [44]:
df = df.reset_index().drop('index',axis=1)

In [45]:
df.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,COMEDY,People Of Twitter Accurately Describe 2018 Thus Far With Song Lyrics,Andy McDonald,https://www.huffingtonpost.com/entry/twitter-describes-2018-thus-far-with-song-lyrics_us_5a5912d3e4b04f3c55a26d16,"🎵 ""I wanna be sedated."" 🎵",2018-01-12
1,POLITICS,Protestors Want Green Day's 'American Idiot' To Top UK Chart For Trump Visit,Lee Moran,https://www.huffingtonpost.com/entry/donald-trump-uk-visit-american-idiot_us_5ae8334ee4b055fd7fcf4fab,🎤 Don't wanna be an American idiot... 🎤,2018-05-01
2,WEIRD NEWS,Spanish Woman Looks More Like Trump Than The Donald Himself,David Moye,https://www.huffingtonpost.com/entry/dolores-antelo-donald-trump-lookalike_us_5adf8ed2e4b07be4d4c58c8c,"″My photo seems to have traveled far. I say it is because of the color of my hair,” she said.",2018-04-24
3,BLACK VOICES,Mississippi School Finds No Evidence Principal Cut Student's Hair Without Permission (UPDATE),David Moye,https://www.huffingtonpost.com/entry/mississippi-boy-hair-locs-cut-principal_us_5abbfa33e4b03e2a5c78e34d,"“[W]e found absolutely no evidence ... that his allegations of having his hair cut at school exist.""",2018-03-28
4,ENTERTAINMENT,Deaf Activist Points Out That Marvel’s Diversity Problem Isn’t Just About Race,Elyse Wanshel,https://www.huffingtonpost.com/entry/deaf-activist-points-out-that-marvels-diversity-problem-isnt-just-about-race_us_5afc9816e4b0a59b4e002ee9,"“[People] think diversity has to do with race and gender, but there’s so much more to it.”",2018-05-17


In [46]:
df.tail()

Unnamed: 0,category,headline,authors,link,short_description,date
7823,QUEER VOICES,"John Oliver Didn't Think Pence-Trolling, Gay-Themed Book Would Be A Hit",Curtis M. Wong,https://www.huffingtonpost.com/entry/john-oliver-ellen-degeneres-kids-book_us_5ab138d5e4b09a2c75c9488a,"""A Day In the Life of Marlon Bundo,"" the comedian says, paints the world in an inclusive light.",2018-03-20
7824,ENTERTAINMENT,"Kim Kardashian Wished Kanye West A Happy Anniversary And, Fine, It's Pretty Cute",Cole Delbyck,https://www.huffingtonpost.com/entry/kim-kardashian-wished-kanye-west-happy-anniversary_us_5b06ed58e4b05f0fc845ff3f,"""4 years down and forever to go....""",2018-05-24
7825,ENTERTAINMENT,Jim Carrey Pens New Pledge Of Allegiance For Students In Bullet-Riddled Schools,Ed Mazza,https://www.huffingtonpost.com/entry/jimmy-kimmel-fox-news-hosts_us_5b039bf1e4b07309e05b89b2,"""...with butchery and injustice for the most innocent of all.""",2018-05-22
7826,WOMEN,The 20 Funniest Tweets From Women This Week,Hollis Miller,https://www.huffingtonpost.com/entry/the-20-funniest-tweets-from-women-this-week_us_5abe36e7e4b055e50acd048e,"""'Did I put deodorant on?' -- Me six times a day""",2018-03-30
7827,BLACK VOICES,#InWakanda Hashtag Brings The Blackest Of Nations To Life,Princess-India Alexander,https://www.huffingtonpost.com/entry/inwakanda-hashtag-black-twitter_us_5a821662e4b0892a035202e5,""" #InWakanda ashiness does not exist! There is jojoba oil in the wind.""",2018-02-13


In [47]:
df.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

In [48]:
df[['category','short_description']].to_csv('drive/MyDrive/data.csv',index=False)

# 3. EDA
<a id="section3"> </a>

In [49]:
print("Total number of articles : ", df.shape[0])
print("Total number of unqiue categories : ", df["category"].nunique())

Total number of articles :  7828
Total number of unqiue categories :  26


In [50]:
df.category.value_counts()

POLITICS          2923
ENTERTAINMENT     1448
WORLD NEWS         548
QUEER VOICES       440
BLACK VOICES       383
COMEDY             355
SPORTS             321
MEDIA              277
WOMEN              224
CRIME              170
WEIRD NEWS         150
BUSINESS            82
LATINO VOICES       77
IMPACT              72
RELIGION            65
TRAVEL              56
TECH                51
SCIENCE             39
PARENTS             33
EDUCATION           32
GREEN               26
STYLE               25
HEALTHY LIVING      15
ARTS & CULTURE       9
TASTE                6
COLLEGE              1
Name: category, dtype: int64

### Distribution of articles category-wise

In [51]:
fig = go.Figure([go.Bar(x=df["category"].value_counts().index, y=df["category"].value_counts().values)])
fig['layout'].update(title={"text" : 'Distribution of articles category-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Category name",yaxis_title="Number of articles")
fig.update_layout(width=800,height=700)

In [52]:
df_per_month = df.resample('m', on = 'date')['short_description'].count()

### Distribution of articles month-wise

In [53]:
fig = go.Figure([go.Bar(x= df_per_month.index.strftime("%b"), y = df_per_month)])
fig['layout'].update(title={"text" : 'Distribution of articles month-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Month",yaxis_title="Number of articles")
fig.update_layout(width=500,height=500)

### The probability distribution function of short description length

It is almost similar to a Guassian distribution, where most of the descriptions are 50 to 100 words long in length.

In [54]:
fig = ff.create_distplot([df['short_description'].str.len()], ["ht"],show_hist=False,show_rug=False)
fig['layout'].update(title={'text':'PDF','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Length of a Short Description",yaxis_title="probability")
fig.update_layout(showlegend = False,width=500,height=500)

# 4. NLP
<a id="section4"> </a>

In [55]:
# Import the English language class
from spacy.lang.en import English
# Create the nlp object
nlp = English()

### 4.1 Tokenizing the text
<a id="section41"> </a>

In [58]:
df['short_description_nlp_tokens'] = df.short_description.apply(lambda x: [token.text for token in nlp(x)])

In [59]:
df.head()

Unnamed: 0,category,headline,authors,link,short_description,date,short_description_nlp_tokens
0,COMEDY,People Of Twitter Accurately Describe 2018 Thus Far With Song Lyrics,Andy McDonald,https://www.huffingtonpost.com/entry/twitter-describes-2018-thus-far-with-song-lyrics_us_5a5912d3e4b04f3c55a26d16,"🎵 ""I wanna be sedated."" 🎵",2018-01-12,"[🎵, "", I, wanna, be, sedated, ., "", 🎵]"
1,POLITICS,Protestors Want Green Day's 'American Idiot' To Top UK Chart For Trump Visit,Lee Moran,https://www.huffingtonpost.com/entry/donald-trump-uk-visit-american-idiot_us_5ae8334ee4b055fd7fcf4fab,🎤 Don't wanna be an American idiot... 🎤,2018-05-01,"[🎤, Do, n't, wanna, be, an, American, idiot, ..., 🎤]"
2,WEIRD NEWS,Spanish Woman Looks More Like Trump Than The Donald Himself,David Moye,https://www.huffingtonpost.com/entry/dolores-antelo-donald-trump-lookalike_us_5adf8ed2e4b07be4d4c58c8c,"″My photo seems to have traveled far. I say it is because of the color of my hair,” she said.",2018-04-24,"[″My, photo, seems, to, have, traveled, far, ., I, say, it, is, because, of, the, color, of, my, hair, ,, ”, she, said, .]"
3,BLACK VOICES,Mississippi School Finds No Evidence Principal Cut Student's Hair Without Permission (UPDATE),David Moye,https://www.huffingtonpost.com/entry/mississippi-boy-hair-locs-cut-principal_us_5abbfa33e4b03e2a5c78e34d,"“[W]e found absolutely no evidence ... that his allegations of having his hair cut at school exist.""",2018-03-28,"[“, [, W]e, found, absolutely, no, evidence, ..., that, his, allegations, of, having, his, hair, cut, at, school, exist, ., ""]"
4,ENTERTAINMENT,Deaf Activist Points Out That Marvel’s Diversity Problem Isn’t Just About Race,Elyse Wanshel,https://www.huffingtonpost.com/entry/deaf-activist-points-out-that-marvels-diversity-problem-isnt-just-about-race_us_5afc9816e4b0a59b4e002ee9,"“[People] think diversity has to do with race and gender, but there’s so much more to it.”",2018-05-17,"[“, [, People, ], think, diversity, has, to, do, with, race, and, gender, ,, but, there, ’s, so, much, more, to, it, ., ”]"


### 4.2 Extracting keywords from text : NER
<a id="section42"> </a>

In [60]:
import spacy
# Load the small English model
nlp = spacy.load('en_core_web_sm')

In [61]:
df['NER'] = df.short_description.apply(lambda x: {ent.text:ent.label_ for ent in nlp(x).ents})

In [62]:
ner_data = defaultdict(set)

In [63]:
for row in df.itertuples():
    ner_d = row[-1]
    if ner_d:
        for text,label in ner_d.items():
            ner_data[label].add(text)

In [64]:
for label, examples in ner_data.items():
    print(label)
    print(list(examples)[:10])
    print()

ORG
['The Defense Department', 'Fed', 'Quinn Norton', 'Shri Thanedar', "Sheriff Israel's", 'Ebenezer', 'the Senate Intelligence committee', "The Children's Health Insurance Program", 'The State Department', 'Team USA']

NORP
['slur', 'Qatari', 'Republicans', 'Americans', 'Palestinians', 'Czech', 'Kevlar', 'Han', 'Armenians', 'Moroccan']

CARDINAL
['some 60', 'more than 31', 'only 10', 'more than 1,000', 'About two-thirds', 'only 17', '16', '150', 'more than two dozen', 'only one']

PERSON
['Jennifer Roach', 'Francis', 'Mia Amor Mottley', 'Allen', 'Markle', 'Leonardo DiCaprio', 'Tyrone Hankerson Jr.', 'Bebe', 'Ayatollah Ali Khamenei', 'Justina Machado']

GPE
['Myanmar', 'Honduras', 'Montréal', 'Maryland', 'Gus Kenworthy', 'Us', 'Kanye West', 'Wales', 'Sting', 'mosque']

DATE
['zero days', 'July 24', '2,000 years old', 'February', 'A Day', 'last summer', 'every day', '2009', 'week', '1985']

WORK_OF_ART
['Newsnight', 'Whip My Hair', 'PhD', '“A Christmas Carol', 'The Music of What Happens

# 5. Deep Learning
<a id="section4"> </a>

In [66]:
import torch
from tqdm.notebook import tqdm

from transformers import BertTokenizer
from torch.utils.data import TensorDataset

from transformers import BertForSequenceClassification

In [67]:
df = pd.read_csv('data.csv')

In [68]:
df.head()

Unnamed: 0,category,short_description
0,COMEDY,"🎵 ""I wanna be sedated."" 🎵"
1,POLITICS,🎤 Don't wanna be an American idiot... 🎤
2,WEIRD NEWS,"″My photo seems to have traveled far. I say it is because of the color of my hair,” she said."
3,BLACK VOICES,"“[W]e found absolutely no evidence ... that his allegations of having his hair cut at school exist."""
4,ENTERTAINMENT,"“[People] think diversity has to do with race and gender, but there’s so much more to it.”"


In [90]:
df.category.value_counts()

POLITICS          2923
ENTERTAINMENT     1448
WORLD NEWS         548
QUEER VOICES       440
BLACK VOICES       383
COMEDY             355
SPORTS             321
MEDIA              277
WOMEN              224
CRIME              170
WEIRD NEWS         150
BUSINESS            82
LATINO VOICES       77
IMPACT              72
RELIGION            65
TRAVEL              56
TECH                51
SCIENCE             39
PARENTS             33
EDUCATION           32
GREEN               26
STYLE               25
HEALTHY LIVING      15
ARTS & CULTURE       9
TASTE                6
Name: category, dtype: int64

In [91]:
good_categories = []
for category, freq in zip(df.category.value_counts().index,df.category.value_counts().values):
    if freq >= 100:
        good_categories.append(category)

In [92]:
df = df[df.category.isin(good_categories)]

In [93]:
df.shape

(7239, 4)

### 5.1 Encoding the Labels
<a id="section51"> </a>

In [95]:
possible_labels = df.category.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'BLACK VOICES': 3,
 'COMEDY': 0,
 'CRIME': 10,
 'ENTERTAINMENT': 4,
 'MEDIA': 5,
 'POLITICS': 1,
 'QUEER VOICES': 7,
 'SPORTS': 6,
 'WEIRD NEWS': 2,
 'WOMEN': 8,
 'WORLD NEWS': 9}

In [96]:
df['label'] = df.category.replace(label_dict)

### 5.2 Train and Validation Split
<a id="section52"> </a>

Because the labels are imbalanced, we split the data set in a stratified fashion, using this as the class labels.

Our labels distribution will look like this after the split.


In [97]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,short_description
category,label,data_type,Unnamed: 3_level_1
BLACK VOICES,3,train,326
BLACK VOICES,3,val,57
COMEDY,0,train,302
COMEDY,0,val,53
CRIME,10,train,144
CRIME,10,val,26
ENTERTAINMENT,4,train,1231
ENTERTAINMENT,4,val,217
MEDIA,5,train,235
MEDIA,5,val,42


### 5.3. BertTokenizer and Encoding the Data
<a id="section53"> </a>

- Constructs a BERT tokenizer. Based on WordPiece.
- Instantiate a pre-trained BERT model configuration to encode our data.
- To convert all the titles from text into encoded form, we use a function called batch_encode_plus , and we will proceed train and validation data separately.
- The 1st parameter inside the above function is the title text.
- add_special_tokens=True means the sequences will be encoded with the special tokens relative to their model.
- When batching sequences together, we set return_attention_mask=True, so it will return the attention mask according to the specific tokenizer defined by the max_length attribute.
- We also want to pad all the titles to certain maximum length.
- We actually do not need to set max_length=256, but just to play it safe.
- return_tensors='pt' to return PyTorch.
- And then we need to split the data into input_ids, attention_masks and labels.
- Finally, after we get encoded data set, we can create training data and validation data.

In [98]:
df.head()

Unnamed: 0,category,short_description,label,data_type
0,COMEDY,"🎵 ""I wanna be sedated."" 🎵",0,train
1,POLITICS,🎤 Don't wanna be an American idiot... 🎤,1,train
2,WEIRD NEWS,"″My photo seems to have traveled far. I say it is because of the color of my hair,” she said.",2,train
3,BLACK VOICES,"“[W]e found absolutely no evidence ... that his allegations of having his hair cut at school exist.""",3,train
4,ENTERTAINMENT,"“[People] think diversity has to do with race and gender, but there’s so much more to it.”",4,train


In [99]:
df['short_description'] = df['short_description'].astype(str)

In [100]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)
                                          
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].short_description.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].short_description.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).



### 5.4 BERT Pre-trained Model
<a id="section54"> </a>

We are treating each title as its unique sequence, so one sequence will be classified to one of the five labels (i.e. conferences).
- bert-base-uncased is a smaller pre-trained model.
- Using num_labels to indicate the number of output labels.
- We don’t really care about output_attentions.
- We also don’t need output_hidden_states.


In [101]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [102]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


### 5.5 Data Loaders
<a id="section55"> </a>

- `DataLoader` combines a dataset and a `sampler`, and provides an iterable over the given dataset.
- We use `RandomSampler` for training and `SequentialSampler` for validation.
- Given the limited memory in my environment, I set `batch_size=3`.

In [103]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 12

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

### 5.6 Optimizer & Scheduler
<a id="section56"> </a>

- To construct an optimizer, we have to give it an iterable containing the parameters to optimize. Then, we can specify optimizer-specific options such as the learning rate, epsilon, etc.
- I found `epochs=5` works well for this data set.
- Create a schedule with a learning rate that decreases linearly from the initial learning rate set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial learning rate set in the optimizer.

In [104]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

### 5.7 Performance Metrics
<a id="section57"> </a>

We will use f1 score and accuracy per class as performance metrics.

In [105]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

### 5.8 Training Loop
<a id="section58"> </a>

In [106]:
import random
import numpy as np

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals
    
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=513.0, style=ProgressStyle(description_widt…


Epoch 1
Training loss: 1.5620516769834893
Validation loss: 1.335740640268221
F1 Score (Weighted): 0.5160920249289354


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=513.0, style=ProgressStyle(description_widt…


Epoch 2
Training loss: 1.1542239864202504
Validation loss: 1.2273200372090707
F1 Score (Weighted): 0.5759235871261247


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=513.0, style=ProgressStyle(description_widt…


Epoch 3
Training loss: 0.9454378672161995
Validation loss: 1.2469884347129654
F1 Score (Weighted): 0.590377783763116


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=513.0, style=ProgressStyle(description_widt…


Epoch 4
Training loss: 0.7973711476398025
Validation loss: 1.2477638993289444
F1 Score (Weighted): 0.5944456094614607


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=513.0, style=ProgressStyle(description_widt…


Epoch 5
Training loss: 0.7048962778515286
Validation loss: 1.2734135605476715
F1 Score (Weighted): 0.5918700209803073



### 5.9 Loading and Evaluating the Model

<a id="section59"> </a>

In [107]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('finetuned_BERT_epoch_4.model', map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Class: COMEDY
Accuracy: 19/53

Class: POLITICS
Accuracy: 350/439

Class: WEIRD NEWS
Accuracy: 1/22

Class: BLACK VOICES
Accuracy: 13/57

Class: ENTERTAINMENT
Accuracy: 158/217

Class: MEDIA
Accuracy: 16/42

Class: SPORTS
Accuracy: 32/48

Class: QUEER VOICES
Accuracy: 16/66

Class: WOMEN
Accuracy: 3/34

Class: WORLD NEWS
Accuracy: 50/82

Class: CRIME
Accuracy: 9/26

