# Real or Not? NLP with Disaster Tweets

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). 

This kaggle competition is to build a machine learning model that predicts which Tweets are about **real** disasters and which **one’s aren’t**. Access has been given to a dataset of 10,000 tweets that were hand classified

Evaluation metric used is F1.
F1 = 2 ∗ ( precision∗recall / precision+recall )
where:

precision = TP / TP+FP
recall = TP / TP+FN

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai import *
from fastai.text import *
from utils import *
from sklearn.model_selection import KFold
from pathlib import PosixPath
from sklearn.metrics import f1_score

In [2]:
torch.cuda.is_available()

True

In [3]:
path = PosixPath('/home/shanmugam/fastai/')
lm_fns = ['learn_en_wiki_15000', 'learn_en_wiki_15_vocab']
path


PosixPath('/home/shanmugam/fastai')

In [4]:
train_df = pd.read_csv(path/'train.csv')
train_df.loc[pd.isna(train_df.text),'text']='NA'
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
test_df = pd.read_csv(path/'test.csv')
test_df.loc[pd.isna(test_df.text),'text']='NA'
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [7]:
df = pd.concat([train_df,test_df], sort=False)

# Creating DataBunch

Creating DataBunch from dataframe using FastAI Datablock API which will be feeded into AWD_LSTM model

In [8]:
bs=128
data_lm = (TextList.from_df(df, path, cols='text')
    .split_by_rand_pct(0.1, seed=42)
    .label_for_lm()           
    .databunch(bs=bs, num_workers=1))

# Transfer Learning 

Usine pretrained model and Vocabulary. This model is trained with 15000 words

In [43]:
lr = 1e-3

Model already knows english language and we are training here specific to Twitter dataset by unfreezing the layers

In [45]:
learn_lm.unfreeze()
learn_lm.fit_one_cycle(8, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,4.219466,3.943643,0.375837,00:18
1,4.038679,3.757119,0.391574,00:18
2,3.871957,3.523306,0.420312,00:18
3,3.672551,3.39792,0.436384,00:18
4,3.493966,3.352379,0.44269,00:18
5,3.34039,3.29679,0.451479,00:18
6,3.229937,3.290707,0.453599,00:18
7,3.149267,3.282875,0.454018,00:18


In [12]:
lang='en'
learn_lm.save(f'{lang}fine_tuned')
learn_lm.save_encoder(f'{lang}fine_tuned_enc')

# Classifier

Now we have using Pretrained Model + Twitter specific mode to classify whether tweet is **Real or Not**

In [14]:
data_clas = (TextList.from_df(train_df, path, vocab=data_lm.vocab, cols='text')
    .split_by_rand_pct(0.1, seed=42)
    .label_from_df(cols='target')
    .databunch(bs=bs, num_workers=1))

data_clas.save(f'{lang}_textlist_class')

In [15]:
data_clas = load_data(path, f'{lang}_textlist_class', bs=bs, num_workers=1)

In [16]:
@np_func
def f1(inp,targ): return f1_score(targ, np.argmax(inp, axis=-1))

In [18]:
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()
learn_c.load_encoder(f'{lang}fine_tuned_enc')
learn_c.freeze()

In [19]:
lr=2e-2

In [20]:
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.60174,0.544232,0.727989,0.568383,00:01
1,0.56943,0.537054,0.7477,0.609953,00:01


In [21]:
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.576629,0.687154,0.688568,0.46706,00:01
1,0.55543,0.532727,0.74113,0.620634,00:01


In [22]:
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.548595,0.527342,0.74113,0.58564,00:02
1,0.520642,0.501762,0.760841,0.666952,00:01


In [23]:
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.499345,0.519685,0.766097,0.6532,00:02
1,0.467862,0.492407,0.781866,0.686761,00:02


In [24]:
learn_c.unfreeze()
learn_c.fit_one_cycle(3, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.415215,0.504529,0.785808,0.675545,00:03
1,0.401466,0.514453,0.764783,0.706156,00:03
2,0.359672,0.513721,0.781866,0.700747,00:03


In [25]:
learn_c.save(f'{lang}clas')

Submitted upto this is to kaggle competition and landed in public leaderboard with score **0.77914**

## Backward

DataBunch is created as reverse order of the document. For Eg: "the moving car" is turned to "car moving the". It turned to be as important as the model knows what is the next word.

In [27]:
bs=128
data_lm = (TextList.from_df(df, path, cols='text')
    .split_by_rand_pct(0.1, seed=42)
    .label_for_lm()           
    .databunch(bs=bs, num_workers=1, backwards=True))
#learn_lm = language_model_learner(data_lm, AWD_LSTM, drop_mult=1.0)
learn_lm = language_model_learner(data_lm, AWD_LSTM, config={**awd_lstm_lm_config, 'n_hid': 1152},
                                  pretrained_fnames=lm_fns, drop_mult=1.0)


lr = 1e-3
# lr *= bs/48

learn_lm.fit_one_cycle(2, lr*10, moms=(0.8,0.7))



epoch,train_loss,valid_loss,accuracy,time
0,6.627207,4.457491,0.277986,00:07
1,5.264265,4.181861,0.343778,00:06


In [28]:
learn_lm.unfreeze()
learn_lm.fit_one_cycle(8, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,4.217378,3.916133,0.384794,00:10
1,4.050858,3.732418,0.401507,00:10
2,3.856618,3.541379,0.436217,00:10
3,3.655553,3.400859,0.452651,00:10
4,3.476488,3.334724,0.463421,00:10
5,3.32372,3.305023,0.466713,00:10
6,3.21598,3.288818,0.469643,00:10
7,3.133718,3.28111,0.469671,00:10


Same as forward, backward trained with reversed tweet with pretrained model by unfreezing the layers

In [29]:
lang='en'
learn_lm.save(f'{lang}fine_tuned_bwd')
learn_lm.save_encoder(f'{lang}fine_tuned_enc_bwd')
data_clas = (TextList.from_df(train_df, path, vocab=data_lm.vocab, cols='text')
    .split_by_rand_pct(0.1, seed=42)
    .label_from_df(cols='target')
    .databunch(bs=bs, num_workers=1, backwards=True))

data_clas.save(f'{lang}_textlist_class_bwd')

data_clas = load_data(path, f'{lang}_textlist_class_bwd', bs=bs, num_workers=1, backwards=True)
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()
learn_c.load_encoder(f'{lang}fine_tuned_enc_bwd')
learn_c.freeze()

In [30]:
lr=2e-2
lr *= bs/48
learn_c.fit_one_cycle(2, lr, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.60687,0.598413,0.695138,0.429398,00:01
1,0.571485,0.528714,0.7477,0.585651,00:01


In [31]:
learn_c.freeze_to(-2)
learn_c.fit_one_cycle(2, slice(lr/(2.6**4),lr), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.559272,0.556698,0.706964,0.476851,00:02
1,0.522216,0.484263,0.785808,0.677238,00:01


In [32]:
learn_c.freeze_to(-3)
learn_c.fit_one_cycle(2, slice(lr/2/(2.6**4),lr/2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.503359,0.494776,0.772668,0.686364,00:02
1,0.476957,0.470973,0.78975,0.711484,00:02


In [35]:
learn_c.unfreeze()
learn_c.fit_one_cycle(3, slice(lr/10/(2.6**4),lr/10), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,f1,time
0,0.314185,0.576438,0.773982,0.686679,00:03
1,0.300418,0.584375,0.771353,0.698708,00:03
2,0.264766,0.622777,0.770039,0.688253,00:03


In [36]:
learn_c.save(f'{lang}clas_bwd')

# Ensemble

### Predictions of forward and backward model are ensembled and took the mean of predictions 

In [60]:
data_clas = load_data(path, f'{lang}_textlist_class', bs=bs, num_workers=1)
data_clas.add_test(test_df)
learn_c = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()
learn_c.load(f'{lang}clas', purge=False);

In [None]:
preds,targs = learn_c.get_preds(DatasetType.Test)

In [62]:
data_clas_bwd = load_data(path, f'{lang}_textlist_class_bwd', bs=bs, num_workers=1, backwards=True)
data_clas_bwd.add_test(test_df)
learn_c_bwd = text_classifier_learner(data_clas_bwd, AWD_LSTM, drop_mult=0.5, metrics=[accuracy,f1]).to_fp16()
learn_c_bwd.load(f'{lang}clas_bwd', purge=False);

In [None]:
preds_b,targs_b = learn_c_bwd.get_preds(DatasetType.Test)

In [65]:
preds_avg = (preds+preds_b)/2

In [69]:
sub = pd.read_csv(path/'sample_submission.csv')

In [72]:
sub.target = preds_avg.argmax(dim=-1)
sub

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
...,...,...
3258,10861,1
3259,10865,1
3260,10868,1
3261,10874,1


In [71]:
sub['target'].value_counts()

0    1769
1    1494
Name: target, dtype: int64

In [73]:
sub.to_csv('model_lin.csv', index=False)

### As Expected ensembling technique improved the score **0.78936** whereas previous score was **0.77914**