<a href="https://colab.research.google.com/github/shuchimishra/Tensorflow_projects/blob/main/Tensorflow_Code/NLP/Disaster_Tweet_analysis_with_Transfer_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A solution to Disaster Twets using Transfer Learning

In this notebook we are going to solve the Geetting Started competition, Natural Language Processing with Disaster Tweets, using Transfer Learning.

We will use the Universal Sentence Encoder from Tensorflow Hub.
https://tfhub.dev/google/universal-sentence-encoder/4

One of the main advantatges of use this model is that we don't need to conver our sentences in embedding, create dictionaris, tokenize.... we avoid a lot of the preprocessment related with NLP.

We are going to do a minimal treatmet to transform the tweets from:

- **'.@WestmdCountyPA land bank targets first #Latrobe building in 20th property acquisition to fight #blight: http://t.co/regDv873Aj'**

to

- **'land bank targets first latrobe building 20th property acquisition fight blight '**

But if you want you can fork the notebook and try to pass the tweets without any transformation. *Most ideas, at the end oh the notebook in the **FORK & IMPROVE** section.*

To achieve this it's necessary to do the next transformations:

- Remove Marks: Remove punctuactions, and symbols.
- Tokenize the sentence: Get all words, we are going to use *TweetTokenizer* from the *NLTK* library that have some intelligence adapted to tweets.
- Remove Stop Words: Downloading the english lybrary *stopwords* from NLTK. We remove a lot of words from the tweets. In the sample above we removed: *in, to*.


In [1]:
#import libraries.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import tensorflow as tf
#Tranfer learning from tensorhub
import tensorflow_hub as hub

from sklearn.model_selection import train_test_split

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

import re
import string

# **[Optional] Copy source data files**

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!mkdir data
!cp '/content/drive/MyDrive/Data Science & Machine Learning/Tensorflow Certification/Repository/Tensorflow_projects/Data/DisasterTweets dataset/train.csv' './data/'
!cp '/content/drive/MyDrive/Data Science & Machine Learning/Tensorflow Certification/Repository/Tensorflow_projects/Data/DisasterTweets dataset/test.csv' './data/'
!cp '/content/drive/MyDrive/Data Science & Machine Learning/Tensorflow Certification/Repository/Tensorflow_projects/Data/DisasterTweets dataset/test.csv' './data/'

# Load & Transform the data

In [4]:
train_tweet= pd.read_csv('./data/train.csv')
test_tweet=pd.read_csv('./data/test.csv')

train_tweet.head(3)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1


We are going to use only the columns *text* and *target* but it's not necessary to drop the other columns, I want to mantain the notebook as simple as possible.

In [6]:
calm, disaster = train_tweet['target'].value_counts()
calm, disaster

(4342, 3271)

In [None]:
fig = plt.figure(figsize=(5,5))
labels = 'Calm', 'Disaster'
sizes = [calm, disaster]
plt.pie(sizes, labels=labels, autopct='%0.0f%%',
        shadow=True, startangle=90)
plt.axis('equal')
plt.show()

As we can see the datase is well balanced. That's all the EDA I'm going to do.
# **Functions to tranform the sentences**

In [7]:
#Remove special Tweets characters.
def remove_marks(sentence):
    sentence_r = re.sub(r'^RT[\s]+', '', sentence)
    sentence_r = re.sub(r'https?://[^\s\n\r]+', '', sentence_r)
    sentence_r = re.sub(r'#', '', sentence_r)

    return sentence_r

In [8]:
#Transform each sentence in a list of words.
TOKENIZER = TweetTokenizer(preserve_case=False, strip_handles=True,
                              reduce_len=True)

def tokenize(sentence):
    sentence_tokens = TOKENIZER.tokenize(sentence)
    return sentence_tokens

In [9]:
#remove stopwords and punctuation
nltk.download('stopwords')
STOPWORDS = stopwords.words('english')

def remove_stop_words(sentence_tokenized):
    no_stop_words = []
    for word in sentence_tokenized:
        if ((word not in STOPWORDS) and (word not in string.punctuation)):
            no_stop_words.append(word)
    return no_stop_words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
#Convert list of words in a sentence again.
def words_to_sentence(list_words):
    sentence = ''
    for word in list_words:
        sentence = sentence + word + ' '
    return sentence

In [11]:
#Call this function to fully transform a sentence.
def prepare_sentence(sentence):
    sentence_nomarks = remove_marks(sentence)
    sentence_tokenized = tokenize(sentence_nomarks)
    sentence_ready = remove_stop_words(sentence_tokenized)
    sentence_treated = words_to_sentence(sentence_ready)
    return sentence_nomarks

In [13]:
#testing with one tweet
tweet = train_tweet['text'][1234]
tweet

'Fire destroys two buildings on 2nd Street in #Manchester http://t.co/Tqh5amoknd'

In [14]:
tweet_ready = prepare_sentence(tweet)
print(tweet_ready)

Fire destroys two buildings on 2nd Street in Manchester 


In [15]:
#Transform all tweets
train_tweet['text'] = train_tweet.apply(lambda row : prepare_sentence(row['text']), axis=1)
train_tweet.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this earthquake Ma...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive wildfires evacuation ord...",1
4,7,,,Just got sent this photo from Ruby Alaska as s...,1


# **Training and Validation set split**

In [17]:
#Split twets and targets in train and validation. 20% in validation.
tweets = train_tweet['text']
targets = train_tweet['target']
train_tweets , val_tweets , train_targets, val_targets = train_test_split(tweets.to_numpy(),
                                                                          targets.to_numpy(),
                                                                          test_size=0.2, random_state=42)
print(train_tweets[1:10])

['@ZachZaidman @670TheScore wld b a shame if that golf cart became engulfed in flames. boycottBears'
 "Tell @BarackObama to rescind medals of 'honor' given to US soldiers at the Massacre of Wounded Knee. SIGN NOW &amp; RT! "
 'Worried about how the CA drought might affect you? Extreme Weather: Does it Dampen Our Economy? '
 '@YoungHeroesID Lava Blast &amp; Power Red PantherAttack @JamilAzzaini @alifaditha'
 "Wreckage 'Conclusively Confirmed' as From MH370: Malaysia PM: Investigators and the families of those who were... "
 'Our builder is having a dental emergency. Which has ruined my plan to emotionally blackmail him this afternoon with my bump.'
 'BMX issues Areal Flood Advisory for Shelby [AL] till Aug 5 9:00 PM CDT '
 "360WiseNews : China's Stock Market Crash: Are There Gems In The Rubble? "
 "@RobertONeill31 Getting hit by a foul ball while sitting there is hardly a freak accident. It's a war zone."]


# **Callbacks**

In [None]:
#just clean the session recomendable if we execute some times the model.
tf.keras.backend.clear_session()

#This callback saves the best model based in val_accuracy
BestModel = tf.keras.callbacks.ModelCheckpoint('modelDA.h5',
                                           mode='max', monitor='val_accuracy',
                                           verbose=1,
                                           save_best_only=True)

ES = keras.callbacks.EarlyStopping(
    monitor="val_accuracy",
    patience=5,
    verbose=1,
    mode="auto"
)

# **Import Universal Sentence Encoder from Tensorflow Hub**

Time to import the model from TensorFlow Hub.

I'm indicating in the *trainable* parameter that I don't want to train the model, I preffer to use the weigths of the model.

But if you want, you can test importing the model with *trainable* to **True**. You will see that the train process is much more slower, and i think that we are not going to get any advantatge.

It's a really big model and we have just a few data, for sure that we will get a lot of overfitting if we try to train the model with our tweets. But... try it!

In [16]:
model_hub_path = "https://tfhub.dev/google/universal-sentence-encoder/4"
use_layer = hub.KerasLayer(model_hub_path,
                                        input_shape=[], # shape of inputs coming to our model
                                        dtype=tf.string, # data type of inputs coming to the USE layer
                                        trainable=False, # keep the pretrained weights.
                                        name="USE")

## Join the Pretrained Model with our Specific Layers

Now we can create a sequential Model and indicate that the first layer is the model downloaded.

The rest of the layers are the trainable layer that we can train with our Data.

I decided to use 2 Dense layers with 128 Nodes and 2 dropoout layers in order to reduce the overfitting.

In [None]:
modeltl0 = tf.keras.Sequential()
modeltl0.add(use_layer)
modeltl0.add(tf.keras.layers.Dropout(0.4))
modeltl0.add(tf.keras.layers.Dense(128,activation='relu'))
modeltl0.add(tf.keras.layers.Dropout(0.4))
modeltl0.add(tf.keras.layers.Dense(128,activation='relu'))
modeltl0.add(tf.keras.layers.Dense(1, activation ='sigmoid'))
modeltl0.compile(loss= "binary_crossentropy", optimizer = 'adam',metrics=["accuracy"])
modeltl0.summary()

As you can see in the summary we have a total of **256,880,129** parameters but only **82,305** of the are trainables.


In [None]:
historytl0 = modeltl0.fit(train_tweets, train_targets, epochs=15,
                          validation_data=(val_tweets, val_targets),
                         callbacks=[SaveModel, ReduceLR])

I use this function *plot_los_acc* to print the loss and accuracy curve of the models after training.

You can pass in the parameter *firstepoch* the epoch where the plot starts. Is useful in long training because sometimes is dificult to aprrecite the curve of the plot because of the big improvements in the first epochs.

In [None]:
#prints loss and accuracy curves. You can indicate first epoch to plot.
def plot_loss_acc(history, firstepoch=0):
  '''Plots the training and validation loss and accuracy from a history object'''
  acc = history.history['accuracy']
  acc = acc[firstepoch:]
  val_acc = history.history['val_accuracy']

  val_acc = val_acc[firstepoch:]
  loss = history.history['loss']
  loss=loss[firstepoch:]
  val_loss = history.history['val_loss']
  val_loss = val_loss[firstepoch:]

  epochs = range(len(acc))

  plt.plot(epochs, acc, 'bo-', label='Training accuracy')
  plt.plot(epochs, val_acc, 'go-', label='Validation accuracy')
  plt.title('Training and validation accuracy')
  plt.legend()

  plt.figure()

  plt.plot(epochs, loss, 'bo-', label='Training Loss')
  plt.plot(epochs, val_loss, 'go-', label='Validation Loss')
  plt.title('Training and validation loss')
  plt.legend()

  plt.show()

In [None]:
plot_loss_acc(historytl0)

The best result, with this combnination of Dense and Dropout layers is obtained in the first epochs. The best model is saved in the file *BMTLTweet.h5*  thanks to one  of the callbacks functions.

### Getting predictions & send results

In [None]:
#Load the best model stored
bestModel = tf.keras.models.load_model('BMTLTweet.h5', custom_objects={'KerasLayer': hub.KerasLayer})

In [None]:
#check the test tweets.
test['text'][350]

In [None]:
#Transform each test tweet and check the result
testweets = test.apply(lambda row : prepare_sentence(row['text']), axis=1)
testweets[350]

In [None]:
results = bestModel.predict(testweets)

In [None]:
#Create a Dataframe with the 'id' and 'target' columns
final = pd.DataFrame()
final['id'] = test['id']
final.head()
final['target'] = results
final['target'] = final['target'].apply(lambda x:1 if x>0.5 else 0)
final.head()

In [None]:
#file to submit
final.to_csv('./submission.csv', index=False)

# Conclusions, Fork & improve.
Yeah! The score of this notebook is 0.813.... not bad at all! Is realy good.

We did a little effort to obtain this score, is a simple notebook, where we just did a simple transformation of the tweets (but I think that we can just don't do it). We imported a amazing model from tensorflow hub, add some really simple layers... and that's all!!!!!

If you want you can Fork and try to improve this score, you can do a lot of research and experiments.

- Don't transform the tweets.
- Maybe don't use stop words.
- Change the layers. Why two dense layers? Try with One. Reduce or increase the nodes (try with just two layers of 8 nodes).
- Reduce, Increase the dropout.
- Import the model as Trainable, and train it with the tweets.
- Change the Callback functio to monitorize the Val_loss instead val_accuracy.

Please, if you improve the results, let me know in the comments! I will be really happy to discuss the results, and of course upvote your notebook.

If you like the notebook, considere to **upvote it!** I will be really happy and it encourage me a lot to share more notebooks.

# Inspirations.
For sure that there must be a lot of similar notebooks in this competition, but my main inspiration comes from TensorFlow Developer Specialization that I coursed in Coursera.

Now I'm doing a serie of notebooks related to NLP. This is the second one, if you are interested in more notebooks about NLP, with more advanced techniques, just consider to follow me, or check my profile.

May the Transfer Learning be with you.
