<a href="https://colab.research.google.com/github/surajsarkar/deepLearning/blob/main/notebooks/08_tensorflow_course_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 08 Tensorflow Exercise NLP

1. Rebuild, compile and train model_1, model_2 and model_5 using the Keras Sequential API instead of the Functional API.
2. Retrain the baseline model with 10% of the training data. How does perform compared to the Universal Sentence Encoder model with 10% of the training data?
3. Try fine-tuning the TF Hub Universal Sentence Encoder model by setting training=True when instantiating it as a Keras layer.



```python
# We can use this encoding layer in place of our text_vectorizer and embedding layer
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[],
                                        dtype=tf.string,
                                        trainable=True) # turn training on to fine-tune the TensorFlow Hub model
```

4. Retrain the best model you've got so far on the whole training set (no validation split). Then use this trained model to make predictions on the test dataset and format the predictions into the same format as the sample_submission.csv file from Kaggle (see the Files tab in Colab for what the sample_submission.csv file looks like). Once you've done this, make a submission to the Kaggle competition, how did your model perform?

5. Combine the ensemble predictions using the majority vote (mode), how does this perform compare to averaging the prediction probabilities of each model?

6. Make a confusion matrix with the best performing model's predictions on the validation set and the validation ground truth labels.



* **model_1** : Dense layer model
* **model_2** : LSTM model
* **model_5** : Conv1D model

For a **NLP** model we need **text vectorizer** and a embedder

## Get the helper script

In [None]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2022-07-16 05:35:55--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2022-07-16 05:35:55 (65.3 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [None]:
from helper_functions import unzip_data

## Get the data

In [None]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
unzip_data("nlp_getting_started.zip")

--2022-07-16 05:35:56--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.68.128, 142.250.4.128, 74.125.24.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.68.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2022-07-16 05:35:56 (97.7 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



## Getting one with the data

In [None]:
import pandas as pd
train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
shuffled_train_data = train_df.sample(frac=1, random_state=42)
shuffled_train_data.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [None]:
X, y = shuffled_train_data["text"], shuffled_train_data["target"]
X[:5]

2644    So you have a new weapon that can cause un-ima...
2227    The f$&amp;@ing things I do for #GISHWHES Just...
5448    DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...
132     Aftershock back to school kick off was great. ...
6845    in response to trauma Children of Addicts deve...
Name: text, dtype: object

In [None]:
# Spliting the data
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state=42
)

In [None]:
train_sentences.shape

(6090,)

In [None]:
# vectorization layer
import tensorflow as tf
text_vectorizer = tf.keras.layers.TextVectorization(
    max_tokens = 10000,
    standardize = "lower_and_strip_punctuation",
    split = "whitespace",
    ngrams = None,
    output_mode = "int",
    output_sequence_length = 15,
    pad_to_max_tokens = True,
)

In [None]:
text_vectorizer.adapt(train_sentences)

## Embedding layer

In [None]:
embedder = tf.keras.layers.Embedding(
    input_dim = 10000, # length of the vocabluary
    output_dim = 128, # shape of the output vector
    input_length=15 # length of each sentence
)

## Experiments

### Experiment 1
**Dense layer sequential model**

In [None]:
import tensorflow as tf

# 1. Create a model
model_0 = tf.keras.Sequential(
    [
     tf.keras.layers.Input(shape=(1,), dtype=tf.string, name="input_layer"),
     text_vectorizer,
     embedder,
     tf.keras.layers.GlobalMaxPooling1D(name="pooling"),
     tf.keras.layers.Dense(1, activation="sigmoid")
    ]
)

# 2. Compile a model 
model_0.compile(
    loss = "binary_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

# 3. Fit the model 
# model_0.fit(
#     train_sentences,
#     train_labels,
#     epochs=5,
#     validation_data=(val_sentences,val_labels)
# )

In [None]:
model_0.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 pooling (GlobalMaxPooling1D  (None, 128)              0         
 )                                                               
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1,280,129
Trainable params: 1,280,129
Non-trainable params: 0
_________________________________________________________________


In [None]:
history_0 = model_0.fit(
    x=train_sentences,
    y=train_labels,
    epochs=5,
    validation_data=(val_sentences, val_labels)
    )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
embedder(text_vectorizer(["This is a hero"]))[0][0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-0.02255533, -0.04826234, -0.03311062, -0.02402052, -0.0402505 ,
       -0.01451021, -0.02572479, -0.02905092, -0.01954997, -0.03289918,
       -0.03207925, -0.04809026, -0.03501017, -0.0493761 , -0.02311846,
       -0.01927548, -0.05557725, -0.02816477, -0.04351018, -0.04047216,
       -0.04776376, -0.03119099, -0.02405515, -0.0446592 , -0.03630833,
       -0.05611802, -0.00722387, -0.03057517, -0.0381459 , -0.0137607 ,
       -0.04814136, -0.03556888, -0.04802714, -0.04936073, -0.01896391,
       -0.02120782, -0.04804309, -0.02925539, -0.03638957, -0.01842359,
       -0.0593348 , -0.02599946, -0.03856061, -0.04607638, -0.04641395,
       -0.04466487, -0.02501427,  0.01630779, -0.03373942, -0.02113366,
       -0.01298078, -0.03300235, -0.02011319, -0.02341105, -0.04314491,
       -0.00884046, -0.03512947, -0.00707153, -0.04609659, -0.04954324,
       -0.02706048, -0.03650626, -0.03620201, -0.04992064, -0.02435609,
       -0.011869

### Experiment 2

**LSTM model**

In [None]:
import tensorflow as tf

model_1 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(1,), dtype=tf.string),
  text_vectorizer,
  embedder,
  tf.keras.layers.LSTM(units=128, activation="tanh", return_sequences=False),
  tf.keras.layers.Dense(1, activation="sigmoid", name="output_layer")
])

In [None]:
model_1.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 lstm (LSTM)                 (None, 128)               131584    
                                                                 
 output_layer (Dense)        (None, 1)                 129       
                                                                 
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Compile the model 
model_1.compile(
    loss="binary_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

# Fit the model 
history_1 = model_1.fit(
    train_sentences,
    train_labels,
    epochs=5,
    validation_data=(val_sentences, val_labels)
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Experiment 3
**Conv1D model**

In [None]:
import tensorflow as tf

# 1. Create a model 
model_2 = tf.keras.Sequential([
  tf.keras.layers.Input(shape=(1,), dtype=tf.string),
  text_vectorizer,
  embedder,
  tf.keras.layers.Conv1D(filters=10, kernel_size=5, activation="relu"),
  tf.keras.layers.GlobalMaxPooling1D(),
  tf.keras.layers.Dense(1, activation="sigmoid")
])


In [None]:
model_2.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 15)               0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 15, 128)           1280000   
                                                                 
 conv1d (Conv1D)             (None, 11, 10)            6410      
                                                                 
 global_max_pooling1d (Globa  (None, 10)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 1)                 11        
                                                                 
Total params: 1,286,421
Trainable params: 1,286,421
No

In [None]:
# 2. Compile the model 
model_2.compile(
    loss="binary_crossentropy",
    optimizer=tf.keras.optimizers.Adam(), 
    metrics=["accuracy"]
)

# 3. Fit the model
history_2 = model_2.fit(
    train_sentences,
    train_labels,
    epochs=5,
    validation_data=(val_sentences, val_labels)
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


###### Retrain the baseline model with 10% of the training data. How does perform compared to the Universal Sentence Encoder model with 10% of the training data?

In [None]:
# Getting 10 percent data
ten_percent = int(len(train_sentences)*0.1)

train_10_percent_sentences = train_sentences[:ten_percent]
train_10_percent_labels = train_labels[:ten_percent]

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

baseline_model = Pipeline([
  ("tfidf", TfidfVectorizer()),
  ("clf", MultinomialNB())
])

baseline_model.fit(train_10_percent_sentences, train_10_percent_labels)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [None]:
basemodel_evaluation = baseline_model.score(val_sentences, val_labels)
basemodel_evaluation

0.7603414313854235

Try fine-tuning the TF Hub Universal Sentence Encoder model by setting training=True when instantiating it as a Keras layer.

In [None]:
import tensorflow_hub as hub

use_embedder = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder/4",
    input_shape=[],
    trainable=True,
    dtype=tf.string,
    name="use"
)

# Create the model 

model_3 = tf.keras.Sequential([
  use_embedder,
  tf.keras.layers.Dense(1, activation="sigmoid", name="output_layer")
])

# Compile a model 
model_3.compile(
    loss="binary_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

# Fit the model 
history_3 = model_3.fit(
    train_sentences,
    train_labels,
    epochs=5,
    validation_data=(val_sentences, val_labels)
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
for layer in use_embedder.layers.Layer:
  print(layer.name, layer.trainable)

AttributeError: ignored

In [None]:
model_3.layers

[<tensorflow_hub.keras_layer.KerasLayer at 0x7ff503e78d10>,
 <keras.layers.core.dense.Dense at 0x7ff503e78d50>]

In [None]:
baseline_model.score(val_sentences, val_labels)

0.7603414313854235

In [None]:
model_0.evaluate(val_sentences, val_labels)



[0.5387312173843384, 0.7839789986610413]

In [None]:
model_1.evaluate(val_sentences, val_labels)



[1.3892662525177002, 0.7255417108535767]

In [None]:
model_2.evaluate(val_sentences, val_labels)



[0.9155717492103577, 0.743926465511322]

In [None]:
model_3.evaluate(val_sentences, val_labels)



[0.6557867527008057, 0.7892317771911621]

In [None]:
# recreating and training model_3 on 100% of the data
import tensorflow_hub as hub

embedder_layer = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder/4",
    trainable=True, 
    input_shape=[],
    dtype=tf.string,
    name="universal_sentence_encoder"
)

# 1. Create a model 
model_4 = tf.keras.Sequential([
  embedder_layer,
  tf.keras.layers.Dense(8, activation="relu"),
  tf.keras.layers.Dense(1, activation="sigmoid")
])

# 2. Compile a model 
model_4.compile(
    loss="binary_crossentropy",
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
)

# 3. Fit the model 
history_4 = model_4.fit(
    train_df["text"],
    train_df["target"],
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
model_4.evaluate(val_sentences, val_labels)



[0.020850183442234993, 0.9901509881019592]

In [None]:
model_4.predict()

In [None]:
import pandas as pd 
test_df = pd.read_csv("./test.csv")
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [None]:
test_pred_probs = model_4.predict(test_df["text"])
test_pred_probs[:10]

array([[0.98505026],
       [0.9964707 ],
       [0.9968629 ],
       [0.98466206],
       [0.997086  ],
       [0.99637324],
       [0.00225905],
       [0.00181326],
       [0.0018137 ],
       [0.0017986 ]], dtype=float32)

In [None]:
test_preds = tf.cast(tf.squeeze(tf.round(test_pred_probs)), dtype=tf.int32)
test_preds[:10]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0], dtype=int32)>

In [None]:
test_df_with_pred = test_df

In [None]:
test_df_with_pred["prediction"] = pd.Series(data=test_preds)

In [None]:
test_df_with_pred.sample(n=10)

Unnamed: 0,id,keyword,location,text,prediction
2946,9758,trapped,,@LauraE303B @SheilaGunnReid A war we'll never ...,0
2641,8824,sirens,,@spookyerica sleeping with sirens?,0
2968,9821,trauma,"Nashville, TN",Esteemed journalist recalls tragic effects of ...,1
550,1795,buildings%20on%20fire,TAIZ - YEMEN,#Taiz\n#Houthi #Saleh indiscriminate shelling ...,1
2096,7034,mayhem,,El Nino is getting stronger! Monster weather s...,1
456,1468,body%20bagging,,Body bagging that I think it's time to bring b...,0
1153,3801,detonate,"Amsterdam, Worldwide",Track : Apollo Brown - Detonate ft. M.O.P. | ...,0
1778,6010,hazardous,,TSA issues Hazardous Weather Outlook (HWO) htt...,1
134,425,apocalypse,Currently Somewhere On Earth,@5SOStag honestly he could say an apocalypse i...,0
1465,4862,explode,,my damn head feel like it's gone explode ??,0


In [None]:
test_df_with_pred.head()

Unnamed: 0,id,keyword,location,text,prediction
0,0,,,Just happened a terrible car crash,1
1,2,,,"Heard about #earthquake is different cities, s...",1
2,3,,,"there is a forest fire at spot pond, geese are...",1
3,9,,,Apocalypse lighting. #Spokane #wildfires,1
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,1


In [None]:
test_df_with_pred.tail()

Unnamed: 0,id,keyword,location,text,prediction
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,1
3259,10865,,,Storm in RI worse than last hurricane. My city...,1
3260,10868,,,Green Line derailment in Chicago http://t.co/U...,1
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...,1
3262,10875,,,#CityofCalgary has activated its Municipal Eme...,1


In [None]:
test_df.sample(n=10, random_state=42)

Unnamed: 0,id,keyword,location,text,prediction
2406,8051,refugees,,Refugees as citizens - The Hindu http://t.co/G...,0
134,425,apocalypse,Currently Somewhere On Earth,@5SOStag honestly he could say an apocalypse i...,0
411,1330,blown%20up,Scout Team,If you bored as shit don't nobody fuck wit you...,0
203,663,attack,,@RealTwanBrown Yesterday I Had A Heat Attack ?...,0
889,2930,danger,Leeds,The Devil Wears Prada is still one of my favou...,0
1432,4743,evacuate,San Diego,my father fucking died when the north tower co...,1
3024,9981,tsunami,Okinawa,Oh itÛªs a soccer ball? I thought it was the ...,0
2741,9129,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
463,1490,body%20bags,CLEVELAND,@ComplexMag he asking for a body bags @PUSHA_T,0
291,943,blaze,"Missoula, MT",@JuneSnowpaw Yeah Gimme dat creamy white stuff ;3,0


In [None]:
test_df[["id", "prediction"]].sample(n=10, random_state=42)

Unnamed: 0,id,prediction
2406,8051,0
134,425,0
411,1330,0
203,663,0
889,2930,0
1432,4743,1
3024,9981,0
2741,9129,1
463,1490,0
291,943,0


In [None]:
random_test_samples = test_df["text"].sample(n=10)
for sample in random_test_samples.itertuples():
  _, _, _, text, pred = sample
  print(f"Text:\n\n{text}\n Prediction-> {'Disaster' if pred==1 else 'Not a disaster'}")

0                      Just happened a terrible car crash
1       Heard about #earthquake is different cities, s...
2       there is a forest fire at spot pond, geese are...
3                Apocalypse lighting. #Spokane #wildfires
4           Typhoon Soudelor kills 28 in China and Taiwan
                              ...                        
3258    EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259    Storm in RI worse than last hurricane. My city...
3260    Green Line derailment in Chicago http://t.co/U...
3261    MEG issues Hazardous Weather Outlook (HWO) htt...
3262    #CityofCalgary has activated its Municipal Eme...
Name: text, Length: 3263, dtype: object

In [None]:
submission_df = pd.DataFrame({"id": test_df["id"], "target": test_df["prediction"]})
submission_df.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1


In [None]:
submission = submission_df.to_csv("submission.csv", sep=",", index=False)