<a href="https://colab.research.google.com/github/terrysu64/Machine-Learning-and-Data-Science-Projects/blob/main/introduction_to_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP Fundamentals

In [1]:
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py

--2022-08-17 16:01:16--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py’


2022-08-17 16:01:17 (107 MB/s) - ‘helper_functions.py’ saved [10246/10246]



In [2]:
from helper_functions import create_tensorboard_callback, plot_loss_curves, unzip_data, walk_through_dir

### import dataset
kaggle: https://www.kaggle.com/competitions/nlp-getting-started

In [3]:
!wget https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip

--2022-08-17 16:01:20--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.10.128, 142.251.12.128, 172.217.194.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.10.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2022-08-17 16:01:20 (121 MB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [4]:
unzip_data('nlp_getting_started.zip')

### visualize text dataset

In [5]:
import pandas as pd
train_df = pd.read_csv('train.csv').sample(frac=1,
                                           random_state=42)
test_df = pd.read_csv('test.csv').sample(frac=1,
                                         random_state=42)
train_df.head()

Unnamed: 0,id,keyword,location,text,target
2644,3796,destruction,,So you have a new weapon that can cause un-ima...,1
2227,3185,deluge,,The f$&amp;@ing things I do for #GISHWHES Just...,0
5448,7769,police,UK,DT @georgegalloway: RT @Galloway4Mayor: ÛÏThe...,1
132,191,aftershock,,Aftershock back to school kick off was great. ...,0
6845,9810,trauma,"Montgomery County, MD",in response to trauma Children of Addicts deve...,0


In [6]:
train_df.target.value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [7]:
len(train_df), len(test_df)

(7613, 3263)

In [8]:
from random import randint
for _ in range(3):
  i = randint(1,7612)
  print(f"{'disaster' if train_df.target[i] else 'not disaster'}: {train_df.text[i]}")

not disaster: 'If you go on with this nuclear arms race all you are going to do is make the rubble bounce.' ? Winston Churchill
disaster: @SenateMajLdr let's try to do our best to prevent another outbreak of violence by talking to each other both the people and the politics


### split data

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train_df.text.to_numpy(),
                                                  train_df.target.to_numpy(),
                                                  test_size=0.1,
                                                  random_state=42)

In [10]:
train_df.head().to_numpy() #changes df => numpy arr

array([[3796, 'destruction', nan,
        'So you have a new weapon that can cause un-imaginable destruction.',
        1],
       [3185, 'deluge', nan,
        'The f$&amp;@ing things I do for #GISHWHES Just got soaked in a deluge going for pads and tampons. Thx @mishacollins @/@',
        0],
       [7769, 'police', 'UK',
        'DT @georgegalloway: RT @Galloway4Mayor: \x89ÛÏThe CoL police can catch a pickpocket in Liverpool Stree... http://t.co/vXIn1gOq4Q',
        1],
       [191, 'aftershock', nan,
        'Aftershock back to school kick off was great. I want to thank everyone for making it possible. What a great night.',
        0],
       [9810, 'trauma', 'Montgomery County, MD',
        'in response to trauma Children of Addicts develop a defensive self - one that decreases vulnerability. (3',
        0]], dtype=object)

### converting text => numbers

In [11]:
#tokenization: word => a specific number
#embedding: representing relationships between tokens (can limit size + can be learned)

import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

MAX_VOCAB=10000
MAX_LENGTH=15
vectorizer = TextVectorization(max_tokens=MAX_VOCAB, #limit to how many words in vocab
                               standardize="lower_and_strip_punctuation",
                               split="whitespace",
                               ngrams=None, #create grops of n words
                               output_mode="int",
                               output_sequence_length=MAX_LENGTH, #how long to limit tweet
                               pad_to_max_tokens=True #replace anything not in max_tokens to 0
)

In [12]:
vectorizer.adapt(X_train)

In [13]:
vectorizer(['hi there buddy'])

<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[1663,   74,    1,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]])>

In [14]:
vocab = vectorizer.get_vocabulary()
vocab[:5] #top 5 words

['', '[UNK]', 'the', 'a', 'in']

### creating embedding layer

In [15]:
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim=MAX_VOCAB,
                             output_dim=128,
                             input_length=MAX_LENGTH) #neural networks work well with %8==0

embedding(vectorizer("hi there buddy"))

<tf.Tensor: shape=(15, 128), dtype=float32, numpy=
array([[ 0.00095879,  0.03337276, -0.00845902, ..., -0.02276065,
        -0.02788907,  0.04178109],
       [ 0.0124415 , -0.01177146, -0.03444948, ..., -0.01477469,
        -0.02455767, -0.02829943],
       [ 0.03301611, -0.00591717, -0.01964932, ..., -0.02976345,
         0.02890769, -0.0352829 ],
       ...,
       [ 0.00279107, -0.00278381,  0.0381883 , ...,  0.01146875,
         0.0477636 , -0.03353547],
       [ 0.00279107, -0.00278381,  0.0381883 , ...,  0.01146875,
         0.0477636 , -0.03353547],
       [ 0.00279107, -0.00278381,  0.0381883 , ...,  0.01146875,
         0.0477636 , -0.03353547]], dtype=float32)>

## building models for the dataset
###multinomial naive bayes using TF-IDF formula

note: it is common to use no DL algos first

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", MultinomialNB())
])

model.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

In [17]:
baseline_score = model.score(X_val, y_val)
baseline_score

0.7926509186351706

In [18]:
model.predict(X_val)[:5], y_val[:5]

(array([1, 1, 1, 0, 0]), array([0, 0, 1, 1, 1]))

In [25]:
from sklearn.metrics import classification_report

report = classification_report(model.predict(X_val),
                      y_val)
report

'              precision    recall  f1-score   support\n\n           0       0.93      0.75      0.83       516\n           1       0.63      0.89      0.73       246\n\n    accuracy                           0.79       762\n   macro avg       0.78      0.82      0.78       762\nweighted avg       0.83      0.79      0.80       762\n'

In [27]:
#               precision    recall  f1-score   support

#            0       0.93      0.75      0.83       516
#            1       0.63      0.89      0.73       246

#     accuracy                           0.79       762
#    macro avg       0.78      0.82      0.78       762
# weighted avg       0.83      0.79      0.80       762


model 1: RNN (LSTM)
connections between tensors form a directed or undirected graph along a temporal sequence.