<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/AWS_Part_II_Day_9_Lab_Notebook_%5BSOLUTIONS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Day 9: Natural Language Processing (NLP) Project**
---

### **Description**
In today's lab, we'll use transfer learning for a text classification task. Finally, you will review the NLP skills you have learned so far by creating a model to classify reviews as positive or negative. In particular, you will apply:
* Text vectorization and **embedding**
* Neural Network models for NLP (Fully connected, CNN, or pre-trained)
* Text classification

**When using pre-trained models, GPUs can help speed things up significantly. Make sure to select "GPU" when starting your runtime in SageMaker.**

<br>

### **Lab Structure**

**Part 1**: [Transfer Learning for Text Classification](#p1)

**Part 2**: [IMDB Sentiment Classification](#p2)



<br>

### **Goals**
By the end of this lab, you will have honed your abilities to use a range of deep learning techniques including embedding.

<br>

### **Cheat Sheets**
[Natural Language Processing III](https://docs.google.com/document/d/1QuY0qdG7ICkmsOtShIpyffgYd_EAXjDm86EEzKK5t9s/edit?usp=sharing)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from fastai.text.all import *
from sklearn.datasets import fetch_20newsgroups

import warnings
warnings.filterwarnings('ignore')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # PyTorch v0.4.0

<a name="p1"></a>

---
## **Part 1: Text Classification of News Articles**
---

In this section, we will return to the 20 News Groups Dataset. This time, we'll look at a subset of three news categories and build a classifier to distinguish between them using a pre-trained model in fast.ai.

<br>


**Run the code provided below to import the dataset.**

In [None]:
categories = ['rec.autos', 'comp.graphics', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create Pandas DataFrames
df = pd.DataFrame({'text': data.data, 'label': data.target})

# Create TextDataLoaders
dls = TextDataLoaders.from_df(
    df,
    text_col='text',
    label_col='label',
    valid_pct=0.2,
    bs=64,
    seq_len=100,
    seed=42,
    device=device)

#### **Problem #1.1: Define a pre-trained model**



###### **Solution**

In [None]:
learn = text_classifier_learner(
    dls,
    AWD_LSTM,
    drop_mult=0.5, # Amount of Dropout to use
    metrics=accuracy)

#### **Problem #1.2: Train the model**


There are two options when using a pre-trained model for transfer learning.

You can freeze all but the last layer and train using the following:
```
learn.freeze()
learn.fit(n_epochs, lr)
```

For the other option, in addition to unfreezing the last layer, you can have fast.ai gradually unfreeze previous layers to improve the results:

```
learn.fine_tune(n_epochs, lr)
```

The `fine_tune()` method can only be used with pre-trained models. The only required input is the number of epochs, but you can also specify a learning rate.

**Choose a training method and train the model.**

###### **Solution**


In [None]:
learn.fine_tune(5)

epoch,train_loss,valid_loss,accuracy,time
0,0.737487,0.549568,0.80678,00:18


epoch,train_loss,valid_loss,accuracy,time
0,0.508746,0.459237,0.830508,00:31
1,0.468097,0.510489,0.852542,00:31
2,0.414818,0.470096,0.837288,00:31
3,0.366171,0.424428,0.840678,00:31
4,0.326511,0.430942,0.842373,00:31


#### **Problem #1.3: Evaluate the model**


Fill in the code below to evaluate the model.

In [None]:
# Calculate training accuracy
train_loss, train_accuracy = # FILL IN CODE HERE
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = # FILL IN CODE HERE
print(f"Validation accuracy: {valid_accuracy:.4f}")

###### **Solution**


In [None]:
# Calculate training accuracy
train_loss, train_accuracy = learn.validate(dl=dls.train)
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = learn.validate(dl=dls.valid)
print(f"Validation accuracy: {valid_accuracy:.4f}")

Training accuracy: 0.9080


Validation accuracy: 0.8424


<a name="p2"></a>

---
## **Part 2: IMDB Sentiment Classification**
---

In this part, you will build a model of your choosing and tune hyperparameters to achieve **90% accuracy** (or aim for higher as a challenge) on the IMDB sentiment classification dataset. This is a dataset of 25,000 movie reviews with sentiment labels: 0 for negative and 1 for positive.

We recommend that you use one of the following:
* A fully-connected model with embedding
* A CNN with embedding
* A pre-trained model

*Hint: You may want to use the Lab 8 notebook as a reference.*


<br>


**Run the code provided below to import the dataset.**

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTdgncgNHtppfS89LHOh1kGl5tYzoEUrUwmOPOQF7mQ0U5Rzba27H45imvZ06_J2x0-wCJySylP5V3_/pub?gid=1712575053&single=true&output=csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses main...",1
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his ...",1
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've ...",1
3,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with so...",0
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of conta...",1


#### **Step #1: Import and split data into training and validation sets**


Use TextDataLoaders to load and split the data.


In [None]:
dls = TextDataLoaders.from_df(
    df,
    text_col='review',
    label_col='sentiment',
    valid_pct=0.2,
    bs=#choose a batch size,
    seq_len=#choose a sequence length,
    device=device
)

###### **Solution**

In [None]:
dls = TextDataLoaders.from_df(
    df,
    text_col='review',
    label_col='sentiment',
    valid_pct=0.2,
    bs=64,
    seq_len=100,
    device=device
)

#### **Step #2: Determine the input dimension of your data**


Print the length of the vocabulary.

###### **Solution**

In [None]:
len(dls.vocab[0])

52000

#### **Steps #3-6: Build the neural network**


###### **Solution**

In [None]:
learn = text_classifier_learner(
    dls,
    AWD_LSTM,
    drop_mult=0.4, # Amount of Dropout to use
    metrics=accuracy)

#### **Step #7: Train the model**


###### **Solution**

In [None]:
learn.fine_tune(4)

epoch,train_loss,valid_loss,accuracy,time
0,0.436369,0.376167,0.8349,03:32


epoch,train_loss,valid_loss,accuracy,time
0,0.303863,0.264936,0.8921,09:10
1,0.237286,0.221283,0.9114,09:11
2,0.222718,0.203754,0.9206,09:11
3,0.192003,0.198431,0.9249,09:12


#### **Step #8: Evaluate the model**


Evaluate the model, then go back to previous steps and tune hyperparameters or change the model architechure until you reach 90% accuracy or more on the validation set.

###### **Solution**

In [None]:
# Calculate training accuracy
train_loss, train_accuracy = learn.validate(dl=dls.train)
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = learn.validate(dl=dls.valid)
print(f"Validation accuracy: {valid_accuracy:.4f}")

Training accuracy: 0.9492


Validation accuracy: 0.9249


#### **Step #9: Test the model**


Write your own text or find a review online and test your model's predictions. Remember that `0` is for a negative review and `1` for a positive review.

In [None]:
# Change to your own text
learn.predict("I really liked that movie!")

('1', tensor(1), tensor([0.1572, 0.8428]))

###### **Solution**

If the above code doesn't work (it may only work if they used `text_classifier_learner()`), use the code below.

In [None]:
# Tokenize and preprocess the input text
text = "I really liked that movie!"
tokens = dls.tokenizer(text)
numericalized_tokens = dls.numericalize(tokens)

# Create a mini-batch with only the input data
test_dl = dls.test_dl([numericalized_tokens], with_labels=False)

# Get the predictions
preds, _ = learn.get_preds(dl=test_dl)

# Get the predicted class and probabilities
pred_class_idx = torch.argmax(preds, dim=-1).item()
pred_class = dls.vocab[1][pred_class_idx]

print(f"Predicted class: {pred_class}")

### **Congratulations!**

You've created a sentiment classifier model that can be used to determine if reviews are positive or negative with over 90% accuracy!

---
###© 2023 The Coding School, All rights reserved