# **Day 9: Natural Language Processing (NLP) Project**
---

### **Description**
In today's lab, we'll use transfer learning for a text classification task. Finally, you will review the NLP skills you have learned so far by creating a model to classify reviews as positive or negative. In particular, you will apply:
* Text vectorization and **embedding**
* Neural Network models for NLP (Fully connected, CNN, or pre-trained)
* Text classification

**When using pre-trained models, GPUs can help speed things up significantly. Make sure to select "GPU" when starting your runtime in SageMaker.**

<br>

### **Lab Structure**

**Part 1**: [Transfer Learning for Text Classification](#p1)

**Part 2**: [IMDB Sentiment Classification](#p2)



<br>

### **Goals**
By the end of this lab, you will have honed your abilities to use a range of deep learning techniques including embedding.

<br>

### **Cheat Sheets**
* [Natural Language Processing III](https://docs.google.com/document/d/1jrzya_r_97qrmk7RGKqWCPhkHhsKMsMDkrn7YjdZUK0/edit)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from fastai.text.all import *
from sklearn.datasets import fetch_20newsgroups

import warnings
warnings.filterwarnings('ignore')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # PyTorch v0.4.0

<a name="p1"></a>

---
## **Part 1: Text Classification of News Articles**
---

In this section, we will return to the 20 News Groups Dataset. This time, we'll look at a subset of three news categories and build a classifier to distinguish between them using a pre-trained model in fast.ai.

<br>


**Run the code provided below to import the dataset.**

In [None]:
categories = ['rec.autos', 'comp.graphics', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create Pandas DataFrames
df = pd.DataFrame({'text': data.data, 'label': data.target})

# Create TextDataLoaders
dls = TextDataLoaders.from_df(
    df,
    text_col='text',
    label_col='label',
    valid_pct=0.2,
    bs=64,
    seq_len=100,
    seed=42,
    device=device)

#### **Problem #1.1: Define a pre-trained model**



#### **Problem #1.2: Train the model**


There are two options when using a pre-trained model for transfer learning.

You can freeze all but the last layer and train using the following:
```
learn.freeze()
learn.fit(n_epochs, lr)
```

For the other option, in addition to unfreezing the last layer, you can have fast.ai gradually unfreeze previous layers to improve the results:

```
learn.fine_tune(n_epochs, lr)
```

The `fine_tune()` method can only be used with pre-trained models. The only required input is the number of epochs, but you can also specify a learning rate.

**Choose a training method and train the model.**

#### **Problem #1.3: Evaluate the model**


Fill in the code below to evaluate the model.

In [None]:
# Calculate training accuracy
train_loss, train_accuracy = # FILL IN CODE HERE
print(f"Training accuracy: {train_accuracy:.4f}")

# Calculate validation accuracy
valid_loss, valid_accuracy = # FILL IN CODE HERE
print(f"Validation accuracy: {valid_accuracy:.4f}")

<a name="p2"></a>

---
## **Part 2: IMDB Sentiment Classification**
---

In this part, you will build a model of your choosing and tune hyperparameters to achieve **90% accuracy** (or aim for higher as a challenge) on the IMDB sentiment classification dataset. This is a dataset of 25,000 movie reviews with sentiment labels: 0 for negative and 1 for positive.

We recommend that you use one of the following:
* A fully-connected model with embedding
* A CNN with embedding
* A pre-trained model

*Hint: You may want to use the Lab 8 notebook as a reference.*


<br>


**Run the code provided below to import the dataset.**

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTdgncgNHtppfS89LHOh1kGl5tYzoEUrUwmOPOQF7mQ0U5Rzba27H45imvZ06_J2x0-wCJySylP5V3_/pub?gid=1712575053&single=true&output=csv'
df = pd.read_csv(url)
df.head()

#### **Step #1: Import and split data into training and validation sets**


Use TextDataLoaders to load and split the data.


In [None]:
dls = TextDataLoaders.from_df(
    df,
    text_col='review',
    label_col='sentiment',
    valid_pct=0.2,
    bs=#choose a batch size,
    seq_len=#choose a sequence length,
    device=device
)

#### **Step #2: Determine the input dimension of your data**


Print the length of the vocabulary.

#### **Steps #3-6: Build the neural network**


#### **Step #7: Train the model**


#### **Step #8: Evaluate the model**


Evaluate the model, then go back to previous steps and tune hyperparameters or change the model architechure until you reach 90% accuracy or more on the validation set.

#### **Step #9: Test the model**


Write your own text or find a review online and test your model's predictions. Remember that `0` is for a negative review and `1` for a positive review.

In [None]:
# Change to your own text
learn.predict("I really liked that movie!")

### **Congratulations!**

You've created a sentiment classifier model that can be used to determine if reviews are positive or negative with over 90% accuracy!

# End of Notebook

---
© 2024 The Coding School, All rights reserved