<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_301_Sentiment_analysis_and_text_classification_Activity_2_2_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 2.2.6 Sentiment analysis and text classification

In this activity, you will build a sentiment analysis model using Python and a data set of customer reviews. You will preprocess the data and fine-tune, evaluate, and test the model.


## Objective
In this activity, you will download a data set from Hugging Face and conduct text classification on it. Your objective is to analyse how different parameter choices affect the performance of a sentiment classifier.

You will complete this in your Notebook, where you will:

- create and train sentiment classifier RNN models
- evaluate model performance.



## Assessment criteria

By completing this activity, you will be able to provide evidence that you can:
*   apply various text preprocessing techniques and representation methods to preprocess and analyse textual data.
*   comprehend and implement different types of recurrent neural networks (RNNs) and understand their applications in NLP.
*   build and fine-tune advanced NLP models for specific natural language processing tasks.


## Activity guidance

1. Install the necessary packages that will be useful in this activity
2. Load the dataset sst5 from hugging face (https://huggingface.co/datasets/SetFit/sst5)

3. Create dataframes of the train and train split
4. Split the train dataframe into train and validation in the ratio of 8:2
5. Preprocess the dataset, set the maximum size to 200, vocabulary size to 30000
6. During tokenisation, mark out of vocabulary words as "[OOV]"
7. Pad your sequences with special tokens
8. Train a sentiment classifier on the dataset and compare different models for text classification
9.Train for 5 epochs
- Train with a vanilla RNN
- Train with an LSTM
- Is there any difference between a GRU and an LSTM?
- Train with a bidirectional LSTM
10. Comment on the performance of all the models


> Start your activity here. Select the pen from the toolbar to add your entry.

In [1]:
#In this activity, you will be required to download a dataset from huggingface and perfom the text classification on the the dataset
#You will be required to study the impact of different different parameter choices on the classification perfomance of sentiment classifier


#1. Install the necessary packages that will be useful in this  activity
#2. Load the dataset sst5 from hugginface (https://huggingface.co/datasets/SetFit/sst5)
#3.Create dataframes of the train and train split
#4 Split the train dataframe into train and validation in the ratio of 8:2
#5 Preprocess the dataset,  set the maximum size to 200, vocabulary size to 30000
#6. During tokenization, mark out of vocabulary words as "<OOV>"
#7 Pad your sequences with special tokens
#8. Train a sentiment classifier on the dataset and compare different models for text classification
# Train for 5 epochs
#    - Train with a vanilla RNN
#    - Train with an LSTM
#    - Is there any difference between a GRU and an LSTM?
#    - Train with a bidirectional LSTM
# Comment on the perfomance of all the models

## **Step 1: Install Necessary Packages**



In [2]:
!pip install transformers datasets tensorflow



## **Step 2: Load the Dataset from Hugging Face**

In [3]:
from datasets import load_dataset

# Load the sst5 dataset
dataset = load_dataset("SetFit/sst5")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Repo card metadata block was not found. Setting CardData to empty.


## **Step 3: Create DataFrames for Train and Test Splits**

In [4]:
import pandas as pd

# Convert train and test splits into DataFrames
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['test'])

## **Step 4: Split Train Data into Train and Validation (80:20)**

In [5]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)


## **Step 5: Preprocess the Dataset**

Set the maximum sequence length to 200 and the vocabulary size to 30,000.

During tokenization, mark out-of-vocabulary (OOV) words as "[OOV]".

1. Tokenize: Use a tokenizer, such as one from the Hugging Face library, with a maximum length and OOV handling.
1. Padding: Add padding tokens to ensure each sequence is exactly 200 tokens long.

In [6]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased" , unk_token="[OOV]")
max_len = 200

def preprocess(texts):
    return tokenizer(
        texts,
        max_length=max_len,
        padding="max_length",
        truncation=True,
    )

train_encodings = preprocess(train_df['text'].tolist())
val_encodings = preprocess(val_df['text'].tolist())
test_encodings = preprocess(test_df['text'].tolist())




## **Step 6: Train a Sentiment Classifier and Compare Different Models**


Train for 5 Epochs

For each model, train on the preprocessed training data for 5 epochs.

Below are the steps for setting up each model type:

1. Vanilla RNN
1. LSTM
1. GRU (Compare its performance with LSTM)
1. Bidirectional LSTM


In [7]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, GRU, Bidirectional, Dense

vocab_size = 30000
embedding_dim = 128

def create_model(rnn_type='VanillaRNN'):
    model = tf.keras.Sequential()
    model.add(Embedding(vocab_size, embedding_dim, input_length=max_len))

    if rnn_type == 'VanillaRNN':
        model.add(SimpleRNN(64))
    elif rnn_type == 'LSTM':
        model.add(LSTM(64))
    elif rnn_type == 'GRU':
        model.add(GRU(64))
    elif rnn_type == 'BidirectionalLSTM':
        model.add(Bidirectional(LSTM(64)))

    model.add(Dense(1, activation='sigmoid'))  # Sigmoid for binary classification
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    return model


## **Step 7: Training and Evaluation**

In [9]:
# Convert to TensorFlow dataset
train_dataset = tf.data.Dataset.from_tensor_slices((train_inputs, train_labels)).batch(32)
val_dataset = tf.data.Dataset.from_tensor_slices((val_inputs, val_labels)).batch(32)

# Create and train each model
for model_type in ['VanillaRNN', 'LSTM', 'GRU', 'BidirectionalLSTM']:
    print(f"Training {model_type} model...")
    model = create_model(rnn_type=model_type)
    model.fit(
        train_dataset,
        validation_data=val_dataset,
        epochs=5
    )
    val_loss, val_accuracy = model.evaluate(val_dataset)
    print(f"{model_type} Validation Accuracy: {val_accuracy:.4f}")


NameError: name 'train_inputs' is not defined