# Text Classification Challenge

## Overview

Welcome to the Text Classification Challenge! In this task, you will develop a machine learning model to classify IMDb movie reviews into positive or negative sentiments. The challenge is designed to help you demonstrate your skills in natural language processing (NLP) and your ability to work with state-of-the-art transformer models.

### Problem Statement

The task is to build a text classification model that accurately predicts whether a given movie review expresses a positive or negative sentiment. Sentiment analysis is a critical task in NLP with applications in marketing, customer feedback, social media monitoring, and more. Accurately classifying sentiments can provide valuable insights into customer opinions and help businesses make data-driven decisions.

### Why This Task is Important

Understanding customer sentiment through text data is crucial for businesses and organizations to respond effectively to customer needs and preferences. By automating the sentiment analysis process, companies can efficiently analyze vast amounts of data, identify trends, and make informed strategic decisions. For this challenge, we will use the IMDb dataset, a widely-used benchmark in sentiment analysis, to train and evaluate our model.

## Dataset Description

The dataset used for this challenge is the IMDb movie reviews dataset, which contains 50,000 reviews labeled as either positive or negative. This dataset is balanced, with an equal number of positive and negative reviews, making it ideal for training and evaluating sentiment analysis models.

- **Columns:**
  - `review`: The text of the movie review.
  - `sentiment`: The sentiment label (`positive` or `negative`).

The IMDb dataset provides a real-world scenario where understanding sentiment can offer insights into public opinion about movies, directors, and actors, as well as broader trends in the entertainment industry.

## Approach

Transformers have revolutionized NLP by allowing models to consider the context of a word based on surrounding words, enabling better understanding and performance on various tasks, including sentiment analysis. Their ability to transfer learning from massive datasets and adapt to specific tasks makes them highly effective for text classification.

## Your Task

You are required to implement a transformer-based model for sentiment classification on the IMDb dataset. Follow the steps below to complete the challenge:

1. **Data Exploration and Preprocessing:**
   - Load the dataset and perform exploratory data analysis (EDA) to understand its structure.
   - Preprocess the data by cleaning text, encoding labels, and splitting into training and test sets.

2. **Model Implementation:**
   - Implement a transformer-based model for sentiment classification. You should consider writing Transformer blocks from scratch.
   - Implement data loaders and training loops using a deep learning framework like PyTorch or TensorFlow.

3. **Training and Evaluation:**
   - Train your model and optimize hyperparameters for the best performance.
   - Evaluate the model using appropriate metrics.

4. **Documentation:**
   - Document your approach, experiments, and results.
   - Discuss any challenges faced and propose potential improvements.

5. **Prediction and Inference:**
    - Implement a function that takes a movie review as input and predicts the sentiment (positive or negative).
    - Test the function with custom reviews and display the predicted sentiment.

6. **Model Deployment:**
    - Save the trained model and any other necessary files.
    - Prepare the model for deployment (e.g., using Flask or FastAPI).
    - Prepare a basic front-end interface for the deployed model.

7. **Submission:**
    - Create a GitHub repository for your code.
    - Write a detailed README.md file with instructions on how to train, evaluate, and use the model.
    - Include a summary of your approach and the results in the README file.
    - Your code should be well-documented and reproducible.
    - Your repository should include a notebook showcasing the complete process, including data loading, preprocessing, model implementation, training, and evaluation.
    - Apart from the notebook, you should also have all the codes in .py files so that it can be easily integrated with the API.
    - You submission should also include a python script for the API.
    - Your submission should also include a basic front-end for the deployed model.
    - Submit the GitHub repository link.

## Getting Started

To get started, follow the structure provided in this notebook, complete each step, and explore additional techniques to enhance your model's performance. Make sure to document your findings and prepare a comprehensive report on your work.

Good luck, and welcome to RealAI!


# Data Exploration and Preprocessing

Let's start by loading the dataset and performing some exploratory data analysis (EDA) to understand its structure and characteristics.
You can download the dataset from the following link: https://drive.google.com/file/d/1aU7Vv7jgodZ0YFOLY7kmSjrPcDDwtRfU/view?usp=sharing

You should provide all the necessary reasoning and code to support your findings.

Finally, you should apply the required preprocessing steps to prepare the data for training the sentiment classification model.

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers
from nltk.corpus import stopwords
from nltk.tokenize import ToktokTokenizer
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, GlobalAveragePooling1D, Dropout, Dense, Layer

import nltk
import re
import pickle

# Data Preprocessing
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'<br />', ' ', text)  # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub('\[[^]]*\]', '', text) # Remove square brackets and its contents
    return text

stop=set(stopwords.words('english'))
tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english')

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
# Load the dataset
file_path = 'IMDB Dataset.csv'  # Update this path if necessary
df = pd.read_csv(file_path)
# Preprocess the text data
df['review'] = df['review'].apply(clean_text)
df['review']=df['review'].apply(remove_stopwords)
# Encode output labels
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Define maximum number of words and sequence length
MAX_NUM_WORDS = 20000  # Vocabulary size
MAX_SEQUENCE_LENGTH = 200  # Max length for each review

# Initialize and fit the tokenizer on the text data
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(df['review'].values)

# Convert text data to padded sequences
sequences = tokenizer.texts_to_sequences(df['review'].values)
padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, df['sentiment'].values, test_size=0.2, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_val.shape[0]}")



Training set size: 40000
Test set size: 10000


In [23]:
#save the tokenizer for use in FastAPI
with open('tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

# Model Implementation

You are required to implement a transformer-based model for sentiment classification from scratch. You can use libraries like PyTorch or TensorFlow to implement the model architecture and training process.

You should include the architecture figure of the proposed model and provide a detailed explanation of why you chose this architecture.

In [2]:

@tf.keras.utils.register_keras_serializable()
class PositionalEncoding(Layer):
    def __init__(self, maxlen, embed_dim, **kwargs):
        super(PositionalEncoding, self).__init__(**kwargs)
        self.maxlen = maxlen
        self.embed_dim = embed_dim

    def call(self, inputs):
        position_indices = tf.range(self.maxlen, dtype=tf.float32)[:, tf.newaxis]
        div_term = tf.exp(tf.range(0, self.embed_dim, 2, dtype=tf.float32) * -(tf.math.log(10000.0) / self.embed_dim))
        
        # Create positional encoding matrix using TensorFlow operations
        sinusoids = tf.expand_dims(position_indices * div_term, -1)
        pos_enc = tf.concat([tf.sin(sinusoids), tf.cos(sinusoids)], axis=-1)
        pos_enc = tf.reshape(pos_enc, [1, self.maxlen, self.embed_dim])
        
        return inputs + pos_enc

    def get_config(self):
        config = super(PositionalEncoding, self).get_config()
        config.update({
            "maxlen": self.maxlen,
            "embed_dim": self.embed_dim
        })
        return config


@tf.keras.utils.register_keras_serializable()
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1, **kwargs):
        super(TransformerBlock, self).__init__(**kwargs)
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.ff_dim = ff_dim
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = tf.keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

    def get_config(self):
        config = super(TransformerBlock, self).get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "ff_dim": self.ff_dim,
        })
        return config

In [3]:
class MultiHeadSelfAttention(tf.keras.layers.Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.num_heads = num_heads
        self.embed_dim = embed_dim
        self.projection_dim = embed_dim // num_heads
        self.query_dense = Dense(embed_dim)
        self.key_dense = Dense(embed_dim)
        self.value_dense = Dense(embed_dim)
        self.combine_heads = Dense(embed_dim)

    def attention(self, query, key, value):
        score = tf.matmul(query, key, transpose_b=True)
        dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
        scaled_score = score / tf.math.sqrt(dim_key)
        weights = tf.nn.softmax(scaled_score, axis=-1)
        output = tf.matmul(weights, value)
        return output, weights

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        batch_size = tf.shape(inputs)[0]
        query = self.query_dense(inputs)
        key = self.key_dense(inputs)
        value = self.value_dense(inputs)
        query = self.separate_heads(query, batch_size)
        key = self.separate_heads(key, batch_size)
        value = self.separate_heads(value, batch_size)
        attention, _ = self.attention(query, key, value)
        attention = tf.transpose(attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim))
        output = self.combine_heads(concat_attention)
        return output



In [4]:

def build_transformer_model(maxlen, vocab_size, embed_dim, num_heads, ff_dim):
    inputs = Input(shape=(maxlen,))
    
    # Embedding layer
    embedding_layer = Embedding(input_dim=vocab_size, output_dim=embed_dim)(inputs)
    
    # Positional encoding
    positional_encoding = PositionalEncoding(maxlen, embed_dim)(embedding_layer)
    
    # Transformer block
    transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)(positional_encoding, training=True)
    
    # Pooling and output layers
    pooling_layer = GlobalAveragePooling1D()(transformer_block)
    dropout_layer = Dropout(0.1)(pooling_layer)
    outputs = Dense(1, activation="sigmoid")(dropout_layer)
    
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

# Model parameters
embed_dim = 128
num_heads = 4
ff_dim = 128

# Build the model
model = build_transformer_model(MAX_SEQUENCE_LENGTH, MAX_NUM_WORDS, embed_dim, num_heads, ff_dim)
model.summary()

# Training and Evaluation

Train your sentiment classification model on the preprocessed data. You should experiment with different hyperparameters and training configurations to achieve the best performance.

Evaluate your model using appropriate metrics and provide an analysis of the results.

In [5]:
# Compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the model
history = model.fit(X_train, y_train, batch_size=32, epochs=5, validation_data=(X_val, y_val))
# Evaluate the model
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")


Epoch 1/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m161s[0m 128ms/step - accuracy: 0.6992 - loss: 0.5176 - val_accuracy: 0.8855 - val_loss: 0.2693
Epoch 2/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m156s[0m 125ms/step - accuracy: 0.9262 - loss: 0.1963 - val_accuracy: 0.8896 - val_loss: 0.2695
Epoch 3/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m155s[0m 124ms/step - accuracy: 0.9550 - loss: 0.1300 - val_accuracy: 0.8682 - val_loss: 0.3831
Epoch 4/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m160s[0m 128ms/step - accuracy: 0.9701 - loss: 0.0912 - val_accuracy: 0.8822 - val_loss: 0.3839
Epoch 5/5
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m158s[0m 127ms/step - accuracy: 0.9838 - loss: 0.0540 - val_accuracy: 0.8764 - val_loss: 0.4469
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 47ms/step - accuracy: 0.8777 - loss: 0.4354
Validation Accuracy: 87.64%


# Prediction and Inference

Implement a function that takes a movie review as input and predicts the sentiment (positive or negative). Test the function with custom reviews and display the predicted sentiment.

In [20]:
# CODE HERE
def preprocess_review(review, tokenizer, max_sequence_length):
    sequence = tokenizer.texts_to_sequences([review])
    
    padded_sequence = pad_sequences(sequence, maxlen=max_sequence_length)
    
    return padded_sequence

def predict_sentiment(review, model, tokenizer, max_sequence_length):
    processed_review = preprocess_review(review, tokenizer, max_sequence_length)
    
    prediction = model.predict(processed_review)
    
    sentiment = "positive" if prediction[0] > 0.5 else "negative"
    
    return sentiment

custom_reviews = [
    "I absolutely loved this movie! The performances were outstanding and the story was captivating.",
    "The film was terrible. The plot made no sense and the acting was worse.",
    "An average movie. It had its moments but could have been better.",
    "A masterpiece! The director has outdone himself with this one.",
    "I was very disappointed with the storyline. It was too predictable."
]

# Test the function with custom reviews
for review in custom_reviews:
    sentiment = predict_sentiment(review, model, tokenizer, MAX_SEQUENCE_LENGTH)
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step
Review: I absolutely loved this movie! The performances were outstanding and the story was captivating.
Predicted Sentiment: positive

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
Review: The film was terrible. The plot made no sense and the acting was worse.
Predicted Sentiment: negative

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
Review: An average movie. It had its moments but could have been better.
Predicted Sentiment: negative

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
Review: A masterpiece! The director has outdone himself with this one.
Predicted Sentiment: positive

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step
Review: I was very disappointed with the storyline. It was too predictable.
Predicted Sentiment: negative



## Model Deployment

Save the trained model and any other necessary files. Prepare the model for deployment using Flask or FastAPI. Make a python script for the API. Also, include a basic front-end for the API.

In [21]:
model.save('sentiment_model.keras')

# Submission

You need to create a GitHub repository and submit the link to the repository containing the complete code, documentation, and any other necessary files.

The repository should include:
- A README file with detailed instructions on how to train, evaluate, and use the model.
- A notebook showcasing the complete process, including data loading, preprocessing, model implementation, training, and evaluation.
- Python scripts for the training, evaluation, and inference functions.
- A python script for the API.
- Front-end code for the API.
