# Overview
This notebook is a project building a sentiment classifier for text of 10,000 movie reviews from IMDb. The movie reviews are classified in 48.22% **positive** and 50.78% **negative** sentiment classes. The smallest and longest reviews have a 6 and 1307 character lengths, respectively. The project is split in two parts:
- The first part uses a Neurol Network built using python TensorFlow library:
    - Using Dropout methods to counter overfitting
    - Using L2 regularization techniques
- The second part uses a Naive Bayes approach to create a sentiment classifier model

## 1. TensorFlow Neural Network
### 1.1 Database Processing and Analysis
For this project, I used a 10,000 movie review database categoriezed into **positive** and **negative** reviews. The data was split into 60-20-20% sets for training, cross validation and testing, respectively. All letters for converted to lowercase letters and non-alphanumeric characters (e.g., punctuation) were scrapped because they are not useful in releaving any sentiment about the movie review. To speed up the training process, data reviews longer than 200 words (i.e., about 37.1% of the total set) were truncated. The **positive** and **negative** lables were converted to a binary set of 1s and 0s. The reviews were suffled and the ammount of **positive** and **negative** reviews in each of the three sets was cross-checked to avoid skewed datasets. Finnally, the text of the reviews was tokenized using the `preprocessing.text.Tokenizer` from the `keras` library. This created an index dictory for all unique words with the most frequently used having a lower index. The dictionary was then used to convert the reviews' text to a vectorized form used for processing.  

In [None]:
import pandas as pd
import numpy as np
import re
import pickle
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l2
import tensorflow as tf

In [None]:
def clean_text(text):
    text= text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text
    
def load_and_preprocess_data(csv_file, max_words=10000, max_len=200):
    """
    Load and preprocess the review data from CSV file with class balance verification
    
    Parameters:
    -----------
    csv_file : str
        Path to CSV file containing reviews and sentiments
    max_words : int, optional (default=10000)
        Maximum number of words to keep in vocabulary
    max_len : int, optional (default=200)
        Maximum length of each sequence
        
    Returns:
    --------
    X : array-like
        Padded sequences of reviews
    y : array-like
        Binary sentiment labels
    tokenizer : Tokenizer
        Fitted tokenizer object
    """
    # Read the CSV file
    print(f"\nLoading data from {csv_file}...")
    df = pd.read_csv(csv_file)
    total_samples = len(df)
    print(f"Total number of reviews: {total_samples:,}")
    
    # Check class distribution before conversion
    sentiment_counts = df['sentiment'].value_counts()
    print("\nOriginal class distribution:")
    for sentiment, count in sentiment_counts.items():
        percentage = (count/total_samples) * 100
        print(f"{sentiment}: {count:,} reviews ({percentage:.1f}%)")

    # Convert sentiment to binary
    df['sentiment'] = (df['sentiment'] == 'Positive').astype(int)

    # Verify expected class balance
    n_positive = (df['sentiment'] == 1).sum()
    n_negative = (df['sentiment'] == 0).sum()
    if n_positive != 5000 or n_negative != 5000:
        print("\nWARNING: Unexpected class distribution!")
        print(f"Expected: 5,000 positive and 5,000 negative")
        print(f"Found: {n_positive:,} positive and {n_negative:,} negative")
    
    # Get text length statistics
    df['review_length'] = df['review'].str.len()
    print("\nReview length statistics:")
    print(f"Mean length: {df['review_length'].mean():.1f} characters")
    print(f"Median length: {df['review_length'].median():.1f} characters")
    print(f"Max length: {df['review_length'].max():,} characters")
    print(f"Min length: {df['review_length'].min():,} characters")
    
    # Remove special characters and set all to lower case
    df['review'] = df['review'].apply(clean_text)

    # Initialize and fit tokenizer
    print("\nTokenizing reviews...")
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(df['review'])
    
    # Get vocabulary statistics
    vocab_size = len(tokenizer.word_index)
    print(f"Total unique words: {vocab_size:,}")
    print(f"Keeping top {max_words:,} words")
    
    # Convert text to sequences
    sequences = tokenizer.texts_to_sequences(df['review'])
    
    # Get sequence length statistics before padding
    seq_lengths = [len(seq) for seq in sequences]
    print("\nSequence length statistics (before padding):")
    print(f"Mean length: {np.mean(seq_lengths):.1f} words")
    print(f"Median length: {np.median(seq_lengths):.1f} words")
    print(f"Max length: {max(seq_lengths):,} words")
    print(f"Min length: {min(seq_lengths):,} words")
    
    # Calculate how many sequences will be truncated
    n_truncated = sum(len(seq) > max_len for seq in sequences)
    if n_truncated > 0:
        print(f"\nWARNING: {n_truncated:,} reviews ({(n_truncated/total_samples)*100:.1f}%) "
              f"will be truncated to {max_len} words")
    
    # Pad sequences
    print(f"\nPadding sequences to length {max_len}...")
    X = pad_sequences(sequences, maxlen=max_len)
    y = df['sentiment'].values
    
    # Final verification of processed data
    print("\nFinal processed data shape:")
    print(f"X shape: {X.shape}")
    print(f"y shape: {y.shape}")
    print(f"Class balance in processed data: {np.mean(y)*100:.1f}% positive")
    
    return X, y, tokenizer

### 1.2 Creating the Model
One of the main concerns with creating the neural network model was to avoid overfitting: good performance on the trainning set and poor performance on unseen data. This is way the cross-validation set is used to help decide the paramaters used in creating the model. Two approaches are explored to counteract overfitting:
- Dropout layer to turn off random neuron units
- L2 regularization to adjust the weights within the neural layers

In [None]:
def create_model(max_words, max_len, embedding_dim=100):
    """
    Create and compile the neural network model
    """
    # Create Adam optimizer with recommended parameters
    adam_optimizer = tf.keras.optimizers.Adam(
        learning_rate=5e-5,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-7,
        amsgrad=False,
        clipvalue=1.0
    )

    model = Sequential([
        Embedding(input_dim = max_words, output_dim = embedding_dim, input_length=max_len),
        LSTM(64), 
        Dense(32, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer=adam_optimizer,
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model