
# Sentiment Analysis Using Naïve Bayes Text Classification



Steps:
1. Import the necessary libraries
2. Load and clean the data
Handle missing values.
Convert labels from neg and pos to 0 and 1.
3. Preprocessing the text
Remove punctuation and symbols using regular expressions.
Lemmatization, stop word removal, and handling logical negations.
4. Split the data into training and testing sets (80% train, 20% test)
5. Train a Naive Bayes classifier
Compute prior and likelihood probabilities.
Implement Laplace smoothing.
6. Evaluate the model
Test the model on the 20% test set.
Generate confusion matrix and calculate precision, recall, and F1 score.
7. Run the experiments with different preprocessing configurations:
Without lemmatization, stop word removal, or negation handling.
With lemmatization.
With lemmatization and stop word removal.
With lemmatization, stop word removal, and negation handling.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score,accuracy_score
import spacy
import re

# Load the spacy English model for lemmatization and negation handling
nlp = spacy.load('en_core_web_sm')


Load and clean the data:

In [16]:
# Load the data from the .tsv file
data = pd.read_table('moviereviews.tsv')

# Drop missing or null values in the reviews
data.dropna(inplace=True)

# Convert the labels from 'neg' to 0 and 'pos' to 1
data['label'] = data['label'].map({'neg': 0, 'pos': 1})

# Quick check on the data
data.head()
data.shape

(1965, 2)

Text Preprocessing Function:

1. Punctuation Removal: Removes punctuation and special symbols using a regex pattern.
2. Convert to spaCy Document: Converts the text to a spaCy document object for NLP processing.
3. Token Processing: Iterates through tokens, handling logical negations (e.g., prefixing with "NOT_") and lemmatizing (getting base form) if enabled.
4. Stop Word Removal: Removes stop words (common words like "the", "and") if enabled.
5. Output: Returns the processed text as a space-separated string of tokens.

In [17]:
def preprocess_text(text, lemmatize_words=True, remove_stop_words=True, handle_logical_negation=True):
    # Remove punctuation and symbols using regex
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert to a spaCy document object
    doc = nlp(text)
    
    processed_tokens = []
    
    for token in doc:
        # Handle logical negations if enabled
        if handle_logical_negation and token.dep_ == 'neg':
            processed_tokens.append('NOT_' + token.head.lemma_)
        elif lemmatize_words:
            processed_tokens.append(token.lemma_)
        else:
            processed_tokens.append(token.text)
    
    # Remove stop words if the flag is enabled
    if remove_stop_words:
        processed_tokens = [token for token in processed_tokens if not nlp.vocab[token].is_stop]
    
    return ' '.join(processed_tokens)


Train-test Split:

1. The function `prepare_data` takes in a dataset and preprocessing arguments for text data.
2. It preprocesses the 'review' column of the dataset using `preprocess_text` and stores the result in a new column `processed_review`.
3. The data is then split into training (80%) and testing (20%) sets.
4. A `CountVectorizer` is used to transform the text data into feature vectors for both training and testing sets.
5. The function returns the vectorized training and testing sets along with their corresponding labels.

In [18]:
# Define the preprocessing function for training and testing sets
def prepare_data(data, preprocess_args):
    # Preprocess reviews based on the provided arguments
    data['processed_review'] = data['review'].apply(lambda x: preprocess_text(x, **preprocess_args))
    
    # Split data into training (80%) and testing (20%)
    X_train, X_test, y_train, y_test = train_test_split(data['processed_review'], data['label'], test_size=0.2, random_state=42)
    
    # Use CountVectorizer to transform text data into feature vectors
    vectorizer = CountVectorizer()
    X_train_vect = vectorizer.fit_transform(X_train)
    X_test_vect = vectorizer.transform(X_test)
    
    return X_train_vect, X_test_vect, y_train, y_test


Training Naive Bayes Classifier:

In [19]:
def train_naive_bayes(X_train_vect, y_train):
    # Initialize the Naive Bayes classifier
    model = MultinomialNB()
    # Train the model
    model.fit(X_train_vect, y_train)
    
    return model


Evaluate the Model:
This function, `evaluate_model`, evaluates a machine learning model's performance:

1. It takes a trained model, test data (`X_test_vect`), and the true labels (`y_test`).
2. The model predicts labels (`y_pred`) for the test data.
3. A confusion matrix is calculated to summarize prediction performance.
4. It computes precision, recall, and F1 score to assess classification quality.
5. It returns the confusion matrix, precision, recall,accuracy and F1 score as output.

In [20]:
def evaluate_model(model, X_test_vect, y_test):
    # Predict on the test set
    y_pred = model.predict(X_test_vect)
    
    # Confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    # Calculate precision, recall, and f1 score
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    
    return conf_matrix, precision, recall, f1, accuracy


Running the Experiments:

This code defines different preprocessing configurations for text data, with options like lemmatization, stop word removal, and logical negation handling. It then iterates over each configuration, applying them to the data. In each iteration, the data is preprocessed and split into training and test sets. The model is trained using Naive Bayes on the preprocessed training data. Finally, the model is evaluated by computing metrics such as confusion matrix, precision, recall, and F1 score, which are printed for each scenario.

In [21]:
# Define different preprocessing configurations
scenarios = [
    {"lemmatize_words": False, "remove_stop_words": False, "handle_logical_negation": False},
    {"lemmatize_words": True, "remove_stop_words": False, "handle_logical_negation": False},
    {"lemmatize_words": True, "remove_stop_words": True, "handle_logical_negation": False},
    {"lemmatize_words": True, "remove_stop_words": True, "handle_logical_negation": True}
]

# Loop through each scenario and evaluate the model
for i, preprocess_args in enumerate(scenarios, 1):
    print(f"Scenario {i}: {preprocess_args}")
    
    # Preprocess data and split into training and test sets
    X_train_vect, X_test_vect, y_train, y_test = prepare_data(data, preprocess_args)
    
    # Train the model
    model = train_naive_bayes(X_train_vect, y_train)
    
    # Evaluate the model
    conf_matrix, precision, recall, f1,accuracy = evaluate_model(model, X_test_vect, y_test)
    
    # Print results
    print(f"Confusion Matrix:\n{conf_matrix}")
    print(f"Precision: {precision}, Recall: {recall}, F1 Score: {f1}, Accuracy: {accuracy}\n")


Scenario 1: {'lemmatize_words': False, 'remove_stop_words': False, 'handle_logical_negation': False}
Confusion Matrix:
[[169  33]
 [ 50 141]]
Precision: 0.8103448275862069, Recall: 0.7382198952879581, F1 Score: 0.7726027397260274, Accuracy: 0.7888040712468194

Scenario 2: {'lemmatize_words': True, 'remove_stop_words': False, 'handle_logical_negation': False}
Confusion Matrix:
[[170  32]
 [ 50 141]]
Precision: 0.815028901734104, Recall: 0.7382198952879581, F1 Score: 0.7747252747252747, Accuracy: 0.7913486005089059

Scenario 3: {'lemmatize_words': True, 'remove_stop_words': True, 'handle_logical_negation': False}
Confusion Matrix:
[[166  36]
 [ 46 145]]
Precision: 0.8011049723756906, Recall: 0.7591623036649214, F1 Score: 0.7795698924731181, Accuracy: 0.7913486005089059

Scenario 4: {'lemmatize_words': True, 'remove_stop_words': True, 'handle_logical_negation': True}
Confusion Matrix:
[[165  37]
 [ 46 145]]
Precision: 0.7967032967032966, Recall: 0.7591623036649214, F1 Score: 0.77747989276