## NLP - Emotion Classification in Text
### Objective:
Develop machine learning models to classify emotions in text samples.

### 1. Loading and Preprocessing (3 marks)
Load the dataset and perform necessary preprocessing steps. This should include text cleaning, tokenization, and removal of stopwords. Explain the preprocessing techniques used and their impact on model performance.

In [1]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

In [2]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
# Load the dataset
data = pd.read_csv("nlp_dataset.csv")
# Display the first few rows of the dataset
print(data.head())

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91954\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91954\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


                                             Comment Emotion
0  i seriously hate one subject to death but now ...    fear
1                 im so full of life i feel appalled   anger
2  i sit here to write i start to dig out my feel...    fear
3  ive been really angry with r and i feel like a...     joy
4  i feel suspicious if there is no one outside l...    fear


In [3]:
# Define a function for text cleaning
def clean_text(text):
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    return text

# Apply text cleaning
data['cleaned_comment'] = data['Comment'].apply(clean_text)

In [4]:
# Tokenization and stopword removal
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

data['processed_comment'] = data['cleaned_comment'].apply(preprocess_text)

# Display the processed comments
print(data[['Comment', 'processed_comment', 'Emotion']].head())

                                             Comment  \
0  i seriously hate one subject to death but now ...   
1                 im so full of life i feel appalled   
2  i sit here to write i start to dig out my feel...   
3  ive been really angry with r and i feel like a...   
4  i feel suspicious if there is no one outside l...   

                                   processed_comment Emotion  
0  seriously hate one subject death feel reluctan...    fear  
1                         im full life feel appalled   anger  
2  sit write start dig feelings think afraid acce...    fear  
3  ive really angry r feel like idiot trusting fi...     joy  
4  feel suspicious one outside like rapture happe...    fear  


### Text Cleaning:
It involves removing any irrelevant characters, such as punctuation and numbers, that do not contribute to the meaning of the text. This step also includes converting all text to lowercase to ensure uniformity, preventing the model from treating the same words in different cases as distinct (e.g., "Happy" vs. "happy"). By cleaning the text, we reduce complexity and focus on meaningful content, which helps improve the model's accuracy.

### Tokenization:
It breaks down the cleaned text into individual words or tokens. This process is crucial because it allows the model to analyze word frequencies and patterns more effectively. Tokenization can be performed using libraries like NLTK (Natural Language Toolkit), which provides tools for splitting sentences into words. By converting text into tokens, we create a structured format that machine learning algorithms can easily interpret, enabling them to learn from the data more efficiently.

### Stopword Removal: Eliminates common words that do not add significant meaning, thus reducing noise.
It eliminate common words (such as "and," "the," and "is") that do not carry significant meaning and can add noise to the analysis. Using predefined lists of stopwords from libraries like NLTK, we filter out these words from our tokenized text. This step is essential because it helps reduce the dimensionality of the dataset, allowing the model to focus on more meaningful words that contribute to emotional expression. Overall, these preprocessing techniques enhance the quality of input data, leading to improved performance of machine learning models in accurately classifying emotions in text samples.

#### 2. Feature Extraction (2 marks):
Implement feature extraction using CountVectorizer or TfidfVectorizer. Describe how the chosen method transforms the text data into numerical features.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the processed comments into TF-IDF features
X = vectorizer.fit_transform(data['processed_comment'])
y = data['Emotion']

### TF-IDF transforms the text into numerical features by calculating the importance of each word relative to its frequency across documents. This helps in emphasizing unique words associated with specific emotions.

#### 3. Model Development (2 marks):
Train the following machine learning models

### Naive Bayes Model

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Make predictions and evaluate the model
nb_predictions = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_predictions)
nb_f1_score = f1_score(y_test, nb_predictions, average='weighted')

print(f'Naive Bayes Accuracy: {nb_accuracy:.2f}, F1 Score: {nb_f1_score:.2f}')

Naive Bayes Accuracy: 0.91, F1 Score: 0.91


### Support Vector Machine Model

In [7]:
from sklearn.svm import SVC

# Initialize and train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Make predictions and evaluate the model
svm_predictions = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_f1_score = f1_score(y_test, svm_predictions, average='weighted')

print(f'SVM Accuracy: {svm_accuracy:.2f}, F1 Score: {svm_f1_score:.2f}')

SVM Accuracy: 0.95, F1 Score: 0.95


### 4. Model Comparison (2 marks)
Evaluate the model using appropriate metrics (e.g., accuracy, F1-score). Provide a brief explanation of the chosen model and its suitability for emotion classification.

In [8]:
# Print comparison results
print("\nModel Comparison:")
print(f"Naive Bayes - Accuracy: {nb_accuracy:.2f}, F1 Score: {nb_f1_score:.2f}")
print(f"SVM - Accuracy: {svm_accuracy:.2f}, F1 Score: {svm_f1_score:.2f}")


Model Comparison:
Naive Bayes - Accuracy: 0.91, F1 Score: 0.91
SVM - Accuracy: 0.95, F1 Score: 0.95


### Model Comparison
After training our models, we evaluated their performance using two key metrics: accuracy and F1 score.

#### Naive Bayes

#### Accuracy: 0.91
#### F1 Score: 0.91
#### Support Vector Machine (SVM)

#### Accuracy: 0.95
#### F1 Score: 0.95

### Explanation of Metrics

#### Accuracy: 
This tells us how many predictions the model got right out of all predictions made. For example, an accuracy of 0.95 means that the SVM model correctly identified emotions in 95 out of every 100 text samples. A higher accuracy indicates a better-performing model.

#### F1 Score: 
This is a measure that combines two important aspects of model performance: precision and recall.

Precision is how many of the predicted positive cases were actually positive (correctly identified emotions).
Recall is how many actual positive cases were correctly identified by the model.
The F1 score gives us a single number that balances both precision and recall, making it useful when we want to ensure that our model is not just good at guessing but also good at capturing all relevant cases. An F1 score of 0.95 for the SVM means it has a strong ability to identify emotions accurately without missing too many.

## Conclusion
In this comparison, the SVM model outperformed the Naive Bayes model in both accuracy and F1 score, suggesting it is better at understanding and classifying emotions in text data. This means that for tasks like identifying feelings expressed in writing, SVM may be a more reliable choice than Naive Bayes based on these results.