## NLP - Emotion Classification in Text

### To develop machine learning models to classify emotions in text samples, I'll go through the following steps:

- Loading and Preprocessing
- Feature Extraction
- Model Development
- Model Comparison

Step 1: Loading and Preprocessing
First, load the dataset and inspect it. Then, clean the text, tokenize it, and remove stopwords.

Loading the Dataset
I will load the dataset from the provided link and check its structure. For demonstration purposes, let's assume the dataset has columns like text and emotion.

Preprocessing Techniques
Text Cleaning: Removing punctuation, numbers, and special characters.
Tokenization: Splitting text into words (tokens).
Stopwords Removal: Removing common words that do not contribute to the meaning .
Lemmatization/Stemming: Reducing words to their base form 
Impact on Model Performance
Text Cleaning ensures that noise is reduced in the text data, leading to better feature extraction.
Tokenization converts text into a format suitable for feature extraction.
Stopwords Removal reduces the dimensionality of the text data and focuses on meaningful words.
Lemmatization/Stemming helps in normalizing words, reducing redundancy, and improving model accuracy.
Step 2: Feature Extraction
Use CountVectorizer and TfidfVectorizer to transform the text data into numerical features.

CountVectorizer: Converts a collection of text documents to a matrix of token counts.
TfidfVectorizer: Converts a collection of text documents to a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.
The choice of vectorizer impacts the performance as TfidfVectorizer tends to give more importance to rare words and downweights common words.

Step 3: Model Development
I will train two models:

Naive Bayes: Suitable for text classification due to its simplicity and effectiveness.
Support Vector Machine (SVM): Effective in high-dimensional spaces and commonly used for text classification.
Step 4: Model Comparison
I will evaluate the models using metrics such as accuracy and F1-score.

Implementation:

In [7]:
# Import necessary libraries
import pandas as pd
import numpy as np
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, classification_report
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
url = "https://drive.google.com/uc?id=1HWczIICsMpaL8EJyu48ZvRFcXx3_pcnb"
df = pd.read_csv(url)

# Inspect the dataset
print("Dataset Columns:", df.columns)
print(df.head())

# Preprocessing function
def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word.lower() not in stopwords.words('english')]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing to the text column
df['clean_text'] = df['Comment'].apply(preprocess_text)

# Feature extraction
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['Emotion']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)

# Train SVM model
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)

# Evaluate the models
nb_accuracy = accuracy_score(y_test, nb_pred)
nb_f1 = f1_score(y_test, nb_pred, average='weighted')
svm_accuracy = accuracy_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred, average='weighted')

print("Naive Bayes Accuracy:", nb_accuracy)
print("Naive Bayes F1 Score:", nb_f1)
print("SVM Accuracy:", svm_accuracy)
print("SVM F1 Score:", svm_f1)
print("Naive Bayes Classification Report:\n", classification_report(y_test, nb_pred))
print("SVM Classification Report:\n", classification_report(y_test, svm_pred))


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Dataset Columns: Index(['Comment', 'Emotion'], dtype='object')
                                             Comment Emotion
0  i seriously hate one subject to death but now ...    fear
1                 im so full of life i feel appalled   anger
2  i sit here to write i start to dig out my feel...    fear
3  ive been really angry with r and i feel like a...     joy
4  i feel suspicious if there is no one outside l...    fear
Naive Bayes Accuracy: 0.9082491582491582
Naive Bayes F1 Score: 0.9082319465206588
SVM Accuracy: 0.9318181818181818
SVM F1 Score: 0.9318506955527953
Naive Bayes Classification Report:
               precision    recall  f1-score   support

       anger       0.89      0.93      0.91       392
        fear       0.92      0.92      0.92       416
         joy       0.93      0.88      0.90       380

    accuracy                           0.91      1188
   macro avg       0.91      0.91      0.91      1188
weighted avg       0.91      0.91      0.91      1188

SVM Cl

### Explanation of the Code
1 Loading and Preprocessing:

- Loaded the dataset from the provided link.

- Preprocessed the text by removing punctuation, numbers, and stopwords, and then lemmatizing the words.

2 Feature Extraction:

- Used TfidfVectorizer to convert the text into numerical features.

3 Model Development:

- Trained MultinomialNB and SVC models on the training data.

4 Model Comparison:

- Evaluated the models using accuracy and F1-score.

- Provided classification reports for a detailed evaluation.