# Sentiment Analysis using Bag of words



Sentiment analysis is to analyze the textual documents and extract information that is related to the author’s sentiment or opinion. It is sometimes referred to as opinion mining.

It is popular and widely used in industry, e.g., corporate surveys, feedback surveys, social media data, reviews for movies, places, hotels, commodities, etc..

The sentiment information from texts can be crucial to further decision making in the industry.

**Output of Sentiment Analysis**

- Qualitative: overall sentiment scale (positive/negative)
- Quantitative: sentiment polarity scores


## Dataset : `NepCov19Tweet`

Referece: C Sitaula, A Basnet, A Mainali and TB Shahi, **Deep Learning-based Methods for Sentiment Analysis on Nepali COVID-19-related Tweets**, Computational Intelligence and Neuroscience, 2021. [Link](https://onlinelibrary.wiley.com/doi/full/10.1155/2021/2158184)

Source: https://www.kaggle.com/datasets/mathew11111/nepcov19tweets/data

In [2]:
import pandas as pd

df = pd.read_csv('../../data/covid19_tweeter_dataset.csv', encoding='utf-8')
df.head(5)

Unnamed: 0.1,Unnamed: 0,Label,Datetime,Tweet,Tokanize_tweet
0,0,-1,2021-01-10 22:06:41+00:00,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"अमेरिकामा,कोभिड,बाट,एकै,दिन,चार,हजारभन्दा,बढीक..."
1,1,-1,2021-01-10 17:49:34+00:00,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"कोभिड,का,कारण,विदेशमा,रहेका,नेपालीहरुमा,मानसिक..."
2,2,1,2021-01-10 16:18:34+00:00,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"नेपालमा,क्लोभर,बायोफार्मास्युटिकल्स,अस्ट्रेलिय..."
3,3,0,2021-01-10 15:12:17+00:00,कोभिड को खोप पनि लगाइयो,"कोभिड,को,खोप,पनि,लगाइयो"
4,4,-1,2021-01-10 15:07:12+00:00,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"अमेरिकामा,कोभिड,को,नयाँ,रेकर्ड,एकै,दिन,हजारभन्..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33474 entries, 0 to 33473
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      33474 non-null  int64 
 1   Label           33474 non-null  object
 2   Datetime        33474 non-null  object
 3   Tweet           33474 non-null  object
 4   Tokanize_tweet  33471 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.3+ MB


In [4]:
df = df.drop(columns=['Unnamed: 0', 'Datetime', 'Tokanize_tweet'], axis=1)
df.head(5)

Unnamed: 0,Label,Tweet
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...
3,0,कोभिड को खोप पनि लगाइयो
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...


## Data Pre-Processing

**Tokenize**

In [5]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/tilak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/tilak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/tilak/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [6]:
# df['tokenized_text'] = df['text'].apply(nltk.word_tokenize)
df['tokenized_text'] = df['Tweet'].map(nltk.word_tokenize)
df.head(5)

Unnamed: 0,Label,Tweet,tokenized_text
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"[अमेरिकामा, कोभिड, बाट, एकै, दिन, चार, हजारभन्..."
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"[कोभिड, का, कारण, विदेशमा, रहेका, नेपालीहरुमा,..."
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र..."
3,0,कोभिड को खोप पनि लगाइयो,"[कोभिड, को, खोप, पनि, लगाइयो]"
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"[अमेरिकामा, कोभिड, को, नयाँ, रेकर्ड, एकै, दिन,..."


**Stop Word Removal**

In [7]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/tilak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords
nepali_stopwords = stopwords.words('nepali')

In [9]:
# nepali_stopwords

In [10]:
def remove_stopwords(tokens):
    return [word for word in tokens if word not in nepali_stopwords]

df['tokenized_text_no_stopwords'] = df['tokenized_text'].apply(remove_stopwords)
df.head(5)

Unnamed: 0,Label,Tweet,tokenized_text,tokenized_text_no_stopwords
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"[अमेरिकामा, कोभिड, बाट, एकै, दिन, चार, हजारभन्...","[अमेरिकामा, कोभिड, बाट, एकै, दिन, हजारभन्दा, ब..."
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"[कोभिड, का, कारण, विदेशमा, रहेका, नेपालीहरुमा,...","[कोभिड, कारण, विदेशमा, नेपालीहरुमा, मानसिक, स्..."
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र...","[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र..."
3,0,कोभिड को खोप पनि लगाइयो,"[कोभिड, को, खोप, पनि, लगाइयो]","[कोभिड, खोप, लगाइयो]"
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"[अमेरिकामा, कोभिड, को, नयाँ, रेकर्ड, एकै, दिन,...","[अमेरिकामा, कोभिड, रेकर्ड, एकै, दिन, हजारभन्दा..."


### Designing Stemmer for removing the suffixes

- Takes a list of tokens as input.
- Initializes an empty list stemmed_tokens to store stemmed tokens.
- Defines a list of suffixes to remove:`['मा', 'बाट', 'को', 'हरु']`
- Iterates through each token:

    - For each suffix, checks if the token ends with it using endswith.
    - If a match is found, removes the suffix using slicing token`[:-len suffix)]`.
    - break is used to stop checking further suffixes for the current token after a suffix is removed.
    - Appends the stemmed token to `stemmed_tokens`.

- Returns the list of stemmed tokens.

In [11]:
SUFFIXES = ['मा', 'बाट', 'को', 'का', 'हरु']

In [12]:
def rule_based_stemmer(tokens):
    stemmed_tokens = []
    for token in tokens:
        for suffix in SUFFIXES:
            if token.endswith(suffix):
                token = token[:-len(suffix)]
                break  # Move to the next token after removing a suffix
        stemmed_tokens.append(token)
    return stemmed_tokens

In [13]:
df['stemmed_tokens'] = df['tokenized_text_no_stopwords'].apply(rule_based_stemmer)

df.head(10)

Unnamed: 0,Label,Tweet,tokenized_text,tokenized_text_no_stopwords,stemmed_tokens
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"[अमेरिकामा, कोभिड, बाट, एकै, दिन, चार, हजारभन्...","[अमेरिकामा, कोभिड, बाट, एकै, दिन, हजारभन्दा, ब...","[अमेरिका, कोभिड, , एकै, दिन, हजारभन्दा, बढी, म..."
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"[कोभिड, का, कारण, विदेशमा, रहेका, नेपालीहरुमा,...","[कोभिड, कारण, विदेशमा, नेपालीहरुमा, मानसिक, स्...","[कोभिड, कारण, विदेश, नेपालीहरु, मानसिक, स्वास्..."
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र...","[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र...","[नेपाल, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्रेल..."
3,0,कोभिड को खोप पनि लगाइयो,"[कोभिड, को, खोप, पनि, लगाइयो]","[कोभिड, खोप, लगाइयो]","[कोभिड, खोप, लगाइयो]"
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"[अमेरिकामा, कोभिड, को, नयाँ, रेकर्ड, एकै, दिन,...","[अमेरिकामा, कोभिड, रेकर्ड, एकै, दिन, हजारभन्दा...","[अमेरिका, कोभिड, रेकर्ड, एकै, दिन, हजारभन्दा, ..."
5,-1,गण्डकी प्रदेश सरकारले कोभिड बाट प्रभावीतहरुको ...,"[गण्डकी, प्रदेश, सरकारले, कोभिड, बाट, प्रभावीत...","[गण्डकी, प्रदेश, सरकारले, कोभिड, बाट, प्रभावीत...","[गण्डकी, प्रदेश, सरकारले, कोभिड, , प्रभावीतहरु..."
6,-1,नेपालको संचार अमेरिकामा कोभिड को नयाँ रेकर्ड ए...,"[नेपालको, संचार, अमेरिकामा, कोभिड, को, नयाँ, र...","[नेपालको, संचार, अमेरिकामा, कोभिड, रेकर्ड, एकै...","[नेपाल, संचार, अमेरिका, कोभिड, रेकर्ड, एकै, दि..."
7,0,रामेछापमा कोभिड सङ्क्रमितको संख्या पुग्यो,"[रामेछापमा, कोभिड, सङ्क्रमितको, संख्या, पुग्यो]","[रामेछापमा, कोभिड, सङ्क्रमितको, संख्या, पुग्यो]","[रामेछाप, कोभिड, सङ्क्रमित, संख्या, पुग्यो]"
8,1,कोरोना भाइरस भारत माघ गतेदेखि कोभिड विरुद्ध रा...,"[कोरोना, भाइरस, भारत, माघ, गतेदेखि, कोभिड, विर...","[कोरोना, भाइरस, भारत, माघ, गतेदेखि, कोभिड, विर...","[कोरोना, भाइरस, भारत, माघ, गतेदेखि, कोभिड, विर..."
9,0,स्वास्थ्य मन्त्रालयले माग्यो कोभिड को रोकथामका...,"[स्वास्थ्य, मन्त्रालयले, माग्यो, कोभिड, को, रो...","[स्वास्थ्य, मन्त्रालयले, माग्यो, कोभिड, रोकथाम...","[स्वास्थ्य, मन्त्रालयले, माग्यो, कोभिड, रोकथाम..."


## Preparing Datasets for Training/Validation

We split the entire dataset into two parts: `training set` and `testing set`.
- The proportion of training and testing sets may depend on the corpus size.
- In the train-test split, make sure the the distribution of the classes is proportional.

In [14]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size = 0.20, random_state=42)

print(f"Train set size: {len(df_train)}")
print(f"Test set size: {len(df_test)}")

Train set size: 26779
Test set size: 6695


## Vectorize (TF-IDF)


**TF-IDF** stands for Term Frequency-Inverse Document Frequency.

It's a numerical statistic used in Natural Language Processing to reflect how important a word is to a document within a collection of documents (corpus).

It works by considering two factors:

- **Term Frequency (TF)**: How frequently a word appears in a document. Higher frequency generally means higher importance.

- **Inverse Document Frequency (IDF)**: How common or rare a word is across the entire corpus. Words that appear in many documents are less important than words that appear in a few.

$TFIDF(t, d, D) = TF(t, d) \times IDF(t, D) $


Where:
- $t$ represents the term (word)
- $d$ represents the document
- $D$ represents the corpus (collection of documents)



We will use TF-IDF for vectorization

To create a `bag-of-words` based `TF-IDF` vector from the stemmed_text column, use the `TfidfVectorizer` from `sklearn`.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
# Join the stemmed tokens back into sentences
df['stemmed_token_joined'] = df['stemmed_tokens'].apply(' '.join)

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# fit the vectorizer
vectorizer.fit(df['stemmed_token_joined'])

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'


In [17]:
vector = vectorizer.transform(["कोभिड समस्या पारेको छ"])
print(vector.toarray())
vector.shape

[[0. 0. 0. ... 0. 0. 0.]]


(1, 5551)

Vectorize train and test dataset

In [18]:
df_train['stemmed_token_joined'] = df_train['stemmed_tokens'].apply(' '.join)
X_train_bow = vectorizer.transform(df_train['stemmed_token_joined'])

df_test['stemmed_token_joined'] = df_test['stemmed_tokens'].apply(' '.join)
X_test_bow  = vectorizer.transform(df_test['stemmed_token_joined'])

In [19]:
X_train_bow.shape, X_test_bow.shape

((26779, 5551), (6695, 5551))

## Visualize the vector

To visualize the TF-IDF matrix, you can use dimensionality reduction techniques like `PCA` or `t-SNE` to project the high-dimensional matrix into a 2D or 3D space.

Then, you can plot the projected data using libraries like matplotlib.

In [20]:
import plotly.express as px
from sklearn.decomposition import PCA

Then, apply PCA to reduce the dimensionality of the TF-IDF matrix:

In [21]:
pca = PCA(n_components=3)  # Reduce to 3 dimensions for visualization
reduced_tfidf = pca.fit_transform(X_test_bow.toarray())  # Convert sparse matrix to dense array
reduced_tfidf.shape

(6695, 3)

In [22]:
# df_3d = pd.DataFrame(reduced_tfidf, columns=['PC1', 'PC2', 'PC3'])

# fig = px.scatter_3d(df_3d, x='PC1', y='PC2', z='PC3')
# fig.show()

Generate labels

In [23]:
y_train = df_train['Label']
y_test = df_test['Label']

## Model Training

For our sentiment classifier, we will try a few common classification algorithms:

- Support Vector Machine
- Decision Tree
- Naive Bayes
- Logistic Regression

In [24]:
# from sklearn import svm

# model_svm = svm.SVC(C=8.0, kernel='linear')
# model_svm.fit(X_train_bow, y_train)

Cross validation

In [25]:
# from sklearn.model_selection import cross_val_score
# model_svm_acc = cross_val_score(estimator=model_svm, X=X_train_bow, y=y_train, cv=5, n_jobs=-1)
# model_svm_acc

In [26]:
from sklearn.tree import DecisionTreeClassifier

model_dec = DecisionTreeClassifier(max_depth=10, random_state=0)
model_dec.fit(X_train_bow, y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,10
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,0
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [27]:
from sklearn.model_selection import cross_val_score
model_dec_acc = cross_val_score(estimator=model_dec, X=X_test_bow, y=y_test, cv=5, n_jobs=2)
model_dec_acc



array([0.55713219, 0.5660941 , 0.57132188, 0.5660941 , 0.54742345])

## Evaluation of Model
To evaluate each model's performance, there are several common metrics can be used.

- Precision
- Recall
- F-score
- Accuracy
- Confusion Matrix

In [28]:
# Mean Accuracy
print(model_dec.score(X_test_bow, y_test))

0.5764002987303958


In [29]:
# F1
from sklearn.metrics import f1_score

y_pred = model_dec.predict(X_test_bow)

f1_score(y_test, y_pred,
         average=None,
         labels = [1,-1,0])

array([0.66264751, 0.55258001, 0.0141129 ])

## Inference

- define input
- process

In [30]:
# df_train.head(100)

In [None]:
# tweet = "नेपालमा कोभिड बढ्नु राम्रो कुरा हो"
# tweet = "नेपालमा कोभिड बढ्नु नराम्रो कुरा हो"

tweet = "कोभिडले सप्तरीमा संक्रमितको लक्षण देखिएका तीनजना मध्धे एक जानाको मृत्यु भएको छ"

tokens          = nltk.word_tokenize(tweet)
tokens          = remove_stopwords(tokens)
stemmed_tokens  = rule_based_stemmer(tokens)
stemmed_tokens

['कोभिडले',
 'सप्तरी',
 'संक्रमित',
 'लक्षण',
 'देखिए',
 'तीनजना',
 'मध्धे',
 'जाना',
 'मृत्यु']

Now Vectorize and pass to model

In [32]:
X_feature = vectorizer.transform([' '.join(stemmed_tokens)])
X_feature.shape

(1, 5551)

In [33]:
model_dec.predict(X_feature)

array(['-1'], dtype=object)

## Saving the model

In [34]:
import pickle

Save the model

In [35]:
with open('model_dec.pkl', 'wb') as file:
    pickle.dump(model_dec, file)

Inference with saved model

In [36]:
with open('model_dec.pkl', 'rb') as file:
    model_dec_loaded = pickle.load(file)
    pred = model_dec.predict(X_feature)
print(f"prediction from saved model: {pred}")

prediction from saved model: ['-1']


In [37]:
from sklearn.metrics import accuracy_score

y_pred = model_dec.predict(X_test_bow)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.5764002987303958


# Assignment - Enhancing the Sentiment Analysis System

Now extend/enhance this notebook to achive following-

**Task 1**: Train following classifiers and Compute the `F1`, `Accuracy` of the model
- Support Vector Machine
- Decision Tree
- Logistic Regression

Write the interpretation of the finding by comparing the model accuracy.

**Task 2**: Perform the hyperparameter tuning on each of the model (from task 1)
- Prepare a table showing the parameters for each classification model

**Task3**: Feature Engineering
- Instead of using `uni-gram` token on TF-IDF, use `bi-gram` token
- Train the decision tree model
- Hyperparameter tune the model to ensure the higher accuracy

**Deliverables**
1. Python Notebook
2. Your Best model from ***Task 3*** (saved pickel file)



**Evaluation**
- Your model should have accuracy `more than 70%` to get the full marks
- The leader board will be published
- Top 10% in the leader board will get the additional bonus marks


# Assignment
## Task 1: Train SVM,and logistic regression

In [38]:
# Train Support Vector Machine (SVM) model
from sklearn import svm
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Train SVM with linear kernel
print("="*60)
print("SUPPORT VECTOR MACHINE (SVM)")
print("="*60)

model_svm = svm.SVC(C=8.0, kernel='linear', random_state=42)
model_svm.fit(X_train_bow, y_train)

# Make predictions
y_pred_svm = model_svm.predict(X_test_bow)

# Calculate accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"\nAccuracy: {accuracy_svm:.4f}")

# Calculate F1 scores for each class
f1_svm = f1_score(y_test, y_pred_svm, average=None, labels=[1, -1, 0])
print(f"F1 Scores (Positive, Negative, Neutral): {f1_svm}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm, labels=[1, -1, 0]))

SUPPORT VECTOR MACHINE (SVM)

Accuracy: 0.6376
F1 Scores (Positive, Negative, Neutral): [0.6885759  0.68390087 0.21868211]

Classification Report:
              precision    recall  f1-score   support

           1       0.65      0.74      0.69      2989
          -1       0.66      0.70      0.68      2723
           0       0.37      0.16      0.22       974

   micro avg       0.64      0.64      0.64      6686
   macro avg       0.56      0.53      0.53      6686
weighted avg       0.61      0.64      0.62      6686



In [None]:
# Train Logistic Regression model
from sklearn.linear_model import LogisticRegression

print("\n" + "="*60)
print("LOGISTIC REGRESSION")
print("="*60)

model_lr = LogisticRegression(max_iter=1000, random_state=42)
model_lr.fit(X_train_bow, y_train)

# Make predictions
y_pred_lr = model_lr.predict(X_test_bow)

# Calculate accuracy
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print(f"\nAccuracy: {accuracy_lr:.4f}")

# Calculate F1 scores
f1_lr = f1_score(y_test, y_pred_lr, average=None, labels=[1, -1, 0])
print(f"F1 Scores (Positive, Negative, Neutral): {f1_lr}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, labels=[1, -1, 0]))


LOGISTIC REGRESSION

Accuracy: 0.6479
F1 Scores (Positive, Negative, Neutral): [0.70076336 0.69008783 0.1884984 ]

Classification Report:
              precision    recall  f1-score   support

           1       0.64      0.77      0.70      2989
          -1       0.67      0.71      0.69      2723
           0       0.42      0.12      0.19       974

   micro avg       0.65      0.65      0.65      6686
   macro avg       0.58      0.53      0.53      6686
weighted avg       0.62      0.65      0.62      6686



In [40]:
# Train Decision Tree model
print("\n" + "="*60)
print("DECISION TREE")
print("="*60)

# We already trained this earlier, so just evaluate it
y_pred_dt = model_dec.predict(X_test_bow)

# Calculate accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"\nAccuracy: {accuracy_dt:.4f}")

# Calculate F1 scores
f1_dt = f1_score(y_test, y_pred_dt, average=None, labels=[1, -1, 0])
print(f"F1 Scores (Positive, Negative, Neutral): {f1_dt}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt, labels=[1, -1, 0]))


DECISION TREE

Accuracy: 0.5764
F1 Scores (Positive, Negative, Neutral): [0.66264751 0.55258001 0.0141129 ]

Classification Report:
              precision    recall  f1-score   support

           1       0.54      0.86      0.66      2989
          -1       0.68      0.47      0.55      2723
           0       0.39      0.01      0.01       974

   micro avg       0.58      0.58      0.58      6686
   macro avg       0.53      0.45      0.41      6686
weighted avg       0.57      0.58      0.52      6686



In [41]:
# Model Comparison Summary
import pandas as pd

print("\n" + "="*60)
print("MODEL COMPARISON SUMMARY")
print("="*60)

# Create comparison table
comparison_df = pd.DataFrame({
    'Model': ['SVM', 'Logistic Regression', 'Decision Tree'],
    'Accuracy': [accuracy_svm, accuracy_lr, accuracy_dt],
    'F1 (Positive)': [f1_svm[0], f1_lr[0], f1_dt[0]],
    'F1 (Negative)': [f1_svm[1], f1_lr[1], f1_dt[1]],
    'F1 (Neutral)': [f1_svm[2], f1_lr[2], f1_dt[2]]
})

print("\n", comparison_df)

# Find best model
best_model_idx = comparison_df['Accuracy'].idxmax()
print(f"\nBest Model: {comparison_df.iloc[best_model_idx]['Model']}")
print(f"Best Accuracy: {comparison_df.iloc[best_model_idx]['Accuracy']:.4f}")


MODEL COMPARISON SUMMARY

                  Model  Accuracy  F1 (Positive)  F1 (Negative)  F1 (Neutral)
0                  SVM  0.637640       0.688576       0.683901      0.218682
1  Logistic Regression  0.647946       0.700763       0.690088      0.188498
2        Decision Tree  0.576400       0.662648       0.552580      0.014113

Best Model: Logistic Regression
Best Accuracy: 0.6479


In [42]:
# Interpretation of Results
print("\n" + "="*60)
print("INTERPRETATION")
print("="*60)

print("""
Based on the comparison:

1. ACCURACY: Which model predicts the most correct results overall
   - Highest = Best overall performance
   
2. F1 SCORE: Balance between finding all positives and avoiding false positives
   - Important for imbalanced datasets (some sentiment classes appear more often)
   - Positive: Negative tweets
   - Negative: Positive tweets (confusing naming!)
   - Neutral: Neutral sentiment

Key Observations:
- SVM: Often performs well with TF-IDF features
- Logistic Regression: Fast, interpretable, good baseline
- Decision Tree: Can overfit but works well with proper depth tuning

Next: Task 2 will involve hyperparameter tuning to improve these models further.
""")


INTERPRETATION

Based on the comparison:

1. ACCURACY: Which model predicts the most correct results overall
   - Highest = Best overall performance
   
2. F1 SCORE: Balance between finding all positives and avoiding false positives
   - Important for imbalanced datasets (some sentiment classes appear more often)
   - Positive: Negative tweets
   - Negative: Positive tweets (confusing naming!)
   - Neutral: Neutral sentiment

Key Observations:
- SVM: Often performs well with TF-IDF features
- Logistic Regression: Fast, interpretable, good baseline
- Decision Tree: Can overfit but works well with proper depth tuning

Next: Task 2 will involve hyperparameter tuning to improve these models further.



# Task 2: Hpyerparameter tuning

In [43]:
# Hyperparameter Tuning for SVM
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

print("="*80)
print("TASK 2: HYPERPARAMETER TUNING")
print("="*80)

print("\n" + "="*80)
print("1. SUPPORT VECTOR MACHINE (SVM) - HYPERPARAMETER TUNING")
print("="*80)

# Define parameter grid for SVM
param_grid_svm = {
    'C': [0.1, 1, 10, 100],           # Regularization strength (lower = more regularization)
    'kernel': ['linear', 'rbf'],      # Type of kernel function
    'gamma': ['scale', 'auto']        # Kernel coefficient (for rbf kernel)
}

print("\nParameter Grid:")
print(param_grid_svm)

# Create GridSearchCV object
grid_svm = GridSearchCV(
    estimator=svm.SVC(random_state=42),
    param_grid=param_grid_svm,
    cv=5,                              # 5-fold cross-validation
    n_jobs=-1,                         # Use all processors
    verbose=1
)

print("\nFitting SVM with GridSearchCV (this may take a moment)...")
grid_svm.fit(X_train_bow, y_train)

# Best parameters
print(f"\nBest Parameters for SVM: {grid_svm.best_params_}")
print(f"Best Cross-Validation Score: {grid_svm.best_score_:.4f}")

# Evaluate on test set
y_pred_svm_tuned = grid_svm.predict(X_test_bow)
accuracy_svm_tuned = accuracy_score(y_test, y_pred_svm_tuned)
f1_svm_tuned = f1_score(y_test, y_pred_svm_tuned, average=None, labels=[1, -1, 0])

print(f"Test Accuracy: {accuracy_svm_tuned:.4f}")
print(f"F1 Scores: {f1_svm_tuned}")

# Store for comparison
best_svm_model = grid_svm.best_estimator_

TASK 2: HYPERPARAMETER TUNING

1. SUPPORT VECTOR MACHINE (SVM) - HYPERPARAMETER TUNING

Parameter Grid:
{'C': [0.1, 1, 10, 100], 'kernel': ['linear', 'rbf'], 'gamma': ['scale', 'auto']}

Fitting SVM with GridSearchCV (this may take a moment)...
Fitting 5 folds for each of 16 candidates, totalling 80 fits

Best Parameters for SVM: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
Best Cross-Validation Score: 0.6787
Test Accuracy: 0.6863
F1 Scores: [0.73183198 0.725971   0.36031332]


In [44]:
# Hyperparameter Tuning for Logistic Regression
print("\n" + "="*80)
print("2. LOGISTIC REGRESSION - HYPERPARAMETER TUNING")
print("="*80)

# Define parameter grid for Logistic Regression
param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Inverse of regularization strength
    'penalty': ['l2'],                      # L2 regularization (prevents overfitting)
    'solver': ['lbfgs', 'liblinear'],      # Optimization algorithm
    'max_iter': [500, 1000, 2000]          # Maximum iterations for convergence
}

print("\nParameter Grid:")
print(param_grid_lr)

# Create GridSearchCV object
grid_lr = GridSearchCV(
    estimator=LogisticRegression(random_state=42, multi_class='multinomial'),
    param_grid=param_grid_lr,
    cv=5,
    n_jobs=-1,
    verbose=1
)

print("\nFitting Logistic Regression with GridSearchCV (this may take a moment)...")
grid_lr.fit(X_train_bow, y_train)

# Best parameters
print(f"\nBest Parameters for Logistic Regression: {grid_lr.best_params_}")
print(f"Best Cross-Validation Score: {grid_lr.best_score_:.4f}")

# Evaluate on test set
y_pred_lr_tuned = grid_lr.predict(X_test_bow)
accuracy_lr_tuned = accuracy_score(y_test, y_pred_lr_tuned)
f1_lr_tuned = f1_score(y_test, y_pred_lr_tuned, average=None, labels=[1, -1, 0])

print(f"Test Accuracy: {accuracy_lr_tuned:.4f}")
print(f"F1 Scores: {f1_lr_tuned}")

# Store for comparison
best_lr_model = grid_lr.best_estimator_


2. LOGISTIC REGRESSION - HYPERPARAMETER TUNING

Parameter Grid:
{'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l2'], 'solver': ['lbfgs', 'liblinear'], 'max_iter': [500, 1000, 2000]}

Fitting Logistic Regression with GridSearchCV (this may take a moment)...
Fitting 5 folds for each of 36 candidates, totalling 180 fits

Best Parameters for Logistic Regression: {'C': 1, 'max_iter': 500, 'penalty': 'l2', 'solver': 'lbfgs'}
Best Cross-Validation Score: 0.6492
Test Accuracy: 0.6479
F1 Scores: [0.70076336 0.69008783 0.1884984 ]


In [45]:
# Hyperparameter Tuning for Decision Tree
print("\n" + "="*80)
print("3. DECISION TREE - HYPERPARAMETER TUNING")
print("="*80)

# Define parameter grid for Decision Tree
param_grid_dt = {
    'max_depth': [5, 10, 15, 20, None],         # Maximum depth of tree (deeper = more complex)
    'min_samples_split': [2, 5, 10],            # Minimum samples to split a node
    'min_samples_leaf': [1, 2, 4],              # Minimum samples at leaf node
    'criterion': ['gini', 'entropy']            # Function to measure split quality
}

print("\nParameter Grid:")
print(param_grid_dt)

# Create GridSearchCV object
grid_dt = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid_dt,
    cv=5,
    n_jobs=-1,
    verbose=1
)

print("\nFitting Decision Tree with GridSearchCV (this may take a moment)...")
grid_dt.fit(X_train_bow, y_train)

# Best parameters
print(f"\nBest Parameters for Decision Tree: {grid_dt.best_params_}")
print(f"Best Cross-Validation Score: {grid_dt.best_score_:.4f}")

# Evaluate on test set
y_pred_dt_tuned = grid_dt.predict(X_test_bow)
accuracy_dt_tuned = accuracy_score(y_test, y_pred_dt_tuned)
f1_dt_tuned = f1_score(y_test, y_pred_dt_tuned, average=None, labels=[1, -1, 0])

print(f"Test Accuracy: {accuracy_dt_tuned:.4f}")
print(f"F1 Scores: {f1_dt_tuned}")

# Store for comparison
best_dt_model = grid_dt.best_estimator_


3. DECISION TREE - HYPERPARAMETER TUNING

Parameter Grid:
{'max_depth': [5, 10, 15, 20, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'criterion': ['gini', 'entropy']}

Fitting Decision Tree with GridSearchCV (this may take a moment)...
Fitting 5 folds for each of 90 candidates, totalling 450 fits

Best Parameters for Decision Tree: {'criterion': 'gini', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 5}
Best Cross-Validation Score: 0.5831
Test Accuracy: 0.5825
F1 Scores: [0.66544933 0.56732419 0.04952381]


In [46]:
# Model comparision after hyperparameter tuning
print("\n" + "="*80)
print("COMPARISON: BEFORE vs AFTER HYPERPARAMETER TUNING")
print("="*80)

# Create comparison dataframe
comparison_tuning = pd.DataFrame({
    'Model': ['SVM', 'SVM (Tuned)', 'Logistic Regression', 'Logistic Regression (Tuned)', 
              'Decision Tree', 'Decision Tree (Tuned)'],
    'Accuracy': [accuracy_svm, accuracy_svm_tuned, accuracy_lr, accuracy_lr_tuned, 
                 accuracy_dt, accuracy_dt_tuned],
    'F1 Positive': [f1_svm[0], f1_svm_tuned[0], f1_lr[0], f1_lr_tuned[0], 
                    f1_dt[0], f1_dt_tuned[0]],
    'F1 Negative': [f1_svm[1], f1_svm_tuned[1], f1_lr[1], f1_lr_tuned[1], 
                    f1_dt[1], f1_dt_tuned[1]],
    'F1 Neutral': [f1_svm[2], f1_svm_tuned[2], f1_lr[2], f1_lr_tuned[2], 
                   f1_dt[2], f1_dt_tuned[2]]
})

print("\n", comparison_tuning.to_string())

# Find best tuned model
best_tuned_idx = comparison_tuning['Accuracy'].idxmax()
print(f"\n{'='*80}")
print(f"BEST TUNED MODEL: {comparison_tuning.iloc[best_tuned_idx]['Model']}")
print(f"BEST ACCURACY: {comparison_tuning.iloc[best_tuned_idx]['Accuracy']:.4f}")
print(f"{'='*80}")


COMPARISON: BEFORE vs AFTER HYPERPARAMETER TUNING

                          Model  Accuracy  F1 Positive  F1 Negative  F1 Neutral
0                          SVM  0.637640     0.688576     0.683901    0.218682
1                  SVM (Tuned)  0.686333     0.731832     0.725971    0.360313
2          Logistic Regression  0.647946     0.700763     0.690088    0.188498
3  Logistic Regression (Tuned)  0.647946     0.700763     0.690088    0.188498
4                Decision Tree  0.576400     0.662648     0.552580    0.014113
5        Decision Tree (Tuned)  0.582524     0.665449     0.567324    0.049524

BEST TUNED MODEL: SVM (Tuned)
BEST ACCURACY: 0.6863


In [47]:
# Hyperparameters Summary
print("\n" + "="*80)
print("BEST HYPERPARAMETERS FOR EACH MODEL")
print("="*80)

# Create summary table
hyperparams_summary = pd.DataFrame({
    'Model': ['SVM', 'Logistic Regression', 'Decision Tree'],
    'Best Parameters': [
        str(grid_svm.best_params_),
        str(grid_lr.best_params_),
        str(grid_dt.best_params_)
    ],
    'CV Score': [
        f"{grid_svm.best_score_:.4f}",
        f"{grid_lr.best_score_:.4f}",
        f"{grid_dt.best_score_:.4f}"
    ],
    'Test Accuracy': [
        f"{accuracy_svm_tuned:.4f}",
        f"{accuracy_lr_tuned:.4f}",
        f"{accuracy_dt_tuned:.4f}"
    ]
})

print("\n", hyperparams_summary.to_string())


BEST HYPERPARAMETERS FOR EACH MODEL

                  Model                                                                        Best Parameters CV Score Test Accuracy
0                  SVM                                           {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}   0.6787        0.6863
1  Logistic Regression                          {'C': 1, 'max_iter': 500, 'penalty': 'l2', 'solver': 'lbfgs'}   0.6492        0.6479
2        Decision Tree  {'criterion': 'gini', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 5}   0.5831        0.5825


In [48]:
# Interpretation of Results
print("\n" + "="*80)
print("KEY INSIGHTS FROM HYPERPARAMETER TUNING")
print("="*80)

print("""
WHAT WE DID:
- GridSearchCV tested multiple combinations of hyperparameters
- Used 5-fold cross-validation to evaluate each combination
- Selected the best parameters based on validation performance

IMPORTANT CONCEPTS:

1. SVM Parameters:
   - C: Balance between accuracy and simplicity (regularization)
   - kernel: Linear vs non-linear decision boundaries
   
2. Logistic Regression Parameters:
   - C: Higher C = more complex model (may overfit)
   - penalty: Regularization method
   
3. Decision Tree Parameters:
   - max_depth: Deeper trees = more overfitting risk
   - min_samples_split/leaf: Prevents splitting on small groups
   
NEXT STEP (Task 3):
- Use bigrams instead of unigrams in TF-IDF
- Retrain best model with bigrams
- Tune hyperparameters again
- Target: >70% accuracy
""")


KEY INSIGHTS FROM HYPERPARAMETER TUNING

WHAT WE DID:
- GridSearchCV tested multiple combinations of hyperparameters
- Used 5-fold cross-validation to evaluate each combination
- Selected the best parameters based on validation performance

IMPORTANT CONCEPTS:

1. SVM Parameters:
   - C: Balance between accuracy and simplicity (regularization)
   - kernel: Linear vs non-linear decision boundaries
   
2. Logistic Regression Parameters:
   - C: Higher C = more complex model (may overfit)
   - penalty: Regularization method
   
3. Decision Tree Parameters:
   - max_depth: Deeper trees = more overfitting risk
   - min_samples_split/leaf: Prevents splitting on small groups
   
NEXT STEP (Task 3):
- Use bigrams instead of unigrams in TF-IDF
- Retrain best model with bigrams
- Tune hyperparameters again
- Target: >70% accuracy



In [49]:
# Feature Engineering with Bigrams
print("="*80)
print("TASK 3: FEATURE ENGINEERING WITH BIGRAMS")
print("="*80)

print("\n" + "="*80)
print("STEP 1: CREATE TF-IDF VECTORIZER WITH BIGRAMS")
print("="*80)

# Create a new TF-IDF vectorizer with bigrams
# ngram_range=(2,2) means only bigrams (pairs of words)
# ngram_range=(1,2) would mean unigrams and bigrams

vectorizer_bigram = TfidfVectorizer(ngram_range=(2, 2), max_features=5000)

print("\nVectorizer Configuration:")
print(f"  - ngram_range: (2, 2) - This means bigrams only")
print(f"  - max_features: 5000 - Limit features to top 5000")

# Fit on entire dataset
vectorizer_bigram.fit(df['stemmed_token_joined'])

# Transform train and test data
X_train_bigram = vectorizer_bigram.transform(df_train['stemmed_token_joined'])
X_test_bigram = vectorizer_bigram.transform(df_test['stemmed_token_joined'])

print(f"\nBigram Vectorization Complete!")
print(f"Training set shape: {X_train_bigram.shape}")
print(f"Test set shape: {X_test_bigram.shape}")
print(f"\nNumber of bigram features: {X_train_bigram.shape[1]}")

# Show some example bigrams
print("\nExample bigrams from vocabulary:")
bigram_vocab = vectorizer_bigram.get_feature_names_out()
print(bigram_vocab[:20])  # First 20 bigrams

TASK 3: FEATURE ENGINEERING WITH BIGRAMS

STEP 1: CREATE TF-IDF VECTORIZER WITH BIGRAMS

Vectorizer Configuration:
  - ngram_range: (2, 2) - This means bigrams only
  - max_features: 5000 - Limit features to top 5000

Bigram Vectorization Complete!
Training set shape: (26779, 5000)
Test set shape: (6695, 5000)

Number of bigram features: 5000

Example bigrams from vocabulary:
['अक जन' 'अक बर' 'अक सफ' 'अकर मण' 'अकल पन' 'अग बढ' 'अग रज' 'अग रद' 'अग रप'
 'अग रपङ' 'अग रभ' 'अग रम' 'अग रस' 'अघ बढ' 'अघ मन' 'अघ वर' 'अघ सम' 'अच नक'
 'अछ रह' 'अझ बढ']


In [50]:
# Train Decision Tree with Bigrams
print("\n" + "="*80)
print("STEP 2: TRAIN DECISION TREE WITH BIGRAMS")
print("="*80)

# Train a baseline Decision Tree with bigrams
model_dt_bigram_baseline = DecisionTreeClassifier(max_depth=10, random_state=42)
model_dt_bigram_baseline.fit(X_train_bigram, y_train)

# Make predictions
y_pred_dt_bigram_baseline = model_dt_bigram_baseline.predict(X_test_bigram)

# Calculate metrics
accuracy_dt_bigram_baseline = accuracy_score(y_test, y_pred_dt_bigram_baseline)
f1_dt_bigram_baseline = f1_score(y_test, y_pred_dt_bigram_baseline, average=None, labels=[1, -1, 0])

print(f"\nBaseline Decision Tree (Bigrams):")
print(f"Accuracy: {accuracy_dt_bigram_baseline:.4f}")
print(f"F1 Scores: {f1_dt_bigram_baseline}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt_bigram_baseline, labels=[1, -1, 0]))


STEP 2: TRAIN DECISION TREE WITH BIGRAMS

Baseline Decision Tree (Bigrams):
Accuracy: 0.5277
F1 Scores: [0.65041201 0.36430804 0.0219342 ]

Classification Report:
              precision    recall  f1-score   support

           1       0.49      0.96      0.65      2989
          -1       0.81      0.24      0.36      2723
           0       0.38      0.01      0.02       974

   micro avg       0.53      0.53      0.53      6686
   macro avg       0.56      0.40      0.35      6686
weighted avg       0.60      0.53      0.44      6686



In [51]:
# Hyperparameter Tuning for Decision Tree with Bigrams
print("\n" + "="*80)
print("STEP 3: HYPERPARAMETER TUNING FOR BIGRAM DECISION TREE")
print("="*80)

# Define parameter grid for Decision Tree with bigrams
param_grid_dt_bigram = {
    'max_depth': [8, 12, 15, 20, 25],           # Deeper trees with more features
    'min_samples_split': [2, 3, 5],             # Allow smaller splits
    'min_samples_leaf': [1, 2],                 # Smaller leaf nodes
    'criterion': ['gini', 'entropy'],           # Split criteria
    'splitter': ['best', 'random']              # Splitter strategy
}

print("\nParameter Grid for Tuning:")
for param, values in param_grid_dt_bigram.items():
    print(f"  {param}: {values}")

# Create GridSearchCV object
grid_dt_bigram = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid_dt_bigram,
    cv=5,
    n_jobs=-1,
    verbose=1,
    scoring='accuracy'  # Optimize for accuracy
)

print("\nFitting Decision Tree with Bigrams (GridSearchCV)...")
print("This may take several minutes...")
grid_dt_bigram.fit(X_train_bigram, y_train)

# Best parameters
print(f"\n{'='*80}")
print(f"Best Parameters: {grid_dt_bigram.best_params_}")
print(f"Best CV Score: {grid_dt_bigram.best_score_:.4f}")
print(f"{'='*80}")

# Evaluate on test set
y_pred_dt_bigram_tuned = grid_dt_bigram.predict(X_test_bigram)
accuracy_dt_bigram_tuned = accuracy_score(y_test, y_pred_dt_bigram_tuned)
f1_dt_bigram_tuned = f1_score(y_test, y_pred_dt_bigram_tuned, average=None, labels=[1, -1, 0])

print(f"\nTuned Decision Tree (Bigrams):")
print(f"Test Accuracy: {accuracy_dt_bigram_tuned:.4f}")
print(f"F1 Scores: {f1_dt_bigram_tuned}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_dt_bigram_tuned, labels=[1, -1, 0]))

# Store best model
best_dt_bigram_model = grid_dt_bigram.best_estimator_


STEP 3: HYPERPARAMETER TUNING FOR BIGRAM DECISION TREE

Parameter Grid for Tuning:
  max_depth: [8, 12, 15, 20, 25]
  min_samples_split: [2, 3, 5]
  min_samples_leaf: [1, 2]
  criterion: ['gini', 'entropy']
  splitter: ['best', 'random']

Fitting Decision Tree with Bigrams (GridSearchCV)...
This may take several minutes...
Fitting 5 folds for each of 120 candidates, totalling 600 fits

Best Parameters: {'criterion': 'gini', 'max_depth': 25, 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'random'}
Best CV Score: 0.5410

Tuned Decision Tree (Bigrams):
Test Accuracy: 0.5461
F1 Scores: [0.6558253  0.43159304 0.05207329]

Classification Report:
              precision    recall  f1-score   support

           1       0.50      0.94      0.66      2989
          -1       0.80      0.30      0.43      2723
           0       0.43      0.03      0.05       974

   micro avg       0.55      0.55      0.55      6686
   macro avg       0.58      0.42      0.38      6686
weighted avg 

In [52]:
# Unigrams vs Bigrams Comparison
print("\n" + "="*80)
print("STEP 4: UNIGRAMS vs BIGRAMS COMPARISON")
print("="*80)

comparison_unigram_bigram = pd.DataFrame({
    'Model': [
        'Decision Tree (Unigram)',
        'Decision Tree (Unigram - Tuned)',
        'Decision Tree (Bigram)',
        'Decision Tree (Bigram - Tuned)'
    ],
    'Accuracy': [
        accuracy_dt,
        accuracy_dt_tuned,
        accuracy_dt_bigram_baseline,
        accuracy_dt_bigram_tuned
    ],
    'F1 Positive': [
        f1_dt[0],
        f1_dt_tuned[0],
        f1_dt_bigram_baseline[0],
        f1_dt_bigram_tuned[0]
    ],
    'F1 Negative': [
        f1_dt[1],
        f1_dt_tuned[1],
        f1_dt_bigram_baseline[1],
        f1_dt_bigram_tuned[1]
    ],
    'F1 Neutral': [
        f1_dt[2],
        f1_dt_tuned[2],
        f1_dt_bigram_baseline[2],
        f1_dt_bigram_tuned[2]
    ]
})

print("\n", comparison_unigram_bigram.to_string())

# Find best overall model
best_overall_idx = comparison_unigram_bigram['Accuracy'].idxmax()
best_overall_model_name = comparison_unigram_bigram.iloc[best_overall_idx]['Model']
best_overall_accuracy = comparison_unigram_bigram.iloc[best_overall_idx]['Accuracy']

print(f"\n{'='*80}")
print(f"BEST OVERALL MODEL: {best_overall_model_name}")
print(f"BEST ACCURACY: {best_overall_accuracy:.4f}")
print(f"{'='*80}")

# Check if we achieved >70% accuracy
if best_overall_accuracy > 0.70:
    print(f"\n✓ SUCCESS! Achieved accuracy > 70% ({best_overall_accuracy:.2%})")
else:
    print(f"\n⚠ Current accuracy: {best_overall_accuracy:.2%} (Target: >70%)")


STEP 4: UNIGRAMS vs BIGRAMS COMPARISON

                              Model  Accuracy  F1 Positive  F1 Negative  F1 Neutral
0          Decision Tree (Unigram)  0.576400     0.662648     0.552580    0.014113
1  Decision Tree (Unigram - Tuned)  0.582524     0.665449     0.567324    0.049524
2           Decision Tree (Bigram)  0.527707     0.650412     0.364308    0.021934
3   Decision Tree (Bigram - Tuned)  0.546079     0.655825     0.431593    0.052073

BEST OVERALL MODEL: Decision Tree (Unigram - Tuned)
BEST ACCURACY: 0.5825

⚠ Current accuracy: 58.25% (Target: >70%)


In [55]:
# Save the Best Model
print("\n" + "="*80)
print("STEP 5: SAVE THE BEST MODEL")
print("="*80)

# Save the best tuned bigram model
model_to_save = grid_dt_bigram.best_estimator_
vectorizer_to_save = vectorizer_bigram

# Save model
with open('best_sentiment_model.pkl', 'wb') as f:
    pickle.dump(model_to_save, f)

# Save vectorizer
with open('best_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer_to_save, f)

print("\n✓ Best model saved as: best_sentiment_model.pkl")
print("✓ Best vectorizer saved as: best_vectorizer.pkl")

print(f"\nModel Details:")
print(f"  - Type: {type(model_to_save).__name__}")
print(f"  - Best Parameters: {grid_dt_bigram.best_params_}")
print(f"  - Test Accuracy: {accuracy_dt_bigram_tuned:.4f}")


STEP 5: SAVE THE BEST MODEL

✓ Best model saved as: best_sentiment_model.pkl
✓ Best vectorizer saved as: best_vectorizer.pkl

Model Details:
  - Type: DecisionTreeClassifier
  - Best Parameters: {'criterion': 'gini', 'max_depth': 25, 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'random'}
  - Test Accuracy: 0.5461
