# Sentiment Analysis using Bag of words



Sentiment analysis is to analyze the textual documents and extract information that is related to the author’s sentiment or opinion. It is sometimes referred to as opinion mining.

It is popular and widely used in industry, e.g., corporate surveys, feedback surveys, social media data, reviews for movies, places, hotels, commodities, etc..

The sentiment information from texts can be crucial to further decision making in the industry.

**Output of Sentiment Analysis**

- Qualitative: overall sentiment scale (positive/negative)
- Quantitative: sentiment polarity scores


## Dataset : `NepCov19Tweet`

Referece: C Sitaula, A Basnet, A Mainali and TB Shahi, **Deep Learning-based Methods for Sentiment Analysis on Nepali COVID-19-related Tweets**, Computational Intelligence and Neuroscience, 2021. [Link](https://onlinelibrary.wiley.com/doi/full/10.1155/2021/2158184)

Source: https://www.kaggle.com/datasets/mathew11111/nepcov19tweets/data

In [1]:
import pandas as pd

df = pd.read_csv('../data/covid19_tweeter_dataset.csv', encoding='utf-8')
df.head(5)

Unnamed: 0.1,Unnamed: 0,Label,Datetime,Tweet,Tokanize_tweet
0,0,-1,2021-01-10 22:06:41+00:00,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"अमेरिकामा,कोभिड,बाट,एकै,दिन,चार,हजारभन्दा,बढीक..."
1,1,-1,2021-01-10 17:49:34+00:00,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"कोभिड,का,कारण,विदेशमा,रहेका,नेपालीहरुमा,मानसिक..."
2,2,1,2021-01-10 16:18:34+00:00,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"नेपालमा,क्लोभर,बायोफार्मास्युटिकल्स,अस्ट्रेलिय..."
3,3,0,2021-01-10 15:12:17+00:00,कोभिड को खोप पनि लगाइयो,"कोभिड,को,खोप,पनि,लगाइयो"
4,4,-1,2021-01-10 15:07:12+00:00,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"अमेरिकामा,कोभिड,को,नयाँ,रेकर्ड,एकै,दिन,हजारभन्..."


In [4]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33474 entries, 0 to 33473
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      33474 non-null  int64 
 1   Label           33474 non-null  object
 2   Datetime        33474 non-null  object
 3   Tweet           33474 non-null  object
 4   Tokanize_tweet  33471 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.3+ MB


(33474, 5)

In [5]:
df = df.drop(columns=['Unnamed: 0', 'Datetime', 'Tokanize_tweet'], axis=1)
df.head(5)

Unnamed: 0,Label,Tweet
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...
3,0,कोभिड को खोप पनि लगाइयो
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...


## Data Pre-Processing

**Tokenize**

In [6]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/tilak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/tilak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/tilak/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [7]:
# df['tokenized_text'] = df['text'].apply(nltk.word_tokenize)
df['tokenized_text'] = df['Tweet'].map(nltk.word_tokenize)
df.head(5)

Unnamed: 0,Label,Tweet,tokenized_text
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"[अमेरिकामा, कोभिड, बाट, एकै, दिन, चार, हजारभन्..."
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"[कोभिड, का, कारण, विदेशमा, रहेका, नेपालीहरुमा,..."
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र..."
3,0,कोभिड को खोप पनि लगाइयो,"[कोभिड, को, खोप, पनि, लगाइयो]"
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"[अमेरिकामा, कोभिड, को, नयाँ, रेकर्ड, एकै, दिन,..."


**Stop Word Removal**

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/tilak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
from nltk.corpus import stopwords
nepali_stopwords = stopwords.words('nepali')

In [None]:
# nepali_stopwords

['छ',
 'र',
 'पनि',
 'छन्',
 'लागि',
 'भएको',
 'गरेको',
 'भने',
 'गर्न',
 'गर्ने',
 'हो',
 'तथा',
 'यो',
 'रहेको',
 'उनले',
 'थियो',
 'हुने',
 'गरेका',
 'थिए',
 'गर्दै',
 'तर',
 'नै',
 'को',
 'मा',
 'हुन्',
 'भन्ने',
 'हुन',
 'गरी',
 'त',
 'हुन्छ',
 'अब',
 'के',
 'रहेका',
 'गरेर',
 'छैन',
 'दिए',
 'भए',
 'यस',
 'ले',
 'गर्नु',
 'औं',
 'सो',
 'त्यो',
 'कि',
 'जुन',
 'यी',
 'का',
 'गरि',
 'ती',
 'न',
 'छु',
 'छौं',
 'लाई',
 'नि',
 'उप',
 'अक्सर',
 'आदि',
 'कसरी',
 'क्रमशः',
 'चाले',
 'अगाडी',
 'अझै',
 'अनुसार',
 'अन्तर्गत',
 'अन्य',
 'अन्यत्र',
 'अन्यथा',
 'अरु',
 'अरुलाई',
 'अर्को',
 'अर्थात',
 'अर्थात्',
 'अलग',
 'आए',
 'आजको',
 'ओठ',
 'आत्म',
 'आफू',
 'आफूलाई',
 'आफ्नै',
 'आफ्नो',
 'आयो',
 'उदाहरण',
 'उनको',
 'उहालाई',
 'एउटै',
 'एक',
 'एकदम',
 'कतै',
 'कम से कम',
 'कसै',
 'कसैले',
 'कहाँबाट',
 'कहिलेकाहीं',
 'का',
 'किन',
 'किनभने',
 'कुनै',
 'कुरा',
 'कृपया',
 'केही',
 'कोही',
 'गए',
 'गरौं',
 'गर्छ',
 'गर्छु',
 'गर्नुपर्छ',
 'गयौ',
 'गैर',
 'चार',
 'चाहनुहुन्छ',
 'चाहन्छु',
 'चाहिए

In [11]:
def remove_stopwords(tokens):
    return [word for word in tokens if word not in nepali_stopwords]

df['tokenized_text_no_stopwords'] = df['tokenized_text'].apply(remove_stopwords)
df.head(5)

Unnamed: 0,Label,Tweet,tokenized_text,tokenized_text_no_stopwords
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"[अमेरिकामा, कोभिड, बाट, एकै, दिन, चार, हजारभन्...","[अमेरिकामा, कोभिड, बाट, एकै, दिन, हजारभन्दा, ब..."
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"[कोभिड, का, कारण, विदेशमा, रहेका, नेपालीहरुमा,...","[कोभिड, कारण, विदेशमा, नेपालीहरुमा, मानसिक, स्..."
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र...","[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र..."
3,0,कोभिड को खोप पनि लगाइयो,"[कोभिड, को, खोप, पनि, लगाइयो]","[कोभिड, खोप, लगाइयो]"
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"[अमेरिकामा, कोभिड, को, नयाँ, रेकर्ड, एकै, दिन,...","[अमेरिकामा, कोभिड, रेकर्ड, एकै, दिन, हजारभन्दा..."


### Designing Stemmer for removing the suffixes

- Takes a list of tokens as input.
- Initializes an empty list stemmed_tokens to store stemmed tokens.
- Defines a list of suffixes to remove:`['मा', 'बाट', 'को', 'हरु']`
- Iterates through each token:

    - For each suffix, checks if the token ends with it using endswith.
    - If a match is found, removes the suffix using slicing token`[:-len suffix)]`.
    - break is used to stop checking further suffixes for the current token after a suffix is removed.
    - Appends the stemmed token to `stemmed_tokens`.

- Returns the list of stemmed tokens.

In [13]:
SUFFIXES = ['मा', 'बाट', 'को', 'का', 'हरु']

In [12]:
def rule_based_stemmer(tokens):
    stemmed_tokens = []
    for token in tokens:
        for suffix in SUFFIXES:
            if token.endswith(suffix):
                token = token[:-len(suffix)]
                break  # Move to the next token after removing a suffix
        stemmed_tokens.append(token)
    return stemmed_tokens

In [14]:
df['stemmed_tokens'] = df['tokenized_text_no_stopwords'].apply(rule_based_stemmer)

df.head(10)

Unnamed: 0,Label,Tweet,tokenized_text,tokenized_text_no_stopwords,stemmed_tokens
0,-1,अमेरिकामा कोभिड बाट एकै दिन चार हजारभन्दा बढीक...,"[अमेरिकामा, कोभिड, बाट, एकै, दिन, चार, हजारभन्...","[अमेरिकामा, कोभिड, बाट, एकै, दिन, हजारभन्दा, ब...","[अमेरिका, कोभिड, , एकै, दिन, हजारभन्दा, बढी, म..."
1,-1,कोभिड का कारण विदेशमा रहेका नेपालीहरुमा मानसिक...,"[कोभिड, का, कारण, विदेशमा, रहेका, नेपालीहरुमा,...","[कोभिड, कारण, विदेशमा, नेपालीहरुमा, मानसिक, स्...","[कोभिड, कारण, विदेश, नेपालीहरु, मानसिक, स्वास्..."
2,1,नेपालमा क्लोभर बायोफार्मास्युटिकल्स अस्ट्रेलिय...,"[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र...","[नेपालमा, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्र...","[नेपाल, क्लोभर, बायोफार्मास्युटिकल्स, अस्ट्रेल..."
3,0,कोभिड को खोप पनि लगाइयो,"[कोभिड, को, खोप, पनि, लगाइयो]","[कोभिड, खोप, लगाइयो]","[कोभिड, खोप, लगाइयो]"
4,-1,अमेरिकामा कोभिड को नयाँ रेकर्ड एकै दिन हजारभन्...,"[अमेरिकामा, कोभिड, को, नयाँ, रेकर्ड, एकै, दिन,...","[अमेरिकामा, कोभिड, रेकर्ड, एकै, दिन, हजारभन्दा...","[अमेरिका, कोभिड, रेकर्ड, एकै, दिन, हजारभन्दा, ..."
5,-1,गण्डकी प्रदेश सरकारले कोभिड बाट प्रभावीतहरुको ...,"[गण्डकी, प्रदेश, सरकारले, कोभिड, बाट, प्रभावीत...","[गण्डकी, प्रदेश, सरकारले, कोभिड, बाट, प्रभावीत...","[गण्डकी, प्रदेश, सरकारले, कोभिड, , प्रभावीतहरु..."
6,-1,नेपालको संचार अमेरिकामा कोभिड को नयाँ रेकर्ड ए...,"[नेपालको, संचार, अमेरिकामा, कोभिड, को, नयाँ, र...","[नेपालको, संचार, अमेरिकामा, कोभिड, रेकर्ड, एकै...","[नेपाल, संचार, अमेरिका, कोभिड, रेकर्ड, एकै, दि..."
7,0,रामेछापमा कोभिड सङ्क्रमितको संख्या पुग्यो,"[रामेछापमा, कोभिड, सङ्क्रमितको, संख्या, पुग्यो]","[रामेछापमा, कोभिड, सङ्क्रमितको, संख्या, पुग्यो]","[रामेछाप, कोभिड, सङ्क्रमित, संख्या, पुग्यो]"
8,1,कोरोना भाइरस भारत माघ गतेदेखि कोभिड विरुद्ध रा...,"[कोरोना, भाइरस, भारत, माघ, गतेदेखि, कोभिड, विर...","[कोरोना, भाइरस, भारत, माघ, गतेदेखि, कोभिड, विर...","[कोरोना, भाइरस, भारत, माघ, गतेदेखि, कोभिड, विर..."
9,0,स्वास्थ्य मन्त्रालयले माग्यो कोभिड को रोकथामका...,"[स्वास्थ्य, मन्त्रालयले, माग्यो, कोभिड, को, रो...","[स्वास्थ्य, मन्त्रालयले, माग्यो, कोभिड, रोकथाम...","[स्वास्थ्य, मन्त्रालयले, माग्यो, कोभिड, रोकथाम..."


## Preparing Datasets for Training/Validation

We split the entire dataset into two parts: `training set` and `testing set`.
- The proportion of training and testing sets may depend on the corpus size.
- In the train-test split, make sure the the distribution of the classes is proportional.

In [16]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size = 0.20, random_state=42)

print(f"Train set size: {len(df_train)}")
print(f"Test set size: {len(df_test)}")

Train set size: 26779
Test set size: 6695


## Vectorize (TF-IDF)


**TF-IDF** stands for Term Frequency-Inverse Document Frequency.

It's a numerical statistic used in Natural Language Processing to reflect how important a word is to a document within a collection of documents (corpus).

It works by considering two factors:

- **Term Frequency (TF)**: How frequently a word appears in a document. Higher frequency generally means higher importance.

- **Inverse Document Frequency (IDF)**: How common or rare a word is across the entire corpus. Words that appear in many documents are less important than words that appear in a few.

$TFIDF(t, d, D) = TF(t, d) \times IDF(t, D) $


Where:
- $t$ represents the term (word)
- $d$ represents the document
- $D$ represents the corpus (collection of documents)



We will use TF-IDF for vectorization

To create a `bag-of-words` based `TF-IDF` vector from the stemmed_text column, use the `TfidfVectorizer` from `sklearn`.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
# Join the stemmed tokens back into sentences
df['stemmed_token_joined'] = df['stemmed_tokens'].apply(' '.join)

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# fit the vectorizer
vectorizer.fit(df['stemmed_token_joined'])

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'


In [20]:
vector = vectorizer.transform(["कोभिड समस्या पारेको छ"])
print(vector.toarray())
vector.shape

[[0. 0. 0. ... 0. 0. 0.]]


(1, 5551)

Vectorize train and test dataset

In [21]:
df_train['stemmed_token_joined'] = df_train['stemmed_tokens'].apply(' '.join)
X_train_bow = vectorizer.transform(df_train['stemmed_token_joined'])

df_test['stemmed_token_joined'] = df_test['stemmed_tokens'].apply(' '.join)
X_test_bow  = vectorizer.transform(df_test['stemmed_token_joined'])

In [22]:
X_train_bow.shape, X_test_bow.shape

((26779, 5551), (6695, 5551))

## Visualize the vector

To visualize the TF-IDF matrix, you can use dimensionality reduction techniques like `PCA` or `t-SNE` to project the high-dimensional matrix into a 2D or 3D space.

Then, you can plot the projected data using libraries like matplotlib.

In [24]:
import plotly.express as px
from sklearn.decomposition import PCA

Then, apply PCA to reduce the dimensionality of the TF-IDF matrix:

In [25]:
pca = PCA(n_components=3)  # Reduce to 3 dimensions for visualization
reduced_tfidf = pca.fit_transform(X_test_bow.toarray())  # Convert sparse matrix to dense array
reduced_tfidf.shape

(6695, 3)

In [None]:
# df_3d = pd.DataFrame(reduced_tfidf, columns=['PC1', 'PC2', 'PC3'])

# fig = px.scatter_3d(df_3d, x='PC1', y='PC2', z='PC3')
# fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Generate labels

In [27]:
y_train = df_train['Label']
y_test = df_test['Label']

## Model Training

For our sentiment classifier, we will try a few common classification algorithms:

- Support Vector Machine
- Decision Tree
- Naive Bayes
- Logistic Regression

In [None]:
# from sklearn import svm

# model_svm = svm.SVC(C=8.0, kernel='linear')
# model_svm.fit(X_train_bow, y_train)

Cross validation

In [None]:
# from sklearn.model_selection import cross_val_score
# model_svm_acc = cross_val_score(estimator=model_svm, X=X_train_bow, y=y_train, cv=5, n_jobs=-1)
# model_svm_acc

In [None]:
from sklearn.tree import DecisionTreeClassifier

model_dec = DecisionTreeClassifier(max_depth=10, random_state=0)
model_dec.fit(X_train_bow, y_train)

In [None]:
from sklearn.model_selection import cross_val_score
model_dec_acc = cross_val_score(estimator=model_dec, X=X_test_bow, y=y_test, cv=5, n_jobs=2)
model_dec_acc

## Evaluation of Model
To evaluate each model's performance, there are several common metrics can be used.

- Precision
- Recall
- F-score
- Accuracy
- Confusion Matrix

In [None]:
# Mean Accuracy
print(model_dec.score(X_test_bow, y_test))

In [None]:
# F1
from sklearn.metrics import f1_score

y_pred = model_dec.predict(X_test_bow)

f1_score(y_test, y_pred,
         average=None,
         labels = [1,-1,0])

## Inference

- define input
- process

In [None]:
# df_train.head(100)

In [None]:
tweet = "नेपालमा कोभिड बढ्नु राम्रो कुरा हो"
tweet = "नेपालमा कोभिड बढ्नु नराम्रो कुरा हो"

tweet = "कोभिडले सप्तरीमा संक्रमितको लक्षण देखिएका तीनजना मध्धे एक जानाको मृत्यु भएको छ"

tokens          = nltk.word_tokenize(tweet)
tokens          = remove_stopwords(tokens)
stemmed_tokens  = rule_based_stemmer(tokens)
stemmed_tokens

Now Vectorize and pass to model

In [None]:
X_feature = vectorizer.transform([' '.join(stemmed_tokens)])
X_feature.shape

In [None]:
model_dec.predict(X_feature)

## Saving the model

In [None]:
import pickle

Save the model

In [None]:
with open('model_dec.pkl', 'wb') as file:
    pickle.dump(model_dec, file)

Inference with saved model

In [None]:
with open('model_dec.pkl', 'rb') as file:
    model_dec_loaded = pickle.load(file)
    pred = model_dec.predict(X_feature)
print(f"prediction from saved model: {pred}")

In [None]:
from sklearn.metrics import accuracy_score

y_pred = model_dec.predict(X_test_bow)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Assignment - Enhancing the Sentiment Analysis System

Now extend/enhance this notebook to achive following-

**Task 1**: Train following classifiers and Compute the `F1`, `Accuracy` of the model
- Support Vector Machine
- Decision Tree
- Logistic Regression

Write the interpretation of the finding by comparing the model accuracy.

**Task 2**: Perform the hyperparameter tuning on each of the model (from task 1)
- Prepare a table showing the parameters for each classification model

**Task3**: Feature Engineering
- Instead of using `uni-gram` token on TF-IDF, use `bi-gram` token
- Train the decision tree model
- Hyperparameter tune the model to ensure the higher accuracy

**Deliverables**
1. Python Notebook
2. Your Best model from ***Task 3*** (saved pickel file)



**Evaluation**
- Your model should have accuracy `more than 70%` to get the full marks
- The leader board will be published
- Top 10% in the leader board will get the additional bonus marks
