Application: Sentiment Analysis\
Dataset: Download the dataset from Sentiment Analysis Dataset (kaggle.com)\
Task:\
▪ Explain the pipeline for developing sentiment analysis task.\
▪ Perform cleaning and preprocessing of text.\
▪ Generate representations using:\
• Bag of Words\
• TF-IDF\
• Continuous Bag of Words\
• Skip gram\
• Word2Vec.\
▪ Classify the data using appropriate machine learning techniques to generate labels.\
▪ Analyze the labels and explain the impact of embedding techniques in misclassification.\
▪ Discuss the limitations of each embedding technique and explain the techniques that rectify it.\
Input:\
What is not to like about this product.\
Not bad.\
Not an issue.\
Not buggy.\
Not happy.\
Not user-friendly.\
Not good.\
Is it any good?\
I do not dislike horror movies.\
Disliking horror movies is not uncommon.\
Sometimes I really hate the show.\
I love having to wait two months for the next series to come out!\
The final episode was surprising with a terrible twist at the end.\
The film was easy to watch but I would not recommend it to my friends.\
I LOL’d at the end of the cake scene

1. Data Gathering
2. Data Preprocessing
    - remove irrelavant chars
    - remove stop words
    - lemmatization, stemming
3. Feature Extraction
4. Model
5. Training
6. Testing

In [3]:
import numpy as np
import pandas as pd
import nltk
import gensim
import string as st
import re
from nltk.corpus import stopwords
from nltk import PorterStemmer
from nltk import WordNetLemmatizer
from wordcloud import WordCloud, ImageColorGenerator
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models.word2vec import Word2Vec
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [4]:
data = pd.read_csv("ex2-sent_an.csv",
                   encoding='unicode_escape')
data.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In [5]:
data = data[['selected_text','sentiment']]
data.head()

Unnamed: 0,selected_text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD,negative
2,bullying me,negative
3,leave me alone,negative
4,"Sons of ****,",negative


In [6]:
data.rename(columns={"selected_text":"text"},inplace=True)
data.head()

Unnamed: 0,text,sentiment
0,"I`d have responded, if I were going",neutral
1,Sooo SAD,negative
2,bullying me,negative
3,leave me alone,negative
4,"Sons of ****,",negative


In [8]:
data.dropna(inplace=True)

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27480 entries, 0 to 27480
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       27480 non-null  object
 1   sentiment  27480 non-null  object
dtypes: object(2)
memory usage: 644.1+ KB


### Remove Punctuation

In [10]:
def remove_punctuation(text):
    removed_text = ""
    for char in str(text):
        if char not in st.punctuation:
            removed_text+=char
    return removed_text

In [11]:
data['removed_punc'] = data['text'].apply(remove_punctuation)

In [12]:
data.head()

Unnamed: 0,text,sentiment,removed_punc
0,"I`d have responded, if I were going",neutral,Id have responded if I were going
1,Sooo SAD,negative,Sooo SAD
2,bullying me,negative,bullying me
3,leave me alone,negative,leave me alone
4,"Sons of ****,",negative,Sons of


### Tokenization

In [13]:
def convert_tokens(text):
    text = str(text).lower()
    tokens = []
    tokens = re.split("\s+",text)
    return tokens

In [14]:
data['Tokens'] = data['removed_punc'].apply(convert_tokens)

In [15]:
data.head()

Unnamed: 0,text,sentiment,removed_punc,Tokens
0,"I`d have responded, if I were going",neutral,Id have responded if I were going,"[id, have, responded, if, i, were, going]"
1,Sooo SAD,negative,Sooo SAD,"[sooo, sad]"
2,bullying me,negative,bullying me,"[bullying, me]"
3,leave me alone,negative,leave me alone,"[leave, me, alone]"
4,"Sons of ****,",negative,Sons of,"[sons, of, ]"


### Stopword Removal

In [16]:
def remove_stopwords(tokens):
    return [token for token in tokens if token not in stopwords.words("english")]

In [17]:
data['removed_stopwords_tokens'] = data['Tokens'].apply(remove_stopwords)

In [18]:
data.head()

Unnamed: 0,text,sentiment,removed_punc,Tokens,removed_stopwords_tokens
0,"I`d have responded, if I were going",neutral,Id have responded if I were going,"[id, have, responded, if, i, were, going]","[id, responded, going]"
1,Sooo SAD,negative,Sooo SAD,"[sooo, sad]","[sooo, sad]"
2,bullying me,negative,bullying me,"[bullying, me]",[bullying]
3,leave me alone,negative,leave me alone,"[leave, me, alone]","[leave, alone]"
4,"Sons of ****,",negative,Sons of,"[sons, of, ]","[sons, ]"


### Stemming

In [19]:
def stem_tokens(tokens):
    ps = PorterStemmer()
    tokens = [ps.stem(tok) for tok in tokens]
    return tokens

In [20]:
data['stemming_tokens'] = data['removed_stopwords_tokens'].apply(stem_tokens)

In [21]:
data.head()

Unnamed: 0,text,sentiment,removed_punc,Tokens,removed_stopwords_tokens,stemming_tokens
0,"I`d have responded, if I were going",neutral,Id have responded if I were going,"[id, have, responded, if, i, were, going]","[id, responded, going]","[id, respond, go]"
1,Sooo SAD,negative,Sooo SAD,"[sooo, sad]","[sooo, sad]","[sooo, sad]"
2,bullying me,negative,bullying me,"[bullying, me]",[bullying],[bulli]
3,leave me alone,negative,leave me alone,"[leave, me, alone]","[leave, alone]","[leav, alon]"
4,"Sons of ****,",negative,Sons of,"[sons, of, ]","[sons, ]","[son, ]"


### Lemmatization

In [22]:
def lemmatization(tokens):
    wordnet = WordNetLemmatizer()
    tokens = [wordnet.lemmatize(tok) for tok in tokens]
    return tokens

In [24]:
data['lemma_tokens'] = data['removed_stopwords_tokens'].apply(lemmatization)

In [25]:
data.head()

Unnamed: 0,text,sentiment,removed_punc,Tokens,removed_stopwords_tokens,stemming_tokens,lemma_tokens
0,"I`d have responded, if I were going",neutral,Id have responded if I were going,"[id, have, responded, if, i, were, going]","[id, responded, going]","[id, respond, go]","[id, responded, going]"
1,Sooo SAD,negative,Sooo SAD,"[sooo, sad]","[sooo, sad]","[sooo, sad]","[sooo, sad]"
2,bullying me,negative,bullying me,"[bullying, me]",[bullying],[bulli],[bullying]
3,leave me alone,negative,leave me alone,"[leave, me, alone]","[leave, alone]","[leav, alon]","[leave, alone]"
4,"Sons of ****,",negative,Sons of,"[sons, of, ]","[sons, ]","[son, ]","[son, ]"


### Return Preprocessed text

In [26]:
def return_sentence(tokens):
    return " ".join(tokens)

In [28]:
data['pre_processed_text'] = data['lemma_tokens'].apply(return_sentence)

In [30]:
data[['text','pre_processed_text']].head()

Unnamed: 0,text,pre_processed_text
0,"I`d have responded, if I were going",id responded going
1,Sooo SAD,sooo sad
2,bullying me,bullying
3,leave me alone,leave alone
4,"Sons of ****,",son


## Feature representation

### Bag of Words

In [31]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(data['pre_processed_text'])

In [32]:
count_matrix.toarray().shape

(27480, 17424)

##### Impact on BoW
- BoW may struggle with capturing semantic meaning and context, leading to misclassification. 
- It treats each word independently, ignoring word relationships.
- OOV Problem

### TF-IDF

In [33]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(data['pre_processed_text'])

In [34]:
tfidf_array = tfidf_matrix.toarray()

In [35]:
tfidf_array.shape

(27480, 17424)

#### Impact on TF-IDF
- While TF-IDF addresses some BoW limitations by giving more weight to important words, it still doesn't capture word relationships and semantics well.
- OOV Problem

### Continuous Bag of Words

In [36]:
cbow = Word2Vec(data['pre_processed_text'],
                vector_size=100,
                window = 5,
                min_count= 2,
                sg = 0)

In [41]:
vocab = cbow.wv.index_to_key

In [43]:
vocab[:10]

[' ', 'e', 'o', 'a', 't', 'n', 'i', 'r', 'l', 's']

In [44]:
def get_mean_vector(model, sentence):
    words = [word for word in sentence if word in vocab]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    return np.zeros((100,))

In [45]:
cbow_array = []

for sentence in data['pre_processed_text'].values.tolist():
    cbow_array.append(get_mean_vector(cbow, sentence))

In [46]:
cbow_array = np.array(cbow_array)
cbow_array.shape

(27480, 100)

### SKipGram

In [49]:
sg = Word2Vec(data['pre_processed_text'],
              vector_size = 100,
              window = 5,
              min_count = 2, sg = 1)

In [50]:
vocab = sg.wv.index_to_key

In [51]:
def get_mean_vector(model, sentence):
    words = [word for word in sentence if word in vocab]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    return np.zeros((100,))

In [52]:
sg_array = []
for sentence in data['pre_processed_text'].values.tolist():
    sg_array.append(get_mean_vector(sg, sentence))

In [53]:
sg_array = np.array(sg_array)
sg_array.shape

(27480, 100)

### Feature Engineering

In [54]:
lb = LabelEncoder()
data['sentiment'] = lb.fit_transform(data['sentiment'])

In [55]:
y = data['sentiment']

In [56]:
x_train_bow, x_test_bow, y_train_bow, y_test_bow = train_test_split(count_matrix, y, test_size=0.2, random_state=9)

In [57]:
x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(tfidf_array, y, test_size=0.2, random_state=9)

In [59]:

x_train_cbow, x_test_cbow, y_train_cbow, y_test_cbow = train_test_split(cbow_array, y, test_size=0.2, random_state=9)

In [60]:
x_train_skg, x_test_skg, y_train_skg, y_test_skg = train_test_split(sg_array, y, test_size=0.2, random_state=9)

In [61]:
print("Bag of Words (BoW) Shapes:")
print("x_train_bow shape:", x_train_bow.shape)
print("x_test_bow shape:", x_test_bow.shape)
print("y_train_bow shape:", y_train_bow.shape)
print("y_test_bow shape:", y_test_bow.shape)
print("=======================")
print("\nTF-IDF Shapes:")
print("x_train_tfidf shape:", x_train_tfidf.shape)
print("x_test_tfidf shape:", x_test_tfidf.shape)
print("y_train_tfidf shape:", y_train_tfidf.shape)
print("y_test_tfidf shape:", y_test_tfidf.shape)
print("=========================")
print("\nContinuous Bag of Words (CBOW) Shapes:")
print("x_train_cbow shape:", x_train_cbow.shape)
print("x_test_cbow shape:", x_test_cbow.shape)
print("y_train_cbow shape:", y_train_cbow.shape)
print("y_test_cbow shape:", y_test_cbow.shape)
print("========================")
print("\nSkip-Gram Shapes:")
print("x_train_skg shape:", x_train_skg.shape)
print("x_test_skg shape:", x_test_skg.shape)
print("y_train_skg shape:", y_train_skg.shape)
print("y_test_skg shape:", y_test_skg.shape)


Bag of Words (BoW) Shapes:
x_train_bow shape: (21984, 17424)
x_test_bow shape: (5496, 17424)
y_train_bow shape: (21984,)
y_test_bow shape: (5496,)

TF-IDF Shapes:
x_train_tfidf shape: (21984, 17424)
x_test_tfidf shape: (5496, 17424)
y_train_tfidf shape: (21984,)
y_test_tfidf shape: (5496,)

Continuous Bag of Words (CBOW) Shapes:
x_train_cbow shape: (21984, 100)
x_test_cbow shape: (5496, 100)
y_train_cbow shape: (21984,)
y_test_cbow shape: (5496,)

Skip-Gram Shapes:
x_train_skg shape: (21984, 100)
x_test_skg shape: (5496, 100)
y_train_skg shape: (21984,)
y_test_skg shape: (5496,)


### Model

In [62]:
def train_and_evaluate_decision_tree(x_train, x_test, y_train, y_test, representation):
    
    dtclassifier = DecisionTreeClassifier(random_state=9,max_depth=5)
    dtclassifier.fit(x_train, y_train)
    y_pred = dtclassifier.predict(x_test)

    print(f"\nMetrics for {representation}:")
    print(f"Model Score: {dtclassifier.score(x_train,y_train)}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

In [63]:
def train_and_evaluate_navie_bayes(x_train, x_test, y_train, y_test, representation):
    
    nbclassifier = MultinomialNB()
    nbclassifier.fit(x_train, y_train)
    y_pred = nbclassifier.predict(x_test)

    print(f"\nMetrics for {representation}:")
    print(f"Model Score: {nbclassifier.score(x_train,y_train)}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    
    return nbclassifier

In [64]:
train_and_evaluate_decision_tree(x_train_bow, x_test_bow, y_train_bow, y_test_bow, "BoW")


Metrics for BoW:
Model Score: 0.4932678311499272
Accuracy: 0.4798034934497817
Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.00      0.00      1548
           1       0.43      0.94      0.59      2182
           2       0.79      0.33      0.46      1766

    accuracy                           0.48      5496
   macro avg       0.61      0.42      0.35      5496
weighted avg       0.59      0.48      0.39      5496



In [65]:
train_and_evaluate_decision_tree(x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf, "TF-IDF")


Metrics for TF-IDF:
Model Score: 0.5020469432314411
Accuracy: 0.4885371179039301
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.00      0.00      1548
           1       0.44      0.99      0.61      2182
           2       0.91      0.30      0.45      1766

    accuracy                           0.49      5496
   macro avg       0.78      0.43      0.35      5496
weighted avg       0.75      0.49      0.39      5496



In [66]:
train_and_evaluate_decision_tree(x_train_cbow, x_test_cbow, y_train_cbow, y_test_cbow, "CBOW")


Metrics for CBOW:
Model Score: 0.6477893013100436
Accuracy: 0.6299126637554585
Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.50      0.52      1548
           1       0.67      0.79      0.72      2182
           2       0.65      0.55      0.60      1766

    accuracy                           0.63      5496
   macro avg       0.62      0.61      0.61      5496
weighted avg       0.63      0.63      0.62      5496



In [67]:
train_and_evaluate_decision_tree(x_train_skg, x_test_skg, y_train_skg, y_test_skg, "Skip-Gram")


Metrics for Skip-Gram:
Model Score: 0.6309133915574964
Accuracy: 0.61608442503639
Classification Report:
               precision    recall  f1-score   support

           0       0.66      0.39      0.49      1548
           1       0.67      0.70      0.69      2182
           2       0.55      0.71      0.62      1766

    accuracy                           0.62      5496
   macro avg       0.63      0.60      0.60      5496
weighted avg       0.63      0.62      0.61      5496



In [68]:
nbc_1 = train_and_evaluate_navie_bayes(x_train_bow, x_test_bow, y_train_bow, y_test_bow, "BoW")


Metrics for BoW:
Model Score: 0.8563045851528385
Accuracy: 0.7530931586608443
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.61      0.69      1548
           1       0.71      0.81      0.76      2182
           2       0.78      0.80      0.79      1766

    accuracy                           0.75      5496
   macro avg       0.76      0.74      0.75      5496
weighted avg       0.76      0.75      0.75      5496



In [75]:
nbc_2 = train_and_evaluate_navie_bayes(x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf, "Tf-IDF")


Metrics for Tf-IDF:
Model Score: 0.8605349344978166
Accuracy: 0.774745269286754
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.59      0.70      1548
           1       0.71      0.89      0.79      2182
           2       0.82      0.79      0.81      1766

    accuracy                           0.77      5496
   macro avg       0.80      0.76      0.77      5496
weighted avg       0.79      0.77      0.77      5496



In [76]:
texts = [
    "What is not to like about this product.",
    "Not bad.",
    "Not an issue.",
    "Not buggy.",
    "Not happy.",
    "Not user-friendly.",
    "Not good.",
    "Is it any good?",
    "I do not dislike horror movies.",
    "Disliking horror movies is not uncommon.",
    "Sometimes I really hate the show.",
    "I love having to wait two months for the next series to come out!",
    "The final episode was surprising with a terrible twist at the end.",
    "The film was easy to watch but I would not recommend it to my friends.",
    "I LOL’d at the end of the cake scene"
]

In [82]:
def simple_preprocess(text):
    text = remove_punctuation(text)
    tokens = convert_tokens(text)
    tokens = remove_stopwords(tokens)
    tokens = lemmatization(tokens)
    return return_sentence(tokens)

In [84]:
for text in texts:
    preprocessed_text = simple_preprocess(text)
    transformed_text = tfidf.transform([preprocessed_text]).toarray()
    prediction = nbc_2.predict(transformed_text)[0]
    
    if prediction == 0:
        print(f"{text}: Negative")
    elif prediction == 1:
        print(f"{text}: Neutral")
    elif prediction == 2:
        print(f"{text}: Positive")

What is not to like about this product.: Positive
Not bad.: Negative
Not an issue.: Neutral
Not buggy.: Neutral
Not happy.: Positive
Not user-friendly.: Neutral
Not good.: Positive
Is it any good?: Positive
I do not dislike horror movies.: Negative
Disliking horror movies is not uncommon.: Negative
Sometimes I really hate the show.: Neutral
I love having to wait two months for the next series to come out!: Neutral
The final episode was surprising with a terrible twist at the end.: Neutral
The film was easy to watch but I would not recommend it to my friends.: Neutral
I LOL’d at the end of the cake scene: Neutral
