<a href="https://colab.research.google.com/github/sanvika15/nlp-lec6/blob/main/Sentimental_Analysis_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis in Python

This notebook is part of a tutorial that can be found on my [youtube channel here](https://www.youtube.com/channel/UCxladMszXan-jfgzyeIMyvw), please check it out!

In this notebook we will be doing some sentiment analysis in python using two different techniques:
1. VADER (Valence Aware Dictionary and sEntiment Reasoner) - Bag of words approach
2. Roberta Pretrained Model from 🤗
3. Huggingface Pipeline

# **Upload the DataSet**

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
plt.style.use('ggplot')
import nltk

In [7]:
import pandas as pd

# Load dataset
df = pd.read_csv('/content/sentiment_tweets3.csv')

# Drop the unwanted 'Unnamed: 0' column
df.drop(columns='Unnamed: 0', inplace=True)

In [8]:
df.head()

Unnamed: 0,message,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat Need to send 'em to my accountant tomo...,0
4,ADD ME ON MYSPACE!!! myspace.com/LookThunder,0


# 1. **Preprocessing Data**
##**a) LoweCasing Text**

In [9]:
# Pick any random Review
df['message'][9]

"@daNanner Night, darlin'!  Sweet dreams to you "

In [10]:
# Lower Casing the review
df['message'][9].lower()

"@dananner night, darlin'!  sweet dreams to you "

In [11]:
# Lowercase the Whole Corpus by using lower() function of Python.
df['message'] = df['message'].str.lower()
df.head()

Unnamed: 0,message,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat need to send 'em to my accountant tomo...,0
4,add me on myspace!!! myspace.com/lookthunder,0


## **b) Remove HTML Tags**



In [12]:
# Import Regular Expression
import re

# Function to remove HTML Tags
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [13]:
# Suppose we have a text Which Contains HTML Tags
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"
text

"<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [14]:
# Apply Function to Remove HTML Tags.
remove_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [15]:
# Apply Function to Remove HTML Tags in our Dataset Colum Review.
df['message'] = df['message'].apply(remove_html_tags)

In [16]:
df.head()

Unnamed: 0,message,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat need to send 'em to my accountant tomo...,0
4,add me on myspace!!! myspace.com/lookthunder,0


## **c) Remove URLs**

In [17]:
# Here We also Use Regular Expressions to Remove URLs from Text or Whole Corpus.
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [18]:
# Suppose we have the FOllowings Text With URL.
text1 = 'Check out my notebook https://www.kaggle.com/campusx/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/campusx/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/campusx/notebook8223fc1abb to search check www.google.com'

In [19]:
# Lets Remove The URL by Calling Function
print(remove_url(text1))
print(remove_url(text2))
print(remove_url(text3))
print(remove_url(text4))

Check out my notebook 
Check out my notebook 
Google search here 
For notebook click  to search check 


In [20]:
df.head()

Unnamed: 0,message,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat need to send 'em to my accountant tomo...,0
4,add me on myspace!!! myspace.com/lookthunder,0


## **d) Remove Punctuations**

In [21]:
# From String we Imorts Punctuation.
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [22]:
# Storing Punctuation in a Variable
punc = string.punctuation

In [23]:
# The code defines a function, remove_punc1, that takes a text input and removes all punctuation characters from it using
# the translate method with a translation table created by str.maketrans. This function effectively cleanses the text of punctuation symbols.
def remove_punc(text):
    return text.translate(str.maketrans('', '', punc))

In [24]:
# Exmaple on whole Dataset.
print(df['message'][9])
# Remove Punctuation
remove_punc(df['message'][9])

@dananner night, darlin'!  sweet dreams to you 


'dananner night darlin  sweet dreams to you '

In [25]:
import re

# Clean non-ASCII characters from 'message' column
df['message'] = df['message'].apply(lambda x: re.sub(r'[^\x00-\x7F]+', '', str(x)))
df['message'] = df['message'].str.strip()


In [26]:
print(df['message'][7])

23 or 24c possible today. nice


In [27]:
df.head()

Unnamed: 0,message,label
0,just had a real good moment. i missssssssss hi...,0
1,is reading manga http://plurk.com/p/mzp1e,0
2,@comeagainjen http://twitpic.com/2y2lx - http:...,0
3,@lapcat need to send 'em to my accountant tomo...,0
4,add me on myspace!!! myspace.com/lookthunder,0


## **d) Handling StopWords**

In NLP text preprocessing, removing stop words is crucial to enhance the quality and efficiency of analysis. Stop words are common words like "the," "is," and "and," which appear frequently in text but carry little semantic meaning. By eliminating stop words, we reduce noise in the data, decrease the dimensionality of the dataset, and improve the accuracy of NLP tasks such as sentiment analysis, topic modeling, and text classification. This process streamlines the analysis by focusing on the significant words that carry more meaningful information, leading to better model performance and interpretation of results.


In [28]:
# We use NLTK library to remove Stopwords.
from nltk.corpus import stopwords
nltk.download('stopwords')

# Here we can see all the stopwords in English.However we can chose different Languages also like spanish etc.
stopword = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [29]:
# Function
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopword:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [30]:
# Text
text = 'probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it\'s not preachy or boring. it just never gets old, despite my having seen it some 15 or more times'
print(f'Text With Stop Words :{text}')
# Calling Function
remove_stopwords(text)

Text With Stop Words :probably my all-time favorite movie, a story of selflessness, sacrifice and dedication to a noble cause, but it's not preachy or boring. it just never gets old, despite my having seen it some 15 or more times


'probably  all-time favorite movie,  story  selflessness, sacrifice  dedication   noble cause,    preachy  boring.   never gets old, despite   seen   15   times'

In [31]:
# We can Apply the same Function on Whole Corpus also
df['message'].apply(remove_stopwords)

Unnamed: 0,message
0,"real good moment. missssssssss much,"
1,reading manga http://plurk.com/p/mzp1e
2,@comeagainjen http://twitpic.com/2y2lx - http:...
3,@lapcat need send 'em accountant tomorrow. ...
4,add myspace!!! myspace.com/lookthunder
...,...
10309,"depression g herbo mood on, done stress..."
10310,depression succumbs brain makes feel l...
10311,ketamine nasal spray shows promise depression...
10312,dont mistake bad day depression! everyone 'em!


## **e) Tokenization**

In [32]:
from nltk.tokenize import word_tokenize,sent_tokenize
nltk.download('punkt_tab')

# Some Sentences
sent5 = 'I have a Ph.D in A.I'
sent6 = "We're here to help! mail us at nks@gmail.com"
sent7 = 'A 5km ride cost $10.50'

# Word Tokenize the Sentences
print(word_tokenize(sent5))
print(word_tokenize(sent6))
print(word_tokenize(sent7))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['I', 'have', 'a', 'Ph.D', 'in', 'A.I']
['We', "'re", 'here', 'to', 'help', '!', 'mail', 'us', 'at', 'nks', '@', 'gmail.com']
['A', '5km', 'ride', 'cost', '$', '10.50']


In [33]:
# Install spaCy and the English model (only once)
!pip install -U spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


this tool Handle the mail also , so the choice of best tokenizer tool depend on your problem, u can try both and select the best oen.

In [34]:
# Import spaCy and load the English language model
import spacy
nlp = spacy.load('en_core_web_sm')

In [35]:
# Tokenize the Sentences in Words
doc1 = nlp(sent5)
doc2 = nlp(sent6)
doc3 = nlp(sent7)

this tool Handle the mail also , so the choice of best tokenizer tool depend on your problem, u can try both and select the best oen.

In [36]:
# Print Token Genrated
for token in doc2:
    print(token.text)

We
're
here
to
help
!
mail
us
at
nks@gmail.com


## **f) Stemming**
Stemming is a text preprocessing technique in NLP used to reduce words to their root or base form, known as a stem, by removing suffixes. It helps in simplifying the vocabulary and reducing word variations, thereby improving the efficiency of downstream NLP tasks like information retrieval and sentiment analysis. By converting words to their common root, stemming increases the overlap between related words, enhancing the generalization ability of models.

In [37]:
# Import PorterStemmer from NLTK Library
from nltk.stem.porter import PorterStemmer

# Intilize Stemmer
stemmer = PorterStemmer()

# This Function Will Stem Words
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

In [38]:
# A single Sentence
st = "walk walks walking walked"
# Calling Function
stem_words(st)

'walk walk walk walk'

Thats How the Stemming will work

However, stemming may sometimes result in the production of non-existent or incorrect words, known as stemming errors, which need to be carefully managed to avoid impacting the accuracy of NLP applications.

## **g) Lemmatization**

Lemmatization is performed in NLP text preprocessing to reduce words to their base or dictionary form (lemma), enhancing consistency and simplifying analysis. Unlike stemming, which truncates words to their root form without considering meaning, lemmatization ensures that words are transformed to their canonical form, considering their part of speech. This process aids in reducing redundancy, improving text normalization, and enhancing the accuracy of downstream NLP tasks such as sentiment analysis, topic modeling, and information retrieval. Overall, lemmatization contributes to refining text data, facilitating more effective linguistic analysis and machine learning model performance.


In [39]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [40]:
# We Will Import WordNetLemmatizer from NLTK Library.
from nltk.stem import WordNetLemmatizer
# Intilize Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Sentence
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

# Intilize Punctuation
punctuations="?:!.,;"

# Tokenize Word
sentence_words = nltk.word_tokenize(sentence)

# Using a Loop to Remove Punctuations.
for word in sentence_words:
    if word in punctuations:
        sentence_words.remove(word)
# Printing Word and Lemmatized Word
print("{0:20}{1:20}".format("Word","Lemma"))
for word in sentence_words:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
He                  He                  
was                 be                  
running             run                 
and                 and                 
eating              eat                 
at                  at                  
same                same                
time                time                
He                  He                  
has                 have                
bad                 bad                 
habit               habit               
of                  of                  
swimming            swim                
after               after               
playing             play                
long                long                
hours               hours               
in                  in                  
the                 the                 
Sun                 Sun                 


# **Training/Validation Split**

In [41]:
from sklearn.model_selection import train_test_split

# Step 1: Split into 70% train and 30% temp (val + test)
train_texts, temp_texts, train_labels, temp_labels = train_test_split(
    df['message'], df['label'], test_size=0.3, random_state=42, stratify=df['label']
)

# Step 2: Split 30% temp into 15% val and 15% test
val_texts, test_texts, val_labels, test_labels = train_test_split(
    temp_texts, temp_labels, test_size=0.5, random_state=42, stratify=temp_labels
)

In [42]:
# Initialize with 'not_set'
df['data_type'] = 'not_set'

# Assign correct split labels using text content
df.loc[df['message'].isin(train_texts), 'data_type'] = 'train'
df.loc[df['message'].isin(val_texts), 'data_type'] = 'val'
df.loc[df['message'].isin(test_texts), 'data_type'] = 'test'


In [43]:
df.groupby(['label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,message
label,data_type,Unnamed: 2_level_1
0,test,1201
0,train,5598
0,val,1201
1,test,360
1,train,1605
1,val,349


## Loading Tokenizer and Encoding our Data


Step 4: Tokenization & Encoding
BERT tokenization is used to convert the raw text into numerical inputs that can be fed into the BERT model. It tokenized the text and performs some preprocessing to prepare the text for the model's input format. Let's understand some of the key features of the BERT tokenization model.

BERT tokenizer splits the words into subwords or workpieces. For example, the word "geeksforgeeks" can be split into "geeks" "##for", and"##geeks". The "##" prefix indicates that the subword is a continuation of the previous one. It reduces the vocabulary size and helps the model to deal with rare or unknown words.
BERT tokenizer adds special tokens like [CLS], [SEP], and [MASK] to the sequence. These tokens have special meanings like :
[CLS] is used for classifications and to represent the entire input in the case of sentiment analysis,
[SEP] is used as a separator i.e. to mark the boundaries between different sentences or segments,
[MASK] is used for masking i.e. to hide some tokens from the model during pre-training.
BERT tokenizer gives their components as outputs:
input_ids: The numerical identifiers of the vocabulary tokens
token_type_ids: It identifies which segment or sentence each token belongs to.
attention_mask: It flags that inform the model which tokens to pay attention to and which to disregard.

In [44]:
from transformers import BertTokenizer
#Tokenize and encode the data using the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [45]:
max_len = 128

# Tokenize and encode the sentences using BERT tokenizer
X_train_encoded = tokenizer.batch_encode_plus(
    train_texts.tolist(),
    padding=True,
    truncation=True,
    max_length=max_len,
    return_tensors='tf'  # Use 'pt' if working with PyTorch
)

X_val_encoded = tokenizer.batch_encode_plus(
    val_texts.tolist(),
    padding=True,
    truncation=True,
    max_length=max_len,
    return_tensors='tf'
)

X_test_encoded = tokenizer.batch_encode_plus(
    test_texts.tolist(),
    padding=True,
    truncation=True,
    max_length=max_len,
    return_tensors='tf'
)


In [46]:
k = 0
print('Training Comment -->>', train_texts.iloc[k])  # Use .iloc if it's a pandas Series
print('\nInput Ids -->>\n', X_train_encoded['input_ids'][k])
print('\nDecoded Ids -->>\n', tokenizer.decode(X_train_encoded['input_ids'][k]))
print('\nAttention Mask -->>\n', X_train_encoded['attention_mask'][k])
print('\nLabel -->>', train_labels.iloc[k])  # Use .iloc if it's a pandas Series

Training Comment -->> @mark_stringer and @geechee_girl how weird/cool that two people from my two lives should meet on twitter

Input Ids -->>
 tf.Tensor(
[  101  1030  2928  1035  5164  2121  1998  1030 20277 25923  1035  2611
  2129  6881  1013  4658  2008  2048  2111  2013  2026  2048  3268  2323
  3113  2006 10474   102     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0], shape=(128,), dtype=int32)

Decoded Ids -->>
 [CLS] @ mark _ str

In [47]:
# BUILDING THE ClASSIFICATION MODEL
from transformers import TFBertForSequenceClassification

# Intialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
import tensorflow as tf

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Reduce batch size
batch_size = 16   # or 8 if you're using CPU only

# Use fewer samples temporarily for testing
sample_size = 1000
val_sample_size = 200

train_inputs = [
    X_train_encoded['input_ids'][:sample_size],
    X_train_encoded['token_type_ids'][:sample_size],
    X_train_encoded['attention_mask'][:sample_size]
]
train_targets = train_labels[:sample_size]

val_inputs = [
    X_val_encoded['input_ids'][:val_sample_size],
    X_val_encoded['token_type_ids'][:val_sample_size],
    X_val_encoded['attention_mask'][:val_sample_size]
]
val_targets = val_labels[:val_sample_size]

# Use only 1 epoch first to test
history = model.fit(
    train_inputs,
    train_targets,
    validation_data=(val_inputs, val_targets),
    batch_size=batch_size,
    epochs=1
)


In [None]:
#plot training progress
import matplotlib.pyplot as plt

plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Model Accuracy Over Epochs')
plt.show()


# **Evaluate the model**

In [None]:
# Convert test labels to NumPy array
test_labels = np.array(test_labels)

# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(
    [X_test_encoded['input_ids'], X_test_encoded['token_type_ids'], X_test_encoded['attention_mask']],
    test_labels
)

print(f'Test loss: {test_loss}, Test accuracy: {test_accuracy}')


In [None]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.metrics import classification_report
import tensorflow as tf
import numpy as np

# Load tokenizer and model
path = '/content'
bert_tokenizer = BertTokenizer.from_pretrained(path + '/Tokenizer')
bert_model = TFBertForSequenceClassification.from_pretrained(path + '/Model')

# Predict on test data
pred = bert_model.predict([
    X_test_encoded['input_ids'],
    X_test_encoded['token_type_ids'],
    X_test_encoded['attention_mask']
])

# Extract logits and get predicted labels
logits = pred.logits
pred_labels = tf.argmax(logits, axis=1).numpy()

# Map labels to strings
label_map = {1: 'positive', 0: 'Negative'}
pred_labels_str = [label_map[i] for i in pred_labels]

# Ensure test_labels is a NumPy array
test_labels = np.array(test_labels)
actual_labels_str = [label_map[i] for i in test_labels]

# Print classification report
print("Classification Report: \n", classification_report(actual_labels_str, pred_labels_str))

# Print sample predictions
print('Predicted Labels :', pred_labels_str[:10])
print('Actual Labels    :', actual_labels_str[:10])


# **Prediction with user inputs**

In [None]:
def Get_sentiment(Review, Tokenizer=bert_tokenizer, Model=bert_model):
    # Ensure input is a list
    if not isinstance(Review, list):
        Review = [Review]

    # Tokenize the input text
    encoding = Tokenizer.batch_encode_plus(
        Review,
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors='tf'
    )

    input_ids = encoding['input_ids']
    token_type_ids = encoding['token_type_ids']
    attention_mask = encoding['attention_mask']

    # Predict using the model
    prediction = Model.predict([input_ids, token_type_ids, attention_mask])

    # Get the predicted class labels
    pred_indices = tf.argmax(prediction.logits, axis=1).numpy()

    # Label mapping
    label_map = {1: 'positive', 0: 'Negative'}
    pred_labels = [label_map[i] for i in pred_indices]

    return pred_labels


In [None]:
print(Get_sentiment("I love this product!"))       # ['positive']
print(Get_sentiment("This is really bad."))        # ['Negative']
print(Get_sentiment(["Great work!", "Not good."])) # ['positive', 'Negative']
