<a href="https://colab.research.google.com/github/sumukhbhat12/Natural-Language-Processing-Course/blob/main/TextBlob_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dataset used is imdb dataset sentiment analysis in csv format**
https://www.kaggle.com/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format

In [1]:
import pandas as pd
from textblob import TextBlob
from nltk.tokenize.toktok import ToktokTokenizer
import re
import spacy

**Loading the data**

remove the < br> < /br> from the dataset to remove the tokenizing parseError

In [8]:
train = pd.read_csv('/content/Train.csv')

print(train)

                                                    text  label
0      I grew up (b. 1965) watching and loving the Th...      0
1      When I put this movie in my DVD player, and sa...      0
2      Why do people who do not know what a particula...      0
3      Even though I have great interest in Biblical ...      0
4      Im a die hard Dads Army fan and nothing will e...      1
...                                                  ...    ...
39995  "Western Union" is something of a forgotten cl...      1
39996  This movie is an incredible piece of work. It ...      1
39997  My wife and I watched this movie because we pl...      0
39998  When I first watched Flatliners, I was amazed....      1
39999  Why would this film be so good, but only gross...      1

[40000 rows x 2 columns]


divide the dataset into 5k samples of label-0's and label-1's respectively and then concat them into a new training dataset
This is to reduce the size of the dataset while maintaining equal samples of both 0's and 1's labels 

In [9]:
label_0 = train[train['label'] == 0].sample(n=5000)
label_1 = train[train['label'] == 1].sample(n=5000)

train = pd.concat([label_1,label_0])

shuffle the train dataset

In [10]:
from sklearn.utils import shuffle
train = shuffle(train)

**Data Preprocessing**

In [11]:
train.isnull().sum()

text     0
label    0
dtype: int64

replace "only space" characters in the string with Nan and replace tab, newline and carriage return escape sequences with empty string

In [12]:
import numpy as np

train.replace(r'^\s*$', np.nan, regex=True, inplace=True)

train.replace(to_replace=[r'\\t|\\n|\\r', '\t|\n|\r'], value=['',''], regex=True, inplace=True)

train.dropna(axis=0, how='any', inplace=True)

filter non ascii text and ignore them

encode the string into ascii encoding and ignore the non convertibles, then decode the asii encoded string back

In [14]:
train['text'] = train['text'].str.encode('ascii','ignore').str.decode('ascii')

punctuations !"#$%&'()*+, -./:;<=>?@[\]^_`{|}~

In [15]:
def remove_punctuations(text):
  import string
  for punctuation in string.punctuation:
    text = text.replace(punctuation,'')
  return text

train['text'] = train['text'].apply(remove_punctuations)

In [None]:
import nltk
nltk.download('popular')
from nltk.corpus import stopwords

print(stopwords.words('english'))

don't keep not and no in the stopword list because we need no and not words for sentiment analysis

In [19]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [20]:
tokenizer = ToktokTokenizer()

In [23]:
def custom_remove_stopwords(text):
  tokens = tokenizer.tokenize(text)
  tokens = [token.strip() for token in tokens]

  filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
  filtered_text = ' '.join(filtered_tokens)
  return filtered_text

In [24]:
train['text'] = train['text'].apply(custom_remove_stopwords)

In [25]:
def remove_special_characters(text):
  text = re.sub('[^a-zA-Z0-9\s]','',text)
  return text

In [26]:
train['text'] = train['text'].apply(remove_special_characters)

In [27]:
def remove_html(text):
  import re
  html_pattern = re.compile('<.*?>')
  return html_pattern.sub(r' ', text)

In [28]:
train['text'] = train['text'].apply(remove_html)

\S matches anything not matched by \s

In [29]:
def remove_url(text):
  url = re.compile(r'https?://\S+|www\.\S+')
  return url.sub(r' ', text)

In [30]:
train['text'] = train['text'].apply(remove_url)

In [33]:
def remove_numbers(text):
  text = ''.join([i for i in text if not i.isdigit()])
  return text

train['text'] = train['text'].apply(remove_numbers)

remove everything that ends with a digit

In [34]:
def cleanse(word):
  rx = re.compile(r'\D*\d')

  if rx.match(word):
    return ''
  return word

def remove_alphanumeric(strings):
  nstrings = [" ".join(filter(None, (cleanse(word) for word in string.split()))) for string in strings.split()]
  str1 = ' '.join(nstrings)
  return str1

In [35]:
train['text'] = train['text'].apply(remove_alphanumeric)

In [36]:
train

Unnamed: 0,text,label
33387,film almost complete waste time studying book ...,0
36849,kidding weight loss thing well might lose weig...,1
33456,nineteen eighty two announced Dismisal going m...,1
30515,fabulous filmwhich watched several times since...,1
31483,Chesty gringo Telly Savalas Frank Cooper USMex...,0
...,...,...
7792,really cant remember recommended said one favo...,1
37649,Tooth Fairy ghost old deformed witch lures chi...,0
14795,Oh well thought good action not Although Jeff ...,0
11818,Treading Water beautiful movie would put strai...,1


**Lemmatize the text**

In [47]:
nlp = spacy.load('en_core_web_sm', disable=['ner'])
# rand_text = nlp('Hello, my name is Sumukh, Nice meeting you. How are you doing, I am doing quite well, what about you? what are you doing currently? I am currently at a party')
# for word in rand_text:
  # print(word.lemma_)

spaCy has a special lemma : -PRON-. This is used as the lemma for all pronouns such as Their , you , me , and I

In [48]:
def lemmatize_text(text):
  text = nlp(text)
  text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
  return text


In [49]:
train['text'] = train['text'].apply(lemmatize_text)

**Sentiment Analysis**

In [50]:
train['sentiment'] = train['text'].apply(lambda tweet: TextBlob(tweet).sentiment)

**Results**

Polarity is the output that lies between [-1,1], where -1 refers to negative sentiment and +1 refers to positive sentiment. Subjectivity is the output that lies within [0,1] and refers to personal opinions and judgments.

In [56]:
sentiment_series = train['sentiment'].tolist()

# print(sentiment_series)

# sentiment column is a tuple with values like (-0.353625, 0.783234) etc.

columns = ['polarity', 'subjectivity']

# we separate the sentiment column and make a new dataframe and split the column into polarity and subjectivity
df1 = pd.DataFrame(sentiment_series, columns=columns, index=train.index)

# print(df1.head(10))

#we concatenate the df1 and train dataframe to get sentiment, polarity and subjectivity columns all together
result = pd.concat([train, df1], axis=1)

# print(result.head(10))

#we no longer need sentiment column in the final result, hence we drop it
result.drop(['sentiment'], axis=1, inplace=True)

result.loc[result['polarity'] >= 0.2, 'Sentiment'] = 'Positive'
result.loc[result['polarity'] < 0.2, 'Sentiment'] = 'Negative'

print(result)



                                                    text  label  polarity  \
33387  film almost complete waste time study book eng...      0 -0.096259   
36849  kid weight loss thing well might lose weight n...      1  0.285714   
33456  nineteen eighty two announce Dismisal going ma...      1  0.245455   
30515  fabulous filmwhich watch several time since bu...      1  0.230003   
31483  chesty gringo Telly Savalas Frank Cooper USMex...      0  0.065556   
...                                                  ...    ...       ...   
7792   really can not remember recommend say one favo...      1  0.049669   
37649  Tooth Fairy ghost old deform witch lure child ...      0  0.119801   
14795  oh well think good action not although Jeff Sp...      0  0.183333   
11818  tread Water beautiful movie would put straight...      1  0.375000   
1644   trailer deceive think good film point bring wo...      0 -0.024074   

       subjectivity Sentiment  
33387      0.354082  Negative  
36849      