In this exercise, you will do a sentiment analysis of text comments.

1) Load the data file DailyComments.csv from the Week 4 Data Files into a data frame.

2) Identify a scheme to categorize each comment as positive or negative. You can devise your      own scheme or find a commonly used scheme to perform this sentiment analysis. However you     decide to do this, make sure to explain the scheme you decide to use.

3) Implement your sentiment analysis with code and display the results. Note:   DailyComments.csv is a purposely small file, so you will be able to clearly see why the     results are what they are.

4) For up to 5% extra credit, find another set of comments, e.g., some tweets, and perform the same sentiment analysis.

In [59]:
# importing the pandas library for making the dataframe
import pandas as pd

In [60]:
# preparing the array of stop words from the nltk package
from nltk.corpus import stopwords

# Load the stop words
stop_words = stopwords.words('english')

In [61]:
# Using string from the conversion and remove all punctuations 
import unicodedata
import sys

# Create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
                           if unicodedata.category(chr(i)).startswith('P'))

In [62]:
# using NLTK’s PorterStemmer

from nltk.stem.porter import PorterStemmer
# Create stemmer
porter = PorterStemmer()

In [63]:
# Loading the supplied file into a dataframe

filename="DailyComments.csv"
dataframe_comments = pd.read_csv(filename)
dataframe_comments

Unnamed: 0,Day of Week,comments
0,Monday,"Hello, how are you?"
1,Tuesday,Today is a good day!
2,Wednesday,It's my birthday so it's a really special day!
3,Thursday,Today is neither a good day or a bad day!
4,Friday,I'm having a bad day.
5,Saturday,There' s nothing special happening today.
6,Sunday,Today is a SUPER good day!


In [64]:
# displaying the dataframe information

dataframe_comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Day of Week  7 non-null      object
 1   comments     7 non-null      object
dtypes: object(2)
memory usage: 240.0+ bytes


In [65]:
# Created a function in order to perform preprocessing on the text value passed as a series

def preprocess_txt(input_txt):
    preprocessed_text = input_txt
    # A. Convert all text to lowercase letters.
    preprocessed_text = " ".join(word.lower() for word in preprocessed_text.split())
    # B. Remove all punctuation from the text.
    preprocessed_text = " ".join(word.translate(punctuation) for word in preprocessed_text.split())
    # C. Remove stop words.
    preprocessed_text = " ".join(word for word in preprocessed_text.split() if word not in stop_words)
    # D. Apply NLTK’s PorterStemmer.
    preprocessed_text = " ".join(porter.stem(word) for word in preprocessed_text.split())
    return(preprocessed_text)

In [66]:
# Adding a new column to the dataframe for storing the processed texts
# Tokenize the sentences using the preprocess function
dataframe_comments['processed_txt'] = dataframe_comments['comments'].apply(lambda x: preprocess_txt(x))

In [67]:
dataframe_comments

Unnamed: 0,Day of Week,comments,processed_txt
0,Monday,"Hello, how are you?",hello
1,Tuesday,Today is a good day!,today good day
2,Wednesday,It's my birthday so it's a really special day!,birthday realli special day
3,Thursday,Today is neither a good day or a bad day!,today neither good day bad day
4,Friday,I'm having a bad day.,im bad day
5,Saturday,There' s nothing special happening today.,noth special happen today
6,Sunday,Today is a SUPER good day!,today super good day


In order to do machine learning on text, we need to transform our texts into vector representation so that we can apply numeric machine learning. This process is called feature extraction or vectorization.
The CountVectorizer transformer from the sklearn.feature_extraction model does its own tokenization and normalization methods.

But for this exercise I have used an out of the box package to do the sentiment analysis. The polarity column in the dataframe shows the positivity value.
I have provided the polarity in the added column and any value >0.5 could be positive.


In [68]:
from textblob import TextBlob

In [69]:
# I have used here the Textblob default classifier to calculate the polarity considering each word from the texts

dataframe_comments['polarity'] = round(dataframe_comments['processed_txt'].apply(lambda x: TextBlob(x).sentiment.polarity),2)

In [83]:
# The scheme used here is to deduce the Sentiment based on the polarity value.
# The sentiment is positive if the polarity > 0 and negative otherwise. But if the polarity is 0 then the statement is neutral.
dataframe_comments['senti'] = dataframe_comments['polarity'].apply(lambda x: 'positive' if x > 0 else ('negative' if x < 0  else 'neutral'))


In [84]:
dataframe_comments

Unnamed: 0,Day of Week,comments,processed_txt,polarity,senti
0,Monday,"Hello, how are you?",hello,0.0,neutral
1,Tuesday,Today is a good day!,today good day,0.7,positive
2,Wednesday,It's my birthday so it's a really special day!,birthday realli special day,0.36,positive
3,Thursday,Today is neither a good day or a bad day!,today neither good day bad day,0.0,neutral
4,Friday,I'm having a bad day.,im bad day,-0.7,negative
5,Saturday,There' s nothing special happening today.,noth special happen today,0.36,positive
6,Sunday,Today is a SUPER good day!,today super good day,0.52,positive


In [72]:
# This is the sample (120 rows) of twitter sentiment data from kaggle related to global warming and climate change
# The dataset already has the sentiment defined for each row based on the twitter analysis
# I am using the same dataset but did not train my model on it

filename="twitter_sentiment_data.csv"
dataframe_twt = pd.read_csv(filename)
dataframe_twt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  120 non-null    int64 
 1   message    120 non-null    object
 2   tweetid    120 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 2.9+ KB


In [73]:
# Determining the polarity of each tweet

dataframe_twt['polarity'] = round(dataframe_twt['message'].apply(lambda x: TextBlob(x).sentiment.polarity),3)

In [80]:
# Calculating the sentiment for each tweet

dataframe_twt['senti'] = dataframe_twt['polarity'].apply(lambda x: 'positive' if x > 0.0 else ('negative' if x < 0  else 'neutral'))

In [85]:
# The below dataframe shows the calculated sentiment in the senti column
# The supplied sentiment is showing as 0,1 or -1 in the sentiment column
# Our prediction from the textblob is most of the time accurate

dataframe_twt.head(120)

Unnamed: 0,sentiment,message,tweetid,polarity,senti
0,0,RT @PlessCatherine: Team energy/climate change...,793140748812226565,0.00,neutral
1,1,RT @retroJACE: global warming real as hell. al...,793141154657239040,0.10,positive
2,1,A web of truths: This is how climate change af...,793141253852594177,0.50,positive
3,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793142226599698432,0.29,positive
4,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793142726560583680,0.29,positive
...,...,...,...,...,...
115,1,RT @MikeBloomberg: .@LeoDiCaprio's #BeforetheF...,793253307531735040,0.10,positive
116,0,@RepDonBeyer Be sure to personally invite me t...,793253637309038592,0.25,positive
117,1,RT @RisingSign: @MolonLabeNews Will do further...,793253805035032576,0.07,positive
118,1,RT @ClimateCentral: This is what it's like to ...,793253844188880900,0.10,positive
