# Ramsey King
# DSC 550 - Data Mining
# September 25, 2021
# Exercise 4.2

Load the data file DailyComments.csv from the Week 4 Data Files into a data frame.


In [1]:
import pandas as pd
import numpy as np

daily_df = pd.read_csv('DailyComments.csv')
daily_df.head(7)

Unnamed: 0,Day of Week,comments
0,Monday,"Hello, how are you?"
1,Tuesday,Today is a good day!
2,Wednesday,It's my birthday so it's a really special day!
3,Thursday,Today is neither a good day or a bad day!
4,Friday,I'm having a bad day.
5,Saturday,There' s nothing special happening today.
6,Sunday,Today is a SUPER good day!


Identify a scheme to categorize each comment as positive or negative. You can devise your own scheme or find a commonly used scheme to perform this sentiment analysis. However you decide to do this, make sure to explain the scheme you decide to use.
Implement your sentiment analysis with code and display the results. Note: DailyComments.csv is a purposely small file, so you will be able to clearly see why the results are what they are.
For up to 5% extra credit, find another set of comments, e.g., some tweets, and perform the same sentiment analysis.

I have decided to use a slightly more descriptive scale for sentiment for this exercise to see if there is much of a difference between what is normally given for sentiment (positive, neutral, negative).  These scores will be added to a column in the DailyComments dataframe that will be created from the DailyComments.csv file.  The function "categorizer" breaks the sentiment into the 7 categories.

In [2]:
def categorizer(compound):
    if compound > 0 and compound <= 0.25:
        return 'Slightly Positive'
    elif compound > 0.25 and compound <= 0.5:
        return 'Moderately Positive'
    elif compound > 0.5:
        return 'Positive'
    elif compound < 0 and compound >= -0.25:
        return 'Slightly Negative'
    elif compound < -0.25 and compound >= -0.5:
        return 'Moderately Negative'
    elif compound < -0.5:
        return 'Negative'
    else:
        return 'Neutral'

We will now get the categorization of sentiment from the categorizer function above using nltk's SentimentIntensityAnalyzer.

In [3]:
# nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

daily_df['Sentiment_Score_SIA'] = daily_df['comments'].apply(sia.polarity_scores)
for i in range(len(daily_df.comments)):
    daily_df['Sentiment_Score_SIA'][i] = daily_df['Sentiment_Score_SIA'][i]['compound']

daily_df['Sentiment_SIA'] = daily_df['Sentiment_Score_SIA'].apply(categorizer)
daily_df.head(7)



Unnamed: 0,Day of Week,comments,Sentiment_Score_SIA,Sentiment_SIA
0,Monday,"Hello, how are you?",0.0,Neutral
1,Tuesday,Today is a good day!,0.4926,Moderately Positive
2,Wednesday,It's my birthday so it's a really special day!,0.5497,Positive
3,Thursday,Today is neither a good day or a bad day!,-0.735,Negative
4,Friday,I'm having a bad day.,-0.5423,Negative
5,Saturday,There' s nothing special happening today.,-0.3089,Moderately Negative
6,Sunday,Today is a SUPER good day!,0.8327,Positive


We will now get the categorization of sentiment from the categorizer function above using TextBlob.

In [4]:
from textblob import TextBlob

sentiment_list = []

for i in range(len(daily_df.comments)):
    blob = TextBlob(daily_df['comments'][i])
    sentiment_list.append(blob.sentiment.polarity)

daily_df['Sentiment_Score_TB'] = sentiment_list

daily_df['Sentiment_TB'] = daily_df['Sentiment_Score_TB'].apply(categorizer)
daily_df.head(7)

Unnamed: 0,Day of Week,comments,Sentiment_Score_SIA,Sentiment_SIA,Sentiment_Score_TB,Sentiment_TB
0,Monday,"Hello, how are you?",0.0,Neutral,0.0,Neutral
1,Tuesday,Today is a good day!,0.4926,Moderately Positive,0.875,Positive
2,Wednesday,It's my birthday so it's a really special day!,0.5497,Positive,0.446429,Moderately Positive
3,Thursday,Today is neither a good day or a bad day!,-0.735,Negative,-0.0875,Slightly Negative
4,Friday,I'm having a bad day.,-0.5423,Negative,-0.7,Negative
5,Saturday,There' s nothing special happening today.,-0.3089,Moderately Negative,0.357143,Moderately Positive
6,Sunday,Today is a SUPER good day!,0.8327,Positive,0.604167,Positive


As shown, there is some difference between the two sentiment analyzer packages.  To me, it seems as if TextBlob has a better grasp on the DailyComments dataset.

Now, for extra credit, we will pull a dataset of tweets from Kaggle that deals with airline sentiment.  Tweets.csv is the name of the file.

In [5]:
airline_df = pd.read_csv('Tweets.csv')
airline_df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


As before, we will use nltk's SentimentIntensityAnalyzer to assign a sentiment score and give the sentiment based on the categorizer function defined above.

In [6]:
sia = SentimentIntensityAnalyzer()

airline_df['Sentiment_Score_SIA'] = airline_df['text'].apply(sia.polarity_scores)
for i in range(len(airline_df.text)):
    airline_df['Sentiment_Score_SIA'][i] = airline_df['Sentiment_Score_SIA'][i]['compound']

airline_df['Sentiment_SIA'] = airline_df['Sentiment_Score_SIA'].apply(categorizer)
airline_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Sentiment_Score_SIA,Sentiment_SIA
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0.0,Neutral
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),0.0,Neutral
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0.0,Neutral
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),-0.5984,Negative
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),-0.5829,Negative


We will now get the categorization of sentiment of Tweets.csv from the categorizer function above using TextBlob.

In [7]:
sentiment_list = []

for i in range(len(airline_df.text)):
    blob = TextBlob(airline_df['text'][i])
    sentiment_list.append(blob.sentiment.polarity)

airline_df['Sentiment_Score_TB'] = sentiment_list

airline_df['Sentiment_TB'] = airline_df['Sentiment_Score_TB'].apply(categorizer)
airline_df.head(10)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone,Sentiment_Score_SIA,Sentiment_SIA,Sentiment_Score_TB,Sentiment_TB
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada),0.0,Neutral,0.0,Neutral
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada),0.0,Neutral,0.0,Neutral
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada),0.0,Neutral,-0.390625,Moderately Negative
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada),-0.5984,Negative,0.00625,Slightly Positive
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada),-0.5829,Negative,-0.35,Moderately Negative
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada),-0.5945,Negative,-0.208333,Slightly Negative
6,570300616901320704,positive,0.6745,,0.0,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada),0.6908,Positive,0.466667,Moderately Positive
7,570300248553349120,neutral,0.634,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada),0.1458,Slightly Positive,0.2,Slightly Positive
8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada),-0.3477,Moderately Negative,1.0,Positive
9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada),0.7717,Positive,0.466667,Moderately Positive


Again, as with the DailyComments dataframe, both sentiment packages are slightly different.  It is a little more difficult to say which sentiment analysis model I would prefer this time around.