## Twitter Sentiment Analysis of US Airlines by Wendy Wong 7 April 2024

* Dataset: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

### Reference:

* https://huggingface.co/blog/sentiment-analysis-python
* https://www.datacamp.com/tutorial/text-analytics-beginners-nltk

In [1]:
# Install Python module
!pip install nltk



In [2]:
# Import Python libaries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer


  from pandas.core import (


In [3]:
# Load the data in a dataframe

df = pd.read_csv ('Tweets.csv', encoding='utf-8')

## Exploratory Analysis

In [4]:
# Inspect the first 5 rows of the dataframe
df.head(3)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)


In [5]:
# Check for nulls
df.tweet_id.count(), df.isnull().sum()

(14640,
 tweet_id                            0
 airline_sentiment                   0
 airline_sentiment_confidence        0
 negativereason                   5462
 negativereason_confidence        4118
 airline                             0
 airline_sentiment_gold          14600
 name                                0
 negativereason_gold             14608
 retweet_count                       0
 text                                0
 tweet_coord                     13621
 tweet_created                       0
 tweet_location                   4733
 user_timezone                    4820
 dtype: int64)

## Pre-processing text

In [6]:
# Transfrom a subset of columns

df1 = df.iloc[:, [3, 5, 10]] = df[['airline_sentiment','airline', 'text']]

In [7]:
# Inspect the transformation
df1

Unnamed: 0,airline_sentiment,airline,text
0,neutral,Virgin America,@VirginAmerica What @dhepburn said.
1,positive,Virgin America,@VirginAmerica plus you've added commercials t...
2,neutral,Virgin America,@VirginAmerica I didn't today... Must mean I n...
3,negative,Virgin America,@VirginAmerica it's really aggressive to blast...
4,negative,Virgin America,@VirginAmerica and it's a really big bad thing...
...,...,...,...
14635,positive,American,@AmericanAir thank you we got on a different f...
14636,negative,American,@AmericanAir leaving over 20 minutes Late Flig...
14637,neutral,American,@AmericanAir Please bring American Airlines to...
14638,negative,American,"@AmericanAir you have my money, you change my ..."


In [8]:
# Replace special characters

df1['text'] = df1['text'].str.replace('@', '')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['text'] = df1['text'].str.replace('@', '')


In [9]:
# Inspect the transformation

df1.head()

Unnamed: 0,airline_sentiment,airline,text
0,neutral,Virgin America,VirginAmerica What dhepburn said.
1,positive,Virgin America,VirginAmerica plus you've added commercials to...
2,neutral,Virgin America,VirginAmerica I didn't today... Must mean I ne...
3,negative,Virgin America,VirginAmerica it's really aggressive to blast ...
4,negative,Virgin America,VirginAmerica and it's a really big bad thing ...


In [10]:
# Convert all text to lowercase

df1 = df1['text'].str.lower()

df1

0                        virginamerica what dhepburn said.
1        virginamerica plus you've added commercials to...
2        virginamerica i didn't today... must mean i ne...
3        virginamerica it's really aggressive to blast ...
4        virginamerica and it's a really big bad thing ...
                               ...                        
14635    americanair thank you we got on a different fl...
14636    americanair leaving over 20 minutes late fligh...
14637    americanair please bring american airlines to ...
14638    americanair you have my money, you change my f...
14639    americanair we have 8 ppl so we need 2 know ho...
Name: text, Length: 14640, dtype: object

In [11]:
# Check for null values
df1.isnull().sum()

0

In [12]:
# Rename pre-procesed text

df2 = df1
df2

0                        virginamerica what dhepburn said.
1        virginamerica plus you've added commercials to...
2        virginamerica i didn't today... must mean i ne...
3        virginamerica it's really aggressive to blast ...
4        virginamerica and it's a really big bad thing ...
                               ...                        
14635    americanair thank you we got on a different fl...
14636    americanair leaving over 20 minutes late fligh...
14637    americanair please bring american airlines to ...
14638    americanair you have my money, you change my f...
14639    americanair we have 8 ppl so we need 2 know ho...
Name: text, Length: 14640, dtype: object

In [13]:
# Inspect the last 5 values

df2.tail()

14635    americanair thank you we got on a different fl...
14636    americanair leaving over 20 minutes late fligh...
14637    americanair please bring american airlines to ...
14638    americanair you have my money, you change my f...
14639    americanair we have 8 ppl so we need 2 know ho...
Name: text, dtype: object

In [14]:
# Inspect the first 20 values

df2.head(20)

0                     virginamerica what dhepburn said.
1     virginamerica plus you've added commercials to...
2     virginamerica i didn't today... must mean i ne...
3     virginamerica it's really aggressive to blast ...
4     virginamerica and it's a really big bad thing ...
5     virginamerica seriously would pay $30 a flight...
6     virginamerica yes, nearly every time i fly vx ...
7     virginamerica really missed a prime opportunit...
8        virginamerica well, i didn't…but now i do! :-d
9     virginamerica it was amazing, and arrived an h...
10    virginamerica did you know that suicide is the...
11    virginamerica i &lt;3 pretty graphics. so much...
12    virginamerica this is such a great deal! alrea...
13    virginamerica virginmedia i'm flying your #fab...
14                                virginamerica thanks!
15         virginamerica sfo-pdx schedule is still mia.
16    virginamerica so excited for my first cross co...
17    virginamerica  i flew from nyc to sfo last

### Tokenizing text

In [15]:
# Break up text into individual words

# Load library
from nltk.tokenize import word_tokenize

In [16]:
# Tokenize into sentences

#Load library
from nltk.tokenize import sent_tokenize

# Create text
string = "VirginAmerica and it's a really big bad thing"


#Tokenize sentences
sent_tokenize(string)

["VirginAmerica and it's a really big bad thing"]

In [17]:
# Tokenize words

string = "VirginAmerica and it's a really big bad thing"

word_tokenize(string)


['VirginAmerica', 'and', 'it', "'s", 'a', 'really', 'big', 'bad', 'thing']

### Removing stop words

In [18]:
# Remove common words
    
# Load library
from nltk.corpus import stopwords

import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    | 

[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package mte_teip5 to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mte_teip5 is already up-to-date!
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package mwa_ppdb is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package nombank.1.0 to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nombank.1.0 is already up-to-date!
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package nonbreaking_prefixes is already up-to-date!
[nltk_data]    | Downloading package nps_chat

[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_tagset to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_tagset is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package universal_treebanks_v20 is already up-to-
[nltk_data]    |       date!
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package vader_lexicon is already up-to-date!
[nltk_data]    | Downloading package verbnet to
[nltk_data]    |     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]    |   Package verbnet is already

True

In [19]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Wendy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

tokenized_words=['Virgin America',
                  'and',
                  'it',
                  'a',
                  'really',
                  'big',
                  'bad thing']
# Remove stop words
[word for word in tokenized_words if word not in stop_words]

['Virgin America', 'really', 'big', 'bad thing']

## Stemming words

In [21]:
# Convert tokenized words into their root form

# Load library
from nltk.stem.porter import PorterStemmer

# Create word tokens
tokenized_words = ['Virgin America',
                  'and',
                  'it',
                  'a',
                  'really',
                  'big',
                  'bad thing']

# Create stemmer
porter = PorterStemmer()

# Apply stemmer
[porter.stem(word)for word in tokenized_words]

['virgin america', 'and', 'it', 'a', 'realli', 'big', 'bad th']

## Sentiment Analyzer

In [22]:
# Import Python module

!pip install vaderSentiment



In [23]:
# Sentiment Check

# Import library

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [24]:
# create get_sentiment function

def sentiment_scores(sentence):
 
    # Create a SentimentIntensityAnalyzer object.
    analyzer = SentimentIntensityAnalyzer()
 
    # polarity_scores method of SentimentIntensityAnalyzer
    # object gives a sentiment dictionary.
    # which contains pos, neg, neu, and compound scores.
    sentiment_dict = analyzer.polarity_scores(sentence)
     
    print("Overall sentiment dictionary is : ", sentiment_dict)
    print("sentence was rated as ", sentiment_dict['neg']*100, "% Negative")
    print("sentence was rated as ", sentiment_dict['neu']*100, "% Neutral")
    print("sentence was rated as ", sentiment_dict['pos']*100, "% Positive")
 
    print("Sentence Overall Rated As", end = " ")
 
    # decide sentiment as positive, negative and neutral
    if sentiment_dict['compound'] >= 0.05 :
        print("Positive")
 
    elif sentiment_dict['compound'] <= - 0.05 :
        print("Negative")
 
    else :
        print("Neutral")

In [25]:
# apply get_sentiment function

# Driver code
if __name__ == "__main__" :
 
    print("\n1st statement :")
    sentence = "i ❤️ flying virginamerica. ☺️👍."
 
    # function calling
    sentiment_scores(sentence)
 
    print("\n2nd Statement :")
    sentence = "americanair you have my money, you change my f"
    sentiment_scores(sentence)
 
    print("\n3rd Statement :")
    sentence = "americanair leaving over 20 minutes late fligh."
    sentiment_scores(sentence)
    
    print("\n4th Statement :")
    sentence = "virginamerica this is such a great deal! alrea."
    sentiment_scores(sentence)
    
    print("\n5th Statement :")
    sentence = "virginamerica really missed a prime opportunit."
    sentiment_scores(sentence)



1st statement :
Overall sentiment dictionary is :  {'neg': 0.0, 'neu': 0.727, 'pos': 0.273, 'compound': 0.4588}
sentence was rated as  0.0 % Negative
sentence was rated as  72.7 % Neutral
sentence was rated as  27.3 % Positive
Sentence Overall Rated As Positive

2nd Statement :
Overall sentiment dictionary is :  {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
sentence was rated as  0.0 % Negative
sentence was rated as  100.0 % Neutral
sentence was rated as  0.0 % Positive
Sentence Overall Rated As Neutral

3rd Statement :
Overall sentiment dictionary is :  {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
sentence was rated as  0.0 % Negative
sentence was rated as  100.0 % Neutral
sentence was rated as  0.0 % Positive
Sentence Overall Rated As Neutral

4th Statement :
Overall sentiment dictionary is :  {'neg': 0.0, 'neu': 0.614, 'pos': 0.386, 'compound': 0.6588}
sentence was rated as  0.0 % Negative
sentence was rated as  61.4 % Neutral
sentence was rated as  38.6 % Positive

In [26]:
# Export the pre-processed csv file to perform Amazon Comprehend Sentiment Analysis

# export pre-processed file as a csv file

df.to_csv('clean.csv', index=False)  