## Twitter sentiment analysis of airlines:
![sentiment analysis: credits: kdnuggets](https://www.kdnuggets.com/wp-content/uploads/sentiment-hero-480.jpg)<br/>
Sentiment analysis is a term that refers to the use of natural language processing, text analysis, and computational linguistics in order to ascertain the attitude of a speaker or writer toward a specific topic.<br/>
<br/>
Basically, it helps to determine whether a text is expressing sentiments that are positive, negative, or neutral. Sentiment analysis is an excellent way to discover how people, particularly consumers, feel about a particular topic, product, or idea.<br/>
<br/>
The origin of sentiment analysis can be traced to the 1950s, when sentiment analysis was primarily used on written paper documents. Today, however, sentiment analysis is widely used to mine subjective information from content on the Internet, including texts, tweets, blogs, social media, news articles, reviews, and comments. This is done using a variety of different techniques, including NLP, statistics, and machine learning methods. Organizations then use the information mined to identify new opportunities and better target their message toward their target demographics. The Obama Administration used sentiment analysis to predict public response to its policy announcements.<br/>  
## How many types of sentiment analysis are there?
According to [upgrad](https://www.upgrad.com/blog/types-of-sentiment-analysis/), there are 4 types of sentiment analyses. <br/>
### (1) Fine grained sentiment analysis:
This analysis gives you an understanding of the feedback you get from customers. You can get precise results in terms of the polarity of the input. However, the process to understand this can be more labor and cost-intensive as compared to other types. <br/>
### (2) Emotion Detection Sentiment Analysis

This is a more sophisticated way of identifying the emotion in a piece of text. Lexicons and machine learning are used to determine the sentiment. Lexicons are lists of words that are either positive or negative. This makes it easier to segregate the terms according to their sentiment. The advantage of using this is that a company can also understand why a customer feels a particular way. This is more algorithm-based and might be complex to understand at first.<br/>

### (3) Aspect-based analysis

This type of sentiment analysis is usually for one aspect of a service or product. For example, if a company that sells televisions uses this type of sentiment analysis, it could be for one aspect of televisions – like brightness, sound, etc. So they can understand how customers feel about specific attributes of the product. <br/>

### (4) Intent analysis

This is a deeper understanding of the intention of the customer. For example, a company can predict if a customer intends to use the product or not. This means that the intention of a particular customer can be tracked, forming a pattern, and then used for marketing and advertising. <br/>
<br/>
In this notebook, we are going to explore different sentiment analysis procedures but we will not train any new one. For training a sentiment analysis procedure from scratch, follow this [notebook](https://www.kaggle.com/shyambhu/sentiment-classification-using-lstm).<br/>
### Here are the contents of this notebook:
(1) [Data analysis and cleaning for airline data](#dataclean)<br/>
(2) [sentiment analysis using NLTK](#nltk)<br/>
(3) [sentiment analysis using textblob](#blob)<br/>
(4) [sentiment analysis using huggingface](#huggingface)<br/>
(5) [sentiment analysis using flair](#flair)<br/>
(6) [conclusion](#conclude)<br/>
### Resources:
(1) [upgrad introduction to sentiment analysis](https://www.upgrad.com/blog/types-of-sentiment-analysis/)<br/>
(2) [lexalytics sentiment analysis introduction](https://www.lexalytics.com/technology/sentiment-analysis)<br/>
(3) [Contractions, a useful python library](https://github.com/kootenpv/contractions)<br/>
(4) [Different sentiment analysis libraries in python](https://www.iflexion.com/blog/sentiment-analysis-python)<br/>
(5) [sentiment analysis using NLTK](https://realpython.com/python-nltk-sentiment-analysis/)<br/>
(6) [monkeylearn api; not implemented here](https://app.monkeylearn.com/main/classifiers/cl_pi3C7JiL/tab/api/)<br/>
(7) [sentiment analysis using textblob](https://www.presentslide.in/2019/08/sentiment-analysis-textblob-library.html)<br/>
(8) [Using pretrained models for sentiment analysis](https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c)<br/>
(9) [text classification using sentiment analysis](https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f)
### Acknowledgements:
I would like to thank [vetrivel-ps](https://www.kaggle.com/vetrirah) and [Atif hassan](https://www.kaggle.com/atifhassan) for helping me with some of the resources and ideas for improvements.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
tweet_data = pd.read_csv('/kaggle/input/twitter-airline-sentiment/Tweets.csv')
print("data shape:",tweet_data.shape)
print("what are columns:",tweet_data.columns)
tweet_data.head()

## <a id = dataclean> Data Cleaning</a>
Now, our goal in this work is to detect sentiment from the tweet text. So we will drop all the unnecessary columns from the data as well as clean it a bit.

In [None]:
tweet_data = tweet_data.drop(['tweet_id','retweet_count', 'tweet_coord', 'tweet_created',
                               'tweet_location','name','user_timezone'],axis = 1)

In [None]:
tweet_data.head()

In [None]:
tweet_data['negativereason_gold'].unique()

In [None]:
tweet_data['negativereason'] = tweet_data['negativereason'].fillna('')
tweet_data['negativereason_confidence'] = tweet_data['negativereason_confidence'].fillna(0)
tweet_data['airline_sentiment_gold'] = tweet_data['airline_sentiment_gold'].fillna('')
tweet_data['negativereason_gold'] = tweet_data['negativereason_gold'].fillna('')

In [None]:
tweet_data.head()

In [None]:
print("different topics of negative reasons are:",tweet_data['negativereason'].unique())

In [None]:
!pip install contractions

In [None]:
from nltk.corpus import stopwords
import string
import re
import contractions
def text_cleaning(text):
    #not removing the stopwords so that the sentences stay normal.
    #forbidden_words = set(stopwords.words('english'))
    if text:
        text = contractions.fix(text)
        text = ' '.join(text.split('.'))
        text = re.sub(r'\s+', ' ', re.sub('[^A-Za-z0-9]', ' ', text.strip().lower())).strip()
        text = re.sub(r'\W+', ' ', text.strip().lower()).strip()
        text = [word for word in text.split()]
        return text
    return []

In [None]:
tweet_data['text'] = tweet_data['text'].apply(lambda x: ' '.join(text_cleaning(x)))

In [None]:
tweet_data.head(20)

## <a id = nltk> Sentiment Analysis with NLTK</a>:
In this section we will perform sentiment analysis using nltk.

In [None]:
import nltk

In [None]:
nltk.download(["names","stopwords","state_union","twitter_samples",
              "movie_reviews","averaged_perceptron_tagger","vader_lexicon",
              "punkt"])

## Description of the nltk packages downloaded:
*     names: A list of common English names compiled by Mark Kantrowitz
*     stopwords: A list of really common words, like articles, pronouns, prepositions, and conjunctions
*     state_union: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens
*     twitter_samples: A list of social media phrases posted to Twitter
*     movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
*     averaged_perceptron_tagger: A data model that NLTK uses to categorize words into their part of speech
*     vader_lexicon: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert
*     punkt: A data model created by Jan Strunk that NLTK uses to split full texts into word lists


## How to calculate sentiment from nltk:
NLTK already has a built-in, pretrained sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner).

Since VADER is pretrained, you can get results more quickly than with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It’s less accurate when rating longer, structured sentences, but it’s often a good launching point.

To use VADER, first create an instance of nltk.sentiment.SentimentIntensityAnalyzer, then use .polarity_scores() on a raw string:

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer as SIA
sia = SIA()
print(sia.polarity_scores('wow! this nltk library really works'))

As you can see, we get back a dictionary of different scores. The negative, neutral, and positive scores are related: They all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.<br/>
### Question1: comment if you know:
how is the compound score calculated as it is clearly not a average?

Let's append this sentiment scores to our original dataset. 

In [None]:
texts = tweet_data['text'].tolist()
negative_scores = []
neutral_scores = []
positive_scores = []
compound_scores = []
final_tag = []
for text in texts:
    score_dictionary = sia.polarity_scores(text)
    negative_scores.append(score_dictionary['neg'])
    positive_scores.append(score_dictionary['pos'])
    neutral_scores.append(score_dictionary['neu'])
    compound_scores.append(score_dictionary['compound'])
    if score_dictionary['compound']>0:
        final_tag.append('positive')
    elif score_dictionary['compound']<0:
        final_tag.append('negative')
    else:
        final_tag.append('neutral')
tweet_data['negative_score'] = negative_scores
tweet_data['positive_score'] = positive_scores
tweet_data['neutral_score'] = neutral_scores
tweet_data['compound_score'] = compound_scores
tweet_data['final_tag'] = final_tag

In [None]:
tweet_data.head(20)

Let's check why some sentences are failing.

In [None]:
texts[17]

probably this fails as the negativity is in the sense; but not on any word basis. Let's now try textblob. Let's see if that works.

In [None]:
from sklearn.metrics import classification_report as crep
print("sentiment analysis performance for nltk:")
print(crep(tweet_data['airline_sentiment'],tweet_data['final_tag']))

## <a id = 'blob'>Sentiment analysis with Textblob</a>:


In [None]:
!pip install -U textblob
!python -m textblob.download_corpora

In [None]:
from textblob import TextBlob

In [None]:
tb1 = TextBlob('I just am trying textblob first time.')
tb1

So clearly, textblob also follows the normal pipeline format, where we make a textblob object out of the text and then we will be working with this object further for the different processing.

Before doing sentiment analysis, let's explore a few attributes.

### tokenization and sentence segmentation:

In [None]:
tb1.words

In [None]:
tb1.sentences

In [None]:
tb1.noun_phrases

### pos tagging:

In [None]:
tb2=TextBlob("Tags will give the Part of speech for all the words.")
tb2.tags

In [None]:
tb3=TextBlob(" We are learning cool Library . We are enjoying a lot .")
tb3.noun_phrases

In [None]:
type(tb3.noun_phrases)

### Polarity:

Polarity as discussed earlier helps us in finding the expression and emotion of the author in the text. The value ranges from -1.0 to +1.0 and they contain float values.<br/>

Less than 0 denotes Negative<br/>
Equal to 0 denotes Neutral<br/>
Greater than 0 denotes Positive<br/>
<br/>
A value near to +1 is more likely to be positive than a value near 0. The same is in the case of negativity.<br/>

In [None]:
doc2=TextBlob("We are having fun here")
doc2.polarity

In [None]:
texts = tweet_data['text'].tolist()
textblob_score = []
textblob_tag = []
for text in texts:
    doc_current = TextBlob(text)
    score = doc_current.polarity
    textblob_score.append(score)
    if score > 0:
        textblob_tag.append('positive')
    elif score<0:
        textblob_tag.append('negative')
    else:
        textblob_tag.append('neutral')
tweet_data['textblob_score'] = textblob_score
tweet_data['textblob_sentiment_tag'] = textblob_tag

In [None]:
tweet_data[['airline_sentiment','text','textblob_score','textblob_sentiment_tag']].head(20)

So textblob is also fine and matches most of the airline tags, but doesn't match in a few cases. Interestingly, the case 17th, where it was a sense wise negative sentence, is again missed by textblob as well. Let's check the textblob's accuracy.

In [None]:
print("sentiment analysis with textblob:")
print(crep(tweet_data['airline_sentiment'],tweet_data['textblob_sentiment_tag']))

## <a id= 'huggingface'>Huggingface Transformer based sentiment analysis</a>
In this section we will do sentiment analysis using pretrained huggingface transformer models. We will also measure their performances and comment on it.<br/>
(1) [different sentiment models in huggingface](https://huggingface.co/models?search=sentim)<br/>
(2) [huggingface quicktour](https://huggingface.co/transformers/quicktour.html)<br/>
(3) [possible bug to look for](https://github.com/huggingface/transformers/issues/4263)<br/>


In [None]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier("I am so happy to use huggingface today!")[0]['label']

In [None]:
classifier("it was a extremely bad movie!")[0]['label']

In [None]:
classifier("it is a statement")[0]['label']

In [None]:
tweet_data['bert_based_sentiment'] = tweet_data['text'].apply(lambda x: classifier(x)[0]['label'].lower())
tweet_data_cut = tweet_data[tweet_data['airline_sentiment']!='neutral']
print(crep(tweet_data_cut['airline_sentiment'],tweet_data_cut['bert_based_sentiment']))

In [None]:
classifier_febu = pipeline('sentiment-analysis',model = 'facebook/bart-large')
classifier_febu("I hate it!")
classifier_febu("I love you!")

In [None]:
def sentiment_analysis(text):
    sentiment = classifier_febu(text)[0]['label']
    if sentiment == 'LABEL_0': return 'positive'
    return 'negative'

In [None]:
tweet_data['facebook_bert_based_sentiment'] = tweet_data['text'].apply(sentiment_analysis)
tweet_data_cut = tweet_data[tweet_data['airline_sentiment']!='neutral']
print(crep(tweet_data_cut['airline_sentiment'],tweet_data_cut['facebook_bert_based_sentiment']))

In [None]:
classifier_third = pipeline('sentiment-analysis',model = 'cardiffnlp/twitter-roberta-base-sentiment')
print(classifier_third("I hate it!"))
print(classifier_third("I love you!"))
print(classifier_third("I don't know about the routine"))
def sentiment_analysis_third(text):
    sentiment = classifier_third(text)[0]['label']
    if sentiment == 'LABEL_0': return 'negative'
    elif sentiment == 'LABEL_1': return 'neutral'
    elif sentiment == 'LABEL_2': return 'positive'
    return 'neutral'
tweet_data['facebook_bert_based_sentiment'] = tweet_data['text'].apply(sentiment_analysis_third)
print(crep(tweet_data['airline_sentiment'],tweet_data['facebook_bert_based_sentiment']))

## Flair:
[Flair NLP](https://github.com/flairNLP/flair) is a very simple NLP framework and contains a number of pretrained models. We will use the library for sentiment analysis in this section.

In [None]:
!pip install flair

In [None]:
from flair.models import TextClassifier
from flair.data import Sentence
classifier = TextClassifier.load('en-sentiment')
sentence = Sentence('Flair is pretty neat!')
classifier.predict(sentence)
# print sentence with predicted labels
print('Sentence above is: ', sentence.labels)

In [None]:
print(str(sentence.labels[0]).split("(")[0].lower()[:-1])

In [None]:
def sentiment_analysis_flair(text):
    sentence = Sentence(text)
    classifier.predict(sentence)
    return str(sentence.labels[0]).split("(")[0].lower()[:-1]

In [None]:
tweet_data['flair_based_sentiment'] = tweet_data['text'].apply(sentiment_analysis_flair)
tweet_data_cut = tweet_data[tweet_data['airline_sentiment']!='neutral']
print(crep(tweet_data_cut['airline_sentiment'],tweet_data_cut['flair_based_sentiment']))

## <a id = conclude>Conclusion</a>:
In this notebook, we preprocessed a airline tweet data set and then used both nltk ( VADER based pretrained system) and textblob to detect, and generate the scores for sentiments for the tweets. We did a bit of superficial observation of the performance of both the systems as well. In both cases, we saw that very implicitly meant negative cases are missed by the models; and hints at the importance of custom sentiment model training.<br/>
We also tried out a number of pretrained models from huggingface for sentiment analysis and the vanilla models didn't function well, while models specifically trained for sentiment performed much better.<br/>
### percentage-wise model comparison:
The NLTK, textblob performed at 50% around accuracy, and while facebook's bart model seemed to be unfit for the downstream task of sentiment analysis, other two models such as the default huggingface model for sentiment analysis performed at 89% accuracy for positive/negative sentiment classification and the cardiffnlp's twitter data based roberta model performed at 70% accuracy for the 3 class classification.<br/>
With this, our sentiment analysis concludes. We may add further frameworks in later versions of this work.<br/>
In the comments, please let us know what specific insights could have been made, or what other frameworks we should definitely try in this case. Thanks for reading the notebook. It will be great to see your appreciation if you liked my work.