# Discover NLP with Python Part II


Last time we performed Exploratory Data Analysis(EDA) to understand the specifics of our dataset. We plotted information to see underlying information.

We also learned basic text processing techniques including tokenization, lemmatization, and stemming to prepare text before inputting that to our model. 


This session we would take forward what we learned in last session to prepare a basic sentiment analysis model. 


# Dataset

We are picking the US Airlines sentiment dataset from Kaggle. The dataset contains customer reviews on Twitter regarding 6 US Airlines.

There are three sentiments: Positive, Negative and Neutral.

Our task is to analyze the reviews, find the reasons behind negative reviews and classify unseen reviews in the correct catgory. 


Find the dataset here: https://www.kaggle.com/crowdflower/twitter-airline-sentiment

Let's get started with importing the libraries

In [None]:
import numpy as np
import pandas as pd
import sklearn 
import matplotlib.pyplot as plt
import seaborn as sns
!pip install kaggle



### Downloading the dataset from  Kaggle

We can directly download dataset from Kaggle in our Colab notebook.

The steps are:

- Create an account on Kaggle
- Go to your profile, generate new API token
- Set permissions and download the dataset using API. By now you'd see the zip file
- To access the csv file, unzip the dataset using `!unzip` command

In [None]:
# Upload your Kaggle API in order to download the required dataset
from google.colab import files
files.upload()  #this will prompt you to upload the kaggle.json

In [None]:
# Preparing the API for downloading
!mkdir -p ~/.kaggle 
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json 

In [None]:
!pip show kaggle

Name: kaggle
Version: 1.5.8
Summary: Kaggle API
Home-page: https://github.com/Kaggle/kaggle-api
Author: Kaggle
Author-email: support@kaggle.com
License: Apache 2.0
Location: /usr/local/lib/python3.6/dist-packages
Requires: requests, tqdm, urllib3, certifi, python-slugify, six, slugify, python-dateutil
Required-by: 


In [None]:
!kaggle datasets download -d crowdflower/twitter-airline-sentiment

Downloading twitter-airline-sentiment.zip to /content
  0% 0.00/2.55M [00:00<?, ?B/s]
100% 2.55M/2.55M [00:00<00:00, 85.1MB/s]


In [None]:
!unzip /content/twitter-airline-sentiment.zip

Archive:  /content/twitter-airline-sentiment.zip
  inflating: Tweets.csv              
  inflating: database.sqlite         


## Analyzing Data

In [None]:
#reading data
data = pd.read_csv("Tweets.csv")

In [None]:
#checking shape of the dataset
data.shape

(14640, 15)

In [None]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
#Let's know more about the data using data.describe.

data.describe()


Checking for the null values. Some rows like negativereason_confidence mostly consists of empty rows, imputing them will not help. So we will drop these columns later.

In [None]:
data.isna().sum()

Let's see how sentiments are distributed that is number of samples per sentiment.

In [None]:
data['airline_sentiment'].value_counts().plot(kind='bar', color=['red','yellow','green'])


Negative sentiments samples are much more than the negative and positive sentiments. This indicates at the imablanced data.

Let's see the number of samples for all the airlines.

In [None]:
data['airline'].value_counts().plot(kind='bar')


People have tweeted the most for United followed by US Airways. We'll see if these tweets were postive, negative or neutral.

## Sentiments per Airline

In [None]:
pd.crosstab(data['airline'],data['airline_sentiment']).plot(kind='bar')

In [None]:
data['airline'].groupby(data['airline_sentiment']).value_counts().plot(kind='bar')


Now we will see how individual sentiments are distrubuted.

## Positive Sentiments

In [None]:
data[data['airline_sentiment']== 'positive'].airline.value_counts().plot(kind='bar')

## Negative Sentiments

In [None]:
data[data['airline_sentiment']== 'negative'].airline.value_counts().plot(kind='bar')

## Neutral Sentiments

In [None]:
data[data['airline_sentiment']== 'neutral'].airline.value_counts().plot(kind='bar')


So clearly people are unahppy with United. Most negative tweets are for united.

Southwest has done a good job in serving people and hence it got most positive tweets.

Whereas Virgin America is stable or (least popular?) with balanced negative, positive and neutral tweets.

We will now visualize the negative reasons for negative sentiments.

In [None]:
data['negativereason'].value_counts().plot(kind='barh')

So mostly, Customer Service Issue made people unhappy. If airlines want to do better they should focus on Customer Service.

Negative reason per airline

In [None]:
pd.crosstab(data['airline'], data['negativereason'])

In [None]:
pd.crosstab(data['airline'], data['negativereason']).plot(kind='bar')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)


American is guilty of the worst Customer Service Issue. Remebering United got the most negative sentiments, the reason for that seems to be the same i.e. Customer Service Issue followed by Late Flight.

Visualizing the correlation in the data

In [None]:
sns.heatmap(data.corr())

## Preprocessing

We are going to pick 'airline_sentiment','text' rows for our task.

In [None]:
df=data[['airline_sentiment','text']]

In [None]:
df.head()


Since the data is collected from twitter it is obvious to find links and mentions in the data which are not helpful in the analysis so we will remove them.

For preprocessing the text we will remove all of the following:

*  stopwords
*  punctuations
*  links
*  mentions(@)

We will also perform stemming with the help of nltk library.

In [None]:
# Preprocessing
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
stop=stopwords.words('english')
from nltk import word_tokenize
import regex as re
snow=nltk.stem.SnowballStemmer('english')

def preprocess(doc):
  doc=re.sub('@\w+'," ",str(doc))
  doc=re.sub('#\w+'," ",str(doc))
  doc=re.sub('http\S+'," ",str(doc))
  doc=re.sub('[^\w\s]'," ",str(doc)) 
  doc=re.sub('[^a-zA-Z]'," ",str(doc))
  tokens=word_tokenize(doc)
  word=[snow.stem(word) for word in tokens]
  word=[word for word in tokens if word not in stop]
  word = [w.lower() for w in word]
  words='  '.join(word)
  return words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df.text.apply(preprocess)

0                                               what  said
1              plus  added  commercials  experience  tacky
2        i  today  must  mean  i  need  take  another  ...
3        really  aggressive  blast  obnoxious  entertai...
4                                  really  big  bad  thing
                               ...                        
14635               thank  got  different  flight  chicago
14637                    please  bring  american  airlines
14638    money  change  flight  answer  phones  any  su...
14639    ppl  need  know  many  seats  next  flight  pl...
Name: text, Length: 14640, dtype: object

In [None]:
df.head()

Let's plot the wordcloud and see how the words are distributed and how the overall data looks like.



In [None]:
from wordcloud import WordCloud

In [None]:
  df.text = df.text.astype(str)
  all_words = ' '.join(text for text in df.text)

  wordcloud_obj = WordCloud(width= 800,
                            height= 500, 
                            max_font_size= 110, 
                            collocations= False).generate(all_words)

  plt.figure(figsize=(15,10))
  plt.imshow(wordcloud_obj, interpolation= "bilinear")
  plt.axis("off")
  plt.show()

# Modeling

Before sending the data to our model we have to convert text into numerical form. 

In the presentation we saw how we can use Bag of Words for this purpose but we are going to Tf-Idf instead. 

The reason behind that is, Bag of Words regards each word as unique when it calculates the frequency, hence a large sparse array is created. This is not possible to use practically.

Tf-Idf calculates the importance of each word based on its occurrence in the document and is faster than Bag of Words. 

In [None]:
#importing TfidfVectorizer from scikit-lear
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf=TfidfVectorizer()
reviews = tfidf.fit_transform(df['text'])

Now let's convert our labels 'Positive', 'Negative' and 'Neutral' in numerical form. 

For this we are going to use scikit-learn's Label Encoder().


In [None]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
labels=le.fit_transform(df['airline_sentiment'])

## Splitiing the dataset in training and testing set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(reviews, labels, test_size=0.25, random_state=0)

## Training the model

We can use any machine learning algorithm for this task, and we have chosen the Support Vector Classifier for now.

In [None]:
#importing SVC from scikit-learn
from sklearn.svm import SVC
text_classifier = SVC(random_state=0)

In [None]:
text_classifier.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

## Predictions

Let's do some predictions on the test set.

In [None]:
predictions = text_classifier.predict(X_test)

In [None]:
#comparing test_labels and predicted label
dataframe=pd.DataFrame()
dataframe['y_test']=le.inverse_transform(y_test)
dataframe['Predicted']=le.inverse_transform(predictions)
dataframe

Unnamed: 0,y_test,Predicted
0,negative,negative
1,negative,negative
2,negative,negative
3,negative,negative
4,negative,positive
...,...,...
3655,negative,negative
3656,negative,negative
3657,positive,negative
3658,positive,positive


## Performance

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[2227   80   20]
 [ 363  378   31]
 [ 180   71  310]]
              precision    recall  f1-score   support

           0       0.80      0.96      0.87      2327
           1       0.71      0.49      0.58       772
           2       0.86      0.55      0.67       561

    accuracy                           0.80      3660
   macro avg       0.79      0.67      0.71      3660
weighted avg       0.79      0.80      0.78      3660

0.796448087431694


We got a reasonable accuracy of 79%, this could be further improved by using deep learning techniques and language models. 

We are going to cover that in the future sessions, the purpose of this session is to show you the complete pipeline of a NLP project and get you comfortable with text manipulation.



References:

1. [Kaggle dataset in Colab](https://www.geeksforgeeks.org/importing-kaggle-dataset-into-google-colaboratory/)

2. [TfIdfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)