<a href="https://colab.research.google.com/github/sursani/airline-tweet-sentiment/blob/main/airline_tweet_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Twitter Sentiment Analysis

**Problem statement:** Airline industry had a very hard time post covid to sustain their business due to a long hault. It is very important for them to make sure they exceed customer expectations. The best way to evaluate performance is customer feedback. You are given a dataset of airline tweets from real customers.

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

You will use the text column and sentiment column to create a classification model that classifies a given tweet into one of the 3 classes - positive, negative, neutral.

**Understanding the Dataset:**

Dataset contains many columns out of which below are most important ones-
1. airline_sentiment - defines the sentiment of the tweet
2. negative_reason - reason for the negative feedback (if negative)
3. Text - tweet text content
4. tweet_location - location from which tweet was posted

You can use more columns in your model training if you want.


**Steps to perform**
1. Load dataset - https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
2. Clean, preprocess data and EDA
3. Vectorise columns that contain text
4. Run Classification model to classify - positive, negative or neutral
5. Evaluate model



## Steps to Download kaggle datasets using Kaggle Public API

1. Go to your account, Scroll to API section and Click Expire API Token to remove previous tokens

2. Click on Create New API Token - It will download kaggle.json file on your machine.

In [1]:
# install kaggle pkg
!pip install -q kaggle

# install The spacy-lookups-data package contains additional language-specific
# data for some languages that can be used by spaCy for different tasks.
# This might include tables of verb forms, spelling exceptions, stop words, or lemmatization data.
!pip install spacy-lookups-data

Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3


In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"shaneursani","key":"9b899cb1f165cdd8f16121e2a8181871"}'}

In [3]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

In [4]:
! chmod 600 ~/.kaggle/kaggle.json

In [5]:
# download the airline sentiment dataset and unzip it
!kaggle datasets download -d crowdflower/twitter-airline-sentiment
!unzip twitter-airline-sentiment.zip

Downloading twitter-airline-sentiment.zip to /content
  0% 0.00/2.55M [00:00<?, ?B/s]
100% 2.55M/2.55M [00:00<00:00, 67.3MB/s]
Archive:  twitter-airline-sentiment.zip
  inflating: Tweets.csv              
  inflating: database.sqlite         


In [6]:

import spacy
import spacy.lookups

# load the English small model in spaCy
nlp = spacy.load('en_core_web_sm')



Read tweets.csv and process each negative tweet in the nlp pipeline

In [7]:
import pandas as pd
import random
from spacy.training import Example
from sklearn.model_selection import train_test_split

# The textcat pipe in spaCy, short for "text categorizer",
# is a component of the processing pipeline specifically designed for categorizing text.

# It assigns category labels (or "classes") to texts, which makes it useful for tasks like
# sentiment analysis, spam detection, or genre classification.
if 'textcat' not in nlp.pipe_names:
    nlp.add_pipe('textcat', last=True)
    textcat = nlp.get_pipe('textcat')
else:
    textcat = nlp.get_pipe('textcat')

# Add labels to text classifier
textcat.add_label('POSITIVE')
textcat.add_label('NEGATIVE')
textcat.add_label('NEUTRAL')

# Load the tweets from the CSV file
df = pd.read_csv('Tweets.csv')

# 'text' column contains the tweet text and 'airline_sentiment' column contains the sentiment
tweets = df['text'].tolist()
airline_sentiments = df['airline_sentiment'].tolist()

# Prepare training data in the format (text, label)
data = list(zip(tweets, [{'cats': {'POSITIVE': airline_sentiment == 'positive', 'NEGATIVE': airline_sentiment == 'negative', 'NEUTRAL': airline_sentiment == 'neutral'}} for airline_sentiment in airline_sentiments]))

# Split the data into a training set and a test set
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
nlp.begin_training()

# Train the model
for itn in range(1): # Number of training iterations
    random.shuffle(train_data)
    losses = {}
    for text, annotations in train_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], losses=losses)
    print(losses)


# Save model to local disk of Colab
nlp.to_disk("/model")



{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 0.0, 'textcat': 1641.6338966369087}


In [8]:
# Use the model's tokenizer to tokenize the text (list of string)
test_docs = [nlp.tokenizer(text) for text, annotation in test_data]

# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores = textcat.predict(test_docs)

# import numpy pkg
# NumPy, which stands for Numerical Python, is a fundamental package for
# scientific computing in Python. It provides support for arrays
# (including multidimensional arrays), along with a collection of mathematical
# functions to operate on these arrays.
import numpy as np

# notes:
# ndarray object: An efficient n-dimensional array object which can be used to
# store homogeneous data, i.e., data of the same type. This object allows for
# efficient vectorized operations, i.e., operations applied on the array as a
# whole, instead of element by element, which makes computations faster and
# code cleaner.

# Define your classes
class_labels = ['POSITIVE', 'NEGATIVE', 'NEUTRAL']

# Get the predicted class indices
predicted_class_indices = np.argmax(scores, axis=1)

# Map the indices to class labels
predicted_classes = [class_labels[index] for index in predicted_class_indices]

# Code for seeing the data set for debugging
print('Print out some sample data (5 rows) to see how the model is doing')

# Now, for each item in your test_data, you can compare the predicted class to the actual class
for i, (text, annotation) in enumerate(test_data[:5]):
    print(f"Text: {text}")
    print(f"Actual class: {max(annotation['cats'], key=annotation['cats'].get)}")
    print(f"Predicted class: {predicted_classes[i]}")
    print()


#######
# sklearn.metrics is a module in the Scikit-learn library in Python that
# contains a number of functions for calculating metrics that can be used in
# model evaluation.
# Import classification report
from sklearn.metrics import classification_report

# Get the predicted class indices
predicted_class_indices = np.argmax(scores, axis=1)

# And the true class indices
true_class_indices = [max(annotations['cats'], key=annotations['cats'].get) for _, annotations in test_data]
true_class_indices = [list(annotations['cats']).index(max_class) for max_class in true_class_indices]

# Now, we can use the classification_report function from sklearn.metrics to get precision, recall, and f1-score
report = classification_report(true_class_indices, predicted_class_indices, target_names=['POSITIVE', 'NEGATIVE', 'NEUTRAL'])
print(report)

Print out some sample data (5 rows) to see how the model is doing
Text: @SouthwestAir you're my early frontrunner for best airline! #oscars2016
Actual class: POSITIVE
Predicted class: POSITIVE

Text: @USAirways how is it that my flt to EWR was Cancelled Flightled yet flts to NYC from USAirways are still flying?
Actual class: NEGATIVE
Predicted class: NEGATIVE

Text: @JetBlue what is going on with your BDL to DCA flights yesterday and today?! Why is every single one getting delayed?
Actual class: NEGATIVE
Predicted class: NEGATIVE

Text: @JetBlue do they have to depart from Washington, D.C.??
Actual class: NEUTRAL
Predicted class: NEUTRAL

Text: @JetBlue I can probably find some of them. Are the ticket #s on there?
Actual class: NEGATIVE
Predicted class: NEUTRAL

              precision    recall  f1-score   support

    POSITIVE       0.56      0.73      0.63       459
    NEGATIVE       0.87      0.84      0.86      1889
     NEUTRAL       0.61      0.55      0.58       580

    accur