<a href="https://colab.research.google.com/github/sursani/airline-tweet-sentiment/blob/main/airline_tweet_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Twitter Sentiment Analysis

**Problem statement:** Airline industry had a very hard time post covid to sustain their business due to a long hault. It is very important for them to make sure they exceed customer expectations. The best way to evaluate performance is customer feedback. You are given a dataset of airline tweets from real customers.

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

You will use the text column and sentiment column to create a classification model that classifies a given tweet into one of the 3 classes - positive, negative, neutral.

**Understanding the Dataset:**

Dataset contains many columns out of which below are most important ones-
1. airline_sentiment - defines the sentiment of the tweet
2. negative_reason - reason for the negative feedback (if negative)
3. Text - tweet text content
4. tweet_location - location from which tweet was posted

You can use more columns in your model training if you want.


**Steps to perform**
1. Load dataset - https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
2. Clean, preprocess data and EDA
3. Vectorise columns that contain text
4. Run Classification model to classify - positive, negative or neutral
5. Evaluate model



## Steps to Download kaggle datasets using Kaggle Public API

1. Go to your account, Scroll to API section and Click Expire API Token to remove previous tokens

2. Click on Create New API Token - It will download kaggle.json file on your machine.

In [17]:
!pip install -q kaggle

!pip install spacy-lookups-data



In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"shaneursani","key":"9b899cb1f165cdd8f16121e2a8181871"}'}

In [3]:
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/

In [4]:
! chmod 600 ~/.kaggle/kaggle.json

In [5]:
#! kaggle competitions download -c 'name-of-competition'

!kaggle datasets download -d crowdflower/twitter-airline-sentiment
!unzip twitter-airline-sentiment.zip

Downloading twitter-airline-sentiment.zip to /content
  0% 0.00/2.55M [00:00<?, ?B/s]
100% 2.55M/2.55M [00:00<00:00, 205MB/s]
Archive:  twitter-airline-sentiment.zip
  inflating: Tweets.csv              
  inflating: database.sqlite         


In [6]:

import spacy
import spacy.lookups

# load the English model in spaCy
nlp = spacy.load('en_core_web_sm')

# Load the lexeme_norm table
#lexeme_norm = spacy.lookups.load_lookups("lexeme_norm", "en")


Read tweets.csv and process each negative tweet in the nlp pipeline

In [9]:
import pandas as pd
import random
from spacy.training import Example
from sklearn.model_selection import train_test_split


if 'textcat' not in nlp.pipe_names:
    nlp.add_pipe('textcat', last=True)
    textcat = nlp.get_pipe('textcat')
else:
    textcat = nlp.get_pipe('textcat')

# Add labels to text classifier
textcat.add_label('POSITIVE')
textcat.add_label('NEGATIVE')
textcat.add_label('NEUTRAL')

# Load the tweets from the CSV file
df = pd.read_csv('Tweets.csv')

# 'text' column contains the tweet text and 'airline_sentiment' column contains the sentiment
tweets = df['text'].tolist()
airline_sentiments = df['airline_sentiment'].tolist()

# Prepare training data in the format (text, label)
data = list(zip(tweets, [{'cats': {'POSITIVE': airline_sentiment == 'positive', 'NEGATIVE': airline_sentiment == 'negative', 'NEUTRAL': airline_sentiment == 'neutral'}} for airline_sentiment in airline_sentiments]))

# Split the data into a training set and a test set
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Train the model
nlp.begin_training()

# Train the model
for itn in range(1): # Number of training iterations
    random.shuffle(train_data)
    losses = {}
    for text, annotations in train_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], losses=losses)
    print(losses)


# Save model
nlp.to_disk("/model")



{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 0.0, 'textcat': 1630.6910126088308}


In [28]:
# Use the model's tokenizer to tokenize the text (list of string)
test_docs = [nlp.tokenizer(text) for text, annotation in test_data]

#print(type(test_docs))

#for item in test_docs[:3]:
#    print(item)

# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(test_docs)

print(scores)

ValueError: ignored

In [14]:
print(test_data[:10])


[("@SouthwestAir you're my early frontrunner for best airline! #oscars2016", {'cats': {'POSITIVE': True, 'NEGATIVE': False, 'NEUTRAL': False}}), ('@USAirways how is it that my flt to EWR was Cancelled Flightled yet flts to NYC from USAirways are still flying?', {'cats': {'POSITIVE': False, 'NEGATIVE': True, 'NEUTRAL': False}}), ('@JetBlue what is going on with your BDL to DCA flights yesterday and today?! Why is every single one getting delayed?', {'cats': {'POSITIVE': False, 'NEGATIVE': True, 'NEUTRAL': False}}), ('@JetBlue do they have to depart from Washington, D.C.??', {'cats': {'POSITIVE': False, 'NEGATIVE': False, 'NEUTRAL': True}}), ('@JetBlue I can probably find some of them. Are the ticket #s on there?', {'cats': {'POSITIVE': False, 'NEGATIVE': True, 'NEUTRAL': False}}), ('@united still waiting to hear back. My wallet was stolen from one of your planes so would appreciate a resolution here', {'cats': {'POSITIVE': False, 'NEGATIVE': True, 'NEUTRAL': False}}), ("@united Yes my f