# Airline Customer Tweets Data for Sentiment Analysis using Twitter API

This project uses Twitter API to gather customer tweets from various airline handles operating in the USA. The tweets are then analyzed using the Google Cloud Natural Language API to understand the sentiment towards each airline. The data collection is automated to run daily for a month, with the results stored in a file named `/airline_twitter.csv`. An exploratory analysis will be conducted using a separate notebook to gain deeper insights into customer sentiments towards each airline.

## Dataset License

**CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License**

This license is one of the Creative Commons licenses and allows users to share and adapt the dataset if they give credit to the copyright holder and do not use the dataset for any commercial purposes.

## Comments

**A Note to Users and Readers**

* I'd appreciate receiving your feedbacks in comments and messages.
* To run this notebook, you will have to request API credentials from Twitter and Google Cloud.
* You are allowed to download and use the dataset only for non-commercial purposes and with proper attribution. (Read Twitter Developer Guidelines for more info).
* A detailed EDA will be presented in a separate Notebook.

**Comment on Sentiment Score and Magnitude Accuracy**
* The sentiment score and magnitude was calculated by Google Cloud Natural Languaging API, that is easy to use and highly accurate. It uses machine learning models that have been trained on large amounts of data, and it can perform a wide range of natural language processing tasks, including sentiment analysis. It also offers pre-trained models that can be used out-of-the-box.
* I have used Google's NL API because of the following reason:
    - I didn't have the labeled data for training the model. Google NL API ML models are pre-trained on large amount of datasets.
    - Google NL API models are highly accurate models.



## Table of Contents for Code Cells

1. Import Libraries
2. Configure objects to Authenticate with Twitter API
3. Define Functions  
A. Read Google Cloud API Project Credentials  
B. Translate non-English Tweet Text to English  
C. Preprocess and Clean the Tweets Texts  
D. Obtain User Locations from the Tweets  
E. Analyze Tweet Sentiment  
F. Write Twitter Data and Sentiment Data to CSV File  
G. Read Airline Twitter Handles Stored in a JSON File  
H. Search Today's 100 Tweets for Each Airline  
4. Call Functions for the Processing

## 1. Import Libraries

In [1]:
# Import libraries
import json
import twitter_api_keys as keys
import tweepy
from google.cloud import translate, language_v1
import preprocessor as p
import os
import csv
import datetime
import pandas as pd
import re

## 2. Configure objects to Authenticate with Twitter API

In [2]:
# Creating and Configuring an OAuthHandler to Authenticate with Twitter
auth = tweepy.OAuthHandler(keys.consumer_key,
                           keys.consumer_secret)

auth.set_access_token(keys.access_token,
                      keys.access_token_secret)

## 3. Define Functions

### A. Read Google Cloud API Project Credentials

In [3]:
def google_api():
    """This function read secured project credentials and keys for Google
    Cloud API from json file and return the parent object to interact
    with Google API.
    (Read Google documentation for more info on set-up and usage of
    Google Cloud API libraries.)
    """
    os.environ['GOOGLE_APPLICATION_CREDENTIALS']='./google_cloud_api.json'
    os.environ['PROJECT_ID'] = keys.project_id
    project_id = os.environ.get("PROJECT_ID", "")
    assert project_id
    parent = f"projects/{project_id}"

    return parent

### B. Translate non-English Tweet Text to English

In [4]:
def translate_tweet_text(client, parent, tweet, tweet_text):
    """If tweet language is not English, translate the tweet
    text to English using Google Cloud Translate API library
    and return the translated text.
    """
    if 'en' in tweet.lang:
        # return tweets as it is, if already in English language
        return tweet_text

    elif 'und' not in tweet.lang:  # translate to English first
        # Use Google Translate API to translate tweet       
        response = client.translate_text(
            contents=[tweet_text],
            target_language_code="en",
            parent=parent
        )        
        # return tweet translated to English language
        return response.translations[0].translated_text

### C. Preprocess and Clean the Tweet Texts

In [5]:
def preprocess_tweet_text(tweet_text):
    """Removes the URLs, hashtags, and @mentions from the tweet text
    and returns the clean text."""
    
    # Tweets cleaning and pre-processing
    # remove URLs and Twitter reserved words, e.g. RT, FAV
    p.set_options(p.OPT.URL, p.OPT.RESERVED)
    p.clean(tweet_text)
    tweet_text = re.sub(r"http\S+", '', tweet_text)
    tweet_text = re.sub(r'#\w+', '', tweet_text)
    tweet_text = re.sub(r'@\w+', '', tweet_text)

    return tweet_text

### D. Obtain User Locations from the Tweets

In [6]:
def get_user_location(tweet):
    """Returns the location of the tweet, if available."""
    if tweet.place is not None:
        return tweet.place.name

    else:
        return "Location not available"

def get_user_timezone(tweet):
    """Returns the timezone of the user, if available."""
    if tweet.user.time_zone is not None:
        return tweet.user.time_zone

    else:
        return "Time zone not available"

### E. Analyze Tweet Sentiment

In [7]:
def analyze_sentiment(client, tweet_text):
    """Performs sentiment analysis on tweets using the Google
    Cloud Natural Language API and returns a list of tuples
    containing the sentiment score and sentiment magnitude.
    """
    # print("Length of tweet text = ", len(tweet_text))

    min_length = 20    

    document = language_v1.Document(
        content=tweet_text,
        type=language_v1.Document.Type.PLAIN_TEXT
    )

    response = client.analyze_sentiment(request={"document":document})
    sentiment = response.document_sentiment
    score = round(sentiment.score, 3)
    magnitude = round(sentiment.magnitude, 3)
    results = (score, magnitude)
    return results

### F. Write Twitter Data and Sentiment Data to CSV File

In [8]:
def write_to_csv(file, mode, headers, airline_name, tweet_text, sent_score, sent_magnitude):
    """Write the data to a csv file /airline_twitter.csv."""
    with open(file, mode, encoding='utf-8', newline='') as f:
        writetweet = csv.writer(f)

        if mode == 'w':
            writetweet.writerow(headers)

        # Get tweet location
        tweet_loc = get_user_location(tweet)

        # Get tweet timezone
        tweet_tz = get_user_timezone(tweet)

        # Write data to csv file
        writetweet.writerow([airline_name,
                             tweet_text,
                             tweet.created_at,
                             sent_score,
                             sent_magnitude,
                             tweet.user.screen_name,
                             tweet.retweet_count,
                             tweet.favorite_count,
                             tweet_loc,
                             tweet_tz
                             ])

### G. Read Airline Twitter Handles Stored in a JSON File

**Note**: This function requires `us_airlines.json` file. It is attached as a separate file.

In [9]:
def read_json(json_filepath):
    """Reads the airline names and their corresponding Twitter handles
    from the JSON file and returns the result as a dictionary.
    Note: Download "us_airlines.json" file along with this notebook.
    """
    with open(json_filepath, 'r') as json_file:
        data = json.load(json_file)
    return data

### H. Search Today's 100 Tweets for Each Airline

In [10]:
def search_tweets(api, query):
    """Searches for tweets matching the given query, and
    returns the tweets only from today. Maximum 100 tweets returned."""
    # Today's date
    today = datetime.datetime.now().date()
    # Empty list
    tweets_today = []
    try:
        # Twitter search query
        tweets = api.search_tweets(q=query, tweet_mode='extended', count=100)
        # Filter today's tweets from the query
        for tweet in tweets:
            tweets_today = list(filter(lambda tweet: tweet.created_at.date() == today, tweets))
            
    except tweepy.error.TweepError as e:
        print(f'Error: {e}')

    return tweets_today

## 4. Call Functions for the Processing

In [11]:
# Create an API object
api = tweepy.API(auth, wait_on_rate_limit=True)

# Get google translate api Parent parameter
parent = google_api()

# Get all US airlines' Twitter handle from the json file.
us_airlines = read_json('us_airlines.json')

# Create service client objects for Google Translation and Language Services
google_translat_client = translate.TranslationServiceClient()
google_language_client = language_v1.LanguageServiceClient()

# Specify csv file name and path
filename = "./airline_twitter1.csv"
# Specify CSV data header row
headers = ['airline', 'tweet_text', 'date', 'score', 'magnitude',
           'user', 'retweet_count', 'likes_count', 'location', 'time_zone']

for key, value in us_airlines.items():
    # Check whether Twitter handls of the airline exists.
    # If yes, proceed, otherwise skip and move to next airline.
    if value != None:
        
        # Write mode
        mode = 'w'
        if os.path.exists(filename):
            # Append mode, if file exists
            mode = 'a'

        # Search today's tweets for each US airline, max count = 100
        print(f"Run date: {datetime.datetime.now()}")
        print(f"{key}")
        print("\tSearching customer tweets...")
        tweets = search_tweets(api=api, query=value)
        print("\tSearching customer tweets. COMPLETED")       

        count = 0
        for tweet in tweets:
            count += 1

            if tweet:

                if (not tweet.retweeted) and ('RT @' not in tweet.full_text):
                    
                    tweet_text = tweet.full_text

                    # Clean tweet text
                    if tweet_text:
                        tweet_text = preprocess_tweet_text(tweet_text)
                    else:
                        tweet_text=""

                    # Translate tweet to English first
                    tweet_text = translate_tweet_text(google_translat_client, parent, tweet, tweet_text)
                    
                    if tweet_text is not None:
                        # Get sentiment analysis scores using Google API
                        score, magnitude = analyze_sentiment(google_language_client, tweet_text)
                        print(f"\tComputing tweet sentiment metrics and writing full data to file. Completed: {count}/100.", end="\r")

                    else:
                        # Skipping sentiment analysis on this tweet
                        score, magnitude = None, None
                        print(f'\tSkipping sentiment analysis of tweet # {count} with text "{tweet_text}" because it is too short.')

                    # Write the tweet data to csv file.
                    write_to_csv(filename, mode, headers, key, tweet_text, score, magnitude)
                    
        if count == 0:
            print("\tNo tweets found in the past day")
        else:
            print(f"\n\tProcessing and writing data to file COMPLETED. {count}/100.")

Run date: 2023-01-23 23:38:36.239456
AirTran Airways
	Searching customer tweets...
	Searching customer tweets. COMPLETED
	No tweets found in the past day
Run date: 2023-01-23 23:38:36.863850
Alaska Airlines Inc.
	Searching customer tweets...
	Searching customer tweets. COMPLETED
	Skipping sentiment analysis of tweet # 28 with text "None" because it is too short.
	Skipping sentiment analysis of tweet # 41 with text "None" because it is too short.
	Skipping sentiment analysis of tweet # 42 with text "None" because it is too short.
	Skipping sentiment analysis of tweet # 52 with text "None" because it is too short.
	Computing tweet sentiment metrics and writing full data to file. Completed: 91/100.
	Processing and writing data to file COMPLETED. 91/100.
Run date: 2023-01-23 23:39:32.392028
Allegiant Air
	Searching customer tweets...
	Searching customer tweets. COMPLETED
	Computing tweet sentiment metrics and writing full data to file. Completed: 9/100.

KeyboardInterrupt: 