In [22]:
import pandas as pd
import numpy as np
import sys
sys.path.append('../')

from util.dataframe_utils import analyse_columns
from util.datetime_utils import calculate_hour_sin_cos, fractional_hour

df = pd.read_csv('../data/unprocessed/Tweets.csv')

I've decided that I want to model the data based on the hour of the day, to establish certain probabilities that relate to various factors. This approach also allows for increasing complexity. Initially, I can do a simple linear regression, based on how likely a tweet is to appear a specific hour of the day. Thereafter, more compelx modeling can enter, such as what the probability is of a certain airline receiving a sentiment, or the type of negative sentiment a specific airline might receive.

With this in mind, I will do some basic preprocessing to prepare the dataset for this modelling. Naturally, this will include some additional EDA as we explore how to transform the data.

In [23]:
# let's also rename the negative reason fields for consistency

df = df.rename(columns={'negativereason': 'negative_reason', 
                        'negativereason_confidence': 'negative_reason_confidence'})

In [24]:
analyse_columns(df)

Unnamed: 0,Column,Data Type,Missing Values,Missing Ratio (%),Unique Values
0,tweet_id,int64,0,0.00%,14485
1,airline_sentiment,object,0,0.00%,3
2,airline_sentiment_confidence,float64,0,0.00%,1023
3,negative_reason,object,5462,37.31%,10
4,negative_reason_confidence,float64,4118,28.13%,1410
5,airline,object,0,0.00%,6
6,airline_sentiment_gold,object,14600,99.73%,3
7,name,object,0,0.00%,7701
8,negativereason_gold,object,14608,99.78%,13
9,retweet_count,int64,0,0.00%,18


## Missing Values ##

In [25]:
# the following columns were dropped entirely due to the large ratio of missing values

df = df.drop(['airline_sentiment_gold', 'negativereason_gold', 'tweet_coord'], axis=1)

I notice that there is a mismatch between the amount of missing values for the `negative_reasons` and `negative_reason_confidence`. Since there are more `negative_reasons_confidence` instances present than `negative_reason` instances, this implies that there must be some values attributed for confidence levels where no reasons are present. This might act as a guiding clue to how the original dataset handled missing values for `negative_reason_confidence`.

In [26]:
filtered_df = df[pd.isnull(df['negative_reason']) & pd.notnull(df['negative_reason_confidence'])]

non_missing_confidence_values = filtered_df['negative_reason_confidence']
non_missing_confidence_values.unique()

array([0.])

So as expected, the dataset attribute 0% confidence to entries that are not present, we might follow suit in this convention, but first let's see if any entries that are associated with a negative reason, else it might lead to confusion.

In [27]:
import pandas as pd

filtered_df = df[pd.notnull(df['negative_reason']) & (df['negative_reason_confidence'] == 0)]
filtered_df['negative_reason_confidence'].unique()

print(f"Number of entries with a negative reason and 0% confidence: {len(filtered_df)}")


Number of entries with a negative reason and 0% confidence: 0


Let's fill in all the missing values with 0 for the `negative_reason_confidence` field to indicate missing values. We'll keep track of these sort of changes in the `ENCODING.md` file found within the directory for future reference.

In [28]:
df['negative_reason_confidence'] = df['negative_reason_confidence'].fillna(0)
df['negative_reason_confidence'].isnull().sum()


0

In [29]:
df['negative_reason'].unique()

array([nan, 'Bad Flight', "Can't Tell", 'Late Flight',
       'Customer Service Issue', 'Flight Booking Problems',
       'Lost Luggage', 'Flight Attendant Complaints', 'Cancelled Flight',
       'Damaged Luggage', 'longlines'], dtype=object)

During the inital EDA, it was established that there were no negative sentiments without a negative sentiment reason. Therefore, we can add a new category to the `negative_reason` called `Not Applicable`.

In [30]:
df['negative_reason'] = df['negative_reason'].fillna('Not Applicable')
df['negative_reason'].value_counts()

negative_reason
Not Applicable                 5462
Customer Service Issue         2910
Late Flight                    1665
Can't Tell                     1190
Cancelled Flight                847
Lost Luggage                    724
Bad Flight                      580
Flight Booking Problems         529
Flight Attendant Complaints     481
longlines                       178
Damaged Luggage                  74
Name: count, dtype: int64

The last two columns that contain missing values that we have to deal with is the `tweet_location` and `user_timezone` fields. Since the Missing Ratio (%) is relatively high, I don't feel confident that imputation will provide an accurate reflection of the underlying patterns contained within the dataset. Nearly a third of the instances are missing! For now we'll assign the `Unknown` identifier to these fields. We are not spending too much time on this, because I won't be using these aspect for my model.

In [31]:
df['tweet_location'] = df['tweet_location'].fillna('Unknown')
df['user_timezone'] = df['user_timezone'].fillna('Unknown')
df['tweet_location'].value_counts()
df['user_timezone'].value_counts()

user_timezone
Unknown                       4820
Eastern Time (US & Canada)    3744
Central Time (US & Canada)    1931
Pacific Time (US & Canada)    1208
Quito                          738
                              ... 
Warsaw                           1
Irkutsk                          1
Lisbon                           1
Canberra                         1
Newfoundland                     1
Name: count, Length: 86, dtype: int64

In [32]:
analyse_columns(df)

Unnamed: 0,Column,Data Type,Missing Values,Missing Ratio (%),Unique Values
0,tweet_id,int64,0,0.00%,14485
1,airline_sentiment,object,0,0.00%,3
2,airline_sentiment_confidence,float64,0,0.00%,1023
3,negative_reason,object,0,0.00%,11
4,negative_reason_confidence,float64,0,0.00%,1410
5,airline,object,0,0.00%,6
6,name,object,0,0.00%,7701
7,retweet_count,int64,0,0.00%,18
8,text,object,0,0.00%,14427
9,tweet_created,object,0,0.00%,14247


## Categorical Encoding ##

Let's encode the categorical features we are most inclined to use, namely `airline_sentiment`, `negative_reason`, and `airline`. Fields such as `tweet_id`, `name`, `text`, `tweet_location`, and `user_timezone` will be left untouched for now, as we might want them in their original forms. Certain parts of the data will have to manipulated based on the model, but currently we are engaged in a general data preproccesing.

In [33]:
# airline_sentiment

codes, uniques = pd.factorize(df['airline_sentiment'])
df['airline_sentiment'] = codes

mapping = dict(enumerate(uniques))
mapping

{0: 'neutral', 1: 'positive', 2: 'negative'}

In [34]:
# airline

codes, uniques = pd.factorize(df['airline'])
df['airline'] = codes

mapping = dict(enumerate(uniques))
mapping

{0: 'Virgin America',
 1: 'United',
 2: 'Southwest',
 3: 'Delta',
 4: 'US Airways',
 5: 'American'}

In [35]:
# negative_reason

codes, uniques = pd.factorize(df['negative_reason'])
df['negative_reason'] = codes

mapping = dict(enumerate(uniques))
mapping

{0: 'Not Applicable',
 1: 'Bad Flight',
 2: "Can't Tell",
 3: 'Late Flight',
 4: 'Customer Service Issue',
 5: 'Flight Booking Problems',
 6: 'Lost Luggage',
 7: 'Flight Attendant Complaints',
 8: 'Cancelled Flight',
 9: 'Damaged Luggage',
 10: 'longlines'}

In [36]:
analyse_columns(df)

Unnamed: 0,Column,Data Type,Missing Values,Missing Ratio (%),Unique Values
0,tweet_id,int64,0,0.00%,14485
1,airline_sentiment,int64,0,0.00%,3
2,airline_sentiment_confidence,float64,0,0.00%,1023
3,negative_reason,int64,0,0.00%,11
4,negative_reason_confidence,float64,0,0.00%,1410
5,airline,int64,0,0.00%,6
6,name,object,0,0.00%,7701
7,retweet_count,int64,0,0.00%,18
8,text,object,0,0.00%,14427
9,tweet_created,object,0,0.00%,14247


## Scaling and Standardisation ##

For the current iteration of the dataset, I won't scale or standardise anything. This can be done specific to the model. I think at this stage interprebility is of greater value. Each of the categorical features have a discrete number associated with them and are logged in `ENCODING.md`. The continous numerical fields such as `airline_sentiment_confidence` and `negative_reason_confidence` are already scale in range of `0-1`, which is an intuitive and interpretable when dealing with percentages.

## Feature Engineering: Temporal ##

Since I've decided to model the data according to the time associated with the creation of a tweet, we can do some prelimanary preprocessing and feature engineering to accomodate the modeling process.

In [37]:
# Convert 'tweet_created' to datetime
df['tweet_created'] = pd.to_datetime(df['tweet_created'])

# The focus will be mostly on the hour per day that certain tweets happened, so let's extract these features

df['day'] = df['tweet_created'].dt.day
df['hour'] = df['tweet_created'].dt.hour
df['minute'] = df['tweet_created'].dt.minute
df['second'] = df['tweet_created'].dt.second

The transformation of the hour of the day into sine and cosine values, and then back to the actual hour, is a method often used in feature engineering for cyclic or periodic features. This method is particularly useful in machine learning and data analysis contexts where the cyclical nature of certain variables (like time of day, day of week, month of year, etc.) needs to be captured effectively. I've defined utility functions for these purposes under the `util` module. This transformation method offers a good representation of the cyclical nature of these features.

In [38]:
# first, let's make a fractional hour based on the associated minutes and seconds for a more precise temporal representation

df['fractional_hour'] = df.apply(lambda row: fractional_hour(row['hour'], row['minute'], row['second']), axis=1)


In [39]:
# now let's get our sine and cosine representations of the fractional hour

df[['hour_sin', 'hour_cos']] = df['fractional_hour'].apply(lambda x: pd.Series(calculate_hour_sin_cos(x)))


I'm also going to drop the `day`, `hour`, `minute`, and `second`` information from the dataframe since information is contained within the `tweet_created` datetime column. This is so that the dataset doesn't become unnecessarily bloated and redundant.

In [40]:
df = df.drop(['day', 'hour', 'minute', 'second'], axis=1)

For the current purposes of my modeling the preprocessing will stop here. There is always a seemingly endless amount of EDA and preprocessing you can do, but the trade-off is time. In my view, the ideal is to be pragmatic.  So let's save the newly processed dataset and get to some modeling.

In [41]:
df.to_csv('../data/processed/ProcessedTweets.csv', index=False)