## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from numpy.lib.type_check import nan_to_num
from numpy.core.numeric import NaN
from nltk.sentiment import SentimentIntensityAnalyzer
import warnings
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('vader_lexicon')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
warnings.filterwarnings("ignore")

## Import Dataset from Local

In [None]:
from google.colab import files
file = files.upload()  #upload file

In [None]:
df = pd.read_csv("Tweets.csv")

## Initial Data Exploration

In [None]:
df.describe()

The describe() functions provides us with the metrics of the numerical variables. Perhaps the most interesting is the count columns. Because we have a number of categorical variables too, we will be examining the shape and type info next.

In [None]:
df.info()

We see that there are 15 total features with 1 of the features being the ID of the tweet. Of those, there are quite a few features that have many rows missing. We also see that the tweet_created column is an object when it should be a datetime type.

#### Examine First 5 Rows of Data

In [None]:
df.head(5)

In the previous section, we saw the types of the different features, but here, we're able to see the actual values in the rows. For example, it looks like the airline_sentiment column provides information such as positive, negative, neutral.  

#### Examining Unique Values

In [None]:
df.nunique()

From the above, we see that there are three general sentiments which we saw previously (positive, negative, neutral). It's also interesting to note that out of all fo the tweets, we are specifically looking at 6 airlines. Here are the 6 airlines whose tweets we will be examining:

In [None]:
df['airline'].unique()

#### Check for Nulls/NaNs

In [None]:
num_of_nan_missing=df.isnull().sum()
print(num_of_nan_missing)

Based on the above, we see that there are a substantial number of rows missing for the different features in our dataset, however we need to figure out which columns are relevant to our analysis.

## Data Cleaning

#### Configure Datetime Objects

In [None]:
df['tweet_created'] = pd.to_datetime(df['tweet_created'])
df.info()

Now we see that the tweet_created feature is of type datetime64 which will make it easier for datetime analysis in the exploration phase.

#### Handling the Null Values

First we're going to find the percentage of features whose values are null.

In [None]:
df.isnull().mean()*100

We see that negativereason_gold, airline_sentiment_gold, and tweet_coord have over 90% of null values, therefore we will drop those.

In [None]:
df = df.drop(df.columns[df.isnull().mean()>0.90], axis=1)

After dropping the values, we will validate that they have been dropped successfully.

In [None]:
df.isnull().mean()*100

#### Handling the Text Columns

First we created new columns to separate the Twitter handle from the rest of the tweet.

In [1]:
df['first_word'] = df['text'].str.split(' ', 1).str[0]
df.loc[~df['first_word'].str.startswith('@'), 'first_word'] = np.nan
df['remaining_sentence'] = df['text'].str.split(' ', 1).str[1]

NameError: ignored

In [None]:
print(df['first_word'])

This is important to the analysis because we want to begin cleaning the tweets to retrieve the sentiment-sensitive text. We'll take a look at the twitter handles we just retrieved.

In [None]:
unique_values = df['first_word'].unique()
print(unique_values)

Above we see that there are slight variations in the twitter handles due to differences in capitalization and and punctuation. We can assume that the rest of the tweet will have similar issues so next we must remove do things like convert text to lowercase, remove special characters and punctuation, and remove numbers.

In [None]:
# Assuming 'df' is your DataFrame and 'text_column' is the column containing the tweets
tweets = df['remaining_sentence']

# Convert text to lowercase
tweets = tweets.str.lower()

# Remove URLs
tweets = tweets.apply(lambda x: re.sub(r"http\S+|www\S+|https\S+", "", x))

# Remove special characters and punctuation
tweets = tweets.apply(lambda x: re.sub(r"[^\w\s]", "", x))

# Remove digits
tweets = tweets.apply(lambda x: re.sub(r"\d+", "", x))

# Tokenize the tweets
tweets = tweets.apply(word_tokenize)

# Remove stop words
stop_words = set(stopwords.words("english"))
tweets = tweets.apply(lambda x: [word for word in x if word not in stop_words])

# Lemmatize the tokens
lemmatizer = WordNetLemmatizer()
tweets = tweets.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# Join the tokens back into sentences
tweets = tweets.apply(lambda x: ' '.join(x))

# Update the DataFrame with the cleaned tweets
df['cleaned_tweets'] = tweets


In [None]:
df['cleaned_tweets']

Above we see that the tweets have a lot of the unwanted text removed alongside standardized capitalization.

#### Initial Sentiment Analysis

For our preliminary exploration into the sentiment analysis, we're going to start off by calculating the sentiment scores so we can understand what our baseline is.

In [None]:
cleaned_tweets = df['cleaned_tweets']

sid = SentimentIntensityAnalyzer()

sentiment_scores = cleaned_tweets.apply(lambda x: sid.polarity_scores(x))

compound_scores = sentiment_scores.apply(lambda x: x['compound'])

df['sentiment_score'] = compound_scores

In [None]:
df['sentiment_score']

Above we see the sentiment_score column that was added to our dataframe that we can now use for some preliminary EDA.

In [None]:
df_cleaned = df.copy()

## Exploratory Data Analysis

In [None]:
ax = df['airline'].value_counts().plot(kind='bar',
                                    figsize=(7,4),
                                    title="Count of Airline @s")
ax.set_xlabel("Airline Name")
ax.set_ylabel("Frequency")
plt.show()

In [None]:
ax = df['negativereason'].value_counts().plot(kind='barh',
                                    figsize=(7,4),
                                    title="Count of Negative Comments by Reason", color = 'red')
ax.invert_yaxis()
ax.set_xlabel("Negative Reason")
ax.set_ylabel("Frequency")
plt.show()

IDEA
X axis - date
Y axis - avg sentiment score for the day by airline