# CS 39AA - Notebook 1: Intro to Text Data and Pandas

In this notebook we'll explore the Airline Tweet dataset and try a simple (the most simple?) model that we can come up with to predict whether a tweet will be a) positive, b) neutral, or c) negative. To run this notebook in Google Colab, click the following link (for a brief intro on what Jupyter notebooks are and how they work, check out [this short tutorial by Jeremy Howard](https://www.kaggle.com/code/jhoward/jupyter-notebook-101).)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/CS39AA/blob/main/nb1_text_data_and_pandas.ipynb)

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/CS39AA/blob/main/nb1_text_data_and_pandas.ipynb)

The basic "model" we are going to create uses an ad hoc approach so we'll only need the pandas Python module for now, which we'll import here.

In [None]:
import pandas as pd

Before getting started let's look at a toy example of a pandas DataFrame. Most often a data frame will be created by reading an input file, but it's also possible to create one manually. Here is a common way of creating one manually that uses a Python dictionary (specifically, the Python data type is "__dict__"). 

In [None]:
toy_dict = {'col_a':[1,2,3,4,5], 'col_b':['blue', 'red', 'red', 'purple', 'red']}
toy_dict

This dictionary can then be used to create a pandas data frame, where each __dict key__ is a column name, and each corresponding __dict__ __value__ (which is a list) defines the data in that column.

In [None]:
toy_df = pd.DataFrame(toy_dict)
toy_df

Now getting back to our problem, let's open the data file. If the data is stored locally then we would use `pd.read_csv("path/to/file/file.csv")` to open it. In this case, the data is online at github. The `pandas` module also knows how to open a file from a URL without any additional parameters, so we can still use the same pandas method but with the URL instead of the local path, as seen below. After loading the data file, check what the dimensions of the data frame are (i.e. number of rows and number of columns - shown together in a single Python tuple).

In [None]:
data_URL = 'https://raw.githubusercontent.com/sgeinitz/CS39AA/main/data/trainA.csv'
df = pd.read_csv(data_URL)
df.shape

Now let's look at the first few observations (i.e. tweets) in the data frame using the DataFrame's __head__ method. This is first and most basic step we can do in what is called, Exploratory Data Analysis (EDA). This is a really imporant step for any data science or machine learning project that you'll work on. 

Perhaps not surprising, this is also a step that LLMs (e.g. ChatGPT) have gotten better and better at over the last few months. Nonetheless, it's still vital to know what some of these basic EDA steps look like. 

In [None]:
df.head(10)

To be able to see the the full __text__ field/column we need to tell pandas to change its default column width to be displayed. 

In [None]:
pd.set_option("display.max_colwidth", 240)
df.head()

Now let's summarize the observations (i.e. tweets) by their labels (i.e. sentiment). We know there should be three possible values for the labels: positive, neutral, and negative, but let's also see how many of each there are. 

In [None]:
df.sentiment.value_counts(normalize=True)

Let's now see what are the most common words used. Note that this means we need to separate the each tweet into the words. The most basic way to accomplish this is to use Python's string method, __split__. We can first confirm that the _text_ field is in a fact a string by looking at the data frame's data types with __dtypes__. 

In [None]:
df.dtypes

Note that the _text_ column is not a string data type, but is instead the more general __object__ data type (note that everything in Python inherits from the __object__ data type). Because of Python's friendly dynamic behavior, we don't need to worry about this if we were to look at a single tweet. That is, Python will allow us to use the string.split() method when we are using a single tweet:


In [None]:
df.iloc[1,1].split()

However, if we try to use split on the entire column we'll run into an issue, as can be seen by the error we encountered here (if you uncomment the line in the following cell).

In [None]:
#df['text'][:5].split()

To avoid this we need to use the pandas Series __str__ attribute to allow for string methods to be used. Try playing with the following code cell to see what happens when changing the indexing, or including/excluding __.str__ at the end. 

In [None]:
df['text'][:5].str.split()

Let's now add a new column that contains the list of tokens (i.e. words) for each tweet. Note that the __split()__ method splits a string by the SPACE character by default. This looks alright for now, but we'll see how to improve this later on. Now, however, that we will also convert everything to lowercase before splitting. 

In [None]:
df['tokens'] = df['text'].str.lower().str.split()
df.head()

Next we will count how often each word occurs across all of the tweets. We'll use a __dict__ for this as well where each __key__ is a word and each word's __value__ will be the number of times it appears in the entire dataset. 

In [None]:
vocab = dict()
for tweet_tokens in df['tokens']:
    for token in tweet_tokens:
        if token not in vocab:
            vocab[token] = 1
        else:
            vocab[token] += 1

len(vocab)

Let's sort these by the frequency with which each word (i.e. token) appears and then look at the top 20 or so.

In [None]:
vocab_sorted = dict(sorted(vocab.items(), key=lambda item: item[1], reverse=True))
list(vocab_sorted.items())[:25]

Not very informative, no? Words such as, "_to_", "_the_", "_i_", etc. do not convey a lot information in terms of sentiment. To prevent such uninformative words from influencing the task at hand, most NLP libraries/tasks provide an easy way to remove stop words. 

Before resorting to a Python module designed to work with text data, let's first try continuing with our simple manual approach. Let's see if we can see some difference between in the word (i.e. token) frequencies when we separate the data frame into positive, neutral, and negative. 

In [None]:
df_pos = df[df['sentiment'] == 'positive']
df_neg = df[df['sentiment'] == 'negative']
df_neu = df[df['sentiment'] == 'neutral']

def create_vocab_list(tokens_column):
    vocab = dict()
    for tweet_tokens in tokens_column:
        for token in tweet_tokens:
            if token not in vocab:
                vocab[token] = 1
            else:
                vocab[token] += 1
    return vocab

vocab_pos = dict(sorted(create_vocab_list(df_pos['tokens']).items(), key=lambda item: item[1], reverse=True))
vocab_neg = dict(sorted(create_vocab_list(df_neg['tokens']).items(), key=lambda item: item[1], reverse=True))
vocab_neu = dict(sorted(create_vocab_list(df_neu['tokens']).items(), key=lambda item: item[1], reverse=True))

In [None]:
list(vocab_pos.items())[:20]

In [None]:
list(vocab_neg.items())[:20]

In [None]:
list(vocab_neu.items())[:20]

Still a lot of stop words in the vocabularies for positive, negative and neutral tweets. Let's try removing the tokens from each of these vocabularies, if the token is also in the top, say, 500, tokens overall. 

In [None]:
top_n_to_remove = 200 # 100-bad, #500-good, #1k-better
for i, item in enumerate(vocab_sorted.items()):
    if i == top_n_to_remove:
        break
    #print(f" removing token: {item[0]:15} (w/ freq = {item[1]:5}) from vocabs")
    if item[0] in vocab_pos:
        del vocab_pos[item[0]]
    if item[0] in vocab_neg:
        del vocab_neg[item[0]]
    if item[0] in vocab_neu:
        del vocab_neu[item[0]]


In [None]:
list(vocab_pos.items())[:25]

In [None]:
list(vocab_neg.items())[:25]

In [None]:
list(vocab_neu.items())[:25]

That looks a little better! Now, let's try classifying the tweets by looking at one and counting how many tokens it has from the top k tokens in the vocab_pos, vocab_neg, and vocab_neutral sets. Whichever vocab it has the greatest number of tokens from, let's classify it as that. 

To accomplish this let's first create a vocabulary for each of the possible label values. Note that below we are including all the tokens for each label but we could easily include just the top k positive tokens, top k negative, etc.

In [None]:
classifier_tokens = {"positive": list(vocab_pos.keys())[:], "negative": list(vocab_neg.keys())[:], "neutral": list(vocab_neu.keys())[:]}
classifier_tokens

As an example, let's see what happens when we try to classify one single tweet.

In [None]:
tweet2classify_i = 5557 #555
tweet2classify = df.iloc[tweet2classify_i,:]['tokens']
df.iloc[tweet2classify_i,:]

In [None]:
pos = 0
neg = 0
neu = 0
for tok in tweet2classify:
    if tok in classifier_tokens['positive']:
        pos += 1
    elif tok in classifier_tokens['negative']:
        neg += 1
    elif tok in classifier_tokens['neutral']:
        neu += 1

print(f"pos: {pos}   neg: {neg}   neu: {neu}")


Here is our simple model. Let's count the number of frequently occurring words. If there are more positive words than negative and netural, then classify it as positive. If there are more neutral than positive or negative, classify neutral. They rest will be classified as negative. NOTE: We could certainly toy around with this more and determine what the default should be, and there should be a threshold different from simply being greater than. 

In [None]:
def predict_tweet_sentiment(tweet_tokens):
    pos = 0
    neg = 0
    neu = 0
    for tok in tweet_tokens:
        if tok in classifier_tokens['positive']:
            pos += 1
        elif tok in classifier_tokens['negative']:
            neg += 1
        elif tok in classifier_tokens['neutral']:
            neu += 1
    if pos > neg and pos > neu:
        return "positive"
    elif neu > pos and neu > neg:
        return "neutral"
    else:
        return "negative"

Make predictions for all of the tweets in our dataset/dataframe.

In [None]:
df['predicted_sentiment'] = df['tokens'].apply(lambda x: predict_tweet_sentiment(x))

Let's look at a few of the predictions as compared to the actual (i.e. true) labels. 

In [None]:
df.head(10)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(df['sentiment'], df['predicted_sentiment'])

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix(df['sentiment'], df['predicted_sentiment']), display_labels=['negative', 'neutral', 'positive'])
disp.plot()

In [None]:
from sklearn.metrics import accuracy_score
mod_accuracy = accuracy_score(df['sentiment'], df['predicted_sentiment'])
print(f"our ad hoc model's accuracy is: {mod_accuracy*100:.2f}%")

Although it's very quick and easy, there are some areas in which we can improve upon the approach we used above. 

1. __Better Tokenization:__ The tokenization is very crude right now. Simply splitting a tweet into tokens with the 'SPACE' character as a delimiter means that positive tokens like _"awesome."_ and _"awesome!"_ are considered as two different tokens. Cases such as that one are relatively easy to solve since it just meaning removing some punctuation. However, different forms of a word, such as _"delay"_ and _"delayed"_ can cause some issues too. For our ad hoc modeling approach above, resolving these tokenization don't seem to affect the outcome too much right now. But, once we try to use a proper model (e.g. naive bayes, neural network, etc.), then these tokenization deficiences can affect model performance even more. For example, having all of the different forms of the verb _"delay"_ will mean that our model needs to have that many more parameters in it. 
2. __Vectorization__: Even with better tokenization, we still need to do more modify the data to be able to use other types of models. What we specifically need to do is convert the tokens into a numerical representation of some kind. When working text data, this process of converting text/tokens into a numerical representation will allow us to use many different types of models, including neural networks. 
3. __Modeling Process/Evaluation:__ The modeling process and assessment need to be improved. To start, our accuracy of {{mod_accuracy}} is not much better than if we simply label every tweet as negative (since 65% of all tweets are negative). So we need to be sure that we're comparing our model performance metrics to a suitable baseline. The larger issue, however, is that we don't have a validation and/or test set right now. We used all 10k observations to build our positive, negative, and neutral sets of words; then we check our accuracy on these exact same sets of tweets. To really understand how our model will work for a new tweet that we have never seen, and that is posted in the future, we need to remove a portion of the dataset from the 'training', then evaluate our model against this. 

We'll come back to this dataset in one or notebooks and address these, but for now try to think about what impact they are having and how best we can resolve them. 

