# Data Preprocessing

Let's start off with a bit of data preprocessing for the Sentiment140 dataset from Kaggle.

In [1]:
import pandas as pd

df_raw = pd.read_csv('./training.1600000.processed.noemoticon.csv', 
                 header = None,
                 names=['target', 'id', 'date', 'flag', 'user', 'text'])
df_raw.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


Per Kaggle, our fields are as follows:

1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet ( 2087)
3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted (robotickilldozr)
6. text: the text of the tweet (Lyx is cool)

We only need 'target' and 'text' for sentiment analysis. We will also encode sentiment labels (0/2/4) to (0, 1, 1). Note that there are no neutral (originally labelled 2) tweets.

In [2]:
df = df_raw[['target', 'text']].copy()
df['sentiment'] = df['target'].map({0: 0, 2: 1, 4: 1})
df = df.drop('target', axis=1)
df = df.sample(10000, random_state = 734).reset_index(drop = True)
df.head()

Unnamed: 0,text,sentiment
0,@samantharonson hii sam!!! I'm so sad for U &a...,0
1,@guyoseary bye bye ... tell madonna we want a...,1
2,Last night in Budapest and back to Scotland to...,0
3,I have not been this happy in a long time!! I ...,1
4,dont feel bad danny my picture wont up load ...,0


For our simpler models (especially the logistic regression), we should look to clean the text so that the model can focus on the most important embeddings. We'll also keep the original text in there for BERT which uses a tokenizer that is designed to handle more complicated text patterns including puncuation, special characters, URLs, etc.

In [3]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)  # remove URLs
    text = re.sub(r"@\w+", "", text)     # remove mentions
    text = re.sub(r"#\w+", "", text)     # remove hashtags
    text = re.sub(r"[^\w\s]", "", text)  # remove punctuation
    text = re.sub(r"\d+", "", text)      # remove numbers
    text = re.sub(r"\s+", " ", text).strip()  # remove extra spaces
    return text

df['clean_text'] = df['text'].apply(clean_text)
df.to_csv("sentiment140_cleaned.csv", index = False)

In [4]:
df.head()

Unnamed: 0,text,sentiment,clean_text
0,@samantharonson hii sam!!! I'm so sad for U &a...,0,hii sam im so sad for u amp lindsay i hope eve...
1,@guyoseary bye bye ... tell madonna we want a...,1,bye bye tell madonna we want anything new love...
2,Last night in Budapest and back to Scotland to...,0,last night in budapest and back to scotland to...
3,I have not been this happy in a long time!! I ...,1,i have not been this happy in a long time i lo...
4,dont feel bad danny my picture wont up load ...,0,dont feel bad danny my picture wont up load ei...
