In [81]:
import re
from string import punctuation

from nltk.tokenize.casual import TweetTokenizer
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook as tqdm_n

tqdm_n().pandas()

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




# INTRNLP MCO: Twitter Emoji Prediction
## Preprocessing
In this step, we preprocess the raw tweet text into tokens.

Note: Some of the "magic" cells contain Bash shell commands; these may not work on all platforms. However, all cells important in producing the final output have been written in Python for portability.

In [82]:
!ls data-raw

[31mMapping.csv[m[m      [31mOutputFormat.csv[m[m [31mTest.csv[m[m         [31mTrain.csv[m[m


There are two files containing the tweets themselves: Test.csv and Train.csv. However, only Test.csv is labeled with corresponding emoji. We use Test.csv as our basis for the final preprocessed dataset.

In [83]:
tweets = pd.read_csv('data-raw/Train.csv').iloc[:, 1:].rename(columns={'TEXT':'text', 'Label':'emoji'})
tweets.head()

Unnamed: 0,text,emoji
0,Vacation wasted ! #vacation2017 #photobomb #ti...,0
1,"Oh Wynwood, you’re so funny! : @user #Wynwood ...",1
2,Been friends since 7th grade. Look at us now w...,2
3,This is what it looks like when someone loves ...,3
4,RT @user this white family was invited to a Bl...,3


In [84]:
tweets.isna().sum()

text     0
emoji    0
dtype: int64

Let's tokenize the text using NLTK's TweetTokenizer. TweetTokenizer is built to handle tweets.

We set the option `preserve_case` to False, giving us lowercase tokens

We use the option `reduce_len`, which does the following:
> Replace[s] repeated character sequences of length 3 or greater with sequences of length 3.

For instance, "yes" and "yesss" will be treated separately, while "yessss" will be counted as an instance of "yesss". This is so we can capture words with greater emphasis. Since we are doing sentiment analysis, we place great value on these variations of words that might imply stronger emotions than their more plain counterparts.

We use the option `strip_handles`, which removes @{} mentions.

In [85]:
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

In [86]:
tokens = tweets['text'].progress_apply(tokenizer.tokenize)\
    .apply(pd.Series).stack().reset_index().drop(['level_1'], axis=1)\
    .rename(columns={'level_0':'index', 0:'token'})
tokens.head()

HBox(children=(IntProgress(value=0, max=70000), HTML(value='')))




Unnamed: 0,index,token
0,0,vacation
1,0,wasted
2,0,!
3,0,#vacation2017
4,0,#photobomb


Extracting hashtags per tweet

In [90]:
hashtags = tokens[tokens['token'].str.match('^#.+')]
hashtags.to_csv('data-clean/hashtags.csv')
hashtags.head()

Unnamed: 0,index,token
3,0,#vacation2017
4,0,#photobomb
5,0,#tired
6,0,#vacationwasted
7,0,#mcgar30


Removing the Twitter reserved word "RT" (retweet)

In [91]:
tokens = tokens[tokens['token']!='rt']
tokens.head()

Unnamed: 0,index,token
0,0,vacation
1,0,wasted
2,0,!
3,0,#vacation2017
4,0,#photobomb


Punctuation is retained, as this may give us valuable sentiment-related insights.

Tweets and tokens are saved to file for later use.

In [94]:
tokens.to_csv('data-clean/tokens.csv', index=False)
tweets.to_csv('data-clean/tweets.csv', index=False)