# Part II: Kaggle Competition (30%)
Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm2022-isa5810-lab2-homework) regarding Emotion Recognition on Twitter. The scoring will be given according to your place in the Private Leaderboard ranking: 
   - **Bottom 40%**: Get 20% of the 30% available for this section.
   - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   

Submit your last submission **BEFORE the deadline (Nov. 22th 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.


## 1. Data Preparation

To load the data efficiently, I first use this file to preprocess the data and save them as a `.pkl` file.

In [1]:
# import library
import pandas as pd
import numpy as np
import nltk
%matplotlib inline

In [2]:
# load the data
df_identification = pd.read_csv('part2_data/data_identification.csv')
df_tweets = pd.read_json('part2_data/tweets_DM.json', lines = True)
df_emotion = pd.read_csv('part2_data/emotion.csv')
# df_tweets = pd.read_pickle("part2_data/cleaned_tweets.pkl")

In [3]:
# check the shape of data
print("identification shape:", df_identification.shape)
print("tweets shape:", df_tweets.shape)
print("emotion shape:", df_emotion.shape)

identification shape: (1867535, 2)
tweets shape: (1867535, 5)
emotion shape: (1455563, 2)


## 1.1 Clean the tweets data
We can find that in `_source` column, there are the main informations we need. Therefore, let's split these informations out. 

In [4]:
# get source info
def get_hashtags(t):
    r = []
    hashtags = t['tweet']['hashtags']
    for i in range(len(hashtags)):
        r.append(hashtags[i])
    return r

def get_id(t):
    return t['tweet']['tweet_id']

def get_text(t):
    return t['tweet']['text']

In [5]:
# get the hashtags, tweet_id, and text from `_source`
df_tweets['hashtags'] = df_tweets['_source'].apply(lambda x: get_hashtags(x))
df_tweets['tweet_id'] = df_tweets['_source'].apply(lambda x: get_id(x))
df_tweets['text'] = df_tweets['_source'].apply(lambda x: get_text(x))

After getting the informations from `_source`, we now can drop the useless columns from the dataframe. By using `groupby` to check the columns, we found that `_type` and `_index` store the same info. in each row. Therefore. we can drop them to get cleaner data.

In [6]:
# check the labels
print(df_tweets.groupby(['_type']).count()['_source'])
print(df_tweets.groupby(['_index']).count()['_source'])

_type
tweets    1867535
Name: _source, dtype: int64
_index
hashtag_tweets    1867535
Name: _source, dtype: int64


In [7]:
df_tweets = df_tweets.drop(columns=['_score', '_crawldate', '_type', '_index', '_source'])
df_tweets = df_tweets.reindex(columns=['tweet_id','text','hashtags'])
df_tweets

Unnamed: 0,tweet_id,text,hashtags
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",[Snapchat]
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...","[freepress, TrumpLegacy, CNN]"
2,0x28b412,"Confident of your obedience, I write to you, k...",[bibleverse]
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,[]
4,0x2de201,"""Trust is not the same as faith. A friend is s...",[]
...,...,...,...
1867530,0x316b80,When you buy the last 2 tickets remaining for ...,"[mixedfeeling, butimTHATperson]"
1867531,0x29d0cb,I swear all this hard work gone pay off one da...,[]
1867532,0x2a6a4f,@Parcel2Go no card left when I wasn't in so I ...,[]
1867533,0x24faed,"Ah, corporate life, where you can date <LH> us...",[]


## 1.2 Concatenate the data
To juxtapose the text, label, and training/testing set, I concatenate them on `tweet_id`.

In [8]:
df = df_tweets.merge(df_emotion, on = 'tweet_id', how='left')
df = df.merge(df_identification, on = 'tweet_id', how='left')
df

Unnamed: 0,tweet_id,text,hashtags,emotion,identification
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",[Snapchat],anticipation,train
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...","[freepress, TrumpLegacy, CNN]",sadness,train
2,0x28b412,"Confident of your obedience, I write to you, k...",[bibleverse],,test
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,[],fear,train
4,0x2de201,"""Trust is not the same as faith. A friend is s...",[],,test
...,...,...,...,...,...
1867530,0x316b80,When you buy the last 2 tickets remaining for ...,"[mixedfeeling, butimTHATperson]",,test
1867531,0x29d0cb,I swear all this hard work gone pay off one da...,[],,test
1867532,0x2a6a4f,@Parcel2Go no card left when I wasn't in so I ...,[],,test
1867533,0x24faed,"Ah, corporate life, where you can date <LH> us...",[],joy,train


In [9]:
import nltk
def tokenize_text(text, remove_stopwords=False):
    """
    Tokenize text using the nltk library
    """
    tokens = []
    for d in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(d, language='english'):
            # filters here
            tokens.append(word)
    return tokens

In case that we need to remove the stopwords while preprcossing, I add the column with text removed stopwords.

In [10]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_eng = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vivian/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# remove stopwords and punctuation
import string
df['unigrams'] = df['text'].apply(lambda x: tokenize_text(x))
df['remove_stopwords'] = df['unigrams'].apply(lambda x: [item for item in x if (item.lower() not in stopwords_eng and item.lower() not in string.punctuation)])
# turn into string and use' ' to seperate each terms
df['remove_stopwords'] = [' '.join(map(str, l)) for l in df['remove_stopwords']]
df

Unnamed: 0,tweet_id,text,hashtags,emotion,identification,unigrams,remove_stopwords
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",[Snapchat],anticipation,train,"[People, who, post, ``, add, me, on, #, Snapch...",People post `` add Snapchat '' must dehydrated...
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...","[freepress, TrumpLegacy, CNN]",sadness,train,"[@, brianklaas, As, we, see, ,, Trump, is, dan...",brianklaas see Trump dangerous freepress aroun...
2,0x28b412,"Confident of your obedience, I write to you, k...",[bibleverse],,test,"[Confident, of, your, obedience, ,, I, write, ...",Confident obedience write knowing even ask Phi...
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,[],fear,train,"[Now, ISSA, is, stalking, Tasha, 😂😂😂, <, LH, >]",ISSA stalking Tasha 😂😂😂 LH
4,0x2de201,"""Trust is not the same as faith. A friend is s...",[],,test,"[``, Trust, is, not, the, same, as, faith, ., ...",`` Trust faith friend someone trust Putting fa...
...,...,...,...,...,...,...,...
1867530,0x316b80,When you buy the last 2 tickets remaining for ...,"[mixedfeeling, butimTHATperson]",,test,"[When, you, buy, the, last, 2, tickets, remain...",buy last 2 tickets remaining show sell .. mixe...
1867531,0x29d0cb,I swear all this hard work gone pay off one da...,[],,test,"[I, swear, all, this, hard, work, gone, pay, o...",swear hard work gone pay one day😈💰💸 LH
1867532,0x2a6a4f,@Parcel2Go no card left when I wasn't in so I ...,[],,test,"[@, Parcel2Go, no, card, left, when, I, was, n...",Parcel2Go card left n't idea get parcel LH
1867533,0x24faed,"Ah, corporate life, where you can date <LH> us...",[],joy,train,"[Ah, ,, corporate, life, ,, where, you, can, d...",Ah corporate life date LH using relative anach...


In [12]:
# save to pickle file
df.to_pickle("part2_data/cleaned_tweets.pkl")