# MDSS Mini-Datathon 2021

URL Link to Mini-Datathon Kaggle competition: https://www.kaggle.com/c/us-election-twitter-mini-datathon-advanced

Submitted by Zac Kao

13 March 2021

## 1. Import libraries

In [1]:
import pandas as pd
import re
import ast
from nltk.corpus import stopwords
from nltk.tokenize.casual import TweetTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, sent_tokenize

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

pd.set_option('display.max_colwidth', None)

## 2. Parse training data

The training data is stored in three files.

We concatenate the data from these files into a single pandas DataFrame.

In [2]:
csv_df = pd.read_csv("train_csv.csv")
json_df = pd.read_json("train_json.json")
xlsx_df = pd.read_excel("train_excel.xlsx")

In [3]:
print("Number of records in CSV:", csv_df.shape[0])
print("Number of records in JSON:", json_df.shape[0])
print("Number of records in Excel:", xlsx_df.shape[0])

Number of records in CSV: 15001
Number of records in JSON: 6400
Number of records in Excel: 4999


In [4]:
# Concatenate the data from the 3 sources to a single pandas DataFrame.
train_df = pd.concat([csv_df, json_df, xlsx_df])
print("Total number of records in training data:", train_df.shape[0])

Total number of records in training data: 26400


We look at the training data.

In [5]:
train_df.sample(10)

Unnamed: 0,tweet,label
2031,b'RT @map2271: Waverly Decorator Quilt Fabric: Chefs Cooking Baker - Available on #Etsy https://t.co/4eqmmxAdgg\n#shopsmall #HodgePodgePam #RO\xe2\x80\xa6',Covid
2015,b'LAUNCH! NOW! What are you waiting for???\n\xf0\x9f\x8c\x8a\xf0\x9f\x98\xb7#WednesdayWisdom #LGBTQ \xf0\x9f\x8f\xb3\xef\xb8\x8f\xe2\x80\x8d\xf0\x9f\x8c\x88 #NOH8 #Resist #FBR #Grassroots #tweetuk\xe2\x80\xa6 https://t.co/gCnrpwTJ4H',BLM
12579,#Trump is the ONLY one who can save us from #Twitter and #Facebook's censorship. Imagine if #Biden wins? He will owe it to Big Tech and will do everything is his power to give them control over our society and this country.,Biden
4431,"b'RT @LingWooLiu: Today @GoogleDoodles is honoring the man behind the mask: my great grandfather, Dr. Wu Lien-teh! An exciting spotlight on a\xe2\x80\xa6'",Covid
346,b'Better Together World! #PiggyBackPositivity #BLM #MaskUp #StaySafe #unlearn #SeeTheGood #BeTheGood\xe2\x80\xa6 https://t.co/vQ7qcox5cu',BLM
9071,"Hijo menor de #Trump, Barron, dio positivo a #COVID19👉👉👉https://t.co/lT4UybUfn8 https://t.co/mdLGU6LQdk",Trump
4735,"b'RT @msulibrary: Who: You!!\nWhat: Cookies, Conversations, and Candidates.\nWhen: March 17, 2021 11am-1pm\nWhere: In front of the MSU Library.\xe2\x80\xa6'",Covid
6536,"@RealJamesWoods New York Post leak is probably from #Trump. #Biden is one of the masterminds of Ukraine's slaughter of Maidan = #Clinton's takeover of #Ukraine, and the message is to be careful because he is trying to do the same in the United States in the wake of the US presidential election",Trump
150,b'RT @seiu_uhw: 300 healthcare workers at Victor Valley Global Medical Center just voted to join our union with a whopping 98% YES VOTE. Thes\xe2\x80\xa6',Covid
10942,"@TeamTrump @realDonaldTrump Democlicans = Republicrats + nothing really\n\n""They never do anything for us anyway, Nothing will fundamentally change”\n\nARE YOU BETTER OFF?"" SURPRISING POLL RESULTS!\n\n@TeamTrump #JoeBiden #DonaldTrump #NancyPelosi #MitchMcConnell #USABananaRepublic\n\nhttps://t.co/WmAJSuhyRj",Biden


In [6]:
train_df.describe()

Unnamed: 0,tweet,label
count,26399,26400
unique,25587,5
top,"b'RT @TalbertSwan: Taylor Enterline, 21, Arrested at a peaceful #BlackLivesMatter protest in PA - $1 Million Bond\n\nRiley Williams, 22 #Capito\xe2\x80\xa6'",Covid
freq,2,9000


There are 5 unique values for the `label` column.

In [7]:
print(train_df["label"].unique())

['BLM' 'Trump' 'Biden' 'Covid' 'Riots']


We observe that there are 26400 records but there are 26399 `count` values in the `tweet` column.

In [8]:
train_df[train_df["tweet"].isnull()]

Unnamed: 0,tweet,label
10175,,Biden


This record is not useful for training, thus we drop this record.

In [9]:
train_df.dropna(how="any", inplace=True)
print("Total number of records in training data:", train_df.shape[0])

Total number of records in training data: 26399


## 3. Data wrangling

Some of the tweets are expressed in bytes literal. We need to perform literal evaluation **only** on the tweets that are expressed in bytes literal.

In [10]:
raw_tweet_regex = r"""(?x)
    # Starts with either b' or b"
    (?:b['"])
    # Any characters (greedy)
    (.*)
    # Ends with either the ' or " character
    (?:['"])
"""

raw_tweet_obj_regex = re.compile(raw_tweet_regex)

We define a function that checks if the tweet record is expressed in bytes literal, and if so, decode this tweet to a proper string.

In [11]:
def wrangle_text(tweet):
    regex_match = raw_tweet_obj_regex.findall(tweet)
    if len(regex_match) > 0: # if string is bytes literal, return decoded string
        return ast.literal_eval(tweet).decode("utf-8")
    else: # normal string
        return tweet

We test this with an example.

In [12]:
test_tweet = json_df.iloc[0]["tweet"]
print("Raw tweet:\n", test_tweet, sep="")
wrangled_tweet = wrangle_text(test_tweet)
print("\nWrangled tweet:\n", wrangled_tweet, sep="")

Raw tweet:
b"Reporting 58 new cases, 5 new hospitalizations, and 0 deaths. Today's #COVID19 figures for Medina County: 13,662 cu\xe2\x80\xa6 https://t.co/E19gzssGeT"

Wrangled tweet:
Reporting 58 new cases, 5 new hospitalizations, and 0 deaths. Today's #COVID19 figures for Medina County: 13,662 cu… https://t.co/E19gzssGeT


Now we perform this to the entire training dataset.

In [13]:
train_df["tweet_wrangled"] = train_df["tweet"].apply(wrangle_text)
wrangled_train_df = train_df.drop(["tweet"], axis=1)

In [14]:
wrangled_train_df.head(5)

Unnamed: 0,label,tweet_wrangled
0,BLM,Silencing #BLM : Priti Patel’s anti-protest law - @IanDunt on how a Government keen to 'tackle cancel culture' when… https://t.co/O7TvlOBkro
1,BLM,"@Trillian42_ @Johnbok5 @NadiaWhittomeMP ""'Silly little woke lefty'. Mostly Harmless. Rebel Scum. Gamer. Intersectio… https://t.co/jdhjYk75bM"
2,BLM,"RT @ErrolWebber: Tell me, would this be considered ""Racist?”\n\nDo people who support #BLM think this is OK? https://t.co/DuwAbLJghP"
3,BLM,@APPLE won't let Parler have an App but still keeps @TWitter Who allows all manner of provoking violence with vario… https://t.co/zFNX8KLFhF
4,BLM,@malika_andrews @wojespn Can we get #JLM trending. Maybe the NBA can put that on the b-ball court and have ALL jers… https://t.co/OE2PQk42FQ


We drop duplicated tweets.

In [15]:
wrangled_train_df.drop_duplicates(inplace=True)
print(wrangled_train_df.shape[0])

26177


## 4. Data pre-processing

Now we perform data pre-processing so that it can be used to train classification algorithms.

### 4.1 Tokenization using TweetTokenizer

We also convert the text to lowercase.

In [16]:
tokenizer = TweetTokenizer()
# tokenize and also convert text to lowercase
wrangled_train_df["tokens"] = wrangled_train_df["tweet_wrangled"].apply(lambda x: tokenizer.tokenize(x.lower())) 

### 4.2 Remove stopwords, URLs, percentages, punctuations, numbers

Remove stopwords, URLs, percentages, punctuations, and numbers as they do not have any meaningfulness when training.

In [17]:
en_stopwords = stopwords.words("english") # We use the English stopwords from NLTK

# Regex for URLs, percentages, punctuations, and numbers
url_regex = r"^http(s*):\/\/"
curr_perc_regex = r"(?:[A-Z]{1,3})?[\$£€¥]?(?:\d{1,3},)*\d{1,3}(?:\.\d+)?"
punctuations_regex = r"([^\w\s]{1})\B"
numbers_regex = r"[0-9]+((\.[0-9]+)*)"

url_regex_obj = re.compile(url_regex)
curr_perc_regex_obj = re.compile(curr_perc_regex)
punctuations_regex_obj = re.compile(punctuations_regex)
numbers_regex_obj = re.compile(numbers_regex)

In [18]:
# New column containing tokens after the removal
wrangled_train_df["processed_tokens"] = wrangled_train_df["tokens"].apply(
    lambda x: [token for token in x if ((token not in en_stopwords) 
                                        and not (url_regex_obj.match(token) 
                                                 or curr_perc_regex_obj.match(token) 
                                                 or punctuations_regex_obj.match(token)
                                                 or numbers_regex_obj.match(token)))
              ])

In [19]:
wrangled_train_df.head(5)

Unnamed: 0,label,tweet_wrangled,tokens,processed_tokens
0,BLM,Silencing #BLM : Priti Patel’s anti-protest law - @IanDunt on how a Government keen to 'tackle cancel culture' when… https://t.co/O7TvlOBkro,"[silencing, #blm, :, priti, patel, ’, s, anti-protest, law, -, @iandunt, on, how, a, government, keen, to, ', tackle, cancel, culture, ', when, …, https://t.co/o7tvlobkro]","[silencing, #blm, priti, patel, anti-protest, law, @iandunt, government, keen, tackle, cancel, culture]"
1,BLM,"@Trillian42_ @Johnbok5 @NadiaWhittomeMP ""'Silly little woke lefty'. Mostly Harmless. Rebel Scum. Gamer. Intersectio… https://t.co/jdhjYk75bM","[@trillian42_, @johnbok5, @nadiawhittomemp, "", ', silly, little, woke, lefty, ', ., mostly, harmless, ., rebel, scum, ., gamer, ., intersectio, …, https://t.co/jdhjyk75bm]","[@trillian42_, @johnbok5, @nadiawhittomemp, silly, little, woke, lefty, mostly, harmless, rebel, scum, gamer, intersectio]"
2,BLM,"RT @ErrolWebber: Tell me, would this be considered ""Racist?”\n\nDo people who support #BLM think this is OK? https://t.co/DuwAbLJghP","[rt, @errolwebber, :, tell, me, ,, would, this, be, considered, "", racist, ?, ”, do, people, who, support, #blm, think, this, is, ok, ?, https://t.co/duwabljghp]","[rt, @errolwebber, tell, would, considered, racist, people, support, #blm, think, ok]"
3,BLM,@APPLE won't let Parler have an App but still keeps @TWitter Who allows all manner of provoking violence with vario… https://t.co/zFNX8KLFhF,"[@apple, won't, let, parler, have, an, app, but, still, keeps, @twitter, who, allows, all, manner, of, provoking, violence, with, vario, …, https://t.co/zfnx8klfhf]","[@apple, let, parler, app, still, keeps, @twitter, allows, manner, provoking, violence, vario]"
4,BLM,@malika_andrews @wojespn Can we get #JLM trending. Maybe the NBA can put that on the b-ball court and have ALL jers… https://t.co/OE2PQk42FQ,"[@malika_andrews, @wojespn, can, we, get, #jlm, trending, ., maybe, the, nba, can, put, that, on, the, b-ball, court, and, have, all, jers, …, https://t.co/oe2pqk42fq]","[@malika_andrews, @wojespn, get, #jlm, trending, maybe, nba, put, b-ball, court, jers]"


### 4.3 Stemming

We use the PorterStemmer to perform stemming on the tokens.

In [20]:
stemmer = PorterStemmer()
wrangled_train_df["processed_tokens"] = wrangled_train_df["processed_tokens"].apply(lambda x: [stemmer.stem(token) for token in x])

In [21]:
wrangled_train_df.head(5)

Unnamed: 0,label,tweet_wrangled,tokens,processed_tokens
0,BLM,Silencing #BLM : Priti Patel’s anti-protest law - @IanDunt on how a Government keen to 'tackle cancel culture' when… https://t.co/O7TvlOBkro,"[silencing, #blm, :, priti, patel, ’, s, anti-protest, law, -, @iandunt, on, how, a, government, keen, to, ', tackle, cancel, culture, ', when, …, https://t.co/o7tvlobkro]","[silenc, #blm, priti, patel, anti-protest, law, @iandunt, govern, keen, tackl, cancel, cultur]"
1,BLM,"@Trillian42_ @Johnbok5 @NadiaWhittomeMP ""'Silly little woke lefty'. Mostly Harmless. Rebel Scum. Gamer. Intersectio… https://t.co/jdhjYk75bM","[@trillian42_, @johnbok5, @nadiawhittomemp, "", ', silly, little, woke, lefty, ', ., mostly, harmless, ., rebel, scum, ., gamer, ., intersectio, …, https://t.co/jdhjyk75bm]","[@trillian42_, @johnbok5, @nadiawhittomemp, silli, littl, woke, lefti, mostli, harmless, rebel, scum, gamer, intersectio]"
2,BLM,"RT @ErrolWebber: Tell me, would this be considered ""Racist?”\n\nDo people who support #BLM think this is OK? https://t.co/DuwAbLJghP","[rt, @errolwebber, :, tell, me, ,, would, this, be, considered, "", racist, ?, ”, do, people, who, support, #blm, think, this, is, ok, ?, https://t.co/duwabljghp]","[rt, @errolwebb, tell, would, consid, racist, peopl, support, #blm, think, ok]"
3,BLM,@APPLE won't let Parler have an App but still keeps @TWitter Who allows all manner of provoking violence with vario… https://t.co/zFNX8KLFhF,"[@apple, won't, let, parler, have, an, app, but, still, keeps, @twitter, who, allows, all, manner, of, provoking, violence, with, vario, …, https://t.co/zfnx8klfhf]","[@appl, let, parler, app, still, keep, @twitter, allow, manner, provok, violenc, vario]"
4,BLM,@malika_andrews @wojespn Can we get #JLM trending. Maybe the NBA can put that on the b-ball court and have ALL jers… https://t.co/OE2PQk42FQ,"[@malika_andrews, @wojespn, can, we, get, #jlm, trending, ., maybe, the, nba, can, put, that, on, the, b-ball, court, and, have, all, jers, …, https://t.co/oe2pqk42fq]","[@malika_andrew, @wojespn, get, #jlm, trend, mayb, nba, put, b-ball, court, jer]"


### 4.4 Corpus

Now we can develop the corpus of each tweet.

In [22]:
# join each token with a space character to form a corpus
wrangled_train_df["corpus"] = wrangled_train_df["processed_tokens"].apply(lambda x: " ".join(x))

In [23]:
wrangled_train_df.head(5)

Unnamed: 0,label,tweet_wrangled,tokens,processed_tokens,corpus
0,BLM,Silencing #BLM : Priti Patel’s anti-protest law - @IanDunt on how a Government keen to 'tackle cancel culture' when… https://t.co/O7TvlOBkro,"[silencing, #blm, :, priti, patel, ’, s, anti-protest, law, -, @iandunt, on, how, a, government, keen, to, ', tackle, cancel, culture, ', when, …, https://t.co/o7tvlobkro]","[silenc, #blm, priti, patel, anti-protest, law, @iandunt, govern, keen, tackl, cancel, cultur]",silenc #blm priti patel anti-protest law @iandunt govern keen tackl cancel cultur
1,BLM,"@Trillian42_ @Johnbok5 @NadiaWhittomeMP ""'Silly little woke lefty'. Mostly Harmless. Rebel Scum. Gamer. Intersectio… https://t.co/jdhjYk75bM","[@trillian42_, @johnbok5, @nadiawhittomemp, "", ', silly, little, woke, lefty, ', ., mostly, harmless, ., rebel, scum, ., gamer, ., intersectio, …, https://t.co/jdhjyk75bm]","[@trillian42_, @johnbok5, @nadiawhittomemp, silli, littl, woke, lefti, mostli, harmless, rebel, scum, gamer, intersectio]",@trillian42_ @johnbok5 @nadiawhittomemp silli littl woke lefti mostli harmless rebel scum gamer intersectio
2,BLM,"RT @ErrolWebber: Tell me, would this be considered ""Racist?”\n\nDo people who support #BLM think this is OK? https://t.co/DuwAbLJghP","[rt, @errolwebber, :, tell, me, ,, would, this, be, considered, "", racist, ?, ”, do, people, who, support, #blm, think, this, is, ok, ?, https://t.co/duwabljghp]","[rt, @errolwebb, tell, would, consid, racist, peopl, support, #blm, think, ok]",rt @errolwebb tell would consid racist peopl support #blm think ok
3,BLM,@APPLE won't let Parler have an App but still keeps @TWitter Who allows all manner of provoking violence with vario… https://t.co/zFNX8KLFhF,"[@apple, won't, let, parler, have, an, app, but, still, keeps, @twitter, who, allows, all, manner, of, provoking, violence, with, vario, …, https://t.co/zfnx8klfhf]","[@appl, let, parler, app, still, keep, @twitter, allow, manner, provok, violenc, vario]",@appl let parler app still keep @twitter allow manner provok violenc vario
4,BLM,@malika_andrews @wojespn Can we get #JLM trending. Maybe the NBA can put that on the b-ball court and have ALL jers… https://t.co/OE2PQk42FQ,"[@malika_andrews, @wojespn, can, we, get, #jlm, trending, ., maybe, the, nba, can, put, that, on, the, b-ball, court, and, have, all, jers, …, https://t.co/oe2pqk42fq]","[@malika_andrew, @wojespn, get, #jlm, trend, mayb, nba, put, b-ball, court, jer]",@malika_andrew @wojespn get #jlm trend mayb nba put b-ball court jer


## 5. Develop pipelines for training

We use the TfidfVectorizer (which uses the tf-idf measure) for vectorization and the following algorithms to train models:
- SVC (support vector machine)
- LinearSVC (support vecotr machine)
- SGDClassifier (stochastic gradient descent)

In [24]:
# vectorizer instance
tfidf_vectorizer = TfidfVectorizer()

# ML classification algorithm instances
svmClassifier = SVC(max_iter=1000)
lsvmClassifier = LinearSVC(max_iter=1000)
mlpClassifier = MLPClassifier(max_iter=500)
sgdClassifier = SGDClassifier(max_iter=1000)

# Create pipelines
pipeline = Pipeline([("tfidf", tfidf_vectorizer),
                    ("svm", svmClassifier)])
pipeline2 = Pipeline([("tfidf", tfidf_vectorizer),
                    ("lsvm", lsvmClassifier)])
# pipeline3 = Pipeline([("tfidf", tfidf_vectorizer),
#                     ("mlp", mlpClassifier)])
pipeline4 = Pipeline([("tfidf", tfidf_vectorizer),
                    ("sgd", sgdClassifier)])

In [25]:
def trainPipelineAndPredict(pipeline, train_features, train_label, test_features, test_label):
    # fit training data to pipeline
    pipeline.fit(train_features, train_label)
    
    # Predict on training data and testing data for evaluation
    predictions_train = pipeline.predict(train_features)
    predictions_test = pipeline.predict(test_features)
    
    # Evaluate accuracy values
    training_accuracy = accuracy_score(train_label, predictions_train)
    testing_accuracy = accuracy_score(test_label, predictions_test)
    
    return (training_accuracy, testing_accuracy)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(wrangled_train_df["corpus"], wrangled_train_df["label"], test_size=0.2) 

We use each algorithm to evaluate the accuracy values.

In [27]:
trainPipelineAndPredict(pipeline, X_train, y_train, X_test, y_test)



(0.9783200420228261, 0.8869365928189458)

In [28]:
trainPipelineAndPredict(pipeline2, X_train, y_train, X_test, y_test)

(0.9776992502745809, 0.8890374331550802)

In [29]:
trainPipelineAndPredict(pipeline4, X_train, y_train, X_test, y_test)

(0.9514827372140776, 0.8926661573720397)

## 6. Repeat steps 2 to 5 on testing data

### 6.1 Parse testing data

Similar to [step 2](#2.-Parse-training-data).

In [30]:
test_df = pd.read_csv("test_data_advanced.csv")
test_df.sample(10)

Unnamed: 0,Train_id,tweet
3637,3638,"b'RT @Consumers_Kenya: Kenya records 337 new #COVID19 cases from 2,924 samples, total cases now 109,164. 53 patients have recovered, 32 from\xe2\x80\xa6'"
3433,3434,b'Flu cases have markedly decreased this season. It is not a coincidence. Common sense COVID precautions must be part\xe2\x80\xa6 https://t.co/vg9cMEWCEq'
2261,2262,#Trump sul palco è tornato a ruggire. #Biden invece non sa dove andare https://t.co/Blpvn8pjww
5124,5125,"b'RT @JohnPersinos1: In the End, it Was Trump Who Threatened America with Terrorism https://t.co/8vFt3ejgU3 #GOPComplicitTraitors #CapitolRio\xe2\x80\xa6'"
3356,3357,b'#Leralla\nTrain to Elandsfontein departed from Leralla Station \nFirst stopping station is Limindlela Station\xe2\x80\xa6 https://t.co/47LyauHhUR'
3137,3138,b'I can assure you that Spring Break Super Spreaders is not what you think it is. #covidspringbreak #WearAMask #covidisreal'
5804,5805,b'Blatant corruption &amp; a betrayal to the people of #Florida. Anything goes with #GOP! #Sedition #CapitolRiots Using e\xe2\x80\xa6 https://t.co/73mB9I50tw'
3876,3877,"b'RT @CTZebra: Becky ""Darlene"" Myhand, 55yo LPN, Care Center of Aberdeen, MS, died of #covid19 7/12. \nShe enjoyed British TV, reading and se\xe2\x80\xa6'"
3513,3514,b'To see how our masks are manufactured check out the video on our website here: \nhttps://t.co/QeKADogNPQ\xe2\x80\xa6 https://t.co/NVa4ftA762'
2806,2807,#Facebook et #Twitter accusés d’avoir bloqué un article controversé sur #JoeBiden #Election2020 https://t.co/05aA5uzmF8


### 6.2 Data wrangling on testing data

Similar to [step 3](#3.-Data-wrangling).

In [31]:
test_df["tweet_wrangled"] = test_df["tweet"].apply(wrangle_text)
wrangled_test_df = test_df.drop(["tweet"], axis=1)
wrangled_test_df.sample(10)

Unnamed: 0,Train_id,tweet_wrangled
2605,2606,#ElectionInterference \n#HunterBiden \n#JoeBiden \n#NewYorkPost https://t.co/MNMCMJwYeG
5777,5778,Trump And The GE Parses The Capital Riots https://t.co/mK6yufVgoM #CapitolRiots #insurrection #TrumpAndTheGE https://t.co/Mnf5unJS1D
1009,1010,YouGov's Social Change Monitor finds Black women are especially likely to have a high opinion of the… https://t.co/Wa1VjIWQRZ
5652,5653,How about republicans supported #CapitolRiots #ading #insurrectionists is it a criminalizing ? https://t.co/yPNwIXEIi9
3012,3013,"#liars #HunterBiden #JoeBiden #CrookedJoeBiden #dirtbags #DemocratsAreDestroyingAmerica Joe Biden did not push out a Ukrainian prosecutor for investigating his son, The Washington Post confirms https://t.co/dZgdqG3ZbM"
3328,3329,"RT @CTZebra: Darrell Robinson, 64yo Mental Health Counselor at Lake County Jail, Crown Point, Indiana. He was also a Bishop who strove to “…"
139,140,RT @KNEEMAGGIO: I'm a black enby that is struggling to survive 😎 my birthday is so soon and i just need to eat and get to and from work and…
1778,1779,Why is a social media site CENSORING this news #twitter simply appalling leftist biased censorship #Biden #Trump https://t.co/RodgwPo462
821,822,"RT @sabby_carter: Are you looking for a Graphic Designer? HMU or DM me, I can make any type of design as per your requirement at a fair pri…"
1909,1910,"Tak hanya Donald Trump dan sang istri yang terjangkit virus corona, anaknya yang bernama Barron Trump juga positif Covid-19. #KAMUHARUSTAU #BarronTrump #DonaldTrump #VirusCorona\nhttps://t.co/sTnZLJdPHu"


### 6.3 Pre-processing on testing data
Similar to [step 4](#4.-Data-pre-processing).

#### 6.3.1 Tokenization
Similar to [step 4.1](#4.1-Tokenization-using-TweetTokenizer)

In [32]:
# Tokenization
wrangled_test_df["tokens"] = wrangled_test_df["tweet_wrangled"].apply(lambda x: tokenizer.tokenize(x.lower()))

#### 6.3.2 Remove stopwords, URLs, percentages, punctuations, numbers

Similar to [step 4.2](#4.2-Remove-stopwords,-URLs,-percentages,-punctuations,-numbers)

In [33]:
# Remove stopwords, URLs, numbers, percentages, punctuations
wrangled_test_df["processed_tokens"] = wrangled_test_df["tokens"].apply(
    lambda x: [token for token in x if ((token not in en_stopwords) 
                                        and (len(token) > 2)
                                        and not (url_regex_obj.match(token) 
                                                 or curr_perc_regex_obj.match(token) 
                                                 or punctuations_regex_obj.match(token)
                                                 or numbers_regex_obj.match(token)))
              ])

#### 6.3.3 Stemming
Similar to [step 4.3](#4.3-Stemming).

In [34]:
# Apply stemming to each token
wrangled_test_df["processed_tokens"] = wrangled_test_df["processed_tokens"].apply(lambda x: [stemmer.stem(token) for token in x])

#### 6.3.4 Corpus
Similar to [step 4.4](#4.4-Corpus).

In [35]:
# Corpus of each tweet after stemming (joined by a single space)
wrangled_test_df["corpus"] = wrangled_test_df["processed_tokens"].apply(lambda x: " ".join(x))

## 6.4 Predict testing labels
We now use the entire training data on the pipelines and make predictions on the testing data.

In [36]:
def trainPipelineAndPredictTest(pipeline, train_features, train_label, test_features):
    pipeline.fit(train_features, train_label) # Using training data to fit model
    predictions_test = pipeline.predict(test_features) # Predict using testing data
    
    # For comparison purposes
    predictions_train = pipeline.predict(train_features)
    training_accuracy = accuracy_score(train_label, predictions_train)
    print(training_accuracy)
    
    # Return the predicted labels
    return predictions_test

Now, we use the entire training set data to train the respective classification models and produce prediction labels on the testing data.

In [37]:
predictions1 = trainPipelineAndPredictTest(pipeline, wrangled_train_df["corpus"], wrangled_train_df["label"], wrangled_test_df["corpus"])



0.974634220880926


In [38]:
predictions2 = trainPipelineAndPredictTest(pipeline2, wrangled_train_df["corpus"], wrangled_train_df["label"], wrangled_test_df["corpus"])

0.9737937884402338


In [39]:
predictions4 = trainPipelineAndPredictTest(pipeline4, wrangled_train_df["corpus"], wrangled_train_df["label"], wrangled_test_df["corpus"])

0.9447224662871987


## 7. Write outputs

Now we assign the labels to the `label` column in the DataFrame, and write the predictions CSV file.

In [40]:
test_df["label"] = predictions1

In [41]:
test_df[["Train_id", "label"]].to_csv("test_predictions.csv", index=False, header=True, sep=",")

In [42]:
test_df2 = test_df.copy()
test_df2["label"] = predictions2
test_df2[["Train_id", "label"]].to_csv("test_predictions2.csv", index=False, header=True, sep=",")

In [43]:
test_df4 = test_df.copy()
test_df4["label"] = predictions4
test_df4[["Train_id", "label"]].to_csv("test_predictions4.csv", index=False, header=True, sep=",")

--- End ---