## Subtweet Classifier Training Set Creator Jupyter Notebook-in-Progress

### Goals:
#### Create a training set for a Naive Bayes Classifier which will be implemented in a Scikit-Learn Pipeline

### Methods:
#### Using Pandas for managing CSV tables, label the data downloaded using the subtweet downloading script as positive, then label the highly positive sentiment tweets gathered by Alec Go as negative

#### Make the number of positively labelled tweets equal to the number of negatively labelled tweets, and save the dataset

#### Import libraries

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from glob import glob
import pandas as pd
import nltk
import re

#### Find filenames to load

In [2]:
old_data_filenames = glob("../data/data_for_training/downloaded_nightly_data/*.csv")

In [3]:
new_data_filenames = glob("../data/data_for_training/consolidated_data/*.csv")

#### Create a list of dataframes

In [4]:
dataframes_list = [pd.read_csv(f, index_col=0) for f in old_data_filenames + new_data_filenames]

#### Concatenate the dataframes

In [5]:
dataframe_original = pd.concat(dataframes_list, ignore_index=True)

#### Remove duplicates

In [6]:
dataframe_dropped = dataframe_original.drop_duplicates("alleged_subtweet_id", keep="first")

#### Remove friends' tweets

In [7]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("akrapf96") == False]

In [8]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("zoeterhune") == False]

In [9]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("juliaeberry") == False]

In [10]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("NoahSegalGould") == False]

#### Drop extra columns

In [11]:
dataframe_final = dataframe_dropped.drop(["accuser_username", "subtweet_evidence", 
                                          "subtweet_evidence_id", "subtweeter_username", 
                                          "alleged_subtweet_id"], axis=1).reset_index(drop=True)

#### Replace leftover HTML encoded characters with normal ones

In [12]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&quot;", "\"")

In [13]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&amp;", "&")

In [14]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&gt;", ">")

In [15]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&lt;", "<")

#### Remove rows with non-english

In [16]:
def is_english(s):
    return all(ord(char) < 128 for char in s)

In [17]:
dataframe_final = dataframe_final[dataframe_final.alleged_subtweet.map(is_english)].reset_index(drop=True)

#### Add column for classification

In [18]:
dataframe_final["is_subtweet"] = "positive"

#### Show the bottom of the table

In [19]:
dataframe_final.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
5351,Ternyata punya follower banyak ga bikin kamu b...,positive
5352,I genuinely think there should be an edit butt...,positive
5353,people who use oof on the internet are virgins,positive
5354,and after all we did for you... sad:/,positive
5355,I hate when babes put delete on Twitter but th...,positive


#### Do the same on the accusers' tweets

In [20]:
dataframe_intermediate = dataframe_dropped.drop(["accuser_username", "subtweet_evidence_id", 
                                                 "subtweeter_username", "alleged_subtweet",
                                                 "alleged_subtweet_id"], axis=1).reset_index(drop=True)

#### Repair leftover HTML again

In [21]:
dataframe_intermediate["subtweet_evidence"] = dataframe_intermediate["subtweet_evidence"].str.replace("&quot;", "\"")

In [22]:
dataframe_intermediate["subtweet_evidence"] = dataframe_intermediate["subtweet_evidence"].str.replace("&amp;", "&")

In [23]:
dataframe_intermediate["subtweet_evidence"] = dataframe_intermediate["subtweet_evidence"].str.replace("&gt;", ">")

In [24]:
dataframe_intermediate["subtweet_evidence"] = dataframe_intermediate["subtweet_evidence"].str.replace("&lt;", "<")

#### Rename the column

In [25]:
dataframe_intermediate = dataframe_intermediate.rename(index=str, columns={"subtweet_evidence": "alleged_subtweet"})

#### Remove rows with non-english

In [26]:
dataframe_intermediate = dataframe_intermediate[dataframe_intermediate.alleged_subtweet.map(is_english)]

#### Add column for classification

In [27]:
dataframe_intermediate["is_subtweet"] = "negative"

#### Randomly sample the table for half as many rows as the actual data

In [28]:
dataframe_intermediate = dataframe_intermediate.sample(n=len(dataframe_final)//2).reset_index(drop=True)

#### Show the bottom of the table

In [29]:
dataframe_intermediate.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
2673,@RightWingRadec @ElVandidoSnek Lamp subtweet,negative
2674,@gonzalezgabbiee Tbh I still don't know how to...,negative
2675,@star_curl is that a subtweet for me?,negative
2676,@crazypastor Cool subtweet breh,negative
2677,@HansFaffing This is a subtweet,negative


#### Load the normal tweets dataset made by Alec Go, Richa Bhayani, and Lei Huang 

In [30]:
go_dataframe = pd.read_csv("../data/data_for_training/other_data/go_data.csv", 
                           names=["Sentiment", "ID", "Date", "Query", "Username", "alleged_subtweet"])

#### Grab only the most positively classified tweets

In [31]:
go_dataframe = go_dataframe[go_dataframe["Sentiment"] == 4]

#### Remove duplicates

In [32]:
go_dataframe = go_dataframe.drop_duplicates("alleged_subtweet", keep="first")

#### Drop extra columns

In [33]:
go_dataframe_final = go_dataframe.drop(["Sentiment", "ID", "Date", "Query", "Username"], 
                                       axis=1).reset_index(drop=True)

#### Replace leftover HTML encoded characters with normal ones

In [34]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&quot;", "\"")

In [35]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&amp;", "&")

In [36]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&gt;", ">")

In [37]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&lt;", "<")

#### Remove rows with non-english

In [38]:
go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.map(is_english)]

#### Add column for new classification

In [39]:
go_dataframe_final["is_subtweet"] = "negative"

#### Remove all rows which contain mentions of usernames, URLs, and broken characters
# SCRATCH THAT: IT IS IMPORTANT TO KEEP URLS AND MENTIONS IN THE NEGATIVES

In [40]:
#go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.str.contains("@") == False]

In [41]:
#pattern = "(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"
#go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.str.contains(pattern) == False]

In [42]:
go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.str.contains("\uFFFD") == False]

#### Randomly sample the table for half as many rows as the actual data

In [43]:
go_dataframe_final = go_dataframe_final.sample(n=len(dataframe_final)//2).reset_index(drop=True)

#### Show the bottom of the table

In [44]:
go_dataframe_final.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
2673,"@stephanosis Mr.Darcy... interesing,i like a b...",negative
2674,Getting up right now. & getting ready for kick...,negative
2675,"morning everyone, beautiful day and i have a g...",negative
2676,@MarcyChen good night!,negative
2677,i want the summer to be over - fall is so much...,negative


#### Combine the three dataframes into one training set

In [45]:
training_dataframe = pd.concat([dataframe_final, dataframe_intermediate, go_dataframe_final], ignore_index=True)

#### Scramble the rows

In [46]:
training_dataframe = training_dataframe.sample(frac=1).reset_index(drop=True)

#### Preview the scrambled dataframe

In [47]:
training_dataframe.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
10707,In thirty years there will be a movie that red...,positive
10708,stop pretending u care about the super bowl lmao,positive
10709,Crazy how one person can make you so happy,positive
10710,@Nahirk i consider this to be a subtweet. now ...,negative
10711,@emisicka haha I'm not even mildly surprised y...,negative


#### Save the dataset

In [48]:
training_dataframe.to_csv("../data/data_for_training/final_training_data/Subtweets_Classifier_Training_Data.csv")