## Subtweet Classifier Training Set Creator Jupyter Notebook-in-Progress

### Goals:
#### Create a training set for a Naive Bayes Classifier which will be implemented in a Scikit-Learn Pipeline

### Methods:
#### Using Pandas for managing CSV tables, label the data downloaded using the subtweet downloading script as positive, then label the highly positive sentiment tweets gathered by Alec Go as negative

#### Make the number of positively labelled tweets equal to the number of negatively labelled tweets, and save the dataset

#### Import libraries

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from glob import glob
import pandas as pd
import nltk
import re

#### Find filenames to load

In [2]:
old_data_filenames = glob("../data/data_for_training/downloaded_nightly_data/*.csv")

In [3]:
new_data_filenames = glob("../data/data_for_training/consolidated_data/*.csv")

#### Create a list of dataframes

In [4]:
dataframes_list = [pd.read_csv(f, index_col=0) for f in old_data_filenames + new_data_filenames]

#### Concatenate the dataframes

In [5]:
dataframe_original = pd.concat(dataframes_list, ignore_index=True)

#### Remove duplicates

In [6]:
dataframe_dropped = dataframe_original.drop_duplicates("alleged_subtweet_id", keep="first")

#### Remove friends' tweets

In [7]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("akrapf96") == False]

In [8]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("zoeterhune") == False]

In [9]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("juliaeberry") == False]

In [10]:
dataframe_dropped = dataframe_dropped[dataframe_dropped.subtweeter_username.str.contains("NoahSegalGould") == False]

#### Drop extra columns

In [11]:
dataframe_final = dataframe_dropped.drop(["accuser_username", "subtweet_evidence", 
                                          "subtweet_evidence_id", "subtweeter_username", 
                                          "alleged_subtweet_id"], axis=1).reset_index(drop=True)

#### Replace leftover HTML encoded characters with normal ones

In [12]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&quot;", "\"")

In [13]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&amp;", "&")

In [14]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&gt;", ">")

In [15]:
dataframe_final["alleged_subtweet"] = dataframe_final["alleged_subtweet"].str.replace("&lt;", "<")

#### Add column for classification

In [16]:
dataframe_final["is_subtweet"] = "positive"

#### Show the bottom of the table

In [17]:
dataframe_final.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
7804,I genuinely think there should be an edit butt...,positive
7805,people who use oof on the internet are virgins,positive
7806,and after all we did for you... sad:/,positive
7807,Ah ça !! Vivement le jour où un homme pourra d...,positive
7808,I hate when babes put delete on Twitter but th...,positive


#### Load the normal tweets dataset made by Alec Go, Richa Bhayani, and Lei Huang 

In [18]:
go_dataframe = pd.read_csv("../data/data_for_training/other_data/go_data.csv", 
                           names=["Sentiment", "ID", "Date", "Query", "Username", "alleged_subtweet"])

#### Grab only the most positively classified tweets

In [19]:
go_dataframe = go_dataframe[go_dataframe["Sentiment"] == 4]

#### Remove duplicates

In [20]:
go_dataframe = go_dataframe.drop_duplicates("alleged_subtweet", keep="first")

#### Drop extra columns

In [21]:
go_dataframe_final = go_dataframe.drop(["Sentiment", "ID", "Date", "Query", "Username"], 
                                       axis=1).reset_index(drop=True)

#### Replace leftover HTML encoded characters with normal ones

In [22]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&quot;", "\"")

In [23]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&amp;", "&")

In [24]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&gt;", ">")

In [25]:
go_dataframe_final["alleged_subtweet"] = go_dataframe_final["alleged_subtweet"].str.replace("&lt;", "<")

#### Remove rows with non-english

In [26]:
def is_english(s):
    return all(ord(char) < 128 for char in s)

In [27]:
go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.map(is_english)]

#### Add column for new classification

In [28]:
go_dataframe_final["is_subtweet"] = "negative"

#### Remove all rows which contain mentions of usernames, URLs, and broken characters
# SCRATCH THAT: IT IS IMPORTANT TO KEEP URLS AND MENTIONS IN THE NEGATIVES

In [29]:
#go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.str.contains("@") == False]

In [30]:
#pattern = "(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"
#go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.str.contains(pattern) == False]

In [31]:
go_dataframe_final = go_dataframe_final[go_dataframe_final.alleged_subtweet.str.contains("\uFFFD") == False]

#### Randomly sample the table for as many rows as the actual data

In [32]:
go_dataframe_final = go_dataframe_final.sample(n=len(dataframe_final)).reset_index(drop=True)

#### Show the bottom of the table

In [33]:
go_dataframe_final.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
7804,@tweeddelights not literally! But it is seriou...,negative
7805,"right now, new york is where i wanna be.",negative
7806,@Godsrep You working today too?! I'm done f...,negative
7807,WEEE!!! playing crazier on my uke gives me a h...,negative
7808,Very tired from a day of awesomeness and cake ...,negative


#### Combine the two dataframes into one training set

In [34]:
training_dataframe = pd.concat([dataframe_final, go_dataframe_final], ignore_index=True)

#### Scramble the rows

In [35]:
training_dataframe = training_dataframe.sample(frac=1).reset_index(drop=True)

#### Preview the scrambled dataframe

In [36]:
training_dataframe.tail()

Unnamed: 0,alleged_subtweet,is_subtweet
15613,@foodiesarah admittedly it was only 8 pages a ...,negative
15614,Imagine getting a standing ovation for being a...,positive
15615,Tell ya lil friends not to talk shit about my ...,positive
15616,trying to get back in the swing of Twitting! B...,negative
15617,..it was specially made/designed for me! I'm...,negative


#### Save the dataset

In [37]:
training_dataframe.to_csv("../data/data_for_training/final_training_data/Subtweets_Classifier_Training_Data.csv")