## Subtweets Downloader Jupyter Notebook-in-Progress

### Goals:
#### Create a corpus of subtweets for use in training.

### Methods:
#### Twitter API searching: When a user uses the phrase "subtweet" in a reply to a Tweet, the original Tweet which all the replies in that thread address is probably an actual subtweet. The Twitter API makes it possible to find such Tweets.

#### Import libraries for Twitter API access, tables, and text cleaning

In [1]:
import tweepy
import pickle
import re
import pandas as pd
from nltk.tokenize import TweetTokenizer

#### Set up access to the API

In [2]:
consumer_key, consumer_secret, access_token, access_token_secret = open("credentials.txt").read().split("\n")

In [3]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

#### Specifically take advantage of built-in methods to handle Twitter API rate limits

In [4]:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

#### Recursively find the last Tweet in a chain of replies which is not in reply to any other Tweet

In [5]:
def first_tweet(tweet_id):
    tweet = api.get_status(tweet_id)
    try:
        return first_tweet(tweet._json["in_reply_to_status_id"])
    except tweepy.TweepError:
        return tweet

#### Specify some parameters for the search

In [6]:
query = "\"subtweet\" since:2016-12-01"
max_tweets = 1000000

#### Make a list of all Tweets matching the search terms

In [7]:
%%time
statuses = []
for status in tweepy.Cursor(api.search, q=query, lang="en").items(max_tweets):
    # The status must be a reply
    try:
        if status._json["in_reply_to_status_id"]:
            statuses.append(status)
        else:
            continue
    except tweepy.TweepError:
        continue

Rate limit reached. Sleeping for: 725
Rate limit reached. Sleeping for: 875
Rate limit reached. Sleeping for: 874
Rate limit reached. Sleeping for: 873
Rate limit reached. Sleeping for: 873
Rate limit reached. Sleeping for: 873
Rate limit reached. Sleeping for: 873
Rate limit reached. Sleeping for: 873
CPU times: user 9.93 s, sys: 600 ms, total: 10.5 s
Wall time: 1h 58min 28s


#### Save the statuses

In [8]:
pickle.dump(statuses, open("statuses.p", "wb"))

#### Load the statuses

In [9]:
#statuses = pickle.load(open("statuses.p", "rb"))

In [10]:
print("Statuses acquired: " + str(len(statuses)))

Statuses acquired: 5417


#### Remove Tweets which do not contain the exact search term "subtweet." Apparently, Tweepy grabs extras

In [11]:
statuses = [status for status in statuses if "subtweet" in status._json["text"]]

In [12]:
print("Statuses actually containing \"subtweet\": " + str(len(statuses)))

Statuses actually containing "subtweet": 4660


#### Make the list into a dictionary for use in a dataframe

In [13]:
df_dict = {}
accuser_usernames = []
subtweet_evidences = []
subtweet_evidence_ids = []
subtweeter_usernames = []
alleged_subtweets = []
alleged_subtweet_ids = []

In [14]:
%%time
for i in range(len(statuses)):
    status = statuses[i]._json
    
    user = status["user"]["screen_name"]
    tweet_text = status["text"]
    tweet_id = status["id"]
    
    #print(str(i+1) + ": " + tweet_text)
    
    accuser_usernames.append(user)
    subtweet_evidences.append(tweet_text)
    subtweet_evidence_ids.append(tweet_id)
    try:
        first = first_tweet(tweet_id)._json
        first_user = first["user"]["screen_name"]
        first_text = first["text"]
        first_id = first["id"]
        if first_user != user: # Confirm a user is not reply to itself
            subtweeter_usernames.append(first_user)
            alleged_subtweets.append(first_text)
            alleged_subtweet_ids.append(first_id)
        else:
            del accuser_usernames[-1]
            del subtweet_evidences[-1]
            del subtweet_evidence_ids[-1]
    except tweepy.TweepError:
        del accuser_usernames[-1]
        del subtweet_evidences[-1]
        del subtweet_evidence_ids[-1]

Rate limit reached. Sleeping for: 791
Rate limit reached. Sleeping for: 794
Rate limit reached. Sleeping for: 792
Rate limit reached. Sleeping for: 793
Rate limit reached. Sleeping for: 790
Rate limit reached. Sleeping for: 791
Rate limit reached. Sleeping for: 788
Rate limit reached. Sleeping for: 785
Rate limit reached. Sleeping for: 791
Rate limit reached. Sleeping for: 787
Rate limit reached. Sleeping for: 784
Rate limit reached. Sleeping for: 786
Rate limit reached. Sleeping for: 785
Rate limit reached. Sleeping for: 790
Rate limit reached. Sleeping for: 787
Rate limit reached. Sleeping for: 788
Rate limit reached. Sleeping for: 790
Rate limit reached. Sleeping for: 791
Rate limit reached. Sleeping for: 787
Rate limit reached. Sleeping for: 787
Rate limit reached. Sleeping for: 787
CPU times: user 4min 13s, sys: 10.7 s, total: 4min 23s
Wall time: 5h 17min 51s


In [15]:
df_dict = {"accuser_username": accuser_usernames, 
           "subtweet_evidence": subtweet_evidences, 
           "subtweet_evidence_id": subtweet_evidence_ids, 
           "subtweeter_username": subtweeter_usernames,
           "alleged_subtweet": alleged_subtweets,
           "alleged_subtweet_id": alleged_subtweet_ids}

#### Remove rows from the dataframe for which the associated Tweet contains a user mention, the phrase "subtweet," or is too short to be a subtweet

In [16]:
tokenizer = TweetTokenizer()

In [17]:
df_dict_copy = {"accuser_username": [], 
                "subtweet_evidence": [], 
                "subtweet_evidence_id": [], 
                "subtweeter_username": [],
                "alleged_subtweet": [],
                "alleged_subtweet_id": []}

In [18]:
pattern = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?')

In [19]:
%%time
for i in range(len(df_dict["alleged_subtweet"])):
    if ("@" not in df_dict["alleged_subtweet"][i] and # Subtweets should not contain mentions
        not pattern.findall(df_dict["alleged_subtweet"][i]) and # Subtweets should not contain URLs
        "subtweet" not in df_dict["alleged_subtweet"][i] and # Subtweets which call themselves subtweets... aren't
        len(tokenizer.tokenize(df_dict["alleged_subtweet"][i])) > 5): # Arbitrarily only count longer Tweets
        #print(str(i))
        df_dict_copy["accuser_username"].append(df_dict["accuser_username"][i])
        df_dict_copy["subtweet_evidence"].append(df_dict["subtweet_evidence"][i])
        df_dict_copy["subtweet_evidence_id"].append(df_dict["subtweet_evidence_id"][i])
        df_dict_copy["subtweeter_username"].append(df_dict["subtweeter_username"][i])
        df_dict_copy["alleged_subtweet"].append(df_dict["alleged_subtweet"][i])
        df_dict_copy["alleged_subtweet_id"].append(df_dict["alleged_subtweet_id"][i])

CPU times: user 71 ms, sys: 0 ns, total: 71 ms
Wall time: 70.4 ms


#### Confirm all the lists are the same length for use in a dataframe

In [20]:
print("Number of accusers (usernames): " + str(len(df_dict_copy["accuser_username"])))
print("Number of evidence Tweets (text): " + str(len(df_dict_copy["subtweet_evidence"])))
print("Number of evidence Tweets (IDs): " + str(len(df_dict_copy["subtweet_evidence_id"])))
print("Number of subtweeters (usernames): " + str(len(df_dict_copy["subtweeter_username"])))
print("Number of subtweets (text): " + str(len(df_dict_copy["alleged_subtweet"])))
print("Number of subtweets (IDs): " + str(len(df_dict_copy["alleged_subtweet_id"])))

Number of accusers (usernames): 1148
Number of evidence Tweets (text): 1148
Number of evidence Tweets (IDs): 1148
Number of subtweeters (usernames): 1148
Number of subtweets (text): 1148
Number of subtweets (IDs): 1148


In [21]:
df = pd.DataFrame(df_dict_copy, columns=["accuser_username", 
                                         "subtweet_evidence", 
                                         "subtweet_evidence_id", 
                                         "subtweeter_username", 
                                         "alleged_subtweet", 
                                         "alleged_subtweet_id"])

#### Attempt to fit more of the strings in each cell

In [22]:
pd.set_option('max_colwidth', 500)

#### Show the top of the dataframe

In [23]:
df.head()

Unnamed: 0,accuser_username,subtweet_evidence,subtweet_evidence_id,subtweeter_username,alleged_subtweet,alleged_subtweet_id
0,Way_too_lazy,@IlluminatifyRBX @SirDeviloper The only tweet he didn't subtweet me in was when he taunted me cause he almost had t… https://t.co/awIH2iWIPa,946336513117257728,IlluminatifyRBX,"i have to be honest im glad fireable was suspended, he deserved it",946328768359972864
1,DexterAlmighty9,@Biltawulf Is this a subtweet?,946333445998997507,Biltawulf,"If you’re tweeting details of your Twitter family then it’s time to put your phone down, go outside and speak to adult humans.",946317847558533121
2,VannaMaKayla,@lifewithkady 🙄🙄 don’t have to subtweet when our elbows are literally touching,946328159426875392,lifewithkady,"Don’t mind Savannah, y’all 😂",946327651685470208
3,bubsby,@Moomishii y u gotta subtweet @NasuFriend like that,946323669978046464,Moomishii,twitter trap: (picture of a cat rolling on the floor and eating spaghetti)\n\nalso twitter trap: i want a boyfriend that tears out my eyes,946225427026268161
4,JessSchmes,@jtbaxa YES YOU DO AND MORE. FUCK DA HOE. #subtweet #iknowyouseethisbitch #comeatme,946319036794650624,jtbaxa,I honestly deserve the best.,946261637920522240


#### Save the dataframe as a CSV

In [24]:
df.to_csv("probably_subtweets.csv")