Ran 2 instances at the same time (parallel power) to acquire data from my subreddits. This instance also saves off the top 500 strings in the tokens to compare with the top 500 strings in the tokens from the other subreddit. 
The 2 top 500 lists will be used to create a custom stopword list to further preprocess the corpus before modeling and predicting. 

The processes in this notebook mirror the processes in Notebook 1. 

In [9]:
#Retrieve data online and load
import requests as req 
import json
import re
import time
from datetime import datetime, timezone

#NLP libraries
from nltk.stem import WordNetLemmatizer as lemma
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize, RegexpTokenizer, TweetTokenizer
from nltk.corpus import stopwords


#Variables to use for NLP
stop = stopwords.words('english')
punc = "?!,.;:)("

#Pandas
import pandas as pd

#Plot and model
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


In [10]:
#function to request data using pushshift
#takes
def so_fresh(n,sub,post_type):
    '''Takes in a number n, name of the subreddit(all lowercase), and post_type(either submissions, comments, or subreddit).
    Creates a request url to obtain data through the pushshift api from reddit.
    Returns a list of n dataframes comprised of 100 reddit post/comments.
    
    example: sub = sub_pull(100,"prorevenge","submission") 
    will run 100 iterations of the function to obtain posts from the prorevenge subreddit'''
    
    df_1 = []
    start = 1600822671
    for i in range(n):
        red_url = f"https://api.pushshift.io/reddit/search/{post_type}/?subreddit={sub}&before={start}&size=100"
        reqs = req.get(red_url, timeout = 30)
        stream = reqs.json()
        body = stream["data"]
        df = pd.DataFrame(body)[['selftext','subreddit','created_utc']]
        start=df['created_utc'].min()
        df_1.append(df)
        time.sleep(30)
        print(f"Pull {i} complete") #Chuck suggested to Tanner- incorporated here
  
    return df_1
        
        

In [11]:
#call the function
sub = so_fresh(100, "prorevenge", "submission")

Pull 0 complete
Pull 1 complete
Pull 2 complete
Pull 3 complete
Pull 4 complete
Pull 5 complete
Pull 6 complete
Pull 7 complete
Pull 8 complete
Pull 9 complete
Pull 10 complete
Pull 11 complete
Pull 12 complete
Pull 13 complete
Pull 14 complete
Pull 15 complete
Pull 16 complete
Pull 17 complete
Pull 18 complete
Pull 19 complete
Pull 20 complete
Pull 21 complete
Pull 22 complete
Pull 23 complete
Pull 24 complete
Pull 25 complete
Pull 26 complete
Pull 27 complete
Pull 28 complete
Pull 29 complete
Pull 30 complete
Pull 31 complete
Pull 32 complete
Pull 33 complete
Pull 34 complete
Pull 35 complete
Pull 36 complete
Pull 37 complete
Pull 38 complete
Pull 39 complete
Pull 40 complete
Pull 41 complete
Pull 42 complete
Pull 43 complete
Pull 44 complete
Pull 45 complete
Pull 46 complete
Pull 47 complete
Pull 48 complete
Pull 49 complete
Pull 50 complete
Pull 51 complete
Pull 52 complete
Pull 53 complete
Pull 54 complete
Pull 55 complete
Pull 56 complete
Pull 57 complete
Pull 58 complete
Pull 59

In [60]:
def so_clean(df = sub):
    '''Takes in a dataframe, drops duplicates and nulls, then reports the change in shape
    default value is the sub created when so_fresh is called
    
    example: so_clean(df)
    returns df after dedupe and removing nulls'''
    df_sub= pd.concat(df)
    print (f"Starting dimensions: {df_sub.shape}")
    #drop duplicate rows based on values in "selftext"
    #this removes all the posts that have been deleted
    #or removed
    df_sub.drop_duplicates(["selftext"], keep = False, inplace = True)
    #drop nulls
    df_sub.dropna(how = "any", inplace = True)
    
    
    return df_sub


In [65]:
df_prorevenge = so_clean(sub)
df_prorevenge = df_prorevenge[df_prorevenge["subreddit"] == "ProRevenge"]
print (f"Ending dimensions: {df_prorevenge.shape}")
df_prorevenge.to_csv(r'prorevenge.csv',index = True)

Starting dimensions: (10000, 3)
Ending dimensions: (5883, 3)
Ending dimensions: (5883, 3)


In [14]:
def stantokenia(df):
    '''Function accepts dataframe. Tokenizes contents of the "selftext" column. Returns dataframe'''
    #create a column in the data frame of tokens comprised of the lowercase components of selftext
    #reddit text is similar to tweet text as opposed to standard speech
    #used TweetTokenizer to split my strings-http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual
    tokenizer = TweetTokenizer()
    df["tokens"] = df["selftext"].apply(tokenizer.tokenize)
    return df

In [16]:
df_prorevenge = stantokenia(df_prorevenge)

In [18]:
df_prorevenge.head()

Unnamed: 0,selftext,subreddit,created_utc,tokens
1,I met a guy online in May on a student forum. ...,ProRevenge,1600797165,"[I, met, a, guy, online, in, May, on, a, stude..."
2,This is a story in the making. My gf recently ...,ProRevenge,1600757538,"[This, is, a, story, in, the, making, ., My, g..."
6,"My now ex-husband, Sean (not real name), came ...",ProRevenge,1600738776,"[My, now, ex-husband, ,, Sean, (, not, real, n..."
8,"It’s bullshit, how will bully’s learn if there...",ProRevenge,1600712335,"[It, ’, s, bullshit, ,, how, will, bully, ’, s..."
9,Not mine but a friend's story from today this ...,ProRevenge,1600710776,"[Not, mine, but, a, friend's, story, from, tod..."


In [61]:
def rosa_parks(text):
    '''Lemmatizer. Accepts text, returns list'''
    lemmatizer = lemma()
    return [lemmatizer.lemmatize(w) for w in text] #from stack overflow, create function to apply to data frame

In [62]:
def roses(df):
    '''Function accepts dataframe. Applies lemmatizer to create "lemmas" and filters through stopwords to create
    "unique". Returns dataframe'''
    
    df["lemmas"]= df["tokens"].apply(rosa_parks)
    df["unique"] = df["lemmas"].apply(lambda x: [word for word in x if word not in (stop)])
    return df


In [22]:
df_prorevenge = roses(df_prorevenge)

In [24]:
df_prorevenge.head()

Unnamed: 0,selftext,subreddit,created_utc,tokens,lemmas,unique
1,I met a guy online in May on a student forum. ...,ProRevenge,1600797165,"[I, met, a, guy, online, in, May, on, a, stude...","[I, met, a, guy, online, in, May, on, a, stude...","[I, met, guy, online, May, student, forum, ., ..."
2,This is a story in the making. My gf recently ...,ProRevenge,1600757538,"[This, is, a, story, in, the, making, ., My, g...","[This, is, a, story, in, the, making, ., My, g...","[This, story, making, ., My, gf, recently, gav..."
6,"My now ex-husband, Sean (not real name), came ...",ProRevenge,1600738776,"[My, now, ex-husband, ,, Sean, (, not, real, n...","[My, now, ex-husband, ,, Sean, (, not, real, n...","[My, ex-husband, ,, Sean, (, real, name, ), ,,..."
8,"It’s bullshit, how will bully’s learn if there...",ProRevenge,1600712335,"[It, ’, s, bullshit, ,, how, will, bully, ’, s...","[It, ’, s, bullshit, ,, how, will, bully, ’, s...","[It, ’, bullshit, ,, bully, ’, learn, ’, conse..."
9,Not mine but a friend's story from today this ...,ProRevenge,1600710776,"[Not, mine, but, a, friend's, story, from, tod...","[Not, mine, but, a, friend's, story, from, tod...","[Not, mine, friend's, story, today, much, alot..."


Create a list of the top 500 tokens in ProRevenge. Export the list of words for use in [Notebook 1](./Tiffany_Baker_Project_3_Reddit_Notebook_1) to create custom
stopwords list.

In [58]:
full_list = []  # list containing all words of all texts
for elmnt in df_prorevenge['unique']:  # loop over lists in df
    full_list += elmnt  # append elements of lists to full list

val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

pro_top_500_count = val_counts.head(500).astype(int).tolist()
pro_top_500_index = val_counts.head(500).index.tolist()

pro_top_500 = list(zip(pro_top_500_index, pro_top_500_count))

In [59]:
%store pro_top_500
%store pro_top_500_index

Stored 'pro_top_500' (list)
Stored 'pro_top_500_index' (list)


Import custom stop word list from [Notebook 1](./Tiffany_Baker_Project_3_Reddit_Notebook_1). Clean tokens using this stopword list using the remix function.
Export dataframe for use in [Notebook 1](./Tiffany_Baker_Project_3_Reddit_Notebook_1).

In [54]:
%store -r top_of_the_top

In [63]:
def remix(df):
    '''Function to apply custom stop word list ("top_of_the_top") to tokens. 
    Takes dataframe, returns dataframe. Adds column "custom" that removes tokens in "unique" that are 
    in custom stop words list.'''
    
    df["custom"] = df["unique"].apply(lambda x: [word for word in x if word not in (top_of_the_top)])
    return df

In [57]:
df_prorevenge = remix(df_prorevenge)
%store df_prorevenge

Stored 'df_prorevenge' (DataFrame)
