# Final Project Data Collection
### Authors: Sinclaire Schuetze and Valerie Tseng
#### Date: March 24, 2022

**Table of Contents**

1. [Collecting data](#sec1)

<a id="sec1"></a>
## 1. Collecting Data

For our analysis, we will begin with collecting 10,000 posts from r/AmItheAsshole to act as our training data. This will be the posts that we use to determine what the typical categories of posts are.

In [1]:
import pandas as pd
import praw
from psaw import PushshiftAPI    #library Pushshift
import datetime as dt            #library for date management
import p                         #library for data manipulation
import matplotlib.pyplot as plt  #library for plotting

In [2]:
api = PushshiftAPI()              #Object of the API

In [3]:
#connecting reddit account to obtain posts
reddit = praw.Reddit(client_id='MXxrtdoyASQVpC98DW9TTw', client_secret='xsj3cNAMWAmkIeZEDxNubYR2iq9tVw', user_agent='CS315Project')

The function data_prep_posts was adapted from an article on Medium that instructs how to use the Pushshift API.  
  https://medium.com/mcd-unison/using-pushshift-api-for-data-analysis-on-reddit-b08d339c48b8

In [4]:
def data_prep_posts(subreddit, start_time, end_time, limit):
    filters = ['id','title', 'selftext']#We set by default some useful columns

    posts = list(api.search_submissions(
        subreddit=subreddit,   #Subreddit we want to audit
        after=start_time,      #Start date
        before=end_time,       #End date
        filter=filters,        #Column names we want to retrieve
        limit=limit))          ##Max number of posts

    return pd.DataFrame(posts) #Return dataframe for analysis

We decided to set the range of dates from the beginning of the year to the current date since this will provide us with the most recent, relevant topics to the events occurring today but also needed a sufficient length of time to collect the amount of posts needed. 

In [5]:
subreddit = "AmItheAsshole"     #Subreddit we are auditing
start_time = int(dt.datetime(2021, 8, 1).timestamp())  
                                     #Starting date for our search
end_time = int(dt.datetime(2022, 3, 28).timestamp())   
                                     #Ending date for our search
limit = 10000                     #Elements we want to recieve

"""Here we are going to get subreddits for a brief analysis"""
aita_training = data_prep_posts(subreddit,start_time,end_time,limit) 



In [6]:
aita_training

Unnamed: 0,created_utc,id,selftext,title,created,d_
0,1648439967,tq0pkr,[removed],AITA for not wanting to visit my mother's side...,1.64845e+09,"{'created_utc': 1648439967, 'id': 'tq0pkr', 's..."
1,1648439924,tq0p5z,[removed],AITA For not allowing my daughters bf at our h...,1.64845e+09,"{'created_utc': 1648439924, 'id': 'tq0p5z', 's..."
2,1648439896,tq0oxd,Chaos is my family. Here's the quick version o...,AITA for feeling bad about my dad being kicked...,1.64845e+09,"{'created_utc': 1648439896, 'id': 'tq0oxd', 's..."
3,1648439895,tq0ox2,"I'm 20M, my friend (""Jack"") is 22M. He invited...",AITA for stranding my friend at a party,1.64845e+09,"{'created_utc': 1648439895, 'id': 'tq0ox2', 's..."
4,1648439867,tq0omw,Me (m32) and my girlfriend (f32) have been dat...,"AITA for ""parenting"" my step daughter and lett...",1.64845e+09,"{'created_utc': 1648439867, 'id': 'tq0omw', 's..."
...,...,...,...,...,...,...
9995,1647750745,tid7dm,My close friend and I (both 22f) have been at ...,AITA for wanting my friend to stop copying me?,1.64777e+09,"{'created_utc': 1647750745, 'id': 'tid7dm', 's..."
9996,1647750679,tid6rz,Picture this: you are going through your daily...,"WIBTA If I don't help my ""friends"" in their re...",1.64777e+09,"{'created_utc': 1647750679, 'id': 'tid6rz', 's..."
9997,1647750446,tid4kk,I’m in 11th grade and he’s in 9th.\n\nBoth of ...,WIBTA if I hung out with a 14 year old outside...,1.64776e+09,"{'created_utc': 1647750446, 'id': 'tid4kk', 's..."
9998,1647750353,tid3r4,My close friend and I (both 22f) have been at ...,AITA for wanting my friend to stop copying me?,1.64776e+09,"{'created_utc': 1647750353, 'id': 'tid3r4', 's..."


We can drop the rows we don't need.

In [7]:
aita_training = aita_training.drop(['created_utc','created','d_'], axis = 1)

In [8]:
aita_training

Unnamed: 0,id,selftext,title
0,tq0pkr,[removed],AITA for not wanting to visit my mother's side...
1,tq0p5z,[removed],AITA For not allowing my daughters bf at our h...
2,tq0oxd,Chaos is my family. Here's the quick version o...,AITA for feeling bad about my dad being kicked...
3,tq0ox2,"I'm 20M, my friend (""Jack"") is 22M. He invited...",AITA for stranding my friend at a party
4,tq0omw,Me (m32) and my girlfriend (f32) have been dat...,"AITA for ""parenting"" my step daughter and lett..."
...,...,...,...
9995,tid7dm,My close friend and I (both 22f) have been at ...,AITA for wanting my friend to stop copying me?
9996,tid6rz,Picture this: you are going through your daily...,"WIBTA If I don't help my ""friends"" in their re..."
9997,tid4kk,I’m in 11th grade and he’s in 9th.\n\nBoth of ...,WIBTA if I hung out with a 14 year old outside...
9998,tid3r4,My close friend and I (both 22f) have been at ...,AITA for wanting my friend to stop copying me?


The above data collection does not retrieve comments, so we need to go back and get the comments of each post.

In [9]:
#collects recent posts and comments
import pandas as pd
import numpy as np
aita_training['comments'] = ''
for post in aita_training.index:
    postId = aita_training['id'][post]
    postComments = [] #list to collect comments for each post
    submission = reddit.submission(id=postId)
    submission.comments.replace_more(limit=0) #gets around "more commments" at end of reddit post
    for top_level_comment in submission.comments[1:]: #first comment is always given from subreddit
        postComments.append(top_level_comment.body) #adds comment to list
    aita_training['comments'][post] = postComments

In [10]:
aita_training.head()

Unnamed: 0,id,selftext,title,comments
0,tq0pkr,[removed],AITA for not wanting to visit my mother's side...,[]
1,tq0p5z,[removed],AITA For not allowing my daughters bf at our h...,[]
2,tq0oxd,Chaos is my family. Here's the quick version o...,AITA for feeling bad about my dad being kicked...,[]
3,tq0ox2,"I'm 20M, my friend (""Jack"") is 22M. He invited...",AITA for stranding my friend at a party,[]
4,tq0omw,Me (m32) and my girlfriend (f32) have been dat...,"AITA for ""parenting"" my step daughter and lett...","[NTA\n\nThis doesn't sound like a ""parenting c..."


In [11]:
for post in aita_training.index:
    if len(aita_training['comments'][post])==0 or aita_training['selftext'][post]=="[removed]":
        aita_training = aita_training.drop([post])

In [12]:
aita_training

Unnamed: 0,id,selftext,title,comments
4,tq0omw,Me (m32) and my girlfriend (f32) have been dat...,"AITA for ""parenting"" my step daughter and lett...","[NTA\n\nThis doesn't sound like a ""parenting c..."
5,tq0odl,I booked a tutoring session at a late time. I ...,AITA for booking a tutoring session late despi...,"[What does ""hours"" mean in this context? It cl..."
6,tq0no3,I (23M) am a graduate student at an ivy league...,AITA for not calling my cousin last week when ...,[^^^^AUTOMOD ***Thanks for posting! This comm...
7,tq0nnf,I know the title sounds bad enough but here me...,AITA for implying my mom looks like monkey,[^^^^AUTOMOD ***Thanks for posting! This comm...
9,tq0n2h,Me and my husband of 4 years have been going t...,AITA my husband went out with our friend group...,[YTA. What's the point of being separated if h...
...,...,...,...,...
9992,tid8o4,"For the past 8 months, I've had a housemate an...",AITA for trying to get my roommate evicted due...,"[NTA OP, this dude’s constant ongoing referenc..."
9993,tid7t7,I am either a huge asshole or been driven craz...,AITA for caring about this? Give it to me stra...,[So this is pretty toxic. \nYou’ve ended thi...
9994,tid7r5,I’m in 11th grade and he’s in 9th.\n\nBoth of ...,WIBTA if I hang out with a 14 year old outside...,[^^^^AUTOMOD ***Thanks for posting! This comm...
9997,tid4kk,I’m in 11th grade and he’s in 9th.\n\nBoth of ...,WIBTA if I hung out with a 14 year old outside...,[^^^^AUTOMOD ***Thanks for posting! This comm...


In [14]:
#send dataframe to csv to be used in data exploration
file_name = "redditPosts.csv"
aita_training.to_csv(file_name, encoding='utf-8', index=False)