### Gathering Data: Reddit's Pushshift API
---

### Data Dictionary
---

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**subreddit**|*object*|Reddit Pushshift API| Name of subreddit.| 
|**title**|*object*|Reddit Pushshift API| Title of Reddit post.|
|**selftext**|*object*|Reddit Pushshift API| This is the text in the body of the post.|
|**author**|*object*|Reddit Pushshift API| This is the author of the post.|
|**num_comments**|*int64*|Reddit Pushshift API| This is the number of comments to the post.|
|**num_crossplots**|*int64*|Reddit Pushshift API| This is the number of crossplots (shares) of the post.|
|**total_awards_recieved**|*int64*|Reddit Pushshift API| This is the number of awards the reddit post recieved.|
|**upvote_raito**|*float64*|Reddit Pushshift API| This is the raito of upvotes to downvotes of a particlar post.|
|**removed_by_catagory**|*object*|Reddit Pushshift API| This is an indication of who removed the post from the subreddit usualy a moderator.|
|**created_utc**|*int64*|Reddit Pushshift API| This is the time in UTC that the post was published.|
|**url**|*object*|Reddit Pushshift API| This is the direct url of the reddit post.|

### Imports
---

In [5]:
import requests
import pandas as pd
import time

In [162]:
#Pushshift API URL for reddit
url = 'https://api.pushshift.io/reddit/search/submission'

#Function for scrapping a reddit subfeed for X number of posts
def get_posts(ammount, sub_reddit):
    count = 0
    posts = []
    while count < ammount:
        #For first go around fill up the list with 100 first articles
        if len(posts) == 0:
            params = {
                'subreddit': sub_reddit,
                'size': 100
            }
            res = requests.get(url, params)
            print('current status code:', res.status_code)
            max_posts_recieved = res.json()['data']
            posts += max_posts_recieved

        #for the second go around add another 100 below the end of the previous     
        else:
            time.sleep(2)
            params = {
                'subreddit': sub_reddit,
                'size': 100,
                'before': posts[-1]['created_utc']
            }
            res = requests.get(url, params)
            print('current status code:', res.status_code)
            max_posts_recieved = res.json()['data']
            posts += max_posts_recieved
            count = len(posts)
            print(f"current number of posts:{count}")
    return posts


In [163]:
# Scrape dating over 40 post 10k
dating_ov_40_posts = get_posts(8000, 'datingoverforty')

current status code: 200
current status code: 200
current number of posts:200
current status code: 200
current number of posts:300
current status code: 200
current number of posts:400
current status code: 200
current number of posts:500
current status code: 200
current number of posts:600
current status code: 200
current number of posts:700
current status code: 200
current number of posts:800
current status code: 200
current number of posts:900
current status code: 200
current number of posts:1000
current status code: 200
current number of posts:1100
current status code: 200
current number of posts:1200
current status code: 200
current number of posts:1300
current status code: 200
current number of posts:1400
current status code: 200
current number of posts:1500
current status code: 200
current number of posts:1600
current status code: 200
current number of posts:1700
current status code: 200
current number of posts:1800
current status code: 200
current number of posts:1900
current sta

In [105]:
# Scrape dating over 30 post 30k
dating_ov_30_posts = get_posts(30000, 'datingoverthirty')

current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200
current status code: 200


In [181]:
# Confirming the nuber of Reddit posts scrpped from Reddit
print('The number of dating over 30 posts:',len(dating_ov_30_posts))
print('The number of dating over 40 posts:',len(dating_ov_40_posts))

The number of dating over 30 posts: 30099
The number of dating over 40 posts: 8000


In [205]:
# list of features to keep for future evaluation of the data
features = [
    'subreddit',
    'title', 
    'selftext', 
    'author', 
    'num_comments', 
    'num_crossposts', 
    'total_awards_received', 
    'upvote_ratio', 
    'removed_by_category', 
    'created_utc',
    'url'
]

In [206]:
# Saving to csv dating over 30 posts, with certain features of the reddit post that I think will be helpful for classification.
df_30 = pd.DataFrame(dating_ov_30_posts)[features]

# Writing to CSV file
df_30.to_csv('./Raw Reddit Data/raw_date_over_30.csv', index=False)

In [207]:
# Saving to csv dating over 40 posts with certain features of the reddit post that I think will be helpful for classification.
df_40 = pd.DataFrame(dating_ov_40_posts)[features]

# Writing to CSV file
df_40.to_csv('./Raw Reddit Data/raw_date_over_40.csv', index=False)

In [208]:
df_40.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   subreddit              8000 non-null   object 
 1   title                  8000 non-null   object 
 2   selftext               7969 non-null   object 
 3   author                 8000 non-null   object 
 4   num_comments           8000 non-null   int64  
 5   num_crossposts         8000 non-null   int64  
 6   total_awards_received  8000 non-null   int64  
 7   upvote_ratio           7619 non-null   float64
 8   removed_by_category    1708 non-null   object 
 9   created_utc            8000 non-null   int64  
 10  url                    8000 non-null   object 
dtypes: float64(1), int64(4), object(6)
memory usage: 687.6+ KB


From here lets go to the EDA and Modeling in the EDA & Modeling.ipynb notebook.