# 1. Data Collection

This notebook collects and prepares posts from two subreddits using [Pushshift's API](https://github.com/pushshift/api). 

#### Data Collection

- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

#### Data Cleaning and EDA

- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

In [2]:
# imports
import requests
import pandas as pd
import time

#### Collect and save data for r/weddingplanning and r/divorce

In [26]:
def collect_posts(subreddit):
    """ This function pulls down data from a given subreddit using the Pushshift API and returns the 
    data as a Pandas DataFrame.
    
    args:
        subreddit (str): name of a subreddit
    
    return:
        df (Pandas DataFrame): information from the 4000 most recent posts with text before Thursday, April 21, 2022 8:55:21 PM
    """

    url = 'https://api.pushshift.io/reddit/search/submission'
    
    num_posts = 0
    min_date = 1650599721
    for x in range(40):
        params = {
            'subreddit':subreddit,
            'size':100,
            'is_self':True,
            'meta_data':True,
            'before':min_date
        }
        res = requests.get(url, params)
        if res.status_code == 200:
            posts = pd.DataFrame(res.json()['data'])
            min_date = posts['created_utc'].min()
            if x == 0:
                df = posts
            else:
                df = pd.concat([df,posts],)
            time.sleep(1)
        else:
            print('request_failed')
            return df
    return df

In [25]:
# These function calls were used to collect and save the data

# df_wedding = collect_posts('weddingplanning')

# df_wedding.to_csv('../datasets/wedding.csv')

# df_divorce = collect_posts('divorce')

# df_divorce.to_csv('../datasets/divorce.csv')

#### Exploratory Data Analysis

In [27]:
# Import data frames already pulled from API

df_wedding = pd.read_csv('../datasets/wedding.csv')

df_divorce = pd.read_csv('../datasets/divorce.csv')

In [15]:
df_wedding.shape

(4000, 78)

In [23]:
df_wedding.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4000 entries, 0 to 99
Data columns (total 78 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  4000 non-null   object 
 1   allow_live_comments            4000 non-null   bool   
 2   author                         4000 non-null   object 
 3   author_flair_css_class         213 non-null    object 
 4   author_flair_richtext          3988 non-null   object 
 5   author_flair_text              213 non-null    object 
 6   author_flair_type              3988 non-null   object 
 7   author_fullname                3988 non-null   object 
 8   author_is_blocked              4000 non-null   bool   
 9   author_patreon_flair           3988 non-null   object 
 10  author_premium                 3988 non-null   object 
 11  awarders                       4000 non-null   object 
 12  can_mod_post                   4000 non-null   boo

In [28]:
df_wedding.isna().sum()

Unnamed: 0                          0
all_awardings                       0
allow_live_comments                 0
author                              0
author_flair_css_class           3787
                                 ... 
author_flair_background_color    4000
banned_by                        3997
edited                           3995
call_to_action                   4000
category                         4000
Length: 79, dtype: int64

In [16]:
df_wedding[['subreddit','selftext','title']].head()

Unnamed: 0,subreddit,selftext,title
0,weddingplanning,Hello everyone! I’m wondering if it’s common f...,Requiring guest contact info to confirm a rehe...
1,weddingplanning,[removed],Officiant Arriving in Town the Day of Wedding ...
2,weddingplanning,This is one of those things I've never put muc...,How do you find a good officiant?
3,weddingplanning,We’re in the process of booking our May 2023 w...,Each time we hire a vendor we really only have...
4,weddingplanning,I love reading y’alls vents so here’s mine. An...,Vent: So many small things


In [18]:
df_divorce.shape

(3999, 72)

In [19]:
df_divorce[['subreddit','selftext','title']].head()

Unnamed: 0,subreddit,selftext,title
0,Divorce,"I can’t handle it, the thought of not being hi...",How to cope
1,Divorce,If my child has done something that requires p...,Using different punishments for your children ...
2,Divorce,I think sometimes it’s easy to get caught up i...,What is something you’re looking forward to?
3,Divorce,My wife and I have a pretty comfortable life b...,So... when do you know it's time to move on?
4,Divorce,"Young married couple, no kids, no house (we re...",Could this be relatively simple?


In [24]:
df_divorce.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3999 entries, 0 to 99
Data columns (total 72 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  3999 non-null   object 
 1   allow_live_comments            3999 non-null   bool   
 2   author                         3999 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          3997 non-null   object 
 5   author_flair_text              29 non-null     object 
 6   author_flair_type              3997 non-null   object 
 7   author_fullname                3997 non-null   object 
 8   author_is_blocked              3999 non-null   bool   
 9   author_patreon_flair           3997 non-null   object 
 10  author_premium                 3997 non-null   object 
 11  awarders                       3999 non-null   object 
 12  can_mod_post                   3999 non-null   boo