# Data Collection & Cleaning

In [1]:
import pandas as pd
import numpy as np

import requests

## Keto Subreddit: Submissions

In [2]:
def submission(subreddit, before):
    """
    This function pulls the submissions of the desired subreddit before a specified time range.
    """
    params = {'subreddit' : subreddit, 'size' : 500, 'before' : before}
    res = requests.get('https://api.pushshift.io/reddit/search/submission', params)
    reddit_submissions = res.json()
    posts = reddit_submissions['data']
    return posts

#### Pulling 1st Submission
- To establish a datframe to append the rest of our automated pulled rows

In [54]:
keto_post = submission('keto', None)

In [64]:
keto_subs = pd.DataFrame(keto_post, columns = ['subreddit', 'title', 'selftext'])
keto_subs

Unnamed: 0,subreddit,title,selftext
0,keto,Tudo começa com o primeiro passo,[removed]
1,keto,Fathead dough - what do you make with it?,[removed]
2,keto,Water fast to jump start keto?,[removed]
3,keto,Can stress kick you out of keto? Cortisol ques...,So I've been losing a steady 4 lbs a week for ...
4,keto,down almost 20 lbs before and after pics (slig...,"I'm 23, and 5'6. I was someone who struggled w..."
...,...,...,...
95,keto,Keto breakfast. is there a cereal I can eat th...,[removed]
96,keto,Suspicious of Electrolyte Powder,"I am in my third week of keto, feeling amazing..."
97,keto,Back on track!,So back in February I posted on here to figure...
98,keto,Changes in back pain on keto or w/ weight loss?,"Hey everyone,\n\nWanted to start off by saying..."


In [82]:
# grab 'created_utc' from the last post in our first submissions pull to utilize in loop

keto_post[-1]['created_utc']

1619132798

#### Creating a loop for data collection:

In [87]:
# goal was to run enough times to gather over 5k rows of data

# this is the 'created_utc' from the last post in the first submissions pull
before = 1619132798

for i in range (2,81):
    # subtracted utc by a large enough number each time in hopes to grab unique rows each time
    before -= 200_000
    
    # append new rows to the already created dataframe from the first submissions pull
    keto_subs = keto_subs.append(pd.DataFrame(submission('keto', before))[['subreddit', 'title', 'selftext']])

In [88]:
# check what was collected

keto_subs

Unnamed: 0,subreddit,title,selftext
3,keto,Can stress kick you out of keto? Cortisol ques...,So I've been losing a steady 4 lbs a week for ...
4,keto,down almost 20 lbs before and after pics (slig...,"I'm 23, and 5'6. I was someone who struggled w..."
11,keto,Pulsatile tinnitus and Keto,"I have been doing Keto for around 10 weeks, ha..."
13,keto,Does Keto Work? (Emphatic Yes),Preface - Over 50 Soldier here. (Male)\n\nDepl...
15,keto,Homemade meal replacement type shake recipe thing,I really struggle to get enough calories in on...
...,...,...,...
95,keto,Does anyone with Hypothyroidism have experienc...,[removed]
96,keto,QQ: Currently I'm in a Keto diet and Intermitt...,[removed]
97,keto,Keto success stories -How she did it??,[removed]
98,keto,5 months 172 to 126 and still going,A year ago I was at my biggest and I decided t...


In [89]:
# filter out for '[removed]'

keto_subs = keto_subs[keto_subs['selftext'] != '[removed]']

In [90]:
# filter out for '[deleted]'

keto_subs = keto_subs[keto_subs['selftext'] != '[deleted]']

In [91]:
# check for null values

keto_subs.isna().sum()

subreddit      0
title          0
selftext     110
dtype: int64

In [92]:
# drop null values

keto_subs.dropna(inplace = True)

In [93]:
# ensure null values were dropped

keto_subs.isna().sum()

subreddit    0
title        0
selftext     0
dtype: int64

In [94]:
# check 'title' column

keto_subs['title'].value_counts(ascending = False)

Anyone wanna be weight loss buddy                                           8
Exercise after keto for muscles                                             7
I can’t eat more than 1000 calories daily                                   5
Keto help, I’m desperate                                                    5
Question on how much fat I should be eating.                                5
                                                                           ..
almost 2 months keto progress                                               1
Intermittent Fasting and Keto                                               1
Has anyone had Hyperthyroidism / "Feeling hot all the time" before Keto?    1
Yet another keto success story w/ pictures                                  1
Need help with Salmon Oil dosage                                            1
Name: title, Length: 3941, dtype: int64

In [131]:
# referenced https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# dropping dupicate titles

keto_subs.drop_duplicates(subset = ['title'], inplace = True)

In [96]:
keto_subs['selftext'].value_counts(ascending = False)

Hello /r/keto Community!\n\nPlease use this support thread to talk freely and support each other. \*\*We've switched up the format to last 2 days so that there's more time for interaction on questions and answers.\*\*\n\nAll visitors, new and old, are kindly reminded to observe the sidebar rules, check the FAQ, and use the Search Bar before creating new posts.\n\n*If you're new to* /r/keto *and need some info, start with* [*Keto in a Nutshell*](https://www.reddit.com/r/keto/wiki/keto_in_a_nutshell) *and* [*the FAQ*](https://www.reddit.com/r/keto/wiki/faq)*. Or, if you have a question that doesn't seem to be covered, head on over to the Community Support thread (pinned to the top of the subreddit) and ask the community!*                                                                                                                                                                                                                                                                               

In [97]:
# also drop duplicates of 'selftext'

keto_subs.drop_duplicates(subset = ['selftext'], inplace = True)

In [134]:
# check shape to see how many rows are left

keto_subs.shape

(3841, 3)

In [132]:
# reset index after all the dropping that was done

keto_subs.reset_index(drop = True, inplace = True)

In [133]:
# final review of the keto df before writing to csv

keto_subs

Unnamed: 0,subreddit,title,selftext
0,keto,Can stress kick you out of keto? Cortisol ques...,So I've been losing a steady 4 lbs a week for ...
1,keto,down almost 20 lbs before and after pics (slig...,"I'm 23, and 5'6. I was someone who struggled w..."
2,keto,Pulsatile tinnitus and Keto,"I have been doing Keto for around 10 weeks, ha..."
3,keto,Does Keto Work? (Emphatic Yes),Preface - Over 50 Soldier here. (Male)\n\nDepl...
4,keto,Homemade meal replacement type shake recipe thing,I really struggle to get enough calories in on...
...,...,...,...
3836,keto,Guess we are all different,So i am starting my third week on keto and i h...
3837,keto,I feel like I’m never full and I’m overwhelmed...,Currently I’m more loosely following Keto whil...
3838,keto,"2 months, &lt;30g carbs per day, 1 lb lost, st...",Okay guys. What am I doing wrong? I’m keeping...
3839,keto,5 months 172 to 126 and still going,A year ago I was at my biggest and I decided t...


In [135]:
# write cleaned keto Submissions to CSV

keto_subs.to_csv('keto_subs.csv', index = False)

---

## Nutrition Subreddit: Submissions

In [7]:
def submission(subreddit, before):
    """
    This function pulls the submissions of the desired subreddit before a specified time range.
    """
    params = {'subreddit' : subreddit, 'size' : 500, 'before' : before}
    res = requests.get('https://api.pushshift.io/reddit/search/submission', params)
    
    if res.status_code == 200:
        reddit_submissions = res.json()
        return reddit_submissions['data']
    else:
        return -1

#### Pulling 1st Submission
- To establish a datframe to append the rest of our automated pulled rows

In [3]:
nutrition_post = submission('nutrition', None)

In [4]:
nutrition_subs = pd.DataFrame(nutrition_post, columns = ['subreddit', 'title', 'selftext'])
nutrition_subs

Unnamed: 0,subreddit,title,selftext
0,nutrition,Can diet soda/sugar alcohols cause water reten...,Can food replacements for “bad” food lead to w...
1,nutrition,What is the healthiest way to drink alcohol?,"So, alcohol is not something you want as part ..."
2,nutrition,Struggling losing weight,[removed]
3,nutrition,calcium RDI - how would you achieve this figur...,dairy has gotten a bad rap for its saturated f...
4,nutrition,calcium RDI - how would you achieve this figur...,[removed]
...,...,...,...
95,nutrition,Plant-Based News and PCRM - are they trustworthy?,"Ever since I got into nutrition, I have had vi..."
96,nutrition,Is oatmeal healthier with boiled water on the ...,I'm not sure if oatmeal is better by adding bo...
97,nutrition,What do Viruses and pathogens feed on?,"Hello, I have just watched a best selling auth..."
98,nutrition,Do Eggs feed Viruses and Pathogens?,[removed]


In [5]:
# grab 'created_utc' from the last post in our first submissions pull to utilize in loop

nutrition_post[-1]['created_utc']

1619331270

#### Creating a loop for data collection:

In [8]:
# goal was to run enough times to gather over 5k rows of data

# this is the 'created_utc' from the last post in the first submissions pull
before = 1619331270

for i in range (2,81):
    # subtracted utc by a large enough number each time in hopes to grab unique rows each time
    before -= 200_000
    
    submission_res = submission('nutrition', before)
    
    # if submission_res returned -1 (res.status_code != 200), then skip over instance with 'continue'
    if submission_res == -1:
        continue
    else:                
        # append new rows to the already created dataframe from the first submissions pull
        nutrition_subs = nutrition_subs.append(pd.DataFrame(submission_res)[['subreddit', 'title', 'selftext']])

In [9]:
# check what was collected

nutrition_subs

Unnamed: 0,subreddit,title,selftext
0,nutrition,Can diet soda/sugar alcohols cause water reten...,Can food replacements for “bad” food lead to w...
1,nutrition,What is the healthiest way to drink alcohol?,"So, alcohol is not something you want as part ..."
2,nutrition,Struggling losing weight,[removed]
3,nutrition,calcium RDI - how would you achieve this figur...,dairy has gotten a bad rap for its saturated f...
4,nutrition,calcium RDI - how would you achieve this figur...,[removed]
...,...,...,...
95,nutrition,How much mussels per day/week/month should I e...,[removed]
96,nutrition,Is protein powder good or bad?,[removed]
97,nutrition,Expired protein powder consumption yes or no?,[removed]
98,nutrition,Protein Powder consumption after the Expiry date,[removed]


In [10]:
# filter out for '[removed]'

nutrition_subs = nutrition_subs[nutrition_subs['selftext'] != '[removed]']

In [11]:
# filter out for '[deleted]'

nutrition_subs = nutrition_subs[nutrition_subs['selftext'] != '[deleted]']

In [12]:
# check for null values

nutrition_subs.isna().sum()

subreddit      0
title          0
selftext     211
dtype: int64

In [13]:
# drop null values

nutrition_subs.dropna(inplace = True)

In [14]:
# ensure null values were dropped

nutrition_subs.isna().sum()

subreddit    0
title        0
selftext     0
dtype: int64

In [15]:
# check 'title' column

nutrition_subs['title'].value_counts(ascending = False)

Question about muscles and vitamins.                                                                             24
/r/Nutrition Weekly Personal Nutrition Discussion Post - All Personal Diet Questions Go Here                     13
Seeking diet advice for an suboptimal diet environment.                                                          12
What foods will make me release lots of farts and they smell bad?                                                12
Is it okay to workout on an empty stomach in the morning, and have a small mid workout snack?                    12
                                                                                                                 ..
Calculate macro percentages from nutrition label?                                                                 1
Cookbook meal prep healthy gut?                                                                                   1
Daily macro goals                                                       

In [16]:
# referenced https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
# dropping dupicate titles

nutrition_subs.drop_duplicates(subset = ['title'], inplace = True)

In [17]:
# check 'selftext' column

nutrition_subs['selftext'].value_counts(ascending = False)

Welcome to Science Friday here in /r/Nutrition.  This is the weekly post for science supported discussion on the latest news, developments, and research in nutrition science.   \n\n**Rules for Discussion**\n\n* This post is only for discussion of recent nutrition news and research. \n\n* ALL responses must support any claims made by including links to science based evidence / studies / data. Including those listed below, other sources of nutrition information can be found at the [**USDA Food Composition Database**](https://ndb.nal.usda.gov/ndb/search/list), [**NutritionData**](http://nutritiondata.self.com/), [**Nutrition Journal**](https://nutritionj.biomedcentral.com/), and [**Nutrition.gov**](https://www.nutrition.gov/) (a service of the National Agricultural Library).\n\n* Keep it civil. [reddiquette](https://www.reddit.com/wiki/reddiquette/) is required**. If you disagree about the science, the source(s), or the interpretation(s) then you must do so civilly.  Any personal attacks 

In [18]:
# also drop duplicates of 'selftext'

nutrition_subs.drop_duplicates(subset = ['selftext'], inplace = True)

In [19]:
# check shape to see how many rows are left

nutrition_subs.shape

(3071, 3)

In [20]:
# reset index after dropping all the rows from cleaning

nutrition_subs.reset_index(drop = True, inplace = True)

In [21]:
# final review of the nutrition df before writing to csv

nutrition_subs

Unnamed: 0,subreddit,title,selftext
0,nutrition,Can diet soda/sugar alcohols cause water reten...,Can food replacements for “bad” food lead to w...
1,nutrition,What is the healthiest way to drink alcohol?,"So, alcohol is not something you want as part ..."
2,nutrition,calcium RDI - how would you achieve this figur...,dairy has gotten a bad rap for its saturated f...
3,nutrition,Are there nutritional drawbacks to eating too ...,Hey r/nutrition. I’m a newbie in this field bu...
4,nutrition,You think vega one have all essential vitamins?,Looking for that extra protein powder with som...
...,...,...,...
3066,nutrition,What are y'alls thoughts on this? Any research...,[https://www.youtube.com/watch?v=jwagCofBDj8&a...
3067,nutrition,Nutrition Question,You are watching the Kansas City Chiefs and Sa...
3068,nutrition,Nutrition For Spinal Discs,Have you seen anything about strengthening spi...
3069,nutrition,Nutrients in homemade organ broth,"I don’t like eating solid lamb/beef organs, bu..."


In [22]:
# write cleaned keto Submissions to CSV

nutrition_subs.to_csv('nutrition_subs.csv', index = False)

### Summary:
- Collected Keto + Nutrition Subreddit Submissions data
- Basic data cleaning
    - Dropping duplicate entries of 'title' and 'selftext'
    - Dropping null values ([removed], [deleted], etc.)

### Next Steps:
- Concat both subreddits and begin modeling!