In [9]:
import pandas as pd
import requests
import time
from datetime import datetime

### Authorizing

In order to make requests to Reddit's API, we're going to have to authenticate ourselves via OAuth2. Unfortunately we're going to need to do several things before we get to the point of receiving our authorization token though.

1. Create a [Reddit](https://www.reddit.com) account.
    - Be sure to remember both your username and password
2. Once you're signed in [create an application](https://www.reddit.com/prefs/apps) to generate the credentials needed to request an authorization token.
    - Scroll all the way down and click `create another app...`
    - Select `script`
    - Enter a name for your application and enter `http://localhost:8080` as your redirect uri
    - Click `create_app`
3. Fill out the information below

In [10]:
client_id = 'REDACTED'
client_secret = 'REDACTED'
user_agent = 'NLP'
username = 'data-potato'
password = 'REDACTED'

Now we're on our way to retrieving our access token; we'll use the basic authentication framework to get there.

In [11]:
auth = requests.auth.HTTPBasicAuth(client_id, client_secret)

data = {
    'grant_type': 'password',
    'username': username,
    'password': password
}

In [12]:

#create an informative header for your application
headers = {'User-Agent': 'robert626/0.0.1'}

res = requests.post(
    'https://www.reddit.com/api/v1/access_token',
    auth=auth,
    data=data,
    headers=headers)

print(res)

<Response [200]>


Hopefully upon running the above, you received a successful response code and can save your token. These should last for about two hours by default.

In [13]:
res.json()

{'access_token': 'eyJhbGciOiJSUzI1NiIsImtpZCI6IlNIQTI1NjpzS3dsMnlsV0VtMjVmcXhwTU40cWY4MXE2OWFFdWFyMnpLMUdhVGxjdWNZIiwidHlwIjoiSldUIn0.eyJzdWIiOiJ1c2VyIiwiZXhwIjoxNjkxNzk4ODA5LjQ0MTk3NiwiaWF0IjoxNjkxNzEyNDA5LjQ0MTk3NiwianRpIjoiUm1QQ1U1VWpCYmZRWlRKM1J1MEQyQ1Yxa0w3dmF3IiwiY2lkIjoiWXpiVlRHcFBWRUVfaUU2QWdndEt2ZyIsImxpZCI6InQyX2MwODdrcnZoYyIsImFpZCI6InQyX2MwODdrcnZoYyIsImxjYSI6MTY4NDk5NDQwOTAwMCwic2NwIjoiZUp5S1Z0SlNpZ1VFQUFEX193TnpBU2MiLCJmbG8iOjl9.Bn3DBCAKbZhycmSImRZLyR9UQMuJp18S44MhAcik8uGdOsYNbcK7hIEi7J965p4idMMajeYQ7NAX7eh3tKJHvH4auu5JuxqF0w9brQ_SNyscFEcPKtwBrKZc3rk6Rp_z-7MxGo02UvfFnz7f8MBoTHfAScMwHP7JvBQDkfVTftruupCgSDw7tgD53jUNfKvep_yceMkHGj2qYyFrWOy9SwS-_ublENoWlt7Ch0y-3vJOlzE7JCNWkSJ6LkE-E04DLct-mnvsC_SccjIo2lm2BxlRSG6rO_Qi1rfqeEnZ8KtUsHXz9ingw8fDpy2wHyoMiEBa-0eaMbQSvaMScIH9og',
 'token_type': 'bearer',
 'expires_in': 86400,
 'scope': '*'}

In [14]:
#retrieve access token
token = res.json()['access_token']

Now let's add your access token to the headers and verify that you can successfully submit a call to the api.

In [15]:
headers['Authorization'] = f'bearer {token}'

requests.get('https://oauth.reddit.com/api/v1/me', headers=headers).status_code == 200

True

If all went correctly, we can finally create a simple request.

In [16]:
base_url = 'https://oauth.reddit.com/r/'
subreddit = 'dating'
subreddit2 = 'askculinary'

res = requests.get(base_url+subreddit, headers=headers)

Explore the response object. Where is our submission data? How many posts were retrieved by default?

In [17]:
#check out response object

res.json()['data']['children']
#res.json()['data']['children'].keys
#res.json()['data']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'dating',
   'selftext': 'There has been a slow and steady influx of unwanted and misguided conversation plaguing our boards over the last year or so. I don\'t think this is a surprise to any of you all. While we ultimately encourage healthy discussion around both the positives and negatives of dating the overall spirit of this sub has been lost. Many of our readers have expressed their concern to our moderation team and we honestly feel the same way.\n\nOur "No Soap-boxing or Promoting an Agenda" rule has always been on the sidebar for our users to see but I want to stress our current stance on the topic. **Soap-boxing will and has always included red/black-pill ideology, "alpha-male" talk, and the subset of vocabulary that comes with it.** \n\nThis means that using our board to preach about how there is no hope for men (or women) who are conventionally unattractive is unwanted and will be removed. Using our board to 

Let's now make use of the fact that we can pass a parameters dictionary to increase the size of our request then create a dataframe of our submissions.

## Data Sections

**Exercise**: write a loop to retrieve the 1000 most recent submissions. What parameters of the submissions endpoint will be most helpful for you here? [To the docs!](https://www.reddit.com/dev/api/)

## Dating

In [18]:
#modify request
params = {
    'limit': 100,
    
    'count': 0
}

res = requests.get(base_url+subreddit,
                   headers=headers,
                  params=params)

In [19]:
def fetch_posts(url, headers, params):
    res = requests.get(url, headers=headers, params=params)
    if res.status_code == 200:
        return res.json()

In [12]:
all_post_data = []
post_count = 1000
#While loop in a for loop 
while len(all_post_data) < post_count:
    data = fetch_posts(base_url + subreddit, headers=headers, params=params)
    if not data:
        break


    posts_count = len(data['data']['children'])
    print("Number of posts retrieved in the current response:", posts_count)

    # Extract relevant information from each post
    posts = data['data']['children']
    for post in posts:
        combined_text = post['data']['title'] + " " + post['data']['selftext']
        all_post_data.append({
            'Title_and_Selftext': combined_text,
            'Author': post['data']['author'],
            'Score': post['data']['score'],
            'URL': post['data']['url'],
            'Created': post['data']['created_utc'],
            'upvote_ratio': post['data']['upvote_ratio']


        })


    params = {'limit': 100, 'after': posts[-1]['data']['name']}

    # If there are no more posts, break out of the loop
    if len(posts) < 100:
        break

Number of posts retrieved in the current response: 101
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 82


In [14]:
date = pd.DataFrame(all_post_data)

In [15]:
date.head()

Unnamed: 0,Title_and_Selftext,Author,Score,URL,Created,upvote_ratio
0,r/Dating is NOT the place to soapbox Incel/Bla...,SyCams,5511,https://www.reddit.com/r/dating/comments/es2ce...,1579646000.0,0.98
1,Fiancé isn’t sexually adventurous with me like...,RevolutionaryAge2239,165,https://www.reddit.com/r/dating/comments/15ktb...,1691434000.0,0.88
2,Should I disclose to a woman I use steroids? I...,thr0waway4dayzz,324,https://www.reddit.com/r/dating/comments/15kmh...,1691419000.0,0.9
3,Girl who rejected me in 2017 approached me ran...,Special-Tight,130,https://www.reddit.com/r/dating/comments/15kqc...,1691427000.0,0.91
4,Men are interested in me (F30) at the beginnin...,Familiar_Value4651,46,https://www.reddit.com/r/dating/comments/15kyw...,1691446000.0,1.0


In [16]:
#date.to_csv('date_seven.csv', index=False)

## Ask Culinary

In [20]:
sec_post_data = []
post_count = 1000
#While loop in a for loop 
while len(sec_post_data) < post_count:
    data = fetch_posts(base_url + subreddit2, headers=headers, params=params)
    if not data:
        break


    posts_count = len(data['data']['children'])
    print("Number of posts retrieved in the current response:", posts_count)

    # Extract relevant information from each post
    posts = data['data']['children']
    for post in posts:
        combined_text = post['data']['title'] + " " + post['data']['selftext']
        sec_post_data.append({
            'Title_and_Selftext': combined_text,
            'Author': post['data']['author'],
            'Score': post['data']['score'],
            'URL': post['data']['url'],
            'Created': post['data']['created_utc'],
            'upvote_ratio': post['data']['upvote_ratio']


        })


    params = {'limit': 100, 'after': posts[-1]['data']['name']}

    # If there are no more posts, break out of the loop
    if len(posts) < 100:
        break

Number of posts retrieved in the current response: 102
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 100
Number of posts retrieved in the current response: 67


In [14]:
cul = pd.DataFrame(sec_post_data)

In [15]:
cul.head()

In [16]:
#cul.to_csv('food_seven.csv', index = False)

## Concat the CSV Files Into the Two Subreddits

In [7]:
def concatenate_csv_files(input_files, output_file):

    dfs = []

    for file_path in input_files:
        df = pd.read_csv(file_path)
        dfs.append(df)

    concatenated_df = pd.concat(dfs, ignore_index=True)


    concatenated_df.to_csv(output_file, index=False)

if __name__ == "__main__":
    input_files = [
        "../data_collection_days/date_submissions.csv",
        "../data_collection_days/date_two.csv",
        "../data_collection_days/date_three.csv",
        "../data_collection_days/date_four.csv",
        "../data_collection_days/date_five.csv"
    ]
    output_file = "../data_collection_days/date_combined.csv"
    concatenate_csv_files(input_files, output_file)

In [26]:
def concatenate_csv_files(input_files, output_file):

    dfs = []

    for file_path in input_files:
        df = pd.read_csv(file_path)
        dfs.append(df)

    concatenated_df = pd.concat(dfs, ignore_index=True)


    concatenated_df.to_csv(output_file, index=False)

if __name__ == "__main__":
    input_files = [
        "../data_collection_days/food_submissions.csv",
        "../data_collection_days/food_two.csv",
        "../data_collection_days/food_three.csv",
        "../data_collection_days/food_four.csv",
        "../data_collection_days/food_five.csv"
    ]
    output_file = "../data_collection_days/food_combined.csv"
    concatenate_csv_files(input_files, output_file)

## Item Access

In [92]:
#items we can access
post = res.json()['data']['children'][0]
[i for i in post['data'].keys()]


['approved_at_utc',
 'subreddit',
 'selftext',
 'author_fullname',
 'saved',
 'mod_reason_title',
 'gilded',
 'clicked',
 'title',
 'link_flair_richtext',
 'subreddit_name_prefixed',
 'hidden',
 'pwls',
 'link_flair_css_class',
 'downs',
 'thumbnail_height',
 'top_awarded_type',
 'hide_score',
 'name',
 'quarantine',
 'link_flair_text_color',
 'upvote_ratio',
 'author_flair_background_color',
 'subreddit_type',
 'ups',
 'total_awards_received',
 'media_embed',
 'thumbnail_width',
 'author_flair_template_id',
 'is_original_content',
 'user_reports',
 'secure_media',
 'is_reddit_media_domain',
 'is_meta',
 'category',
 'secure_media_embed',
 'link_flair_text',
 'can_mod_post',
 'score',
 'approved_by',
 'is_created_from_ads_ui',
 'author_premium',
 'thumbnail',
 'edited',
 'author_flair_css_class',
 'author_flair_richtext',
 'gildings',
 'content_categories',
 'is_self',
 'mod_note',
 'created',
 'link_flair_type',
 'wls',
 'removed_by_category',
 'banned_by',
 'author_flair_type',
 'dom