# Summary of this notebook

In this notebook, we collect data from the subreddits [r/composer](https://www.reddit.com/r/composer/) ("Composers") and [r/musicproduction](https://www.reddit.com/r/musicproduction/) ("Producers").  We use [PRAW](https://praw.readthedocs.io/en/stable/) to access the Reddit API since [Pushshift](https://reddit-api.readthedocs.io/en/latest/), the other main alternative, has (as of the time of this project) recently undergone changes and currently does not retrieve any posts from earlier than November 2022.  This would not be nearly enough data for the purposes of this project!

Using PRAW, we can only obtain the most recent ~950 posts on a given subreddit.  This means that in order to collect 1000 or more posts from a given subreddit, we must (1) collect as many posts as we can at a given time, (2) wait a while for new posts to be made, and then (3) add the new posts to our collection of old posts.  We do so below for both the "Composer" and "Music Production" subreddits.

Note that in order for this notebook to run without errors, you must include a `praw.ini` file in the same folder as this notebook.  The `praw.ini` should contain your personal Reddit app developer information so as to provide access to Reddit's data.

## Imports

In [1]:
import requests
import pandas as pd
import praw

## Helper Funtions

In [2]:
def extract_info(post):
    output = {}
    output['title'] = post.title
    output['text'] = post.selftext
    output['id'] = post.id
    output['utc'] = post.created_utc
    output['author'] = post.author
    
    return output

In [3]:
#For this to work, you need to make a praw.ini file and add it to this folder.
def get_posts(subreddit, lim):
    reddit=praw.Reddit()
    posts = reddit.subreddit(subreddit).new(limit=lim)
    
    output = []
    for post in posts:
        output.append(extract_info(post))
        
    return output

# "Composer" Subreddit

## Get new data with PRAW

In [4]:
composers_results = get_posts('composer', 200)

In [5]:
composers_new = pd.DataFrame(composers_results)
composers_new.set_index('id', inplace=True)

In [6]:
composers_new.head()

Unnamed: 0_level_0,title,text,utc,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10dgcto,Please check out my piece “Coming Out Party” b...,Performance: https://youtu.be/0szFCm6kmRs\n\nS...,1673880000.0,1987ScreamBloodyGore
10dg0xg,Bernard Herrmann's scores,Hey guys! I'm trying to find scores (or just e...,1673880000.0,luigii-2000
10ddeo5,My new piano composition,"I wrote this short piece yesterday, and I'd ap...",1673872000.0,RxAxS_TE
10daedt,#InfinitePiChallenge,"Dear Composers,\n\nI am pointing here to a sma...",1673862000.0,musescore1983
10da67m,Disclaimer! I do not know music theory nor can...,I had some ideas and went from there. Musescor...,1673861000.0,MRkaland


## Import old data

The old data was extracted using the same procedure above.  However, PRAW only allows access to about 950 of the most recent posts.  So to assemble more data, we need to download posts periodically, waiting for new posts to appear.  We can then add these new posts onto our old list.

In [7]:
composers_old = pd.read_csv('../data/composers.csv', index_col='id')

#Display the 5 most recent posts from the old data
composers_old.sort_values('utc', ascending=False).head()

Unnamed: 0_level_0,title,text,utc,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10dgcto,Please check out my piece “Coming Out Party” b...,Performance: https://youtu.be/0szFCm6kmRs\n\nS...,1673880000.0,1987ScreamBloodyGore
10dg0xg,Bernard Herrmann's scores,Hey guys! I'm trying to find scores (or just e...,1673880000.0,luigii-2000
10ddeo5,My new piano composition,"I wrote this short piece yesterday, and I'd ap...",1673872000.0,RxAxS_TE
10daedt,#InfinitePiChallenge,"Dear Composers,\n\nI am pointing here to a sma...",1673862000.0,musescore1983
10da67m,Disclaimer! I do not know music theory nor can...,I had some ideas and went from there. Musescor...,1673861000.0,MRkaland


## Combine old and new data

In [8]:
#Find the UTC of the most recent post that exists in the old data
newest_old = composers_old.utc.max()
newest_old

1673880471.0

In [9]:
new_posts = composers_new[composers_new['utc']>newest_old]

#Display the oldest posts of the new data
new_posts.sort_values('utc', ascending=True).head()
#The top row here should have a UTC not too much larger than the top row of the
#last displayed dataframe.  If this dataframe is empty, then there haven't
#been any new posts since the last time this notebook was run.

Unnamed: 0_level_0,title,text,utc,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [10]:
composers = pd.concat([composers_old, new_posts])

#How may data points do we now have in total?
len(composers)

1086

In [11]:
## Check that we have no duplicate posts
len(composers.index.unique())
#If so, this cell's output should be the same as the last cell's output

1086

## Export results

In [12]:
composers.to_csv('../data/composers.csv', index_label='id')

# "Music Production" Subreddit

## Get new data with PRAW

In [23]:
producers_results = get_posts('musicproduction', 200)

In [24]:
producers_new = pd.DataFrame(producers_results)
producers_new.set_index('id', inplace=True)

In [25]:
producers_new.head()

Unnamed: 0_level_0,title,text,utc,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10dggy1,What's the best way to learn music production.,I was hoping to hear some experiences of peopl...,1673881000.0,TraditionAlarming777
10df1s9,I need advice,"So i know basic music theory, How to play a ke...",1673877000.0,EmperorAlpha557
10ddhdn,Need your opinion regarding monitors in untrea...,Been producing for 10 years as a hobby. I have...,1673873000.0,_-RandomDude-_
10db2bk,How to recreate these drums?,Hi. Does anyone know how I can’t recreate dru...,1673864000.0,No_Opportunity6714
10d9v21,What is 'certified producer' and how to become...,"Hey everyone, I recently came across couple of...",1673860000.0,thestrangedavinci


## Import old data

In [26]:
producers_old = pd.read_csv('../data/producers.csv', index_col='id')

#Display the 5 most recent posts from the old data
producers_old.sort_values('utc', ascending=False).head()

Unnamed: 0_level_0,title,text,utc,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10dggy1,What's the best way to learn music production.,I was hoping to hear some experiences of peopl...,1673881000.0,TraditionAlarming777
10df1s9,I need advice,"So i know basic music theory, How to play a ke...",1673877000.0,EmperorAlpha557
10ddhdn,Need your opinion regarding monitors in untrea...,Been producing for 10 years as a hobby. I have...,1673873000.0,_-RandomDude-_
10db2bk,How to recreate these drums?,Hi. Does anyone know how I can’t recreate dru...,1673864000.0,No_Opportunity6714
10d9v21,What is 'certified producer' and how to become...,"Hey everyone, I recently came across couple of...",1673860000.0,thestrangedavinci


## Combine old and new data

In [27]:
#Find the UTC of the most recent post that exists in the old data
newest_old = producers_old.utc.max()
newest_old

1673880755.0

In [28]:
new_posts = producers_new[producers_new['utc']>newest_old]

#Display the oldest posts of the new data
new_posts.sort_values('utc', ascending=True).head()
#The top row here should have a UTC not too much larger than the top row of the
#last displayed dataframe.  If this dataframe is empty, then there haven't
#been any new posts since the last time this notebook was run.

Unnamed: 0_level_0,title,text,utc,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [29]:
producers = pd.concat([producers_old, new_posts])

#How may data points do we now have in total?
len(producers)

1245

In [30]:
## Check that we have no duplicate posts
len(producers.index.unique())
#If so, this cell's output should be the same as the last cell's output

1245

## Export results

In [31]:
producers.to_csv('../data/producers.csv', index_label='id')

## What's next?

In the [next notebook](02_data_cleaning.ipynb), we clean the data that we collected in this notebook.