#### Note:
The problem statement and analysis of the data collected here are found in the [second notebook](https://github.com/q-downgrade/language-trans-queer-communities-nlp/blob/master/code/1-modeling-and-analyzing-data.ipynb) of this project.

## Scraping Data

In [1]:
import pandas as pd
import datetime as dt
import time
import requests

Starting off by establishing the base url for the Reddit API I'm using and the two subreddits I'll be gathering data from, just to make it easier. For this project I'll use r/traaaaaaannnnnnnnnns, a popular subreddit for trans people, and r/actuallesbians, a popular subreddit for queer women no matter how they identify.

In [2]:
# sstablish URL base
base_url = "https://api.pushshift.io/reddit/search/submission" # also known as the "API endpoint" 

# establish parameters
r_traa = "traaaaaaannnnnnnnnns"
r_lesbian = 'actuallesbians'
size = 500

# construct full url
stem_traa = f"{base_url}?subreddit={r_traa}&size={size}"
stem_les = f"{base_url}?subreddit={r_lesbian}&size={size}"


In [3]:
def get_data(subreddit, stem):
    '''
    Function takes in a subreddit and a url stem and uses them to get data through the Pushshift API.
    '''
    # loop variables
    day_window = 5 
    n = 8

    # establish empty list
    posts_subreddit = []

    # loop from 1 until n + 1
    for i in range(1, n + 1):
        # create custom URL 
        URL = f"{stem}&after={i * day_window}d"
        print("Querying from: " + URL)
        res = requests.get(URL)
        assert res.status_code == 200
        json = res.json()['data']
        df = pd.DataFrame(json)
        posts_subreddit.append(df)
        time.sleep(2)

    print("Query complete!")
    return posts_subreddit

In [4]:
posts_traa = get_data(r_traa, stem_traa)

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=5d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=10d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=15d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=20d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=25d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=35d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=traaaaaaannnnnnnnnns&size=500&after=40d
Query complete!


Getting data from r/traaaaaaannnnnnnnnns

In [5]:
posts_les = get_data(r_lesbian, stem_les)

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=5d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=10d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=15d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=20d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=25d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=30d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=35d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=actuallesbians&size=500&after=40d
Query complete!


Getting data from r/actuallesbians

In [6]:
df_traa = pd.concat(posts_traa, sort = False)

In [7]:
df_les = pd.concat(posts_les, sort = False)

Turning both lists into dataframes.

In [8]:
df_traa.shape

(4000, 75)

In [9]:
df_final = pd.concat([df_traa, df_les])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Combining the two dataframes into one.

In [10]:
df_final.shape

(8000, 80)

In [11]:
df_final.keys()

Index(['all_awardings', 'allow_live_comments', 'author', 'author_cakeday',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
       'banned_by', 'can_mod_post', 'contest_mode', 'created_utc',
       'crosspost_parent', 'crosspost_parent_list', 'distinguished', 'domain',
       'edited', 'full_link', 'gilded', 'gildings', 'id', 'is_crosspostable',
       'is_meta', 'is_original_content', 'is_reddit_media_domain',
       'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_css_class',
       'link_flair_richtext', 'link_flair_template_id', 'link_flair_text',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media',
       'media_embed', 'media_metadata', 'media_only', 'no_follow',
       'num_comments', 'nu

Checking the list of columns in the dataframe. Many of these are Reddit-specific, such as "author_cakeday" which lists the day the person signed up for Reddit. We can drop many of these from the final dataframe.

In [12]:
cols = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 
        'author_flair_text', 'score', 'is_self']
df_final = df_final[cols]

Keeping the title and selftext (the body of the posts, if it's text) for analyzing in the next steps. Also including author_flair_text in case that proves helpful or interesting later.

In [13]:
df_final.reset_index(inplace = True)

Resetting the index of the final dataframe.

In [14]:
# Creating `timestamp` column using `created_utc` column
df_final["timestamp"] = df_final['created_utc'].map(dt.date.fromtimestamp)
df_final['timestamp'].head()

0    2020-01-21
1    2020-01-21
2    2020-01-21
3    2020-01-21
4    2020-01-21
Name: timestamp, dtype: object

In [15]:
df_final.drop_duplicates(inplace = True)

In [17]:
df_final.head()

Unnamed: 0,index,title,selftext,subreddit,created_utc,author,author_flair_text,score,is_self,timestamp
0,0,We Vibin,,traaaaaaannnnnnnnnns,1579648948,RemIsAMess,Remy|FtM|Peecock Gen 4,1,False,2020-01-21
1,1,First week of T be like,,traaaaaaannnnnnnnnns,1579649007,bubblegumblueart,Spicy FTMemer,1,False,2020-01-21
2,2,When you put on breast forms for the first time,,traaaaaaannnnnnnnnns,1579649035,craft6886,,1,False,2020-01-21
3,3,"please, I need to know",,traaaaaaannnnnnnnnns,1579649862,sociopathic_muffin,"cant get a bf, became the bf",1,False,2020-01-21
4,4,"""I don't have E"" is such a h*cking mood",,traaaaaaannnnnnnnnns,1579650222,A_Wild_Rat_Appeared,"mtr, Male To Rat.",1,False,2020-01-21


In [20]:
df_final.drop(columns = 'index', inplace = True)
df_final.reset_index(inplace = True)

In [21]:
#df_final.to_csv('../datasets/data.csv', index = False)
#commenting out to prevent overwriting the data use by the models

**Note:** All data was obtained on 1/26/20.