# Project 3: Web APIs & NLP

###  Notebook Contents:

1. Import the packages that I need.
2. Use the PushShift API to pull 100 rows from the teenagers subreddit.
    - I create a loop that will pull 100 rows of data from the teenagers subreddit multiple times.
    - I implement sleep so that I do not overload the server.
3. Create a dataframe that will only contain the subreddit, title, and selftext columns.
4. Cleaning of teenagers subreddit data, removing nulls and adding title and selftext together.
5. CountVectorizer of combined columns title and selftext (user_post).
6. Save teenagers_df and export it to my datasets folder.

### Imports

In [1]:
# Import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

import requests
from bs4 import BeautifulSoup

# Use this to control the scrape rate!
from time import sleep

C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


### Subreddit #2.  Teenagers

In [2]:
# Setting URL from video Riley Dallas Posted
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Setting parameters
params = {
    'subreddit' : 'teenagers', # Choosing my subreddit
    'size' : 100, # Choosing how many posts I want to collect. Reddit only allows 100.
    'before' : 1606606392 # From Riley Dallas, will allow me to collect more than 100 posts! More data the better!
}

In [4]:
# Setting variable res as requests from my url and parameters
res = requests.get(url, params)
# Will print status code (Usually 200)
res.status_code

200

In [5]:
# Setting data as json
data = res.json()
# Posts will be the data from what I collect
posts = data['data']
# Print how many posts I have collected
len(posts)

100

In [6]:
# Visualize 1st post
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'OrangeSheep12',
 'author_flair_css_class': None,
 'author_flair_richtext': [{'e': 'text', 't': '15'}],
 'author_flair_template_id': 'ef54aab6-9bc7-11e1-a725-12313b0c247a',
 'author_flair_text': '15',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'richtext',
 'author_fullname': 't2_5o26os3c',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1606606379,
 'domain': 'self.teenagers',
 'full_link': 'https://www.reddit.com/r/teenagers/comments/k2y95l/i_cant_sleep_im_so_fucking_angry/',
 'gildings': {},
 'id': 'k2y95l',
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': False,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_css_class': 'other',
 'link_flair_richtext': [{'e': 'text', 't': 'Other'}],
 'link_fl

In [7]:
# Function to grab 100 rows multiple times (i in range(9)) will grab me 900 more
for i in range(49):
    # Takes 5 seconds off each time it loops
    sleep(5)
    # -1 from the created_utc from above that we already collected
    utc = posts[-1]['created_utc']
    # Setting parameters
    params = {
        'subreddit': 'teenagers',
        'size': 100,
        'before': utc
    }
    # Setting requests to get url and the parameters
    data = requests.get(url, params)
    # Making it a list
    data = data.json()['data']
    # Adding posts and data to equal posts
    posts = posts + data

In [8]:
# Looking at how many rows of data we have.
len(posts)

5000

In [9]:
# Setting posts as a dataframe
df = pd.DataFrame(posts)
# Setting which columns I want to look at
df[['subreddit', 'title', 'selftext']]

Unnamed: 0,subreddit,title,selftext
0,teenagers,I can't sleep I'm so fucking angry,[removed]
1,teenagers,They say that at least one person has a crush ...,But damn I wish whoever had a crush on me woul...
2,teenagers,i need to lose 80 pounds immediately,my ass is 220 pounds and 5’6”. im fucking 15. ...
3,teenagers,i wanna drink hydrochloric acid,​​​​​ ​​​​​​ ​​​​​​ ​​​​​​​​ ​​​​​​​​​​​​​​​​​...
4,teenagers,Question :,[removed]
...,...,...,...
4995,teenagers,SELF DISCIPLINE | NoFap | best motivation | Ep...,[removed]
4996,teenagers,I think I have a crush on her again,She put a picture on Instagram and I just...\n...
4997,teenagers,15m bored af,Anyone wanna talk? I got time to kill and nobo...
4998,teenagers,advice please,should i walk to the supermarket near my house...


In [10]:
# Creating column to add together title and selftext columns
# Instead of only using titles or self text we can use both
df['user_post'] = df['title'] + " " + df['selftext']
df['user_post'][0]

"I can't sleep I'm so fucking angry [removed]"

In [11]:
# Visualizing first 20 rows
df[['subreddit', 'title', 'selftext', 'user_post']][0:20]

Unnamed: 0,subreddit,title,selftext,user_post
0,teenagers,I can't sleep I'm so fucking angry,[removed],I can't sleep I'm so fucking angry [removed]
1,teenagers,They say that at least one person has a crush ...,But damn I wish whoever had a crush on me woul...,They say that at least one person has a crush ...
2,teenagers,i need to lose 80 pounds immediately,my ass is 220 pounds and 5’6”. im fucking 15. ...,i need to lose 80 pounds immediately my ass is...
3,teenagers,i wanna drink hydrochloric acid,​​​​​ ​​​​​​ ​​​​​​ ​​​​​​​​ ​​​​​​​​​​​​​​​​​...,i wanna drink hydrochloric acid ​​​​​ ​​​​​​ ​...
4,teenagers,Question :,[removed],Question : [removed]
5,teenagers,Is a platonic friendship with an opposite sex ...,[removed],Is a platonic friendship with an opposite sex ...
6,teenagers,Ask me weirdly personal questions part 3 (seri...,"Idk why I'm doing this, I guess it's fun and l...",Ask me weirdly personal questions part 3 (seri...
7,teenagers,"Dear teen boys,",If a girl you have a crush on says she's not i...,"Dear teen boys, If a girl you have a crush on ..."
8,teenagers,Star Wars question,Is 2014 late to the party to start Star Wars o...,Star Wars question Is 2014 late to the party t...
9,teenagers,Relationship advice (depressed boyfriend),[removed],Relationship advice (depressed boyfriend) [rem...


In [12]:
# Setting teenagers_df to specific columns within df
teenagers_df = df[['subreddit', 'title', 'selftext', 'user_post']]

In [13]:
# Visualizing DF
teenagers_df

Unnamed: 0,subreddit,title,selftext,user_post
0,teenagers,I can't sleep I'm so fucking angry,[removed],I can't sleep I'm so fucking angry [removed]
1,teenagers,They say that at least one person has a crush ...,But damn I wish whoever had a crush on me woul...,They say that at least one person has a crush ...
2,teenagers,i need to lose 80 pounds immediately,my ass is 220 pounds and 5’6”. im fucking 15. ...,i need to lose 80 pounds immediately my ass is...
3,teenagers,i wanna drink hydrochloric acid,​​​​​ ​​​​​​ ​​​​​​ ​​​​​​​​ ​​​​​​​​​​​​​​​​​...,i wanna drink hydrochloric acid ​​​​​ ​​​​​​ ​...
4,teenagers,Question :,[removed],Question : [removed]
...,...,...,...,...
4995,teenagers,SELF DISCIPLINE | NoFap | best motivation | Ep...,[removed],SELF DISCIPLINE | NoFap | best motivation | Ep...
4996,teenagers,I think I have a crush on her again,She put a picture on Instagram and I just...\n...,I think I have a crush on her again She put a ...
4997,teenagers,15m bored af,Anyone wanna talk? I got time to kill and nobo...,15m bored af Anyone wanna talk? I got time to ...
4998,teenagers,advice please,should i walk to the supermarket near my house...,advice please should i walk to the supermarket...


### Cleaning Teenagers DataFrame

In [14]:
# Visualizing df
teenagers_df.head()

Unnamed: 0,subreddit,title,selftext,user_post
0,teenagers,I can't sleep I'm so fucking angry,[removed],I can't sleep I'm so fucking angry [removed]
1,teenagers,They say that at least one person has a crush ...,But damn I wish whoever had a crush on me woul...,They say that at least one person has a crush ...
2,teenagers,i need to lose 80 pounds immediately,my ass is 220 pounds and 5’6”. im fucking 15. ...,i need to lose 80 pounds immediately my ass is...
3,teenagers,i wanna drink hydrochloric acid,​​​​​ ​​​​​​ ​​​​​​ ​​​​​​​​ ​​​​​​​​​​​​​​​​​...,i wanna drink hydrochloric acid ​​​​​ ​​​​​​ ​...
4,teenagers,Question :,[removed],Question : [removed]


In [15]:
# Removing null values
teenagers_df = teenagers_df[teenagers_df['selftext'].notna()]
# Removing rows where the selftext is removed
teenagers_df = teenagers_df[teenagers_df['selftext'] != '[removed]']
# Removing rows where the title is removed
teenagers_df = teenagers_df[teenagers_df['title'] != '[removed]']
# Removing rows where the user_post is removed
teenagers_df = teenagers_df[teenagers_df['user_post'] != '[removed]']

In [16]:
# Visualizing changes
teenagers_df

Unnamed: 0,subreddit,title,selftext,user_post
1,teenagers,They say that at least one person has a crush ...,But damn I wish whoever had a crush on me woul...,They say that at least one person has a crush ...
2,teenagers,i need to lose 80 pounds immediately,my ass is 220 pounds and 5’6”. im fucking 15. ...,i need to lose 80 pounds immediately my ass is...
3,teenagers,i wanna drink hydrochloric acid,​​​​​ ​​​​​​ ​​​​​​ ​​​​​​​​ ​​​​​​​​​​​​​​​​​...,i wanna drink hydrochloric acid ​​​​​ ​​​​​​ ​...
6,teenagers,Ask me weirdly personal questions part 3 (seri...,"Idk why I'm doing this, I guess it's fun and l...",Ask me weirdly personal questions part 3 (seri...
7,teenagers,"Dear teen boys,",If a girl you have a crush on says she's not i...,"Dear teen boys, If a girl you have a crush on ..."
...,...,...,...,...
4994,teenagers,when are we going to talk about the fact that ...,filler filler filler filler filler filler fill...,when are we going to talk about the fact that ...
4996,teenagers,I think I have a crush on her again,She put a picture on Instagram and I just...\n...,I think I have a crush on her again She put a ...
4997,teenagers,15m bored af,Anyone wanna talk? I got time to kill and nobo...,15m bored af Anyone wanna talk? I got time to ...
4998,teenagers,advice please,should i walk to the supermarket near my house...,advice please should i walk to the supermarket...


In [17]:
# Setting CVEC model with dataframe['title and selftext']
X = teenagers_df['user_post']
# Instantiating countvectorizer
cvec = CountVectorizer()
# Fitting cvec with X (user_post)
cvec.fit(X)
# Setting variable X to transformed X data
X = cvec.transform(X)

In [18]:
# Setting dataframe from cvec model
X_df = pd.DataFrame(X.todense(), columns = cvec.get_feature_names())

In [19]:
# Finding middle point where our words will really start to differ between subreddits.
X_df.shape

(3842, 14644)

In [20]:
# The top words are going to be commonly used words (to, what, how, why)
X_df.sum().sort_values(ascending=False)[475:525]

answer       64
lonely       64
gave         64
currently    64
subreddit    63
broke        63
link         63
black        63
hell         63
car          63
fact         63
others       62
date         62
talked       62
funny        62
country      62
mental       62
depressed    61
dick         61
listen       61
sucks        61
full         61
number       61
looked       61
worse        61
short        60
less         60
hurt         60
hair         60
check        60
true         60
hour         60
meme         59
dog          59
yourself     59
kill         59
eyes         59
feelings     59
18           58
meet         58
word         58
write        58
posts        58
wouldn       58
pls          58
soon         58
straight     58
happen       57
joke         57
eve          57
dtype: int64

### Export To CSV

In [21]:
# Exporting teenagers_df to datasets
teenagers_df.to_csv('./data/teenagers_subreddit.csv', index=False)