# Project 3: Web APIs & NLP

###  Notebook Contents:

1. Import the packages that I need.
2. Use the PushShift API to pull 100 rows from the adulting subreddit.
    - I create a loop that will pull 100 rows of data from the adulting subreddit multiple times.
    - I implement sleep so that I do not overload the server.
3. Create a dataframe that will only contain the subreddit, title, and selftext columns.
4. Cleaning of adulting subreddit data, removing nulls and adding title and selftext together.
5. CountVectorizer of combined columns title and selftext (user_post).
6. Save and export adult_df to my datasets.

### Imports

In [1]:
# Import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

import requests
from bs4 import BeautifulSoup

# Use this to control the scrape rate!
from time import sleep

C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\Users\nolan_fur2pfn\.conda\envs\dsi\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
  stacklevel=1)


### Subreddit #2.  Adulting

In [2]:
# Setting URL to use pushshift api from video by Riley Dallas
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Setting parameters
params = {
    'subreddit' : 'adulting', # Choosing my subreddit
    'size' : 100, # Choosing how many posts I want to collect. Reddit only allows 100.
    'before' : 1606252870 # From Riley Dallas, will allow me to collect more than 100 posts! More data the better!
}

In [4]:
# Setting variable res as requests from my url and parameters
res = requests.get(url, params)
# Will print status code (Usually 200)
res.status_code

200

In [5]:
# Setting data as json
data = res.json()
# Posts will be the data from what I collect
posts = data['data']
# Print how many posts I have collected
len(posts)

100

In [6]:
# Visualize 1st post
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'JOSEPHDEPTH',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_6kh2janl',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1606249496,
 'domain': 'self.Adulting',
 'full_link': 'https://www.reddit.com/r/Adulting/comments/k0d05y/how_could_i_start_planning_to_move_out_of_my/',
 'gildings': {},
 'id': 'k0d05y',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 5,
 'num_crossposts': 0,
 'over_18': False,
 'parent_white

In [7]:
# Function to grab 100 rows multiple times (i in range(9)) will grab me 900 more
for i in range(49):
    # Setting sleep to take a break between loops to not overload server
    sleep(1)
    # Taking the utc before our above data
    utc = posts[-1]['created_utc']
    params = {
        'subreddit': 'adulting',
        'size': 100,
        'before': utc
    }
    # Using .get to get url and params from requests
    data = requests.get(url, params)
    # Making it a list
    data = data.json()['data']
    # Adding posts and data to equal variable posts
    posts = posts + data

In [8]:
# Looking at how many rows of data we have.
len(posts)

5000

In [9]:
# Setting posts as a dataframe
df = pd.DataFrame(posts)
# Setting which columns I want to look at
df[['subreddit', 'title', 'selftext']]

Unnamed: 0,subreddit,title,selftext
0,Adulting,How could i start planning to move out of my p...,"I'm 17, 11th grade and finally picking myself ..."
1,Adulting,Stained countertops,I live in an apartment that was built in the 6...
2,Adulting,Should I just quit? I really don't want to.,"I(19M) currently work at a Costco in my town, ..."
3,Adulting,Question: Fiancé’s doctor,So we moved to my fiancé hometown a while back...
4,Adulting,How I feel. Also shameless plug. Idk if agains...,
...,...,...,...
4995,Adulting,Has anyone tried HelloFresh or any other food ...,I'm starting to get into cooking but I'm kind ...
4996,Adulting,How much in savings should you have,
4997,Adulting,Move and find a job or find a job and move first?,"Hi all, so I've been looking to move recently..."
4998,Adulting,How to convince South Asian parents that I nee...,"Hey guys,\n\nSo i’m a 26 year old (M) with a s..."


In [10]:
# Creating column to add together title and selftext columns
# Instead of only using titles or self text we can use both
df['user_post'] = df['title'] + " " + df['selftext']
# Visualizing if it worked
df['user_post'][0]

"How could i start planning to move out of my parents house to go get an apartment? I'm 17, 11th grade and finally picking myself together to grind for success. I wanna be a TV writer and make shows for HBO or something but anyway I have to start grinding on how to get money, manage money and live on my own.\n\nBut yeah, I currently have no job yet I'm gonna get a job soon and apply at Walmart soon. And I'm still trying to make money online but I found nothing to help me cause I have no specific talents.\n\nI just want to know what's the process of moving out? Any videos u can suggest? Like, how does one learn to deal with a bank account and manage money? How does one learn to pay bills and buy a house? U know the usual stuff...."

In [11]:
# Visualizing the first 20 rows of the columns I want
df[['subreddit', 'title', 'selftext', 'user_post']][0:20]

Unnamed: 0,subreddit,title,selftext,user_post
0,Adulting,How could i start planning to move out of my p...,"I'm 17, 11th grade and finally picking myself ...",How could i start planning to move out of my p...
1,Adulting,Stained countertops,I live in an apartment that was built in the 6...,Stained countertops I live in an apartment tha...
2,Adulting,Should I just quit? I really don't want to.,"I(19M) currently work at a Costco in my town, ...",Should I just quit? I really don't want to. I(...
3,Adulting,Question: Fiancé’s doctor,So we moved to my fiancé hometown a while back...,Question: Fiancé’s doctor So we moved to my fi...
4,Adulting,How I feel. Also shameless plug. Idk if agains...,,How I feel. Also shameless plug. Idk if agains...
5,Adulting,What's *YOUR* personal adulting weekly routine...,* How many times do you do laundry/vacuum/dust...,What's *YOUR* personal adulting weekly routine...
6,Adulting,Fact,,Fact
7,Adulting,"24, autistic and trans. Complete failure in life",long post incoming\n\nI don't know what to do ...,"24, autistic and trans. Complete failure in li..."
8,Adulting,19 years old feeling like a failure in life an...,1. I failed High School poorly mostly because ...,19 years old feeling like a failure in life an...
9,Adulting,I didn’t do my taxes this year....now what?,So I didn’t file my taxes this year. Big no-no...,I didn’t do my taxes this year....now what? So...


In [12]:
# Setting adulting_df to columns that I would like to use from df
adulting_df = df[['subreddit', 'title', 'selftext', 'user_post']]

In [13]:
# Cleaning adulting_df removed words
adulting_df = adulting_df[adulting_df['selftext'] != '[removed]']
#adulting_df['selftext'].fillna("N/A", inplace=True)

In [14]:
# Visualizing rows that need to be cleaned/removed
adulting_df[adulting_df['selftext'].isnull()]

Unnamed: 0,subreddit,title,selftext,user_post
1847,Adulting,Anyone down to chill ?,,
2052,Adulting,Hold the Line,,
2366,Adulting,"My washing/drier machines are down, can’t go t...",,
2431,Adulting,Car insurance expired after accident.,,
2446,Adulting,I’m flying by myself for the first time in a c...,,
3025,Adulting,How can I live on my own or at least move out ...,,
3079,Adulting,Little known great careers?,,


In [15]:
# Removing na values from adulting_df
adulting_df.dropna(inplace=True)

In [16]:
# Visualize how many rows are NA
adulting_df.isnull().sum()

subreddit    0
title        0
selftext     0
user_post    0
dtype: int64

In [17]:
# Setting CVEC model with dataframe['title and selftext']
X = adulting_df['user_post']
# Instantiate countvectorizer
cvec = CountVectorizer()
# Fitting X on countvectorizer
cvec.fit(X)
# Setting variable X as transformed X data
X = cvec.transform(X)

In [18]:
# Setting dataframe from cvec model
X_df = pd.DataFrame(X.todense(), columns = cvec.get_feature_names())

In [19]:
# Finding middle point where our words will really start to differ between subreddits.
X_df.shape

(4688, 16461)

In [20]:
# The top words are going to be commonly used words (to, what, how, why)
X_df.sum().sort_values(ascending=False)[475:525]

future         129
university     129
store          128
reddit         128
interview      128
share          127
often          127
roommate       126
minutes        125
eat            124
email          123
personal       123
medical        123
turn           123
graduate       123
advance        122
business       122
happy          122
turned         122
wish           122
laundry        122
talking        120
lived          120
25             120
short          120
three          120
town           120
gotten         120
sort           119
called         119
gets           118
single         118
score          118
100            117
quit           116
renting        116
stuck          115
loans          115
read           115
information    114
financial      114
list           114
cooking        113
field          113
bring          113
budget         113
000            113
cook           113
couldn         112
mean           112
dtype: int64

### Export To CSV

In [21]:
# Exporting adulting_df to datasets. Used with teenagers_df to create a merged DF.
adulting_df.to_csv('./data/adulting_subreddit.csv', index=False)