<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

---
## Problem Statement

Maker Faire Conference wants to do a fun interactive app for their attendees at the upcoming conference. For the attendees who are fans of Arduinos and Raspberry Pi they want to build a classification model that will identify which person is a fan of either device based on the text they enter into the app. The goal of the model is to be as accurate as possible. The hope for this project is to delight the attendees and retain attendance for future conferences.

---

# Data Wrangling

In [3]:
# Imports
import requests
import pandas as pd
import time

In [1]:
def get_data(subreddit, iters):
    '''
    Collect the subreddit, title, and selftext of posts from
    reddit api. Return a pandas dataframe of the subreddit data.
    
    Params:
    
    subreddit - name of the subreddit to search
    
    iters - number of iterations to run to pull data
    '''
    posts = []
    oldest_post = 0

    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit': subreddit,
        'size': 100,
    }
    res = requests.get(url, params)
    if res.status_code == 200:
        posts = res.json()['data']
        oldest_post = res.json()['data'][-1]['created_utc']
    else:
        return f'Server error: {res.status_code}'
    
    for _ in range(0, iters):
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': oldest_post
        }
        res = requests.get(url, params)
        if res.status_code == 200:
            posts += res.json()['data']
            oldest_post = res.json()['data'][-1]['created_utc']
        else:
            return f'Server error: {res.status_code}'
        time.sleep(10)
    return pd.DataFrame(posts)[['subreddit', 'title', 'selftext']]

In [75]:
# get arduino data
# 40 total runs to pull 4000 rows of data
arduino = get_data('arduino', 39)

In [76]:
# check head of arduino dataframe
arduino.head()

Unnamed: 0,subreddit,title,selftext
0,arduino,Can I use a usb bluetooth adapter on a Leonardo?,I have a little Beetle board that is a Leonard...
1,arduino,New to the Arduino universe and need suggestio...,[removed]
2,arduino,long shot but does anyone have a 3d model of t...,I want to build a 3d enclosure for this board ...
3,arduino,Need help understanding some variables and fun...,I need to understand the code from [this](http...
4,arduino,I made an Arduino Uno compatible controller bo...,


In [77]:
# check number of rows of data
arduino.shape

(4000, 3)

In [78]:
# check nulls in dataset
arduino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  4000 non-null   object
 1   title      4000 non-null   object
 2   selftext   4000 non-null   object
dtypes: object(3)
memory usage: 93.9+ KB


In [83]:
# export dataframe to csv
arduino.to_csv('data/arduino.csv', index = False)

In [4]:
# get raspberry pi data
# 200 total runs to pull 20000 rows of data
raspberrypi = get_data('raspberry_pi')

In [5]:
# check head of raspberry pi dataframe
raspberrypi.head()

Unnamed: 0,subreddit,title,selftext
0,raspberry_pi,RP4 w/ Raspbian opens video but no picture,I have a RP4 with a 1080x1920 (portriate) h.2...
1,raspberry_pi,DeskPi Pro V2 Case for Raspberry Pi 4 Setup: M...,
2,raspberry_pi,Plex Server and DVD Ripper,My in-laws have a ton of movies. I suggested p...
3,raspberry_pi,A Windows 10 VM(s) on a Pi 4 Cluster?,[removed]
4,raspberry_pi,Another Crypto Ticker,[removed]


In [6]:
# check number of rows of data
raspberrypi.shape

(20000, 3)

In [7]:
# check nulls in dataset
raspberrypi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  20000 non-null  object
 1   title      20000 non-null  object
 2   selftext   19751 non-null  object
dtypes: object(3)
memory usage: 468.9+ KB


In [9]:
# export dataframe to csv
raspberrypi.to_csv('../data/raspberrypi.csv', index = False)