<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

---
## Problem Statement

Maker Faire Conference wants to do a fun interactive app for their attendees at the upcoming conference. For the attendees who are fans of Arduinos and Raspberry Pi they want to build a classification model that will identify which person is a fan of either device based on the text they enter into the app. The goal of the model is to be as accurate as possible. The hope for this project is to delight the attendees and retain attendance for future conferences.

---

# Data Cleaning

In [24]:
import pandas as pd
import numpy as np

## Arduino Data Cleaning

In [25]:
# read arduino data
arduino = pd.read_csv('../data/arduino.csv')
arduino.head()

Unnamed: 0,subreddit,title,selftext
0,arduino,Can I use a usb bluetooth adapter on a Leonardo?,I have a little Beetle board that is a Leonard...
1,arduino,New to the Arduino universe and need suggestio...,[removed]
2,arduino,long shot but does anyone have a 3d model of t...,I want to build a 3d enclosure for this board ...
3,arduino,Need help understanding some variables and fun...,I need to understand the code from [this](http...
4,arduino,I made an Arduino Uno compatible controller bo...,


In [26]:
# Check dtypes and null values
arduino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  4000 non-null   object
 1   title      4000 non-null   object
 2   selftext   2562 non-null   object
dtypes: object(3)
memory usage: 93.9+ KB


In [27]:
# Check to see how many "removed" rows exist
arduino[arduino['selftext'] == '[removed]']

Unnamed: 0,subreddit,title,selftext
1,arduino,New to the Arduino universe and need suggestio...,[removed]
15,arduino,Grove Help :),[removed]
16,arduino,Arduino UNO missing IDE port,[removed]
20,arduino,How do you convert 24v ac to 5v dc to power ar...,[removed]
24,arduino,BRG LED Strip Help.,[removed]
...,...,...,...
3697,arduino,Raspberry Pi for Beginners (Mac + PC),[removed]
3802,arduino,Teaching not to doubt about requests to God:,[removed]
3826,arduino,Technology Promotes Equality...,[removed]
3938,arduino,I'm working on Iot based smart inverter on Pro...,[removed]


In [28]:
# Drop all rows with no selftext
arduino.dropna(inplace=True)

In [29]:
# Drop all rows with 'removed' in self text
arduino = arduino[arduino['selftext'] != '[removed]']

In [30]:
# Check remaining rows
arduino.shape

(2487, 3)

## Raspberry Pi Data Cleaning

In [31]:
# read raspberrypi data
raspberrypi = pd.read_csv('../data/raspberrypi.csv')
raspberrypi.head()

Unnamed: 0,subreddit,title,selftext
0,raspberry_pi,RP4 w/ Raspbian opens video but no picture,I have a RP4 with a 1080x1920 (portriate) h.2...
1,raspberry_pi,DeskPi Pro V2 Case for Raspberry Pi 4 Setup: M...,
2,raspberry_pi,Plex Server and DVD Ripper,My in-laws have a ton of movies. I suggested p...
3,raspberry_pi,A Windows 10 VM(s) on a Pi 4 Cluster?,[removed]
4,raspberry_pi,Another Crypto Ticker,[removed]


In [32]:
# check dtypes and null values
raspberrypi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  20000 non-null  object
 1   title      20000 non-null  object
 2   selftext   14581 non-null  object
dtypes: object(3)
memory usage: 468.9+ KB


In [33]:
# Check to see how many "removed" rows exist
raspberrypi[raspberrypi['selftext'] == '[removed]']

Unnamed: 0,subreddit,title,selftext
3,raspberry_pi,A Windows 10 VM(s) on a Pi 4 Cluster?,[removed]
4,raspberry_pi,Another Crypto Ticker,[removed]
5,raspberry_pi,Raspberry Pi Replacement for Fingbox?,[removed]
7,raspberry_pi,Won’t boot after shutdown,[removed]
8,raspberry_pi,What power supply for two RPi3 ?,[removed]
...,...,...,...
19991,raspberry_pi,Only orange light no green light @ ethernet port,[removed]
19992,raspberry_pi,Troubleshooting with Argon One Case (MicroSD n...,[removed]
19995,raspberry_pi,New 8Gb RAM Rpi 4,[removed]
19997,raspberry_pi,[Help Needed] RasberryPi4 with retropie does n...,[removed]


In [34]:
# Drop all rows with no selftext
raspberrypi.dropna(inplace=True)

In [35]:
# Drop all rows with 'removed' in self text
raspberrypi = raspberrypi[raspberrypi['selftext'] != '[removed]']

In [36]:
# Check remaining rows
raspberrypi.shape

(2689, 3)

## Final Subreddit DataFrame

In [38]:
# Combine cleaned subreddit dataframes into one
subreddits = pd.concat([arduino, raspberrypi], axis=0)
subreddits.head()

Unnamed: 0,subreddit,title,selftext
0,arduino,Can I use a usb bluetooth adapter on a Leonardo?,I have a little Beetle board that is a Leonard...
2,arduino,long shot but does anyone have a 3d model of t...,I want to build a 3d enclosure for this board ...
3,arduino,Need help understanding some variables and fun...,I need to understand the code from [this](http...
6,arduino,A picture book teaches C programming basics,"When I was learning C programming, I found tha..."
7,arduino,How to trick an airbag controller,"Not exactly Arduino related, but hopefully som..."


In [39]:
# check final shape of full corpus
subreddits.shape

(5176, 3)

In [40]:
# export dataframe to csv
subreddits.to_csv('../data/subreddits.csv', index=False)