# Project 3: Subreddit Classification (Part 1 of 2)


> By:  Rohazeanti Mohamad Jenpire

## Table of Contents
- [Problem Statement](#Problem-Statement)
- [Imports](#Imports)
- [Data Collection](#Data-Collection)

## Problem Statement

Develop a classification model that can distinguish which of two subreddits, (r/diet or r/exercise), a particular post belongs to.

Diet and exercise are both essential for optimal health. And people often incorporate diet (for the purpose of living a healthy life) with exercise in their daily lives. We also often associate the word diet with unpleasant weight-loss regimen?  For example, consider the use of the term "diet" in marketing food products—it usually describes foods low in calories, such as diet soda.But there is another meaning of this word. Diet can also refer to the food and drink a person consumes daily and the mental and physical circumstances connected to eating. Nutrition involves more than simply eating a “good” diet—it is about nourishment on every level. 

Exercise and diet is associated with weight loss, losing weight, reduce fat, burn calorie(food or type of exercises). There are posts that are diet related are posted in r/exercise and vice versa. 

As the data scientist on the team, I have been assigned the task of building a classificiation model that can accurately classify posts from each subreddit so that r/diet and r/exercise can be free of unrelated posts so that the forum can continue to offer accurate content. 

#### Imports

In [2]:
import requests
import time
import pandas as pd
import numpy as np

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Data Collection

In [3]:
def get_posts(subreddit, number=100):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
            'subreddit': subreddit,
            'size': 100
        } # 'before' is added later in the while loop
    df = pd.DataFrame() # empty dataframe for concatenating
    returned = pd.DataFrame()
    while True: # Need a break out of this
        num = np.random.randint(5,30)
        time.sleep(num)
        
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        
        params['before'] = df.iloc[-1:,:]['created_utc']
        
        returned = pd.concat([returned, df[['subreddit','selftext','title', 'created_utc']]], axis=0)
        returned.drop_duplicates(inplace=True)
        
        if len(returned) > number:
            break
      
    returned.reset_index(inplace=True,drop=True)
    n_post = returned[:number]
    print(f"Data collection completed...{len(n_post)} total number of posts collected. ")
    return returned[:number]
    

### Scrape `diet` posts

In [9]:
diet = get_posts('diet', number=3000)

Data collection completed...3000 total number of posts collected. 


In [10]:
len(diet["selftext"].unique())

2088

In [11]:
diet.head(10)

Unnamed: 0,subreddit,selftext,title,created_utc
0,diet,,Best diet to lose weight before going on vacat...,1659585910
1,diet,[removed],Does anyone else get sick from foods you used ...,1659582639
2,diet,"For those of you that do not know, a fruitaria...",Debunking the fruitarian diet,1659567365
3,diet,[removed],"Is it okay if I stop eating all bread , pasta,...",1659554541
4,diet,[removed],KETO DIET,1659546069
5,diet,,How do I complet my protein requirement as a v...,1659545207
6,diet,[removed],Working out on an empty stomach,1659544720
7,diet,Since I stopped smoking and drinking alcohol I...,Cheating &amp; Boredom,1659542824
8,diet,[removed],Diet idea,1659539759
9,diet,[removed],Is it important to eat a mix of vegetables?,1659534798


In [12]:
diet.shape

(3000, 4)

In [13]:
# Save to csv
diet.to_csv("diet.csv", index=False)

In [4]:
exercise = get_posts('exercise', number=3000)

Data collection completed...3000 total number of posts collected. 


In [5]:
len(exercise["selftext"].unique())

1092

In [6]:
exercise["selftext"].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       1473
[removed]                                                                                                                                                                                                                                                                   

In [7]:
exercise.isnull().sum() 

subreddit      0
selftext       8
title          0
created_utc    0
dtype: int64

In [8]:
# Save to csv
exercise.to_csv("exercise.csv", index=False)