## Problem Statement

The television shows 'Elementary' and 'Sherlock' are about the detective Sherlock Holmes from the studios CBS and BBC respectively. The CBS executives want to use a model to classify the subreddit threads r/elementary and r/sherlock to determine whether in viewers' minds there are differences between 'Elementary' and the other show.

This project will develop a model to distinguish between the 2 subreddits. The positive class will be the posts from the elementary subreddit and the negative class will be sherlock subreddit posts. Logisitic Regression and Naive Bayes models will be used to classify the 2 subreddits. The accuracy scores will be used to evaluate and find the best model for classification. 

The project will also analyse which words were used by the model for successful classification to help CBS executives find out which terms/topics resonate in viewers' minds to ensure the popularity of the show.

## Import libraries

In [1]:
import requests
import pandas as pd
import time
import random


## Scrapping Reddit subthreads

In [2]:
sherlock= 'https://www.reddit.com/r/Sherlock/new.json'
elementary = 'https://www.reddit.com/r/elementary/new.json'

In [3]:
# Scrapping sherlock subreddit

posts = []
after = None

for a in range(50):
    if after == None:
        current_url = sherlock
    else:
        current_url = sherlock + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

pd.DataFrame(posts).to_csv('../datasets/sherlock.csv', index = False)
    

https://www.reddit.com/r/Sherlock/new.json
5
https://www.reddit.com/r/Sherlock/new.json?after=t3_gk0c71
3
https://www.reddit.com/r/Sherlock/new.json?after=t3_gij72p
2
https://www.reddit.com/r/Sherlock/new.json?after=t3_ggdbzo
3
https://www.reddit.com/r/Sherlock/new.json?after=t3_geeir5
4
https://www.reddit.com/r/Sherlock/new.json?after=t3_gcw5f4
6
https://www.reddit.com/r/Sherlock/new.json?after=t3_gae507
2
https://www.reddit.com/r/Sherlock/new.json?after=t3_g7ohpg
5
https://www.reddit.com/r/Sherlock/new.json?after=t3_g5kwlu
4
https://www.reddit.com/r/Sherlock/new.json?after=t3_g4p5o1
4
https://www.reddit.com/r/Sherlock/new.json?after=t3_g2on0h
4
https://www.reddit.com/r/Sherlock/new.json?after=t3_g0r9xb
4
https://www.reddit.com/r/Sherlock/new.json?after=t3_fy1fgd
3
https://www.reddit.com/r/Sherlock/new.json?after=t3_fwf71l
6
https://www.reddit.com/r/Sherlock/new.json?after=t3_fu3o7k
5
https://www.reddit.com/r/Sherlock/new.json?after=t3_fsox80
3
https://www.reddit.com/r/Sherlock/new.js

In [4]:
# Scrapping elementary subreddit


posts_e = []
after = None

for a in range(50):
    if after == None:
        current_url = elementary
    else:
        current_url = elementary + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Sky Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts_e = [p['data'] for p in current_dict['data']['children']]
    posts_e.extend(current_posts_e)
    after = current_dict['data']['after']
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

pd.DataFrame(posts_e).to_csv('../datasets/elementary.csv', index = False)


https://www.reddit.com/r/elementary/new.json
2
https://www.reddit.com/r/elementary/new.json?after=t3_fy3825
6
https://www.reddit.com/r/elementary/new.json?after=t3_f9h5dc
2
https://www.reddit.com/r/elementary/new.json?after=t3_ekamx2
5
https://www.reddit.com/r/elementary/new.json?after=t3_dzeieb
4
https://www.reddit.com/r/elementary/new.json?after=t3_d3zhvw
4
https://www.reddit.com/r/elementary/new.json?after=t3_ct8r4f
4
https://www.reddit.com/r/elementary/new.json?after=t3_cqx4ww
5
https://www.reddit.com/r/elementary/new.json?after=t3_cldnvr
3
https://www.reddit.com/r/elementary/new.json?after=t3_cdp89x
2
https://www.reddit.com/r/elementary/new.json?after=t3_c78ibp
6
https://www.reddit.com/r/elementary/new.json?after=t3_bwmoyo
6
https://www.reddit.com/r/elementary/new.json?after=t3_brkflr
3
https://www.reddit.com/r/elementary/new.json?after=t3_ba61ow
2
https://www.reddit.com/r/elementary/new.json?after=t3_aluiwb
3
https://www.reddit.com/r/elementary/new.json?after=t3_a4h7lj
2
https://