# Project 3 - Quantifying TV Laughter: A Data-Backed Guide for Brooklyn Nine-Nine and Big Bang Theory Investment

### Contents:
- [Background](#Background)
- [Scraping reddit using API](#Scraping-Reddit-using-API)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)

# Background

# Scraping reddit using API

In [140]:
from pprint import pprint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import time
import random
import os
import ipywidgets as widgets
from IPython.display import display


brooklynninenine has 113 columns while bigbangtheory has 11 columns. For both datasets, we will only extract the columns present in bigbangtheory.
['title', 'selftext', 'author_flair_text', 'link_flair_text', 'score', 'upvote_ratio', 'distinguished', 'is_original_content', 'is_self', 'num_comments', 'subreddit']
    
FYI: Initially I wanted to extract these cols for b99:
- #subreddit
- #title
- #selftext
- #link_flair_css_class (Eg. meme, discussion etc)
- #upvote_ratio
- #ups
- #link_flair_text (Eg. Humour, Disucssion etc)
- #post_hint (image, NaN)
- #author
- #num_comments
- #subreddit_subscribers

In [145]:
def scrape(subreddit_name, limit_val):
    url = 'https://www.reddit.com/r/{}.json?t=hot'.format(subreddit_name)

    posts = []
    after = None
    
    # Setting file paths and folders to save the final dataframes in:
    # Construct the file path
    current_directory = os.getcwd()  # Get the current working directory
    target_directory = os.path.join(current_directory, '..', 'data')
    # Create the target directory if it doesn't exist
    if not os.path.exists(target_directory):
        os.makedirs(target_directory)
    # Construct the file path within the target directory
    file_path = os.path.join(target_directory, '{}.csv'.format(subreddit_name+'test'))

    col_to_save = ['title', 'selftext', 'author_flair_text', 'link_flair_text', 'score', 'upvote_ratio', 'distinguished', 'is_original_content', 'is_self', 'num_comments', 'subreddit']
    
    # Setting the number of reviews to scrape (Each scrape only gives 25 rows)
    range_val = int(limit_val/25)
    
    # Create a progress bar widget
    print(subreddit_name)
    progress_bar = widgets.IntProgress(min=0, max=range_val, value=0)

    # Display the progress bar
#     display(progress_bar)
    
    for a in range(range_val):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        res = requests.get(current_url, headers={'User-agent': 'BLEEP BLORP 2.0'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        if a > 0:
            prev_posts = pd.read_csv(file_path)
            current_df = pd.DataFrame(current_posts)
            combined = pd.concat([prev_posts, current_df])
            df = pd.DataFrame(combined)
            df.reset_index(inplace=True)
            df[col_to_save].to_csv(file_path, index = False)
        else:
            df = pd.DataFrame(posts)
            df[col_to_save].to_csv(file_path, index = False)

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)

        # Update the progress bar value
        progress_bar.value = a+1

In [None]:
# Uncomment the following to scrape data again

scrape('brooklynninenine', 1500)
scrape('bigbangtheory', 1500)

brooklynninenine


# Data Import & Cleaning

In [129]:
current_directory = os.getcwd()
file_path = os.path.join(current_directory, '../data/brooklynnineninetest.csv')
df_b99 = pd.read_csv(file_path)
file_path = os.path.join(current_directory, '../data/bigbangtheorytest.csv')
df_bbt = pd.read_csv(file_path)

In [130]:
df_b99.info()
display(df_b99.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                1000 non-null   object 
 1   selftext             480 non-null    object 
 2   author_flair_text    120 non-null    object 
 3   link_flair_text      1000 non-null   object 
 4   score                1000 non-null   int64  
 5   upvote_ratio         1000 non-null   float64
 6   distinguished        0 non-null      float64
 7   is_original_content  1000 non-null   bool   
 8   is_self              1000 non-null   bool   
 9   num_comments         1000 non-null   int64  
 10  subreddit            1000 non-null   object 
dtypes: bool(2), float64(2), int64(2), object(5)
memory usage: 72.4+ KB


Unnamed: 0,title,selftext,author_flair_text,link_flair_text,score,upvote_ratio,distinguished,is_original_content,is_self,num_comments,subreddit
0,Flat-Top and the Freak supremacy,,,Humour,4051,0.96,,False,False,69,brooklynninenine
1,"We require this. Like, today.",,Cheddar,Humour,719,0.99,,False,False,14,brooklynninenine
2,Maybe the Pimento quotation I like the best,,,Other,3355,0.98,,False,False,85,brooklynninenine
3,Brooklyn 99 users coming back to the sub after...,,,Humour,358,0.96,,False,False,20,brooklynninenine
4,Saw some vindication on my walk today...,She is magnificent!,,Humour,87,1.0,,False,False,2,brooklynninenine


In [131]:
df_bbt.info()
display(df_bbt.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                1080 non-null   object 
 1   selftext             640 non-null    object 
 2   author_flair_text    80 non-null     object 
 3   link_flair_text      1000 non-null   object 
 4   score                1080 non-null   int64  
 5   upvote_ratio         1080 non-null   float64
 6   distinguished        0 non-null      float64
 7   is_original_content  1080 non-null   bool   
 8   is_self              1080 non-null   bool   
 9   num_comments         1080 non-null   int64  
 10  subreddit            1080 non-null   object 
dtypes: bool(2), float64(2), int64(2), object(5)
memory usage: 78.2+ KB


Unnamed: 0,title,selftext,author_flair_text,link_flair_text,score,upvote_ratio,distinguished,is_original_content,is_self,num_comments,subreddit
0,Official Discord Server for r/bigbangtheory!,Hi all! \nI made a The Big Bang Theory Discor...,pennyisafreeloader,,202,1.0,,False,True,17,bigbangtheory
1,New 'Big Bang Theory' Spinoff in Development a...,,pennygetyourownwifi,,286,0.95,,False,False,142,bigbangtheory
2,Switching it up,,,Character discussion,530,0.96,,False,False,22,bigbangtheory
3,"Where did the ""Half Sandwich"" come from?",,,Episode discussion,82,0.99,,False,False,22,bigbangtheory
4,I like President Siebert… 🥰,He knew how to manage his scientists.,,Character discussion,282,0.97,,False,False,43,bigbangtheory


In [132]:
# Merging title and selftext into 'posts' column, and only using this data.

# Replace NaN values with empty strings, if not it cannot be merged together
df_bbt['selftext'] = df_bbt['selftext'].fillna('')
df_bbt['posts'] = df_bbt['title'] + ' ' + df_bbt['selftext']
display(df_bbt.head(5))

df_b99['selftext'] = df_b99['selftext'].fillna('')
df_b99['posts'] = df_b99['title'] + ' ' + df_b99['selftext']
display(df_b99.head(5))

Unnamed: 0,title,selftext,author_flair_text,link_flair_text,score,upvote_ratio,distinguished,is_original_content,is_self,num_comments,subreddit,posts
0,Official Discord Server for r/bigbangtheory!,Hi all! \nI made a The Big Bang Theory Discor...,pennyisafreeloader,,202,1.0,,False,True,17,bigbangtheory,Official Discord Server for r/bigbangtheory! H...
1,New 'Big Bang Theory' Spinoff in Development a...,,pennygetyourownwifi,,286,0.95,,False,False,142,bigbangtheory,New 'Big Bang Theory' Spinoff in Development a...
2,Switching it up,,,Character discussion,530,0.96,,False,False,22,bigbangtheory,Switching it up
3,"Where did the ""Half Sandwich"" come from?",,,Episode discussion,82,0.99,,False,False,22,bigbangtheory,"Where did the ""Half Sandwich"" come from?"
4,I like President Siebert… 🥰,He knew how to manage his scientists.,,Character discussion,282,0.97,,False,False,43,bigbangtheory,I like President Siebert… 🥰 He knew how to man...


Unnamed: 0,title,selftext,author_flair_text,link_flair_text,score,upvote_ratio,distinguished,is_original_content,is_self,num_comments,subreddit,posts
0,Flat-Top and the Freak supremacy,,,Humour,4051,0.96,,False,False,69,brooklynninenine,Flat-Top and the Freak supremacy
1,"We require this. Like, today.",,Cheddar,Humour,719,0.99,,False,False,14,brooklynninenine,"We require this. Like, today."
2,Maybe the Pimento quotation I like the best,,,Other,3355,0.98,,False,False,85,brooklynninenine,Maybe the Pimento quotation I like the best
3,Brooklyn 99 users coming back to the sub after...,,,Humour,358,0.96,,False,False,20,brooklynninenine,Brooklyn 99 users coming back to the sub after...
4,Saw some vindication on my walk today...,She is magnificent!,,Humour,87,1.0,,False,False,2,brooklynninenine,Saw some vindication on my walk today... She i...


In [133]:
# Preparing one final dataset to use for EDA
df = pd.concat([df_bbt[['posts', 'subreddit']], df_b99[['posts', 'subreddit']]]).reset_index()

In [74]:
df.groupby('subreddit').count()

Unnamed: 0_level_0,index,posts
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1
bigbangtheory,52,52
brooklynninenine,50,50


# Exploratory Data Analysis

Interesting things in the posts: 
- !
- Capital letters / all uppercase
- emojis (how to detect emojis?)