# Introduction


This project focuses on building a classifier for Natural Language Processing (NLP). 

Using Reddit's API to retrieve data from two movie subreddit groups, it will be cleaned in order to fit and train the final classifier model to be able to accurately predict the movie genres from posts. 

This is a supervised learning classification model and the workflow will touch on Data Wrangling/Acquisition, Natural Language Processing and Classification Modeling. 



***

## 1.0 Data Collection and Cleaning

Goal: Obtain corpus

- [1.1 Data Collection from Reddit](#1.1-Data-Collection-from-Reddit)
- [1.2 Initial View of the Data](#1.2-Initial-View-of-the-Data)
- [1.3 Cleaning of the Data](#1.3-Cleaning-of-the-Data)


<!-- Back to: [1.0 - Data Collection and Cleaning](#1.0-Data-Collection-and-Cleaning) -->
<!-- Go to [Defined Functions](#Defined-Functions:) -->

Jump to:
<!-- - [1.0 Data Collection and Cleaning](01_data_collection_&_cleaning.ipynb) -->
- [2.0 EDA & Data Preprocessing](02_eda_&_data_processing.ipynb)
- [3.0 Model Building](03_model_building.ipynb)
- [4.0 Reddit Classification Report](04_reddit_classification_report.ipynb)

#### Imports:

In [1]:
import pandas as pd
import requests
import numpy as np

import time
import random

#### Defined Functions:

In [2]:
#function to get data from reddit API, 40 loops x 25 post

def get_data(dict_name, url, file_name, loops):
    """ loop to get data from each page of reddit, saves into csv."""
    posts = []
    after = None
    count_lines =0

    #looping through each page to get data, using the 'after' to get to the next page
    for a in range (loops):
        if after == None:
            current_url= url
        else: 
            current_url = url + '?after=' + after
        print (current_url)
        res= requests.get(current_url, headers={'User-agent': 'Mabbook Pro'})

        if res.status_code != 200:
            print("status error", res.status_code)
            break
        
        
        dict_name= res.json()
        current_posts =  [p['data'] for p in dict_name['data']['children']]
        posts.extend(current_posts)
        after = dict_name['data']['after']
        

        #generate random sleep
        sleep_duration = random.randint(1,3)
        print(f"sleep: {sleep_duration}")
        time.sleep(sleep_duration)
        
        #adding a counter to help shorten the loop after
        count_lines += 1
        print(f"count: {count_lines} ")
    
        
    csv_name = '../datasets/' + file_name + '.csv'
    pd.DataFrame(posts).to_csv(csv_name, index=False)

    return len(posts)
  

In [3]:
# create function for viewing unique

def viewuni(df, column):
    """for viewing unique entries by column and dataframe"""
    print(df[column].unique())

In [4]:
# create fucntion for viewing nulls

def viewnull (df, column):
    """for viewing 'NaN' entries by column and dataframe"""
    return df[df[column].isnull()]

In [5]:
# create function for replacing values

def replace_val(df, column, old_value, new_value):
    """ for replacing values in columns"""
    df[column] = df[column].replace({old_value: new_value})

In [6]:
# create function for printing value counts
def val_count(df, columns):
    """ for value counts"""
    print (df[columns].value_counts(dropna=False))

In [7]:
# create a function to count nulls only
def null_count(df, columns):
    """to count null in column"""
    print(f"{columns} nulls: {df[columns].isna().sum()}")


***

## 1.1 Data Collection from Reddit

- Back to: [1.0 - Data Collection and Cleaning](#1.0-Data-Collection-and-Cleaning)
- Go to [Defined Functions](#Defined-Functions:)

In [8]:
#scifi movies url and horror movies url
scifi_url = 'https://www.reddit.com/r/scifimovies.json'
horror_url = 'https://www.reddit.com/r/HorrorMovies.json'

In [9]:
#use get_data function to retrieve post from reddit and save as dataframe. 
get_data('scifi_dict', scifi_url, 'scifi_data', 14)

https://www.reddit.com/r/scifimovies.json
sleep: 3
count: 1 
https://www.reddit.com/r/scifimovies.json?after=t3_gyb8db
sleep: 3
count: 2 
https://www.reddit.com/r/scifimovies.json?after=t3_gmldkf
sleep: 1
count: 3 
https://www.reddit.com/r/scifimovies.json?after=t3_gcdgft
sleep: 3
count: 4 
https://www.reddit.com/r/scifimovies.json?after=t3_fbdwk6
sleep: 2
count: 5 
https://www.reddit.com/r/scifimovies.json?after=t3_eepqw0
sleep: 2
count: 6 
https://www.reddit.com/r/scifimovies.json?after=t3_c6dcjq
sleep: 3
count: 7 
https://www.reddit.com/r/scifimovies.json?after=t3_alyqnr
sleep: 2
count: 8 
https://www.reddit.com/r/scifimovies.json?after=t3_93q8sv
sleep: 3
count: 9 
https://www.reddit.com/r/scifimovies.json?after=t3_85k8mt
sleep: 3
count: 10 
https://www.reddit.com/r/scifimovies.json?after=t3_6nwdw8
sleep: 3
count: 11 
https://www.reddit.com/r/scifimovies.json?after=t3_69moep
sleep: 1
count: 12 
https://www.reddit.com/r/scifimovies.json?after=t3_5nraru
sleep: 1
count: 13 
https://www

347

In [10]:
#use get_data function to retrieve post from reddit and save as dataframe. 
get_data('horror_dict', horror_url, 'horror_data', 36)

https://www.reddit.com/r/HorrorMovies.json
sleep: 1
count: 1 
https://www.reddit.com/r/HorrorMovies.json?after=t3_hbs6u3
sleep: 2
count: 2 
https://www.reddit.com/r/HorrorMovies.json?after=t3_h8ruze
sleep: 1
count: 3 
https://www.reddit.com/r/HorrorMovies.json?after=t3_h0k30s
sleep: 3
count: 4 
https://www.reddit.com/r/HorrorMovies.json?after=t3_gsz555
sleep: 3
count: 5 
https://www.reddit.com/r/HorrorMovies.json?after=t3_go74h9
sleep: 2
count: 6 
https://www.reddit.com/r/HorrorMovies.json?after=t3_gkliya
sleep: 3
count: 7 
https://www.reddit.com/r/HorrorMovies.json?after=t3_gfk6gq
sleep: 2
count: 8 
https://www.reddit.com/r/HorrorMovies.json?after=t3_gdfcmd
sleep: 1
count: 9 
https://www.reddit.com/r/HorrorMovies.json?after=t3_gbavs2
sleep: 2
count: 10 
https://www.reddit.com/r/HorrorMovies.json?after=t3_g9duh9
sleep: 2
count: 11 
https://www.reddit.com/r/HorrorMovies.json?after=t3_g70z58
sleep: 3
count: 12 
https://www.reddit.com/r/HorrorMovies.json?after=t3_g53kte
sleep: 2
count: 13

895

<div class="alert alert-block alert-info">
<b>Data Collection:</b> 

- I have collected 345 posts from the subreddit group "scifimovies"
- I have collected 895 posts from the subreddit group "HorrorMovies" 
    
There seeems to be imbalance data as horror subreddit group has more than twice the amount of post scifimovies has. I am not particularly worried about this as the data in "scifimovies" seem to contain good data. 
    
I will avoid using accuracy as scoring for future models as it does not perform well with imbalanced data. 
    
    
    
</div>

In [11]:
#read both data sets
sci_fi = pd.read_csv('../datasets/scifi_data.csv')
horror = pd.read_csv('../datasets/horror_data.csv')

***

## 1.2 Initial View of the Data
- Back to: [1.0 - Data Collection and Cleaning](#1.0-Data-Collection-and-Cleaning)
- Go to [Defined Functions](#Defined-Functions:)

In [12]:
#view first 5 rows
sci_fi.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,crosspost_parent,media_metadata,author_cakeday
0,,scifimovies,,t2_5s3h6rkp,False,,0,False,Earth vs the Flying Saucers (1956) movie trail...,[],...,https://youtu.be/S2eebFMKpnU,790,1592832000.0,0,"{'type': 'youtube.com', 'oembed': {'provider_u...",False,,,,
1,,scifimovies,,t2_5s3h6rkp,False,,0,False,It Lives Again 1974 movie trailer Plot: An epi...,[],...,https://youtu.be/gebiEuLAyME,790,1592831000.0,0,"{'type': 'youtube.com', 'oembed': {'provider_u...",False,,,,
2,,scifimovies,....,t2_63xg0g0d,False,,0,False,Dredd is a good sci fi movie,[],...,https://www.reddit.com/r/scifimovies/comments/...,790,1592774000.0,0,,False,,,,
3,,scifimovies,,t2_5s3h6rkp,False,,0,False,The Unearthly 1957 movie trailer Plot: Mad doc...,[],...,https://youtu.be/QF_B_Hf59HA,790,1592739000.0,0,"{'type': 'youtube.com', 'oembed': {'provider_u...",False,,,,
4,,scifimovies,,t2_gr6gdel,False,,0,False,Sci-Fi Short Film - C600,[],...,https://youtu.be/pZs4SYfU6pA,790,1592739000.0,0,{'oembed': {'provider_url': 'https://www.youtu...,False,,,,


In [13]:
#view last 5 rows
sci_fi.tail()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,crosspost_parent,media_metadata,author_cakeday
342,,scifimovies,,t2_3gkcp,False,,0,False,"Shoot First, No Questions Later — The Decline ...",[],...,https://medium.com/geek-empire/863d21d793a6,790,1377654000.0,0,,False,,,,
343,,scifimovies,,t2_3gkcp,False,,0,False,Starship Size Chart,[],...,http://i.imgur.com/itBXFu0.jpg,790,1377487000.0,0,,False,,,,
344,,scifimovies,,t2_3gkcp,False,,0,False,"50 high res pics by Syd Mead, visual futurist ...",[],...,http://imgur.com/a/s9Oyr,790,1377487000.0,0,"{'type': 'imgur.com', 'oembed': {'provider_url...",False,,,,
345,,scifimovies,,t2_3gkcp,False,,0,False,District 9 creator Neill Blomkamp has started ...,[],...,http://www.themarysue.com/district-9-sequel-di...,790,1374418000.0,0,,False,,,,
346,,scifimovies,"This is a list of some good sci-fi ""alien cont...",t2_53wsj,False,,0,False,"Top ""Alien Contact"" Movies",[],...,https://www.reddit.com/r/scifimovies/comments/...,790,1312261000.0,0,,False,,,,


In [14]:
#view information about dataframe
sci_fi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347 entries, 0 to 346
Columns: 110 entries, approved_at_utc to author_cakeday
dtypes: bool(26), float64(34), int64(8), object(42)
memory usage: 236.7+ KB


In [15]:
#view shape
sci_fi.shape

(347, 110)

In [16]:
#view columns
sci_fi.columns

Index(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved',
       'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext',
       ...
       'url', 'subreddit_subscribers', 'created_utc', 'num_crossposts',
       'media', 'is_video', 'crosspost_parent_list', 'crosspost_parent',
       'media_metadata', 'author_cakeday'],
      dtype='object', length=110)

In [17]:
#check for moderated post
viewuni(sci_fi, 'mod_reason_title')
viewuni(sci_fi, 'mod_reports')
viewuni(sci_fi, 'quarantine')

[nan]
['[]']
[False]


In [18]:
#view null counts
null_count(sci_fi, 'selftext')
null_count(sci_fi,'title')
null_count(sci_fi,'subreddit')

selftext nulls: 268
title nulls: 0
subreddit nulls: 0


In [19]:
#check unqiue values for 'subreddit'
val_count(sci_fi, 'subreddit')

scifimovies    347
Name: subreddit, dtype: int64


<div class="alert alert-block alert-info">
<b>Initial View (Sci_fi):</b> 

- There are 110 columns: we are only looking at 3 columns of interest 'subreddit', 'selftext' and 'title'. 
    
- There are no values in moderated columns
    
- There is only one variable under 'subreddit' : scifimovies
    
- 'selftext' column has 267 null values which we will impute with 'none' in the cleaning stage.

</div>

***

In [20]:
#view first 5 rows
horror.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,crosspost_parent_list,crosspost_parent,author_cakeday
0,,HorrorMovies,1. No promotion spamming. This includes YouTub...,t2_3o8fm,False,,0,False,Official Rules.,[],...,6884,1507160000.0,0,,False,,,,,
1,,HorrorMovies,I've been having to remove a bunch of posts an...,t2_3o8fm,False,,0,False,Reminder: read the rules please.,[],...,6884,1592508000.0,0,,False,,,,,
2,,HorrorMovies,,t2_4mb4c1lg,False,,0,False,J.A.S.O.N.,[],...,6884,1592751000.0,0,,False,image,{'images': [{'source': {'url': 'https://previe...,,,
3,,HorrorMovies,"Just finished watching ""The Autopsy of Jane Do...",t2_54h5rewu,False,,0,False,"Thoughts on ""The Autopsy of Jane Doe""",[],...,6884,1592791000.0,0,,False,,,,,
4,,HorrorMovies,,t2_6bj49szw,False,,0,False,Hey...so I know it's a bit late but 1 month af...,[],...,6884,1592757000.0,0,,False,image,{'images': [{'source': {'url': 'https://previe...,,,


In [21]:
#view last 5 rows
horror.tail()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,crosspost_parent_list,crosspost_parent,author_cakeday
890,,HorrorMovies,I recently watched Hereditary. The horror movi...,t2_25zas02y,False,,0,False,Am I the only one who thinks Hereditary sucks?,[],...,6884,1536496000.0,0,,False,,,,,
891,,HorrorMovies,It used to be on the Verizon channel called Fe...,t2_22u3ylhh,False,,0,False,Someone PLEASE help me find the name of this m...,[],...,6884,1536531000.0,0,,False,,,,,
892,,HorrorMovies,"Early today, I finally sat down and watched He...",t2_252c0luy,False,,0,False,Movie Recommendations for those who don't scar...,[],...,6884,1536551000.0,0,,False,,,,,
893,,HorrorMovies,soooo i watched the nun last night and it was ...,t2_7k1vrvc,False,,0,False,wached the nun last night,[],...,6884,1536425000.0,0,,False,,,,,
894,,HorrorMovies,Can you guys help me find the title of the mov...,t2_13x922,False,,0,False,Some halloween movie I watched as a kid,[],...,6884,1536410000.0,0,,False,,,,,


In [22]:
#view information about dataframe
horror.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 895 entries, 0 to 894
Columns: 109 entries, approved_at_utc to author_cakeday
dtypes: bool(26), float64(35), int64(8), object(40)
memory usage: 603.2+ KB


In [23]:
#view shape
horror.shape

(895, 109)

In [24]:
#view columns
horror.columns

Index(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved',
       'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext',
       ...
       'subreddit_subscribers', 'created_utc', 'num_crossposts', 'media',
       'is_video', 'post_hint', 'preview', 'crosspost_parent_list',
       'crosspost_parent', 'author_cakeday'],
      dtype='object', length=109)

In [25]:
#check for moderated post
viewuni(horror, 'mod_reason_title')
viewuni(horror, 'mod_reports')
viewuni(horror, 'quarantine')

[nan]
['[]']
[False]


In [26]:
#view null counts
null_count(horror, 'selftext')
null_count(horror,'title')
null_count(horror,'subreddit')

selftext nulls: 67
title nulls: 0
subreddit nulls: 0


In [27]:
#check unqiue values for 'subreddit'
val_count(horror, 'subreddit')

HorrorMovies    895
Name: subreddit, dtype: int64


<div class="alert alert-block alert-info">
<b>Initial View (Horror):</b> 
 

- There are 109 columns: we are only looking at 3 columns of interest 'subreddit', 'selftext' and 'title'. 
    
- There are no values in  moderated columns
    
- There is only one variable under 'subreddit' : HorrorMovies
    
- 'selftext' column has 67 null values which we will impute with 'none' in the cleaning stage.
    
- index 0 and 1 belong to rules and instructions and will be dropped.

</div>

***

## 1.3 Cleaning of the Data

- Back to: [1.0 - Data Collection and Cleaning](#1.0-Data-Collection-and-Cleaning)
- Go to [Defined Functions](#Defined-Functions:)

In [28]:
#dropping index 0 and 1 from Horror
horror.drop([0,1], axis=0, inplace=True)

In [29]:
#replace nulls to "None"
replace_val(sci_fi, 'selftext', np.nan, 'None')
replace_val(horror, 'selftext', np.nan, 'None')

In [30]:
#concatenate 'title' and 'selftext' into new column 'data'

sci_fi['data'] = sci_fi['title'] + " " +sci_fi['selftext']
horror['data'] = horror['title'] + " " +horror['selftext']

# combine data for new dataframe
red_data = pd.concat([sci_fi[['data', 'subreddit']], horror[['data', 'subreddit']]],
                     axis=0, ignore_index=True)

In [31]:
#view first 5 rows
red_data.head()

Unnamed: 0,data,subreddit
0,Earth vs the Flying Saucers (1956) movie trail...,scifimovies
1,It Lives Again 1974 movie trailer Plot: An epi...,scifimovies
2,Dredd is a good sci fi movie ....,scifimovies
3,The Unearthly 1957 movie trailer Plot: Mad doc...,scifimovies
4,Sci-Fi Short Film - C600 None,scifimovies


In [32]:
#view last 5 rows
red_data.tail()

Unnamed: 0,data,subreddit
1235,Am I the only one who thinks Hereditary sucks?...,HorrorMovies
1236,Someone PLEASE help me find the name of this m...,HorrorMovies
1237,Movie Recommendations for those who don't scar...,HorrorMovies
1238,wached the nun last night soooo i watched the ...,HorrorMovies
1239,Some halloween movie I watched as a kid Can yo...,HorrorMovies


In [33]:
#view shape of dataframe
red_data.shape

(1240, 2)

In [34]:
#view information about dataframe
red_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1240 entries, 0 to 1239
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   data       1240 non-null   object
 1   subreddit  1240 non-null   object
dtypes: object(2)
memory usage: 19.5+ KB


In [35]:
#view null values (if any)
red_data.isnull().sum()

data         0
subreddit    0
dtype: int64

In [36]:
# count of column values for label
val_count(red_data, 'subreddit')

HorrorMovies    893
scifimovies     347
Name: subreddit, dtype: int64


In [37]:
# change columns name of 'reddit' to 'label'
red_data.rename(columns={'subreddit': 'label'}, inplace=True)

In [38]:
#change column values to lowercase
red_data['data'] = red_data.data.str.lower()
red_data['label'] = red_data.label.str.lower()

In [39]:
#changing target label to binary for modeling
red_data['label'] = red_data['label'].map({'horrormovies':0, 'scifimovies':1})

In [43]:
#remove duplicates
red_data.drop_duplicates(inplace=True)

In [44]:
#view shape of dataframe
red_data.shape

(1238, 2)

In [46]:
# count of column values for label
val_count(red_data, 'label')

0    893
1    345
Name: label, dtype: int64


In [47]:
#view first 5 rows
red_data.head()

Unnamed: 0,data,label
0,earth vs the flying saucers (1956) movie trail...,1
1,it lives again 1974 movie trailer plot: an epi...,1
2,dredd is a good sci fi movie ....,1
3,the unearthly 1957 movie trailer plot: mad doc...,1
4,sci-fi short film - c600 none,1


In [48]:
#view last 5 rows
red_data.tail()

Unnamed: 0,data,label
1235,am i the only one who thinks hereditary sucks?...,0
1236,someone please help me find the name of this m...,0
1237,movie recommendations for those who don't scar...,0
1238,wached the nun last night soooo i watched the ...,0
1239,some halloween movie i watched as a kid can yo...,0


 <div class="alert alert-block alert-info">
<b>Final view:</b>
    
- Data has no null values
    
- Label has been converted into binary
    
- All data is converted into lowercase
    
- Both subredits have been combined into a tidy Dataframe consisting of 2 columns; where each variable forms a columns and each observation, a row.
    
- There is a total of 1238 post: 893 belonging to Horror and 345 belonging to Scifi. This presents an issue of imbalanced data. We can mediate this by avoiding models and scoring methods which are not favourable towards imbalanced data.

    
</div>


In [49]:
#saving cleaned data
red_data.to_csv('../datasets/combined_data.csv', index=False)

Back up to: [1.0 - Data Collection and Cleaning](#1.0-Data-Collection-and-Cleaning)

or 

Jump to:
<!-- - [1.0 Data Collection and Cleaning](01_data_collection_&_cleaning.ipynb) -->
- [2.0 EDA & Data Preprocessing](02_eda_&_data_processing.ipynb)
- [3.0 Model Building](03_model_building.ipynb)
- [4.0 Reddit Classification Report](04_reddit_classification_report.ipynb)