# Project Name: Streaming Service Classifier (Web APIs & NLP Part 1)

## Content:
### 1. Overview
### 2. Problem Statement
### 3. Data Collection

## 1. Overview

Reddit serves as a social media platform hosting a myriad of communities, allowing individuals to explore their interests, hobbies, and passions. These communities, referred to as Subreddits, are dedicated to specific topics. The objective of this project is to develop a machine learning model capable of topic classification, focusing on insights related to the chosen topics (Netflix and Disneyplus).

## 2. Problem Statement

Netflix is facing threat in their market share of subscribers to Disneyplus as their subscriber numbers has been surpassed. With the additional joint ownership of Hulu and ESPN by Disneyplus, it added more disadvantage to Netflix in the show streaming industry. Moreover, Netflix also lost 18 billion dollars in value that cause shareholders to be unhappy with the company. The main source of revenue for Netflix comes from the subscribers and this source of revenue has been threathened.


sources:
* https://www.forbes.com/sites/qai/2022/09/27/disney-surpasses-netflix-subscriber-count-what-does-that-means-for-investors/?sh=66739a245e0b
* https://www.gamingbible.com/news/tv-and-film/netflix-just-lost-18-billion-in-value-949478-20230721

### Scope
Create a classification model for Netflix and provide key insights to ensure their marketing campaigns and shows aligns with keywords and themes closely associated with Disneyplus.


### Goal
* Achieve a minimum of 90 percent F1 score.
* Strategically redirect online search traffic towards Netflix from Disneyplus with keywords and themes closely associated with Disneyplus on Reddit. 


### Why are we doing this:
* This strategic maneuver will help Netflix solidify its position in the market and attract more viewers to their platform.

### Who we are:
A group of data science consultant engaged by the Netflix marketing team.


### Primary Stakeholders: 
* Netflix Marketing Team

### Secondary Stakeholders: 
* Netflix Content Team


### Data Sources

The data were scraped from Reddit by using Praw (Python Reddit API Wrapper) to access Reddit, tabulate and save its data as CSV with Pandas.

Praw (Python Reddit API Wrapper)
* https://www.reddit.com/r/Netflix
* https://www.reddit.com/r/Disneyplus


### Data Files for Data Cleaning and EDA
* netflix_df: Netflix_reddit_submissions.csv
* disneyplus_df: Disneyplus_reddit_submissions.csv

## 3. Data Collection

Datasets were obtained through using Reddit API, **PRAW** (Python Reddit API Wrapper) to access and scrape the posts of the two subreddits topics:
* Netflix: https://www.reddit.com/r/Netflix
* Disneyplus: https://www.reddit.com/r/Disneyplus

The data were scrapped from 5 categories from each subreddits:
* hot
* new
* rising
* controversial
* top

The features that will be extracted from each categories are:

* subreddit: topic_num,
* title: submission.title,
* selftext: submission.selftext,
* ups: submission.ups,
* upvote_ratio: submission.upvote_ratio,
* num_comments: submission.num_comments,
* author: str(submission.author),
* link_flair_text: submission.link_flair_text,
* awards: len(submission.all_awardings),
* is_original_content: submission.is_original_content,
* is_video: submission.is_video,
* post_type: 'text' if submission.is_self else 'link',
* domain: submission.domain,
* created_utc: submission.created_utc,
* pinned: submission.pinned,
* locked: submission.locked,
* stickied: submission.stickied 
  
Thereafter, the dataset of each topic will be saved as the below following before Data Cleaning and EDA:
* Netflix data: Netflix_reddit_submissions.csv
* Disneyplus data: Disneyplus_reddit_submissions.csv

### Import Libraries

In [1]:
import pandas as pd
import praw
import csv
from datetime import datetime

### Reddit API Credentials

There are a few required credentials that are needed to be insert. These credentials are created when we registered for the Reddit API from https://www.reddit.com/prefs/apps:

* CLIENT_ID
* SECRET_KEY
* reddit_user_agent

In [2]:
# Insert Credentials
CLIENT_ID='jvCz9G6yyL5I3KBefnikgQ'
SECRET_KEY='vS6Vpo3zhgyUIZmLEXG4lgSoCiJHmQ'
reddit_user_agent = 'MyBot'

# Specify the subreddit topics
topic_1 = 'Netflix'
topic_2 = 'Disneyplus'

### Create Function (scrape_reddit_post) for Data Scrapping and Export as CSV Files

In [3]:
# Create instance of the Credentials
reddit = praw.Reddit(client_id=CLIENT_ID,
                     client_secret=SECRET_KEY,
                     user_agent=reddit_user_agent)



# Create a user-defined function as a scrapper.
# The arguement to parse in will be the topic of the subreddit
def scrape_reddit_post(topic_num):
    chosen_subreddit = reddit.subreddit(topic_num)
    count = 0  
    
    with open(f'./datasets/{topic_num}_reddit_submissions.csv', 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['subreddit','title', 'selftext', 'ups', 'upvote_ratio', 'num_comments', 'author', 
                      'link_flair_text', 'awards', 'is_original_content', 'is_video', 'post_type', 
                      'domain', 'created_utc', 'pinned', 'locked', 'stickied']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        
        writer.writeheader()
        
        for category in ['hot', 'new', 'rising', 'controversial', 'top']:
            submissions = getattr(chosen_subreddit, category)(limit=None)
            for submission in submissions:
                writer.writerow({
                    'subreddit': topic_num,
                    'title': submission.title,
                    'selftext': submission.selftext,
                    'ups': submission.ups,
                    'upvote_ratio': submission.upvote_ratio,
                    'num_comments': submission.num_comments,
                    'author': str(submission.author),
                    'link_flair_text': submission.link_flair_text,
                    'awards': len(submission.all_awardings),
                    'is_original_content': submission.is_original_content,
                    'is_video': submission.is_video,
                    'post_type': 'text' if submission.is_self else 'link',
                    'domain': submission.domain,
                    'created_utc': submission.created_utc,
                    'pinned': submission.pinned,
                    'locked': submission.locked,
                    'stickied': submission.stickied
                })
                
                count += 1
                if count >= 10000:
                    break
                    
            if count >= 10000:
                break

### Data Scrapping and Export as CSV Files

In [4]:
# This line of code is made as comments as data had been scraped.
# This is also to avoid overwriting the existing data.

#scrape_reddit_post(topic_1)
#scrape_reddit_post(topic_2)

### Load the Collected Data
* netflix_df: Netflix_reddit_submissions.csv
* disneyplus_df: Disneyplus_reddit_submissions.csv

In [5]:
netflix_df = pd.read_csv('datasets/Netflix_reddit_submissions.csv')
disneyplus_df = pd.read_csv('datasets/Disneyplus_reddit_submissions.csv')

### Check First 5 Rows of netflix_df

In [6]:
netflix_df.head()

Unnamed: 0,subreddit,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied
0,Netflix,/r/Netflix Discord Server,We are pleased to announce we have affiliated ...,427,0.97,181,N3DSdude,Announcement,3,False,False,text,self.netflix,1619278000.0,False,False,True
1,Netflix,Netflix Announces Plans to Crack Down on Passw...,> **Any post relating to this thread will now ...,666,0.94,3059,UniversallySecluded,Megathread,0,False,False,text,self.netflix,1675331000.0,False,False,True
2,Netflix,Netflix's live-action adaptation of One Piece ...,,389,0.89,108,dailymail,,0,False,False,link,dailymail.co.uk,1693906000.0,False,False,False
3,Netflix,Aaron Paul has not received any money for Brea...,,1048,0.92,218,Ben__Harlan,,0,False,False,link,hardwaresfera.com,1693830000.0,False,False,False
4,Netflix,Why are streaming shows much more downbeat and...,"This is very general I know that, and it actua...",6,0.8,6,Beau_bell,,0,False,False,text,self.netflix,1693920000.0,False,False,False


### Check netflix_df Data Shape

In [7]:
netflix_df.shape

(3477, 17)

### Check First 5 Rows of disneyplus_df

In [8]:
disneyplus_df.head()

Unnamed: 0,subreddit,title,selftext,ups,upvote_ratio,num_comments,author,link_flair_text,awards,is_original_content,is_video,post_type,domain,created_utc,pinned,locked,stickied
0,Disneyplus,This is the Weekly Tech Support Thread,All posts regarding tech support belong here.\...,3,1.0,20,AutoModerator,:Tech: Tech Support,0,False,False,text,self.DisneyPlus,1693415000.0,False,False,True
1,Disneyplus,Ahsoka - Episodes 1 and 2 Megathread,Ahsoka is (almost) here!\n\nStart streaming th...,18,0.95,27,anonRedd,:Thread: Mega Thread,0,False,False,text,self.DisneyPlus,1692730000.0,False,False,True
2,Disneyplus,Marvel Studios’ Loki Season 2 | October 6 on D...,,122,0.97,8,Capital_Gate6718,:Trailer: Official Trailer,0,False,False,link,youtube.com,1693845000.0,False,False,False
3,Disneyplus,Can I make a specific watchlist on DisneyPlus?,Is there any way I can make a specific watchli...,10,0.92,2,mickieals,:Question: Question,0,False,False,text,self.DisneyPlus,1693879000.0,False,False,False
4,Disneyplus,Does the windows app have atmos and 4K and HDR?,I was trying to load up the Asoka series and I...,5,1.0,5,Behacad,:Discussion: Discussion,0,False,False,text,self.DisneyPlus,1693862000.0,False,False,False


### Check disneyplus_df Data Shape

In [9]:
disneyplus_df.shape

(3915, 17)

### Refer to 02_data_clean_eda:
1. Data Cleaning
2. Exploratory Data Analysis (EDA)