# Project 3: Subreddit Classification with NLP

# The Problem Statement
Reddit is a collection of interest-based communities known as subreddits, with members discussing a great variety of topics. Within each subreddit, users can create text or image posts, and upvote or downvote posts to express approval or disapproval regarding the content of the post. The number of upvotes and downvotes are fed into Reddit's hot-ranking algorithm to determine a score for the post, with higher scoring posts rising to the top of the subreddit.

We are part of the core team within the customer success department of Zoom International.  With MST as our largest and fastest growing competitor, we want to examine what users have been discussing on Reddit by applying NLP techniques. We will then train a classifier to accurately classify content between the two sub-Reddits. Based on the model's  we will make recommendations on two prongs - to the software development team and the marketing team:

1) Software Development Team - to highlight what are the common issues faced by users, as well as any additional features that users would like

2) Marketing - (i) to look at what features MST users have issues with (more than Zoom users) and tweak our campaigns to capitalise on their perceived weaknesses and (ii) to look at which words are closely associated with Zoom and MST. These words can considered for our Search Engine Marketing and Search Engine Optimisation campaigns. To utilise these words as paid keywords such as Google AdWords or organic keywords in our sites.

With the problem statement explained above, we have selected both subreddits -- r/Zoom and r/MicrosoftTeams. Both subreddits contain comments or issues raised quite actively by the community, mostly users of the individual platform.

In this project, I experimented with vectorizers including TfidfVectorizer and CountVectorizer using both Logistic Regression and Random Forest.  I have also leverage on VEDA to perform a initial sentiment analysis to further explore the general perception of our platform (Zoom) versus Microsoft Team.

Since the extraction of the relevant data leveraging on the PushShift API may take long, I have decided to split my notebook into 2 separate copy.  The first notebook would handle the extraction of the respective posting leveraging on PushShift API and the 2nd one includes code to perform standard data cleaning, EDA and Modelling.  Since COVID pandemic, the volume of user using both platform has increased significantly and hence we have chose to analyze posting between the period starting year 2020 April and 2021 March.

# Reddit Web Scraping

In this notebook, I will extract text data from 2 different subreddits - r/Zoom and r/MicrosoftTeams.  I will be leveraging on PushShift API to extract the related data.

In [33]:
# Import libraries needed for webscraping
import requests
import json
from datetime import datetime
import pandas as pd

#### Testing Out the API:

In [34]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [35]:
params = {
    'subreddit': 'MicrosoftTeams',
    'size': 500
}

In [36]:
res = requests.get(url, params)

In [37]:
res.status_code

200

In [38]:
data = res.json()
posts = data['data']

In [39]:
len(posts)

100

In [40]:
#displaying the 1st post retrieved in the post[] array
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'nimjay25',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_bvougu2x',
 'author_is_blocked': False,
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1642647305,
 'domain': 'self.MicrosoftTeams',
 'full_link': 'https://www.reddit.com/r/MicrosoftTeams/comments/s888mu/music_on_hold/',
 'gildings': {},
 'id': 's888mu',
 'is_created_from_ads_ui': False,
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '#ff6138',
 'link_flair_css_class': 'help',
 'link_flair_richtext': [{'e': 'text', 't': 'Question/Help'}],
 'link_flair_template_id': '7b480ed0-9bd0-11ea-b766-0e794b6e423d',
 'link_flair_t

In [42]:
df = pd.DataFrame(posts)

In [43]:
df[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,MicrosoftTeams,We're porting our phone system to Teams and te...,Music on Hold
1,MicrosoftTeams,"Hey all,\n\nThis may seem like a confusing que...",Scheduling questions.
2,MicrosoftTeams,Hello all. So I am unsure why this is popping ...,"Im confused by 'Available, Out of Office' status"
3,MicrosoftTeams,"Hello,\n\nI use a Jabra bluetooth headset and ...",Ubuntu 20.04 Bluetooth headset
4,MicrosoftTeams,I’m always looking for shortcuts. \nSo I’m cha...,Schedule meeting shortcut?


In [52]:
# Adapted from https://rareloot.medium.com/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563
def getPushshiftData(after, before, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' \
            +str(sub)+'&size=1000&after='+str(after)+'&before='+str(before)
    
    print(url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

In [53]:
def post_scrapper(data):
    date = [] #.created_utc
    title = [] #.title
    is_self = [] #.is_self
    selftext = [] # .selftext 
    upvotes = [] #.score
    upvote_ratio = [] #.upvote_ratio
    n_comments = [] #.num_comments
    permalink = [] #.permalink
    author = [] #.author
    
    for post in data:
        date.append(str(datetime.fromtimestamp(post['created_utc'])))
        title.append(post['title'])
        upvotes.append(post['score'])
        try:
            upvote_ratio.append(post['upvote_ratio'])
        except:
            upvote_ratio.append('NA')
        n_comments.append(post['num_comments'])
        is_self.append(post['is_self'])
        try:
            selftext.append(post['selftext'])
        except:
            selftext.append('NA')
        author.append(post['author'])
        permalink.append(post['permalink'])
    
    df = pd.DataFrame({'date':date,
                  'title':title,
                  'selftext':selftext,
                  'is_self':is_self,
                  'upvotes':upvotes,
                  'upvote_ratio': upvote_ratio,
                  'n_comments':n_comments,
                  'permalink':permalink,
                  'author':author})
    
    return df

In [54]:
def parse_posts(after, before, sub):
    
    # Initialise list
    list_of_dfs = []
    data = getPushshiftData(after, before, sub)
    
    while len(data) > 0:
        current_df = post_scrapper(data)
        # Calls getPushshiftData() with the created date of the last submission
        print(len(data))
        print(str(datetime.fromtimestamp(data[-1]['created_utc'])))
        after = data[-1]['created_utc']
        data = getPushshiftData(after, before, sub)
        list_of_dfs.append(post_scrapper(data))
        
    return list_of_dfs

To align with our problem statement, we have decided to retrieve data from both subreddit from 1st April 2020 to 31st March 2021 since we believe both platform would be leveraged on during the onset of the COVID pandemic.  With higher volume, it would certainly make our analysis more relevant and with more data, it benefits our analysis and development of classification model.

Conversion of timestamp on Epoch Converter website:

Epoch timestamp: 1585699200
Timestamp in milliseconds: 1585699200000
Date and time (GMT): Wednesday, April 1, 2020 12:00:00 AM

Epoch timestamp: 1617192000
Timestamp in milliseconds: 1617192000000
Date and time (GMT): Wednesday, March 31, 2021 12:00:00 PM

In [55]:
# Setting date in accordance to above converted timestamp to ensure all posts in timeframe are captured
zoom_post = parse_posts('1585699200', '1617192000', 'Zoom')

https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1585699200&before=1617192000
100
2020-04-02 08:12:46
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1585786366&before=1617192000
100
2020-04-04 12:36:46
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1585975006&before=1617192000
99
2020-04-06 14:23:56
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1586154236&before=1617192000
100
2020-04-07 14:02:45
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1586239365&before=1617192000
100
2020-04-08 22:54:40
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1586357680&before=1617192000
100
2020-04-10 04:10:48
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1586463048&before=1617192000
100
2020-04-12 03:00:54
https://api.pushshift.io/reddit/search/submission/?subre

100
2020-10-07 14:18:32
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1602051512&before=1617192000
100
2020-10-10 01:21:11
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1602264071&before=1617192000
100
2020-10-13 16:56:12
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1602579372&before=1617192000
100
2020-10-15 21:12:05
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1602767525&before=1617192000
100
2020-10-19 05:39:32
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1603057172&before=1617192000
100
2020-10-22 22:52:52
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1603378372&before=1617192000
100
2020-10-26 18:29:27
https://api.pushshift.io/reddit/search/submission/?subreddit=Zoom&size=1000&after=1603708167&before=1617192000
100
2020-10-29 01:43:00
https://api.pushshift.io/reddit

In [56]:
zoom_v = pd.concat(zoom_post, ignore_index=True)

In [60]:
zoom_v.head()

Unnamed: 0,date,title,selftext,is_self,upvotes,upvote_ratio,n_comments,permalink,author
0,2020-04-02 08:17:08,(new) ZOOM Discord got taken down at +500 ppl ...,[https://discord.gg/EfQWuE3](https://discord.g...,1.0,1.0,,0.0,/r/Zoom/comments/ftbzmh/new_zoom_discord_got_t...,Not_Fearr
1,2020-04-02 08:34:15,JOIN ZOOM DISCORD,[https://discord.gg/EfQWuE3](https://discord.g...,1.0,1.0,,0.0,/r/Zoom/comments/ftca3f/join_zoom_discord/,Not_Fearr
2,2020-04-02 08:36:03,join zoom discord,[https://discord.gg/EfQWuE3](https://discord.g...,1.0,1.0,,1.0,/r/Zoom/comments/ftcb3a/join_zoom_discord/,Not_Fearr
3,2020-04-02 08:59:49,How do I use prerecorded footage as my video?,Also does anyone have any good source of front...,1.0,1.0,,7.0,/r/Zoom/comments/ftcp3h/how_do_i_use_prerecord...,Mememan054
4,2020-04-02 09:13:59,How to separately adjust output audio?,"Is there a way to adjust Zoom’s output volume,...",1.0,1.0,,0.0,/r/Zoom/comments/ftcxf0/how_to_separately_adju...,demonroses


In [58]:
#save it to external file so that it can be picked easily by another notebook
zoom_v.to_csv('zoom_v_bigger.csv', index=False)

In [64]:
# Setting date slightly further back to ensure all posts in timeframe are captured
team_post = parse_posts('1585699200', '1617192000', 'MicrosoftTeams')

https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1585699200&before=1617192000
100
2020-04-04 21:33:45
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1586007225&before=1617192000
100
2020-04-09 16:00:20
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1586419220&before=1617192000
100
2020-04-14 23:25:00
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1586877900&before=1617192000
100
2020-04-18 03:29:13
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1587151753&before=1617192000
100
2020-04-22 23:52:17
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1587570737&before=1617192000
100
2020-04-27 22:07:51
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1587996471&before=1617192000
100
2020-

100
2021-01-22 15:54:26
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1611302066&before=1617192000
100
2021-01-28 01:47:09
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1611769629&before=1617192000
100
2021-01-31 05:11:30
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1612041090&before=1617192000
100
2021-02-03 18:08:08
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1612346888&before=1617192000
100
2021-02-09 17:59:37
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1612864777&before=1617192000
100
2021-02-12 06:41:22
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1613083282&before=1617192000
100
2021-02-18 04:09:47
https://api.pushshift.io/reddit/search/submission/?subreddit=MicrosoftTeams&size=1000&after=1613592587&bef

In [65]:
team_v = pd.concat(team_post, ignore_index=True)

In [69]:
team_v.head()

Unnamed: 0,date,title,selftext,is_self,upvotes,upvote_ratio,n_comments,permalink,author
0,2021-06-07 19:21:27,Teams invitation via Desktop Outlook in anothe...,I've seen this being asked several times. Ever...,1.0,1.0,1.0,3.0,/r/MicrosoftTeams/comments/nu9ykm/teams_invita...,AriHD
1,2021-06-07 19:30:11,Teams Contacts Issues,"Hi all,\n\nI've recently started using Teams f...",1.0,1.0,1.0,6.0,/r/MicrosoftTeams/comments/nua41j/teams_contac...,Afraid-Bread
2,2021-06-07 20:02:31,Route calls to a Voice-App instead of a call g...,Hello! Is there an option to route calls to a ...,1.0,1.0,1.0,1.0,/r/MicrosoftTeams/comments/nuapnt/route_calls_...,monkeyape
3,2021-06-07 22:18:02,Microsoft Teams for iOS adds webinars support ...,,0.0,1.0,1.0,0.0,/r/MicrosoftTeams/comments/nudncn/microsoft_te...,IT_PRO_21
4,2021-06-07 22:19:50,Recording a PowerPoint Live presentation doesn...,My colleague wishes to record her Presentation...,1.0,1.0,1.0,2.0,/r/MicrosoftTeams/comments/nudor3/recording_a_...,DarrenOL83


In [66]:
#save it to external file so that it can be picked easily by another notebook
team_v.to_csv('team_v_bigger.csv', index=False)

In [67]:
team_v.count()

date            6912
title           6912
selftext        6912
is_self         6912
upvotes         6912
upvote_ratio    6912
n_comments      6912
permalink       6912
author          6912
dtype: int64