<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

# Background

The effects of COVID-19 continue to flow through the world’s health, educational, financial, and commercial institutions, and the global sporting goods industry is no different. In 2020, the sporting goods industry contracted for the first time since the financial crisis of 2007–08, with most brands, retailers, and manufacturers finishing the year significantly in the red, despite a bounce back in activity after the first and before the second wave of COVID-19-related lockdowns. 
[(source)](https://www.mckinsey.com/industries/retail/our-insights/sporting-goods-2021-the-next-normal-for-an-industry-in-flux)

Despite slowing economic activity during the pandemic, the pandemic has led to a surge in e-commerce retail and accelerated digital transformation. A report [(source)](https://iprice.sg/trends/insights/report-how-pandemic-affects-southeast-asias-online-shopping-in-2020/?nocache=0) done by iPrice Group, an e-commerce aggregator, reveals that the overall website traffic of online shopping platforms increased positively across all countries year-over-year. Data also showed that online department stores’ web traffic experienced a 52% average increase from Q1 of 2020. 


---

# Problem Statement



With more intense competition from many sporting companies which are moving to e-commerce retail amidst the pandemic, an e-commerce water sporting goods company, which also operates a discussion forum on their website, is looking to improve their sales revenue by staying ahead of the e-commerce competition. To do this, they have engaged our data science team to use Machine Learning to help predict what kinds of sports apparel and  equipments these forum users would be interested in (e.g. wetsuit, mask, surf boards) based on their posts in the forum, so that they can offer highly personalized product recommendations to them.

---

# Executive Summary

In the first notebook, 2000 posts will be scraped from each subreddit, r/scubadiving and r/surfing. As we will be focusing on classifying which category a discussion post belong to, we will drop all except for the 'title' of the discussions, 'selftext' which contains the discussions, and the 'subreddit' category that the posts were scraped from. 

In the second notebook, These datasets will then be checked for missing values, removed and deleted 'selftext', and duplicated posts. Missing values in 'selftext', and 'seltext' with '[removed]' and '[deleted]' will be replaced with empty strings, while duplicated posts will be dropped. A new column called 'text' will be created by combining the 'title' and the 'selftext' of a post. Another new column 'diving' with binary values will also be created according to the 'subreddit' column; '1' for r/scubadiving and '0' for surfing.

Preprocessing of the dataset was then done to prepare them for modeling. Preprocessing steps are as follows:
1. Removal of special characters, html tags, hyperlinks and punctuation marks
2. Convert text to lowercase 
3. Lemmatization of words in 'text' with POS tag (Comparisons between Stemming and Lemmatization were made)
4. Removal of stop words
5. Plot barcharts of the most top 20 frequently used unigrams, bigrams and trigrams to identify words that should be added to the stop word list
6. Go back to step 4 

In the 3rd notebook, the dataset that is preprocessed in 2nd notebook will be fitted into classification models such as Logistic Regression, Random Forest Classifier, and KNeighbors Classifier, to train the models on the classification of the subreddit categories. These models are then used to predict which subreddits categories the 'text' from the test dataset belong to. The best model will then be selected to be used on the waster sporting goods company's discussion forum to identify the water sports that people are interested in.

---

# Data Collection

# Table of Contents:
- [Background](#Background)
- [Problem Statement](#Problem-Statement)
- [Import Libraries](#Import-Libraries)
- [Functions](#Functions)
- [Data Collection](#Data-Collection)
- [Combine Subreddit DataFrames](#Combine-Subreddit-DataFrames)
- [Save DataFrame to CSV](#Save-DataFrame-to-CSV)

---

# Import Libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import timeit

---

# Functions

Despite the pushshift API documentation stating that the maximum number of results that can be returned is at 500, only 100 was returned. To overcome that, the 'before' parameter was used to limit results to before a specific epoch timestamp and a loop was run to retreive data in blocks.

This function can be reused easily to retreive any other subreddits posts, with the option to change the number of rows of data to be retrieved.

Codes adapted from: [source](https://www.reddit.com/r/pushshift/comments/bfc2m1/capping_at_1000_posts/)

In [2]:
def collect_data(subreddit, rows):
    start = timeit.default_timer()
    n = 0
    last = ''
    posts = []
    rows = int(rows)
    url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=' + subreddit
    while n < rows:
        res = requests.get('{}&before={}'.format(url, last))
        json = res.json()
        for i in json['data']:
            posts.append(i)
            n += 1
        last = int(i['created_utc'])
    print('Status code: ', res.status_code)
    stop = timeit.default_timer()
    elapsed_time = stop - start
    print('Total elapsed time in seconds:', elapsed_time)
    print('Number of requests per second:', rows/elapsed_time)
    
    return posts

---

# Data Collection

## Subreddit: Scubadiving

In [3]:
posts = collect_data('scubadiving', '2000')

Status code:  200
Total elapsed time in seconds: 257.0713725
Number of requests per second: 7.779940568839496


In [4]:
posts

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'traceystaceyy',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_g4hvbtbq',
  'author_is_blocked': False,
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1641958924,
  'domain': 'self.scubadiving',
  'full_link': 'https://www.reddit.com/r/scubadiving/comments/s1wc0z/going_diving_tomorrow_for_an_open_water_cert_and/',
  'gildings': {},
  'id': 's1wc0z',
  'is_created_from_ads_ui': False,
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_o

In [5]:
dive_df = pd.DataFrame(posts)

In [6]:
dive_df['title'][0]

'Going diving tomorrow for an open water cert and doc has told me I have swimmers ear today.'

In [7]:
dive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2024 entries, 0 to 2023
Data columns (total 87 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  1639 non-null   object 
 1   allow_live_comments            1548 non-null   object 
 2   author                         2024 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          2008 non-null   object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              2008 non-null   object 
 7   author_fullname                1929 non-null   object 
 8   author_is_blocked              312 non-null    object 
 9   author_patreon_flair           1879 non-null   object 
 10  author_premium                 1382 non-null   object 
 11  awarders                       1458 non-null   object 
 12  can_mod_post                   2024 non-null   b

In [8]:
dive_df = dive_df[['title','subreddit','selftext']]

In [9]:
dive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2024 entries, 0 to 2023
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      2024 non-null   object
 1   subreddit  2024 non-null   object
 2   selftext   2024 non-null   object
dtypes: object(3)
memory usage: 47.6+ KB


## Subreddit: Surfing

In [10]:
posts = collect_data('surfing', '2000')

Status code:  200
Total elapsed time in seconds: 281.9875748
Number of requests per second: 7.092511084640883


In [11]:
surf_df = pd.DataFrame(posts)

In [12]:
surf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 81 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  2000 non-null   object 
 1   allow_live_comments            2000 non-null   bool   
 2   author                         2000 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          1986 non-null   object 
 5   author_flair_text              176 non-null    object 
 6   author_flair_type              1986 non-null   object 
 7   author_fullname                1986 non-null   object 
 8   author_is_blocked              2000 non-null   bool   
 9   author_patreon_flair           1986 non-null   object 
 10  author_premium                 1986 non-null   object 
 11  awarders                       2000 non-null   object 
 12  can_mod_post                   2000 non-null   b

In [13]:
surf_df = surf_df[['title','subreddit','selftext']]

In [14]:
surf_df

Unnamed: 0,title,subreddit,selftext
0,Mentawai - Internet,surfing,I run an online business and can't be out of s...
1,Solo surf trip?,surfing,"Hello dearest kooks,\n\nWanting to branch into..."
2,Another from OB yesterday,surfing,
3,"I’m from that place with hobbits down under, b...",surfing,
4,If only it was this easy to clear the lineup,surfing,
...,...,...,...
1995,Need to change my stance?,surfing,
1996,Guadeloupe christmas surf,surfing,"Hey!\nAnyone surfed the guadeloupe lately, how..."
1997,Searching for a rideshare for Fuerteventura th...,surfing,[removed]
1998,Epicly big for the gulf coast! Clearwater beach,surfing,


---

# Combine Subreddit DataFrames

`pd.concat` used to concatenate the datasets from r/scubadiving and r/surfing 

In [15]:
df = pd.concat([dive_df, surf_df]).reset_index(drop=True)

In [16]:
df.columns

Index(['title', 'subreddit', 'selftext'], dtype='object')

In [17]:
df.head(10)

Unnamed: 0,title,subreddit,selftext
0,Going diving tomorrow for an open water cert a...,scubadiving,Going diving tomorrow for an open water cert a...
1,Diving with swimmers ear,scubadiving,
2,BIG HUNGRY FISH!,scubadiving,
3,"Dusk at Blue Heron Bridge, Rivera Beach, FL - USA",scubadiving,&amp;#x200B;\n\nhttps://preview.redd.it/1nwkll...
4,Scuba Diving Evolution,scubadiving,
5,Mark V *inspired* dive helmet lamp decoration ...,scubadiving,
6,just starting and wondering about a piece of gear,scubadiving,Just looking for some info about a regulator s...
7,Near miss aka: the importance of making mistak...,scubadiving,This story also reflects the difference betwee...
8,A Nudibranch Called Phyllidia Ocellata Cuvier,scubadiving,
9,36 on 05Jan2022,scubadiving,


In [18]:
df.tail(10)

Unnamed: 0,title,subreddit,selftext
4014,WARM REMINDERS - Wyatt McHale,surfing,
4015,Remote Work + Surfing in July/August,surfing,[removed]
4016,Is this a stoke face? Idk but I'm stoked bc a ...,surfing,
4017,Need feedback on a product for surfers and get...,surfing,[removed]
4018,"Follow Up: Go ghosted by the shaper, should I ...",surfing,My previous post: [https://www.reddit.com/r/su...
4019,Need to change my stance?,surfing,
4020,Guadeloupe christmas surf,surfing,"Hey!\nAnyone surfed the guadeloupe lately, how..."
4021,Searching for a rideshare for Fuerteventura th...,surfing,[removed]
4022,Epicly big for the gulf coast! Clearwater beach,surfing,
4023,Surfmats. Have you ridden one? Would you ride ...,surfing,My buddy is dropping off a board and a surfmat...


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      4024 non-null   object
 1   subreddit  4024 non-null   object
 2   selftext   4023 non-null   object
dtypes: object(3)
memory usage: 94.4+ KB


---

# Save DataFrame to CSV

In [22]:
df.to_csv('../data/subreddits.csv')