In [5]:
#imports

import requests
import pandas as pd
import time
import datetime as dt

# Classifying NLP Data for the Service Industry: Comparing 2 Subreddits

## Project Contents:

This project contains the following:

|Name|File Type|Description|
|---|---|---|
|1.intro_and_webscrape|Jupyter Notebook|Provides an introduction to the project, including problem statement and background, and the code used for gathering the data.|
|2.eda|Jupyter Notebook|Displays data cleaning and exploratory data analysis.|
|3.modeling|Jupyter Notebook|Builds 4 classification models.|
|4.sentiment_analysis|Jupyter Notebook|Explores sentiment analysis of the language used in each subreddit|
|README.md|txt|An executive summary of the project.|
|models|folder|contains saved copies of the models produced|
|data|folder|contains csv files of data used|

You are currently in notebook #1: Intro and Webscrape.

This notebooks contains:

* [Problem Statement](#problemstatement)
* [Background](#background)
* [A description of the data / Data Dictionary](#data)
* [Webscraping Code Used to Collect the Data](#code)
* [Sources](#sources)


## <a name=problemstatement></a>Problem Statement

During the past year, the service industry has seen an unprecedented rate of resignations.  I wanted to develop a classification  model that could predict if written language originates from a service worker in the restaurant industry or in the retail industry.  I wanted to see which language was common to both industries, and which language differentiates the two, as well as the sentiment of that language.  By examining these similarities and differences, businesses can better understand what steps they need to take in order to retain workers, as well as to attract new employees.


## <a name=background></a>Background

Wages for workers in the service industry have remained stagnant for years.  While retail and serving work schedules have tended to fluctuate in general (as anyone who has worked in retail or serving knows) this is something that was made worse by the COVID-19 pandemic when demand for stores and restaurants shrank and many businesses were temporarily closed.  Many workers lost significant income due to hours being cut, hours that they relied on for supporting themselves and in some cases their families. One reason often mentioned is the expanded jobless benefits that came out of the COVID-19 pandemic, which may have reduced the incentive to accept jobs. But states that canceled those benefits early on saw no increase in employment compared with those that didn’t. 

Through this study I hoped to gain a better idea of what is actually motivating workers to resign, and therefore help businesses to better understand what changes they can make to retain their workforce. 

Source: https://www.nytimes.com/2021/10/14/opinion/workers-quitting-wages.html?searchResultPosition=3


## <a name=data></a>The Data

To gather the data for this study, I used Pushshift’s API to collect the most recent 180,000 comments from 2 subreddits: TalesFromRetail and TalesFromYourServer.  That large of a dataset proved impractical for my modeling purposes due to slow fit times, and I shrank each data set to the 40,000 most recent comments for each subreddits. TalesFromRetail is a subreddit where retail workers share their experiences, and TalesFromYourServer is the equivalent for servers. Both subreddits have been around for about 10 years. 

TalesFromRetail:     
* https://www.reddit.com/r/TalesFromRetail/
* Members 645,000
* Created November 9, 2011
* “A place to exchange stories about your daily experiences in brick & mortar retail.”

TalesFromYourServer:      
* https://www.reddit.com/r/TalesFromYourServer/ 
* Members: 444,000
* Created September 24, 2012
* A subreddit where servers share stories and advice. 


Pushshift API: 
* https://github.com/pushshift/api


### Data Dictionary 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|body|str|stop_words_removed|The text of the comment.|
|subreddit|str|stop_words_removed|The title of the subreddit.
|author|str|stop_words_removed|The username of the comment author.
|created_utc|int|stop_words_removed|The Universal Time Coordinated (UTC) time when the comment was posted.
|comment_length|int|stop_words_removed|The wordcount of the comment.
|comment_tokens|str|stop_words_removed|Cleaned copy of body column following EDA process.


## <a name="sources"></a>Sources

* The New York Times: https://www.nytimes.com/2021/10/14/opinion/workers-quitting-wages.html?searchResultPosition=3

* TalesFromRetail Subreddit: https://www.reddit.com/r/TalesFromRetail/

* TalesFromYourServer Subreddit: https://www.reddit.com/r/TalesFromYourServer/ 

* Pushshift API: https://github.com/pushshift/api

* source for TOC functionality: https://stackoverflow.com/questions/5319754/cross-reference-named-anchor-in-markdown/7335259#7335259


## <a name='code'></a>Webscraping Code

In [40]:

# Generating initial csv files upon which to build using pushshift's API: https://github.com/pushshift/api
# Code developed by Chuck Dye of General Assembly. 

# Uncomment the code below in order to run. 

#specifying the subreddit, the beginning date in UTC, and the type of pull (comments vice posts). 

# subreddit = 'talesfromyourserver'
# date = '1635356081'
# kind = 'comment'

# stem = 'https://api.pushshift.io'
# slug = f'/reddit/search/{kind}/?subreddit={subreddit}&before={date}&size=100'

# res = requests.get(stem + slug)

# res

# data = res.json()


# posts = []

# for post in data['data']:
#     post_dict = {
#         'body' : post['body'],
#         'subreddit' : post['subreddit'],
#         'author' : post['author'],
#         'created_utc' : post['created_utc']
#     }
#     posts.append(post_dict)


In [47]:
# Saving to a csv file. 
# Uncomment the code below in order to run. 

# pd.DataFrame(posts).to_csv('./talesfromyourserver.csv')

In [109]:


# Continuing to build upon the csv file using pushshift's API: https://github.com/pushshift/api

# def pushshift(subreddit, kind='comment', date='1635356081', iters=500, og_csv='./talesfromretail.csv'):

# Uncomment the code below in order to run.     
    
#     for scrape in range(1,iters + 1):
#         date = pd.read_csv(og_csv)['created_utc'].min()
#         stem = 'https://api.pushshift.io'
#         slug = f'/reddit/search/{kind}/?subreddit={subreddit}&before={date}&size=100&selftext=true'
        
#         res = requests.get(stem + slug) 
        
#         if res.status_code != 200:
#             print(f'url: {stem + slug} returned {res.status_code}')
#         else:
#             data = res.json()
#             posts = []

#             for post in data['data']:
#                 post_dict = {
#                     'body' : post['body'],
#             #         'title' : post['title'],
#                     'subreddit' : post['subreddit'],
#                     'author' : post['author'],
#             #         'num_comments' : post['num_comments'], 
#                     'created_utc' : post['created_utc']
#                 }
#                 posts.append(post_dict)
                
                
#             og_df = pd.read_csv(og_csv)
#             temp_df = pd.DataFrame(posts)
            
#             new_df = pd.concat([og_df,temp_df])
            
#             new_df.to_csv(og_csv,index=False)
            
#             print("Scrape success")
#             print(f'original data was {og_df.shape[0]} rows')
#             print(f'New dataset is {new_df.shape[0]} rows!')
#             print(new_df['created_utc'].min())
#             time.sleep(10)

# pushshift('talesfromretail')


Scrape success
original data was 120164 rows
New dataset is 120264 rows!
1576624719
Scrape success
original data was 120264 rows
New dataset is 120364 rows!
1576601489
Scrape success
original data was 120364 rows
New dataset is 120464 rows!
1576568096
Scrape success
original data was 120464 rows
New dataset is 120564 rows!
1576538533
Scrape success
original data was 120564 rows
New dataset is 120664 rows!
1576518030
Scrape success
original data was 120664 rows
New dataset is 120764 rows!
1576482989
Scrape success
original data was 120764 rows
New dataset is 120864 rows!
1576458130
Scrape success
original data was 120864 rows
New dataset is 120964 rows!
1576432479
Scrape success
original data was 120964 rows
New dataset is 121064 rows!
1576410811
Scrape success
original data was 121064 rows
New dataset is 121164 rows!
1576388872
Scrape success
original data was 121164 rows
New dataset is 121264 rows!
1576366255
Scrape success
original data was 121264 rows
New dataset is 121364 rows!
157

In [107]:
pd.read_csv('./talesfromyourserver.csv')['subreddit'].value_counts()

TalesFromYourServer    136000
TalesFromRetail           100
Name: subreddit, dtype: int64

In [108]:
pd.read_csv('./talesfromyourserver.csv').nunique()

Unnamed: 0        100
body           129331
subreddit           2
author          27903
created_utc    135374
dtype: int64

In [102]:
pd.read_csv('./talesfromretail.csv').nunique()

Unnamed: 0        100
body           102206
subreddit           1
author          23767
created_utc    119938
dtype: int64