# Project 3: Classifying subreddits

- Part 1: Data collection and basic data cleaning (current)
- Part 2: EDA and modelling

## Problem Statement

For the uninitiated, Reddit is an American mega-forum where people can share any kind of news and content and have discussion with other Redditors. Reddit has more than 300 million users and is divided into more than 100,000 sub-forums known as subreddits. As of September 2020, Reddit ranks #17 in global internet traffic and engagement. [(Source)](https://www.alexa.com/siteinfo/reddit.com)

In this project, I attempt to classify posts from two subreddits – `r/AskWomen` and `r/AskWomenOver30`. This is part of a wider project that my team is working on to automate the sorting of Reddit posts.

`r/AskWomen` and `r/AskWomenOver30` are both subreddits where women can comfortably and candidly share their responses in a non-judgmental space, but the latter is catered towards females above 30 years old.

I am interested to see whether these two subreddits are distinct enough to be distinguished by a classifier, and explore whether the classifier can extract meaningful differences in the topics discussed in the two subreddits.

<BR>
<details><summary>Full description of r/AskWomen and r/AskWomenOver30 from Reddit</summary>

**r/AskWomen**: A subreddit dedicated to asking women questions about their thoughts, lives, and experiences; providing a place where all women can comfortably and candidly share their responses in a non-judgmental space. 

**r/AskWomenOver30**: A place for women redditors aged 30 and over to discuss questions.

</details>
    

In [1]:
import requests, time, re, string
import pandas as pd
import pickle
from nltk.corpus import stopwords

pd.set_option('display.max_columns', None)

## Scrape Reddit posts

First, I will scrape the Reddit posts using Reddit's API. The API has a limit of 25 posts per request, for a total of 40 requests, which sums up to a maximum of 1,000 posts that can be scraped from each subreddit.

### Set up a scraper function

In [2]:
def scraper(url_bit, header_bit):
    lst_name = []
    after = None

    for i in range(40):
        if i % 10 == 0:
            print('Scraping in progress...') # indicate that our scraping is in progress
        if after == None:
            params = {}
        else:
            params = {'after': after}
        url = 'https://www.reddit.com/r/' + url_bit + '.json'
        res = requests.get(url, params=params, headers={'User-agent': header_bit})

        if res.status_code == 200:
            the_json = res.json()
            
            # append the list of children to the end of `lst_name`
            lst_name.extend(p['data'] for p in the_json['data']['children'])
            
            # override the value of `after` so that params is activated when we hit the API again
            after = the_json['data']['after'] 
        else:
            print('Status error:', res.status_code)
            break
        time.sleep(1) # sleep 1 second before running the loop again
    
    print(f'Completed scraping a total of {len(lst_name)} posts.')
    return lst_name

### Gather posts from r/AskWomen
- I will use the "new" filter so as to gather more unique posts for analysis.

(Scraped on 7 September 2020)

In [3]:
askwm = scraper('AskWomen/new', 'grape 1.0')

Scraping in progress...
Scraping in progress...
Scraping in progress...
Scraping in progress...
Completed scraping a total of 991 posts.


In [4]:
# save the posts to a dataframe
askwm_df = pd.DataFrame(askwm)

In [5]:
# total number of posts scraped
len(askwm_df)

991

In [6]:
# number of unique posts based on post title
askwm_df.title.nunique()

985

In [8]:
# drop duplicates
askwm_df.drop_duplicates(subset=['title'], inplace=True)

In [9]:
# save a copy of the raw dataframe
askwm_df.to_csv(r'datasets/askwm_original.csv', index=False)

### Gather posts from r/AskWomenOver30
- I will use the "new" filter so as to gather more unique posts for analysis.

(Scraped on 7 September 2020)


In [10]:
askwm30 = scraper('AskWomenOver30/new', 'salmon 1.0')

Scraping in progress...
Scraping in progress...
Scraping in progress...
Scraping in progress...
Completed scraping a total of 987 posts.


In [11]:
# save posts to a dataframe
askwm30_df = pd.DataFrame(askwm30)

In [12]:
# total number of posts scraped
len(askwm30_df)

987

In [13]:
# number of unique posts based on post title
askwm30_df.title.nunique()

986

In [14]:
# drop duplicates
askwm30_df.drop_duplicates(subset=['title'], inplace=True)

In [15]:
# save a copy of the raw dataframe
askwm_df.to_csv(r'datasets/askwm30_original.csv', index=False)

### Combine the r/AskWomen and r/AskWomenOver30 dataframes 

In [16]:
full_df = pd.concat([askwm_df, askwm30_df]).reset_index(drop=True)

In [17]:
full_df.head(2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,author_cakeday,thumbnail_height,thumbnail_width,post_hint,preview,url_overridden_by_dest
0,,AskWomen,,t2_h461s06,False,,0,False,What's the difference between your worst and b...,[],r/AskWomen,False,6,,0,,True,t3_io3irh,False,dark,0.67,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,,False,,[],{},,True,,1599495000.0,text,6,,,text,self.AskWomen,False,,,top,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rxrw,,,,io3irh,True,,mahboilucas,,11,True,all_ads,False,[],False,,/r/AskWomen/comments/io3irh/whats_the_differen...,all_ads,False,https://www.reddit.com/r/AskWomen/comments/io3...,1714772,1599466000.0,0,,False,,,,,,,
1,,AskWomen,,t2_7qborbgv,False,,0,False,Would you welcome Queen Elsa and Princess Anna...,[],r/AskWomen,False,6,,0,,True,t3_io3awn,False,dark,0.5,,public,0,0,{},,False,[],,False,False,,{},,False,0,,False,,False,,[],{},,True,,1599493000.0,text,6,,,text,self.AskWomen,False,,,top,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rxrw,,,,io3awn,True,,boredkate,,6,True,all_ads,False,[],False,,/r/AskWomen/comments/io3awn/would_you_welcome_...,all_ads,False,https://www.reddit.com/r/AskWomen/comments/io3...,1714772,1599465000.0,0,,False,,,,,,,


## Data cleaning and preprocessing

The posts were cleaned and preprocessed as outlined in the steps below to transform the text into a usable format for our classifier models.

* Remove posts by moderators
* Combine post title column and post content column into a single full-text column
* Create target variable column `is_askwomenover30`
* Make text all lowercase
* Remove punctuation
* Remove all non-alphabetical text
* Remove stopwords
* Lemmatize words

In [18]:
full_df.shape

(1971, 109)

### Remove posts by moderators

In [19]:
# check number of moderator posts that we should remove
full_df.distinguished.value_counts()

moderator    4
Name: distinguished, dtype: int64

In [20]:
# filter out the rows that are posted by moderators
full_df = full_df.loc[full_df['distinguished'] != 'moderator', :]

In [21]:
full_df.shape

(1967, 109)

In [23]:
full_df.sample(5)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,author_cakeday,thumbnail_height,thumbnail_width,post_hint,preview,url_overridden_by_dest
83,,AskWomen,,t2_46lm7gpw,False,,0,False,Have you managed to overcome extreme resentmen...,[],r/AskWomen,False,6,,0,,False,t3_in8gmm,False,dark,0.57,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,,False,,[],{},,True,,1599367000.0,text,6,,,text,self.AskWomen,False,,,top,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rxrw,,,,in8gmm,True,,jjnava7,,4,True,all_ads,False,[],False,,/r/AskWomen/comments/in8gmm/have_you_managed_t...,all_ads,False,https://www.reddit.com/r/AskWomen/comments/in8...,1714772,1599338000.0,0,,False,,,,,,,
988,,AskWomenOver30,I (30F) really love writing and to deal with t...,t2_2bhs3iol,False,,0,False,Starting a new thing and looking for encourage...,[],r/AskWomenOver30,False,6,,0,,False,t3_inz4x1,False,dark,0.84,,public,4,0,{},fcee48ae-1a19-11e3-b0ae-12313b04c5c2,False,[],,False,False,,{},,False,4,,False,self,False,female,[],{},,True,,1599475000.0,text,6,,,text,self.AskWomenOver30,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,female 27 - 30,[],False,,,,t5_2ya5k,,,,inz4x1,True,,nj3008,,5,True,all_ads,False,[],False,dark,/r/AskWomenOver30/comments/inz4x1/starting_a_n...,all_ads,False,https://www.reddit.com/r/AskWomenOver30/commen...,79600,1599446000.0,0,,False,,,,,,,
1380,,AskWomenOver30,For the longest time I naturally thought that ...,t2_46lm7gpw,False,,0,False,How did you find your tribe and who does it co...,[],r/AskWomenOver30,False,6,,0,,False,t3_ia5l6h,False,dark,0.87,,public,24,0,{},,False,[],,False,False,,{},,False,24,,False,self,False,,[],{},,True,,1597519000.0,text,6,,,text,self.AskWomenOver30,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2ya5k,,,,ia5l6h,True,,jjnava7,,19,True,all_ads,False,[],False,,/r/AskWomenOver30/comments/ia5l6h/how_did_you_...,all_ads,False,https://www.reddit.com/r/AskWomenOver30/commen...,79600,1597490000.0,0,,False,,,,,,,
723,,AskWomen,,t2_gurab,False,,0,False,When was the last time you asked for a raise? ...,[],r/AskWomen,False,6,,0,,False,t3_ietj9a,False,dark,0.86,,public,5,0,{},,False,[],,False,False,,{},,False,5,,False,,False,,[],{},,True,,1598170000.0,text,6,,,text,self.AskWomen,False,,,top,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rxrw,,,,ietj9a,True,,amazinglymorgan,,22,True,all_ads,False,[],False,,/r/AskWomen/comments/ietj9a/when_was_the_last_...,all_ads,False,https://www.reddit.com/r/AskWomen/comments/iet...,1714773,1598142000.0,0,,False,,,,,,,
68,,AskWomen,,t2_78zukj7q,False,,0,False,shy women in retail: how do you create your bu...,[],r/AskWomen,False,6,,0,,False,t3_incvd4,False,dark,0.88,,public,17,0,{},8106c61a-c8aa-11e1-a771-12313b0ce1e2,False,[],,False,False,,{},,False,17,,False,,False,female,[],{},,True,,1599383000.0,text,6,,,text,self.AskWomen,False,,,top,,,False,False,False,False,False,[],[],False,False,False,False,♀,[],False,,,,t5_2rxrw,,,,incvd4,True,,The_cuddly_duckling,,26,True,all_ads,False,[],False,dark,/r/AskWomen/comments/incvd4/shy_women_in_retai...,all_ads,False,https://www.reddit.com/r/AskWomen/comments/inc...,1714772,1599354000.0,0,,False,,,,,,,


### Subset only the columns that we need

In [24]:
# save a copy of the original df
full_df_old = full_df.copy()

In [None]:
# full_df = full_df_old.copy()

In [25]:
full_df = full_df[['subreddit', 'selftext', 'title', 'ups', 'created', 'num_comments']]

In [26]:
full_df

Unnamed: 0,subreddit,selftext,title,ups,created,num_comments
0,AskWomen,,What's the difference between your worst and b...,1,1.599495e+09,11
1,AskWomen,,Would you welcome Queen Elsa and Princess Anna...,0,1.599493e+09,6
2,AskWomen,,"Women who had very lax or uninvolved parents, ...",5,1.599493e+09,4
3,AskWomen,,What generalization about our gender irks you ...,6,1.599487e+09,36
4,AskWomen,,How much exercise does your dog get a day and ...,2,1.599480e+09,10
...,...,...,...,...,...,...
1966,AskWomenOver30,,"What is something that annoys, bothers or upse...",6,1.594175e+09,19
1967,AskWomenOver30,My fiancé and I have been together for 2 years...,Not looking for a pity party (I'm the one in t...,10,1.594170e+09,13
1968,AskWomenOver30,I don't know how common it is but I do know it...,Do you celebrate Christmas in July?,6,1.594169e+09,15
1969,AskWomenOver30,Pretty much as the title says. My partner is a...,My (38M) partner (33F) has returned to work in...,13,1.594167e+09,5


In [28]:
# make `subreddit` lowercase
full_df['subreddit'] = full_df['subreddit'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


###  Combine `title` and `selftext` into one column `fulltext`



In [29]:
full_df['fulltext'] = full_df['title'] + ' ' + full_df['selftext']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [30]:
full_df.head()

Unnamed: 0,subreddit,selftext,title,ups,created,num_comments,fulltext
0,askwomen,,What's the difference between your worst and b...,1,1599495000.0,11,What's the difference between your worst and b...
1,askwomen,,Would you welcome Queen Elsa and Princess Anna...,0,1599493000.0,6,Would you welcome Queen Elsa and Princess Anna...
2,askwomen,,"Women who had very lax or uninvolved parents, ...",5,1599493000.0,4,"Women who had very lax or uninvolved parents, ..."
3,askwomen,,What generalization about our gender irks you ...,6,1599487000.0,36,What generalization about our gender irks you ...
4,askwomen,,How much exercise does your dog get a day and ...,2,1599480000.0,10,How much exercise does your dog get a day and ...


In [31]:
# drop selftext and title
full_df.drop(columns=['selftext', 'title'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [32]:
full_df.head(3)

Unnamed: 0,subreddit,ups,created,num_comments,fulltext
0,askwomen,1,1599495000.0,11,What's the difference between your worst and b...
1,askwomen,0,1599493000.0,6,Would you welcome Queen Elsa and Princess Anna...
2,askwomen,5,1599493000.0,4,"Women who had very lax or uninvolved parents, ..."


### Create `is_askwomenover30` to indicate whether subreddit is r/askwomenover30 (1) or r/askwomen (0)

In [33]:
full_df['is_askwomenover30'] = full_df['subreddit'].apply(lambda x: 1 if x == 'askwomenover30' else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### Create a function to clean text

In Natural Language Processing (NLP), most of the text and documents contain information that is redundant for text classification, such as stopwords, URLs, emoticons, numbers etc. 

Thus, after scraping the data, we need to perform data cleaning to transform the text into a usable format to input into our classifier models.

In [34]:
def clean_text(text):
    # lowercase text
    text = text.lower()
    
    # add white space before opening bracket so that words will not get joined together after I remove punctuation
    text = text.replace("(", " (")

    # remove all url links
    text = re.sub(r"http\S+", "", text)

    # if anything matches the punctuation marks, then get rid of them
    # string.punctuation = a list of punctuation marks
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    
    # remove everything that isn't a letter
    # Use regular expressions to do a find-and-replace
    text = re.sub("[^a-zA-Z]", " ", text)   
    
    # split each row into tokens and remove stopwords
    meaningful_words = [w for w in text.split() if w not in set(stopwords.words('english'))]
    
    return (" ".join(meaningful_words))

In [35]:
full_df['fulltext'] = full_df['fulltext'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [36]:
# check the cleaned text
import random

random.choices(full_df['fulltext'], k=10)

['last time asked raise get havent asked plan',
 'ever unplanned pregnancy circumstances deal',
 'know look potential date boyfriend eventually husband dont even know start guess admit im struggling relationships far one left aware idea good absolute garbage taste know look still clue good relationship looks like dont know women managed find someone actually good friends longterm boyfriends frank boyfriends seem like unicorns unquestionably loyal devoted upping lifestyle ambitious career attentive present relationships allround positive attitude social life etc see friends kick back relax domains life friendships careers family etc relationship gives stable rock everything need qualities man kinda rare feels absolutely unreplicable im late game women met sos college never dated couple times somehow found great guy like im baffled side see lot women around age straight settling clearly unhappy relationships men unhappy time men generally always happier women together dont think thats co

The output looks relatively clean, although there are a few words wrongly joined together.

### Save our cleaned dataset

Let's save the cleaned dataset and proceed to EDA and modelling in the next notebook.

In [37]:
full_df.to_csv(r'datasets/women_combined_clean.csv', index=False)