**Previous:** [Part 1 - Web Scraping](Part%201%20-%20Web%20Scraping.ipynb)

<img src='http://imgur.com/1ZcRyrc.png' style='float: left; margin: 20px; height: 55px'>

# Project 3: Reddit Web Scraping

## Part 2 - Data Cleaning

---
## Contents
---

### [Part 1 - Web Scraping](Part%201%20-%20Web%20Scraping.ipynb)
1. Introduction
2. Import - Web Scraping using PRAW

### [Part 2 - Data Cleaning](Part%202%20-%20Data%20Cleaning.ipynb)
1. [Import](#1.-Import)
2. [Cleaning - Data Frame and Text](#2.-Cleaning---Data-Frame-and-Text)

### [Part 3 - Exploratory Data Analysis (EDA)](Part%203%20-%20Exploratory%20Data%20Analysis%20(EDA).ipynb)
1. Import
2. Exploratory Data Analysis - Trends
3. Exploratory Data Analysis - Unigrams 
4. Exploratory Data Analysis - Bigrams
5. Exploratory Data Analysis - Trigrams 

### [Part 4 - Pre-processing & Modelling](Part%204%20-%20Pre-processing%20&%20Modelling.ipynb)
1. Import
2. Pre-processing - Binarizing The 2 Classes, Train-test Split
3. Modelling - Feature Engineering, Comparing Against Other Models
4. Conclusion - Summary, Recommendations

---
## 1. Import
---

### 1.1 Libraries

In [1]:
import pandas as pd
import re
import datetime
import requests
from bs4 import BeautifulSoup
import praw
import nltk
import numpy as np
import matplotlib.pyplot as plt
import concurrent.futures
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import time
import itertools
from collections import defaultdict, Counter
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.util import bigrams
from sklearn.feature_extraction.text import CountVectorizer
import string

ModuleNotFoundError: No module named 'praw'

> <font size = 3 color = "crimson"> Some of these imports are not needed for this book. While it's not important in this case, importing unnecessary things can slow your notebook down. </font>

### 1.2 Raw data

In [3]:
# Read raw data saved in previous notebook
reddit = pd.read_csv('reddit_raw.csv')

reddit.head()

FileNotFoundError: [Errno 2] No such file or directory: 'reddit_raw.csv'

> <font size = 3 color = "crimson"> File pathing error. Do be careful that your final file organisation matches your code. Fixed below. I note this is the 'final' dataset and not the raw one. Would be good if you included the raw dataset in your uploads as well.</font>

In [5]:
# Read raw data saved in previous notebook
reddit = pd.read_csv('../data/reddits.csv')

reddit.head()

Unnamed: 0,title,post_text,id,score,total_comments,post_url,subreddit,post_type,title_&_text,stemmed_round_1,lemmatized_round_1,stemmed_round_2,lemmatized_round_2,stemmed_round_3,lemmatized_round_3,trending
0,Daily Fasting Check-in!,"* **Type** of fast (water, juice, smoking, etc...",16o7z6r,1,2,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,hot,Daily Fasting Check-in! * **Type** of fast (wa...,"['daili', 'fast', 'checkin', 'type', 'fast', '...","['daily', 'fasting', 'checkin', 'type', 'fast'...","['daili', 'fast', 'checkin', 'type', 'fast', '...","['daily', 'fasting', 'checkin', 'type', 'fast'...","['daili', 'fast', 'checkin', 'type', 'fast', '...","['daily', 'fasting', 'checkin', 'type', 'fast'...",2
1,I decided who I wanted to be and I became her 💅🏽,"So a little background: I’m 39, have birthed t...",16ntqoy,1176,36,https://i.redd.it/fclkjnwhmgpb1.jpg,intermittentfasting,hot,I decided who I wanted to be and I became her ...,"['decid', 'want', 'becam', 'littl', 'backgroun...","['decided', 'wanted', 'became', 'little', 'bac...","['decid', 'want', 'becam', 'background', '39',...","['decided', 'wanted', 'became', 'background', ...","['decid', 'want', 'becam', 'littl', 'backgroun...","['decided', 'wanted', 'became', 'little', 'bac...",42336
2,Some photos from a past vacation came up as a ...,I remember being miserable and insecure the en...,16ni914,1505,77,https://www.reddit.com/gallery/16ni914,intermittentfasting,hot,Some photos from a past vacation came up as a ...,"['photo', 'past', 'vacat', 'came', 'memori', '...","['photo', 'past', 'vacation', 'came', 'memory'...","['photo', 'vacat', 'came', 'memori', 'today', ...","['photo', 'vacation', 'came', 'memory', 'today...","['photo', 'past', 'vacat', 'came', 'memori', '...","['photo', 'past', 'vacation', 'came', 'memory'...",115885
3,"Anybody find IF, lose weight, and then lose mo...",I know I am an idiot.,16nuqx9,198,78,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,hot,"Anybody find IF, lose weight, and then lose mo...","['anybodi', 'find', 'lose', 'lose', 'motiv', '...","['anybody', 'find', 'lose', 'lose', 'motivatio...","['anybodi', 'find', 'weight', 'motiv', 'know',...","['anybody', 'find', 'weight', 'motivation', 'k...","['anybodi', 'find', 'lose', 'weight', 'lose', ...","['anybody', 'find', 'lose', 'weight', 'lose', ...",15444
4,2 and a half months of IF,From 234 to 211 in 2.5 months. It works! Once ...,16nuxqs,180,12,https://i.redd.it/30yqmtsdvgpb1.jpg,intermittentfasting,hot,2 and a half months of IF From 234 to 211 in 2...,"['2', 'half', '234', '211', '25', 'work', 'dis...","['2', 'half', '234', '211', '25', 'work', 'dis...","['2', 'half', 'month', '234', '211', '25', 'mo...","['2', 'half', 'month', '234', '211', '25', 'mo...","['2', 'half', 'month', '234', '211', '25', 'mo...","['2', 'half', 'month', '234', '211', '25', 'mo...",2160


## Data Dictionary

| Column                             | Datatype  | Explanation                                           |
| ---------------------------------- | --------- | ----------------------------------------------------- |
| **title**                          | object    | Title of the Reddit post                              |
| **post_text**                      | object    | Text content of the Reddit post                       |
| **id**                             | object    | Unique identifier for the post                        |
| **score**                          | int64     | Score or upvotes of the post                          |
| **total_comments**                 | int64     | Total number of comments on the post                  |
| **post_url**                       | object    | URL of the post                                       |
| **subreddit**                      | object    | Subreddit where the post was made                     |
| **post_type**                      | object    | Type or format of the post                            |
| **time_uploaded**                  | object    | Timestamp when the post was uploaded                  |
| **title_&_text**                   | object    | Title and text content with punctuations removed      |
| **title_text_stemmed**             | object    | Title and text content after stemming                 |
| **title_text_lemmatized**          | object    | Title and text content after lemmatization            |
| **stemmed_round_1**                | object    | Stemmed title and text after 1st round of feature engineering|
| **lemmatized_round_1**             | object    | Lemmatized title and text after 1st round of feature engineering|
| **stemmed_round_2**                | object    | Stemmed title and text after 2nd round of feature engineering|
| **lemmatized_round_2**             | object    | Lemmatized title and text after 2nd round of feature engineering|
| **stemmed_round_3**                | object    | Stemmed title and text after 3rd round of feature engineering|
| **lemmatized_round_3**             | object    | Lemmatized title and text after 3rd round of feature engineering|

---
## 2. Cleaning - Data Frame and Text
---

Cleaning was done in 2 sections:
1. Cleaning the dataframe
2. Cleaning the texts

### 2.1 Cleaning The Dataframe
The following steps were undertaken to clean the dataframe containing the raw data:
1. Column Headers - Check for trailing space, lowercase and snakecase if needed
2. Data Info (data type and null values) - Check data type (change data type if necessary) and null values (fill or drop null values where necessary)

#### 2.1.1 Column Headers

In [6]:
# Check column headers
reddit.columns

Index(['title', 'post_text', 'id', 'score', 'total_comments', 'post_url',
       'subreddit', 'post_type', 'title_&_text', 'stemmed_round_1',
       'lemmatized_round_1', 'stemmed_round_2', 'lemmatized_round_2',
       'stemmed_round_3', 'lemmatized_round_3', 'trending'],
      dtype='object')

In [4]:
# No trailing space found, so just lowercase and snake-case column headers
reddit.columns = [col.lower().replace(' ', '_') for col in reddit.columns]

reddit.columns

Index(['title', 'post_text', 'id', 'score', 'total_comments', 'post_url',
       'subreddit', 'post_type', 'time_uploaded'],
      dtype='object')

#### 2.1.2 Data Info (data type and null values)

In [5]:
# Check info
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           3960 non-null   object
 1   post_text       2496 non-null   object
 2   id              3960 non-null   object
 3   score           3960 non-null   int64 
 4   total_comments  3960 non-null   int64 
 5   post_url        3960 non-null   object
 6   subreddit       3960 non-null   object
 7   post_type       3960 non-null   object
 8   time_uploaded   3960 non-null   object
dtypes: int64(2), object(7)
memory usage: 278.6+ KB


In [6]:
# Check null values
reddit.isnull().sum()

title                0
post_text         1464
id                   0
score                0
total_comments       0
post_url             0
subreddit            0
post_type            0
time_uploaded        0
dtype: int64

* 3960 posts scraped in total
* All data are in the correct type, no need for any change of data type
* 1464 null values found in the 'Post Text' column

#### 2.1.3 Create a new column

Before we remove null values, we want to first add a new column 'title_&_text'.

As there are people who post in subreddit titles rather than in post text to get more traction, posts with just text in the title and image only in the text are marked as null values. For instance: <br><br>

<a href='https://www.reddit.com/r/intermittentfasting/comments/16shbmz/i_lost_120_lbsshe_lost_80_one_meal_a_day_from/'>
    <figure>
        <img src='https://preview.redd.it/cft42u8lso151.jpg?width=960&crop=smart&auto=webp&s=7dffccf1f5f70685154e59bddfc63ad84beaaf7b'/>
        <center><figcaption>A typical image-only reddit post</figcaption></center>
    </figure>
</a><br>
To get as many words as possible, we will be adding words in the 'title' and 'post_text' columns to form a 'title_&_text' column. But prior to that, we will first .fillna( ) the 'post_text' column so the resultant 'title_&_text' column do not contain any null values.

In [7]:
# Fill null values with empty strings in the 'title' and 'post_text' columns
reddit['title'].fillna('', inplace=True)
reddit['post_text'].fillna('', inplace=True)

# Create new column 'title_&_text', an addition of words from the 'title' and 'post_text' columns
reddit['title_&_text'] = reddit['title'] + ' ' + reddit['post_text']

reddit.head()

Unnamed: 0,title,post_text,id,score,total_comments,post_url,subreddit,post_type,time_uploaded,title_&_text
0,Does taking flavoured creatine break a fast?,"Taking one scoop, roughly 3g. It has sucralose...",16shh83,1,0,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 07:57:13,Does taking flavoured creatine break a fast? T...
1,I lost 120 lbs.......she lost 80. One meal a d...,,16shbmz,6,1,https://i.redd.it/cft42u8lso151.jpg,intermittentfasting,new,2023-09-26 07:46:54,I lost 120 lbs.......she lost 80. One meal a d...
2,Does fasting out of spite work?,We’ll see in 4 weeks when I go to a wedding wh...,16sfrlc,0,2,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 06:10:27,Does fasting out of spite work? We’ll see in 4...
3,Daily Fasting Check-in!,"* **Type** of fast (water, juice, smoking, etc...",16sfl07,1,0,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 06:00:31,Daily Fasting Check-in! * **Type** of fast (wa...
4,90 Days of Intermittent Fasting - IT WORKS!,"Hi Everyone, \n\nToday was the 90th day of my ...",16sdl2e,17,8,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 04:10:24,90 Days of Intermittent Fasting - IT WORKS! Hi...


In [8]:
# Check
reddit.isnull().sum()

title             0
post_text         0
id                0
score             0
total_comments    0
post_url          0
subreddit         0
post_type         0
time_uploaded     0
title_&_text      0
dtype: int64

In [9]:
#nltk.download('stopwords')
#nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download('punkt')
#Please uncomment the above if you haven't downloaded these libraries.

### 2.2 Cleaning The Texts
A function is defined to run the following cleaning steps:
* removes punctuation
* tokenize
* lowercase
* removes duplicate content
* removes stopwords
* stemming
* lemmatizing

As part of feature engineering, we have added stopwords in three different rounds (one batch over another). The stopwords were determined based on the most common words used. They were removed to prevent our model from training words that are uesd excessively and become unmeaningful.

The model scores for each round are presented in the next notebook (Part 3 - Pre-processing & Modelling) to determine which set of features to use eventually. The set of features with the highest score was chosen.

In [10]:
# Define a function for cleaning

# And returns stemmed text in a column and lemmatized text in the next column

ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()

def clean(text, custom_stopwords):

    remove_punct = ''.join([char for char in text if char not in string.punctuation])

    tokenize = re.split('\W+', remove_punct)
    
    lowercase = [word.lower() for word in tokenize]
        
    all_stopwords = stopwords.words('english') + custom_stopwords
    no_stopwords = [word for word in lowercase if word not in all_stopwords]
    
    remove_dupli = list(set(no_stopwords))
    
    stemmed = [ps.stem(word) for word in no_stopwords]
    
    lemmatized = [wn.lemmatize(word) for word in no_stopwords]
    
    return pd.Series({'stemmed': stemmed, 'lemmatized': lemmatized})

In [18]:
# Round 1 of cleaning with added stopwords (68 words added)
stopwords_round_1 = ['everyon', 'didnt', 'tri', 'never', 'normal', 
                     'thank', 'say', 'post', 'use', 'els', 
                     'gain', 'thought', 'year', 'lose', 'past', 
                     'life', 'without', 'hope', 'cant', 'love', 
                     'sure', 'get', 'ago', 'week', 'comment', 
                     'around', 'meal', 'work', 'look', 'long', 
                     'littl', 'alway', 'start', 'right', 'thing', 
                     'end', 'stop', 'could', 'peopl', 'made', 
                     'went', 'want', 'almost', 'period', 'find', 
                     'make', 'advic', 'id', 'time', 'actual', 
                     'notic', 'two', 'hard', 'felt', 'come', 
                     'ill', 'pretti', 'healthi', 'anyth', 'enough', 
                     'etc', 'sometim', 'happi', 'mayb', 'hungri', 
                     'experi', 'less', 'that']

reddit[['stemmed_round_1', 'lemmatized_round_1']] = reddit['title_&_text'].apply(clean, custom_stopwords=stopwords_round_1)

# Round 2 of cleaning with added stopwords (20 words added)
stopwords_round_2 = ['made', 'went', 'want', 'almost', 'period', 
                     'find', 'make', 'advic', 'id', 'time', 
                     'actual', 'notic', 'two', 'hard', 'felt', 
                     'come', 'ill', 'pretti', 'healthi', 'anyth']

reddit[['stemmed_round_2', 'lemmatized_round_2']] = reddit['title_&_text'].apply(clean, custom_stopwords=stopwords_round_2)

# Round 3 of cleaning with added stopwords (9 words added)
stopwords_round_3 = ['enough', 'etc', 'sometim', 'happi', 'mayb', 'hungri', 'experi', 'less', 'that']

reddit[['stemmed_round_3', 'lemmatized_round_3']] = reddit['title_&_text'].apply(clean, custom_stopwords=stopwords_round_3)

reddit.head()

Unnamed: 0,score,total_comments,subreddit,post_type,title_&_text,stemmed_round_1,lemmatized_round_1,stemmed_round_2,lemmatized_round_2,stemmed_round_3,lemmatized_round_3
0,1,0,intermittentfasting,new,Does taking flavoured creatine break a fast? T...,"[take, flavour, creatin, break, fast, take, on...","[taking, flavoured, creatine, break, fast, tak...","[take, flavour, creatin, break, fast, take, on...","[taking, flavoured, creatine, break, fast, tak...","[take, flavour, creatin, break, fast, take, on...","[taking, flavoured, creatine, break, fast, tak..."
1,6,1,intermittentfasting,new,I lost 120 lbs.......she lost 80. One meal a d...,"[lost, 120, lbsshe, lost, 80, one, day, ]","[lost, 120, lbsshe, lost, 80, one, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]"
2,0,2,intermittentfasting,new,Does fasting out of spite work? We’ll see in 4...,"[fast, spite, see, 4, week, go, wed, bh, siste...","[fasting, spite, see, 4, week, go, wedding, bh...","[fast, spite, work, see, 4, week, go, wed, bh,...","[fasting, spite, work, see, 4, week, go, weddi...","[fast, spite, work, see, 4, week, go, wed, bh,...","[fasting, spite, work, see, 4, week, go, weddi..."
3,1,0,intermittentfasting,new,Daily Fasting Check-in! * **Type** of fast (wa...,"[daili, fast, checkin, type, fast, water, juic...","[daily, fasting, checkin, type, fast, water, j...","[daili, fast, checkin, type, fast, water, juic...","[daily, fasting, checkin, type, fast, water, j...","[daili, fast, checkin, type, fast, water, juic...","[daily, fasting, checkin, type, fast, water, j..."
4,17,8,intermittentfasting,new,90 Days of Intermittent Fasting - IT WORKS! Hi...,"[90, day, intermitt, fast, work, hi, everyon, ...","[90, day, intermittent, fasting, work, hi, eve...","[90, day, intermitt, fast, work, hi, everyon, ...","[90, day, intermittent, fasting, work, hi, eve...","[90, day, intermitt, fast, work, hi, everyon, ...","[90, day, intermittent, fasting, work, hi, eve..."


> <font size = 3 color = "crimson"> While I see that you said that stop words were defined via most common words used, there are a few things that come to mind. The first is that some of the words being removed seem to me not to be contributing noise but rather to be crucial words such as 'period', or 'hope', or 'hungri' or 'pretti'. I say these are crucial words because these indicate some sentiment that I, as a casual layperson at least, would expect might have some relation to the mindset of users of these subreddits. Words that you remove should be meaningless words that don't tell you anything about the users. Good examples from your own stop list include words like 'want' or 'advic'.</font>

In [12]:
# Drop columns that we do not need
reddit = reddit.drop(columns=['title', 'post_text', 'time_uploaded', 'post_url', 'id']).copy()

reddit.head()

Unnamed: 0,score,total_comments,subreddit,post_type,title_&_text,stemmed_round_1,lemmatized_round_1,stemmed_round_2,lemmatized_round_2,stemmed_round_3,lemmatized_round_3
0,1,0,intermittentfasting,new,Does taking flavoured creatine break a fast? T...,"[take, flavour, creatin, break, fast, take, on...","[taking, flavoured, creatine, break, fast, tak...","[take, flavour, creatin, break, fast, take, on...","[taking, flavoured, creatine, break, fast, tak...","[take, flavour, creatin, break, fast, take, on...","[taking, flavoured, creatine, break, fast, tak..."
1,6,1,intermittentfasting,new,I lost 120 lbs.......she lost 80. One meal a d...,"[lost, 120, lbsshe, lost, 80, one, day, ]","[lost, 120, lbsshe, lost, 80, one, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]","[lost, 120, lbsshe, lost, 80, one, meal, day, ]"
2,0,2,intermittentfasting,new,Does fasting out of spite work? We’ll see in 4...,"[fast, spite, see, 4, week, go, wed, bh, siste...","[fasting, spite, see, 4, week, go, wedding, bh...","[fast, spite, work, see, 4, week, go, wed, bh,...","[fasting, spite, work, see, 4, week, go, weddi...","[fast, spite, work, see, 4, week, go, wed, bh,...","[fasting, spite, work, see, 4, week, go, weddi..."
3,1,0,intermittentfasting,new,Daily Fasting Check-in! * **Type** of fast (wa...,"[daili, fast, checkin, type, fast, water, juic...","[daily, fasting, checkin, type, fast, water, j...","[daili, fast, checkin, type, fast, water, juic...","[daily, fasting, checkin, type, fast, water, j...","[daili, fast, checkin, type, fast, water, juic...","[daily, fasting, checkin, type, fast, water, j..."
4,17,8,intermittentfasting,new,90 Days of Intermittent Fasting - IT WORKS! Hi...,"[90, day, intermitt, fast, work, hi, everyon, ...","[90, day, intermittent, fasting, work, hi, eve...","[90, day, intermitt, fast, work, hi, everyon, ...","[90, day, intermittent, fasting, work, hi, eve...","[90, day, intermitt, fast, work, hi, everyon, ...","[90, day, intermittent, fasting, work, hi, eve..."


In [13]:
# Save the cleaned data in 'reddit_cleaned_final.csv'
reddit.to_csv('reddit_cleaned_final.csv', index=False)

---
**Next:** [Part 3 - Exploratory Data Analysis (EDA)](Part%203%20-%20Exploratory%20Data%20Analysis%20(EDA).ipynb)