# 06_Implementation

This notebook provides an end-to-end demonstration of the process for scraping, cleaning, preprocessing, and predicting comments gathered from subreddit submissions to determine if the submission should be labeled as "misinformation." It serves as a proof of concept (POC) implementation to showcase the backend process for producing a "misinformation" alert on the monitoring dashboard suggested for the subreddit moderator.

The alert is triggered based on the aggregation of prediction outcomes for each comment within a submission. If more than 55% of the comments predicted for a submission are classified as misinformation, the dashboard will display an "Alert," signaling the moderator to prioritize review and potential moderation actions if necessary. 55% is a configurable threshold, which should be reviewed and adjusted periodically for its effectiveness and accuracy over time. 

For illustration purposes, comments on the submissions to subreddit 'r/politics' will be used for the implementation. r/politics is a subreddit for news and discussion about U.S. politics. It serves as a forum for users to share and discuss political news, opinions, analyses, and developments across various political topics. The subreddit is moderated to maintain civility, relevance, and adherence to community guidelines, fostering constructive dialogue and engagement on political issues. We attempt to analyze the comments posted on the submissions for r/politics using our misinformation classification model. The result is presented in a tabular format, resembling a dashboard view for the moderator.  

Based on general sentiment on the comments posted in r/politics, we had gathered the following submissions that were marked "misleading" by redditors/moderators. These submissions will serve as a litmus test for this implementation. 

_r/politics posts perceived as misleading based on user comment:_
- https://www.reddit.com/r/politics/comments/1b4828m/biden_calls_for_immediate_ceasefire_in_gaza/
- https://www.reddit.com/r/politics/comments/1ap2dwr/rfk_jr_apologizes_for_super_bowl_ad_thats_still/
- https://www.reddit.com/r/politics/comments/19ajy65/six_top_secret_files_identified_in_donald_trumps/
- https://www.reddit.com/r/politics/comments/1b1aojn/joe_biden_pledges_17_billion_to_end_hunger_across/
- https://www.reddit.com/r/politics/comments/1bmzylv/eric_trump_says_454m_fine_imposed_on_his_father/



_r/politics posts perceived as not-misleading based on user comment:_
- https://www.reddit.com/r/politics/comments/1bngesz/trump_bond_reduced_to_175_million_as_he_appeals/
- https://www.reddit.com/r/politics/comments/1bnhop5/its_a_date_trumps_first_felony_trial_will_be/
- https://www.reddit.com/r/politics/comments/1bnid3p/israel_cancels_washington_visit_after_us_allows/
- https://www.reddit.com/r/politics/comments/1bo59n6/biden_gets_some_good_news_in_poll_as_he_gains/

The code will also scrape additional number of submissions to make up the number for the final dashboard. 

---
### Import libraries

In [606]:
import praw
from datetime import datetime
import pandas as pd
import time
import numpy as np
import re
import string
import requests 

from spellchecker import SpellChecker
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

import pickle



---
### Data Scraping

For the list of identified reddit submission, the data can be scraped from the JSON data of the submission from reddit.com website. JSON data of a submission can be retrieved by appending `.json` to the post ID on the URL. 

For example: 
- URL of a Reddit submission https://www.reddit.com/r/politics/comments/1b4828m/biden_calls_for_immediate_ceasefire_in_gaza/
- JSON data of the Reddit submission https://www.reddit.com/r/politics/comments/1b4828m.json

For this implementation (and streamlit app), scraping as JSON is proven to be faster and more effective, allowing the app to be more responsive. 

Code was developed to execute the following:
1. to perform  HTTP GET action to retrieve the JSON data for the identified submission
2. from JSON data, extract out the title, number of comments, and also the list of comments (including all child comments in the comment forest)
3. All the extracted comments are placed onto dataframe


In [607]:
## Variable configuration that drive the outcome

subreddit = 'politics'   # subreddit name to be scraped and analysed 
num_submission = 100      # number of submission in the subreddit to be scraped

Code to extract the list of "Hot" submission on the intended subreddit 

In [608]:
url = f"https://www.reddit.com/r/{subreddit}/hot.json?limit={num_submission}"

# Send a GET request to retrieve the JSON data
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON data
    json_data = response.json()
    
    # Extract submission IDs from the JSON data
    submission_ids = [post['data']['id'] for post in json_data['data']['children']]
    
    # Print the list of submission IDs
    print("List of submission IDs for hot submissions in r/politics:")
    print(submission_ids)
else:
    print("Failed to retrieve data. Status code:", response.status_code)
    

List of submission IDs for hot submissions in r/politics:
['1bnf9ci', '1bp5udi', '1bp9lcx', '1bpef1i', '1bphl49', '1bp218u', '1bphx66', '1bpca0a', '1bpkqtj', '1bpg3ni', '1bp65g6', '1bp2dfu', '1bp70p6', '1bp24vc', '1bp0ioq', '1bp5wft', '1bpbsmi', '1bpjsrl', '1bp9nbl', '1bpbq8s', '1bp2lx8', '1bpizvm', '1bpc5pl', '1bp3ok9', '1bpd68u', '1bpj55q', '1bphf3z', '1bozu7h', '1boxmwp', '1bp4qm4', '1bpl8w1', '1bpasr0', '1bp0anx', '1bp1t3g', '1bp2cop', '1bphlfy', '1bp1jfv', '1bozt1c', '1bpbokz', '1bpbj49', '1boqqmu', '1bpcwvv', '1bor5hj', '1bp7402', '1bpmd50', '1bph7pc', '1bpatrk', '1bpfbwr', '1bp34mr', '1bp3czk', '1bp53m4', '1bp0osc', '1bp6xwx', '1bpa1a1', '1bp4cwr', '1bp2ly6', '1bpg525', '1bpabxc', '1bp6zwe', '1bpdegg', '1bplgv4', '1bp8hj5', '1bp4ioh', '1bpdvb7', '1bpn83q', '1bp5zfm', '1borjiv', '1boi4wv', '1bozzpf', '1bp3g99', '1bplyue', '1bph5dz', '1bpkjlu', '1bp1sek', '1bpi6hi', '1bpoclj', '1bpa3gk', '1bp2myu', '1bpc52r', '1bp67gf', '1bpias5', '1boq2fu', '1bojbha', '1boue6m', '1bol03s', '1bp8s

Consolidating the list of URLs based on Submission ID

In [609]:
submission_urls = [ f"https://www.reddit.com/r/politics/comments/{id}.json" for id in submission_ids ]

## Append the URLs for the selected submissions as above mentioned  
submission_urls.append('https://www.reddit.com/r/politics/comments/1bmzylv.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1b4828m.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1ap2dwr.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/19ajy65.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1b1aojn.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1bngesz.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1bnhop5.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1bnid3p.json')
submission_urls.append('https://www.reddit.com/r/politics/comments/1bo59n6.json')


Function `extract_comments` is a recursive function to extract all the child comments in a comment forest

In [610]:
def extract_comments(comment, submission_title, comments_list):
    """Recursive function to extract comments and their child comments."""
    if 'body' in comment['data']:  # Check if the comment has a body
        comments_list.append({
            'comment id': comment['data']['id'],
            'title': submission_title,
            'body': comment['data']['body']
        })
    
    # Recursively extract child comments
    if 'replies' in comment['data'] and comment['data']['replies'] != '':
        replies_data = comment['data']['replies']['data']['children']
        for reply in replies_data:
            extract_comments(reply, submission_title, comments_list)

Code to extract the comments for each submission. All extracted comments are eventually stored into dataframe `comments_df`

In [611]:

# Initialize an empty list to store comments
all_comments_list = []

# Loop through each submission URL
for url in submission_urls:
    # Send a GET request to retrieve the JSON data
    response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the JSON data
        json_data = response.json()
        
        # Extract submission information
        submission_data = json_data[0]['data']['children'][0]['data']
        submission_title = submission_data['title']
        num_comments = submission_data['num_comments']
        
        print("Submission Title:", submission_title)
        print("Number of Comments:", num_comments)
        
        # Extract comments and child comments recursively
        comments_data = json_data[1]['data']['children']
        for comment in comments_data:
            extract_comments(comment, submission_title, all_comments_list)
        
    else:
        print("Failed to retrieve data from URL:", url)

# Create DataFrame from all comments
comments_df = pd.DataFrame(all_comments_list)

print(f"All comments from multiple submissions (including comment forests) download into dataframe successfully.")

Submission Title: r/Politics' 2024 US Elections Live Thread, Part 6
Number of Comments: 99
Submission Title: Donald Trump Attacks Judge's Daughter Less Than 24 Hours After Gag Order
Number of Comments: 2915
Submission Title: Trump Launches Fresh Attack on Judge’s Daughter After Gag Order
Number of Comments: 857
Submission Title: Trump hush money judge appearing more irritated
Number of Comments: 268
Submission Title: X Account Trump Cited in Attacks on Judge’s Daughter Isn’t Even Hers: Court
Number of Comments: 95
Submission Title: Donald Trump Selling Bibles Sparks Fury From Christians—'Blasphemous Grift'
Number of Comments: 2263
Submission Title: Fox News host slammed for linking Baltimore Key bridge disaster to immigration: ‘Reprehensible stupidity’
Number of Comments: 105
Submission Title: Hate influencer Chaya Raichik thinks Pete Buttigieg can’t do his job because he loves his husband. He apparently can't lead the federal response to the Baltimore bridge collapse because he and hi

---
## Data Cleaning

The data cleaning actions performed in 02_DataCleaning and 03_EDA will be carried out in this section, in the proper sequence, namely clean text, lemmatize text, and clean the lemmatized text.  

In [612]:
# verify the comments are in place in the comments_df
comments_df.head(5)

Unnamed: 0,comment id,title,body
0,kww8gj4,"r/Politics' 2024 US Elections Live Thread, Part 6","The politics stuff aside, the way Joe Lieberma..."
1,kwthuxt,"r/Politics' 2024 US Elections Live Thread, Part 6",So... what's Bernie Sanders's move? Is he runn...
2,kwvs0rz,"r/Politics' 2024 US Elections Live Thread, Part 6",It could be why there hasn't been a permanent ...
3,kwudm7d,"r/Politics' 2024 US Elections Live Thread, Part 6",I haven’t looked into it but he is 82 and as f...
4,kwsenqu,"r/Politics' 2024 US Elections Live Thread, Part 6",I'm not an American. But electing the most pow...


In [613]:
# verify the comments are in place in the comments_df
comments_df['body']

0        The politics stuff aside, the way Joe Lieberma...
1        So... what's Bernie Sanders's move? Is he runn...
2        It could be why there hasn't been a permanent ...
3        I haven’t looked into it but he is 82 and as f...
4        I'm not an American. But electing the most pow...
                               ...                        
11041             Who the fuck cares about polls. Its 2024
11042    This isn’t good news for Biden, it’s good news...
11043    In the same way that we criticize polls when t...
11044                                        Good for him.
11045                                   "Gains ground????"
Name: body, Length: 11046, dtype: object

Define the variables and functions that will be used for data cleaning, including lemmatization

In [614]:
stop_words = set(stopwords.words('english'))
special_char_list = list(string.punctuation)
special_char_list+=["’","'s","’s","...","$","@$$.","like", "it", "would", "im","“", "”", "u"]


Define functions that will be used for data cleaning (code was copied from 02_Data_Cleaning and 03_EDA) 

In [615]:
'''
function to perform lemmatization on a text
Using the WordNetLemmatizer by nltk
'''
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)


def sentence_lemmatizer(text):
    if text.strip() == '':
        return np.NaN

    lemmatizer = WordNetLemmatizer()
    char_list = word_tokenize(text.lower())

    # Lemmatize list of words and join
    return ' '.join([lemmatizer.lemmatize(w.lower(), get_wordnet_pos(w.lower())) for w in char_list])


In [616]:
'''
function to perform the various text_cleaning actions on the extrated comment from Reddit submission. 
'''
def text_cleaning(text):

    url_regex = re.compile(
    r'((http|https)://'     # Start with http:// or https://
    r'([a-zA-Z0-9.-]+)'     # Match the domain name (alphanumeric characters, dots, and dashes)
    r'(\.[a-zA-Z]{2,})'     # Match the top-level domain (e.g., .com, .net) with at least 2 characters
    r'(:\d+)?'              # Match an optional port number
    r'(/\S*)?'              # Match an optional path (any non-whitespace characters)
    r'(\?[^"\s]*)?)',        # Match an optional query string (attribute-value pairs)
    re.IGNORECASE        # Ignore case sensitivity
    )

    # remove URL
    text = url_regex.sub("", text)

    # Mark the comment with [deleted] or [removed] with pseudo marker "pseudodeleted" and "pseudoremoved"
    # After lemmatization, [deleted] become [ delete ], [removed] become [ remove ]
    # After stemming, [deleted] become [ delet ], [removed] become [ remov ]
    text = text.replace("[deleted]","pseudodeleted").replace("[removed]","pseudoremoved").replace("[ delete ]","pseudodeleted").replace("[ remove ]","pseudoremoved").replace("[ delet ]","pseudodeleted").replace("[ remov ]","pseudoremoved")

    # remove newline 
    text = text.replace("\n", " ").replace("\r", " ").replace("\r\n"," ").replace("_x000D_", " ")

    # remove  "'s"
    text = re.sub(r"(\'s)","", text)
    
    # remove stopword 
    text = ' '.join([word for word in text.split() if word.lower() not in stop_words])

   # remove punctuation and special character
    text = ''.join([char for char in text if char not in special_char_list])

    # return the cleaned text
    return text.strip()

In [617]:
'''
function to perform the various text_cleaning actions on the comment lemmatized text. 
'''
def text_cleaning_post_lemmatized(text):

    # pattern of a standard footer in a submission template
    pattern = r'pm\sexclude\sme\sexclude\sfrom\ssubreddit\sfaq\sinformation\ssource\s.*downvote\sto\sremove\sv028'
    text = re.sub(pattern, '', text)

    # pattern2 of a standard footer in a submission template
    pattern2 = r'nonmobile\slink\shelperbot\sv11\srhelperbot\si\sbe\sa\sbot\splease\smessage\suswim1929\swith\sany\sfeedback\sandor\shate\scounter\s\d{6}'
    text = re.sub(pattern, '', text)

    # pattern to find any words that are repeating more than 2 times, replace with only 1 occurence of the word
    text = re.sub(r'\b(\w+)(?: \1\b)+', r'\1', text.lower())

    text = re.sub(r'gon\sna', "gonna", text.lower())

    # return the cleaned text
    return text.strip()

> All functions are in place. Ready to perform data cleaning


`text_cleaning` function performs the various text cleaning action identified in the 01_DataCleaning steps,e.g. removal of HTTP URL, newline, [deleted], [removal], e`

In [618]:
comments_df['comments_cleaned'] = comments_df['body'].map(text_cleaning)
comments_df['comments_cleaned'].head(10)

0    politics stff aside way Joe Lieberman died sca...
1    So Bernie Sanders move rnning Senate reelectio...
2    cold permanent Labor Secretary yet Jlie S acti...
3    havent looked 82 far seen still mch mentally p...
4    Im American electing powerfll president world ...
5           So thoghts RFK Jrs vp pick Nicole Shanahan
6    people even stand beyond third choice vaccine ...
7    Receiving money Rssia spreading propaganda hel...
8    cold still spoiler effect certainly gaining vo...
9                             legitimately voting gy 🤔
Name: comments_cleaned, dtype: object

The cleaned text will then go through lemmatization

In [619]:
comments_df['comments_cleaned_lemmatized'] = comments_df['comments_cleaned'].map(sentence_lemmatizer)
comments_df['comments_cleaned_lemmatized']

0        politics stff aside way joe lieberman die scar...
1        so bernie sander move rnning senate reelection...
2        cold permanent labor secretary yet jlie s act ...
3        havent look 82 far see still mch mentally phys...
4        im american elect powerfll president world sti...
                               ...                        
11041                                   fck care poll 2024
11042    isnt good news biden it good news america im p...
11043    way criticize poll show trmp lead mst criticiz...
11044                                             good him
11045                                           gain grond
Name: comments_cleaned_lemmatized, Length: 11046, dtype: object

If a comment (e.g. comment with only HTTP URL, or the one marked [deleted]) has become blank line after cleaning, the rows with blank comment text need to be removed. 

In [620]:
comments_df = comments_df.dropna(subset=['comments_cleaned_lemmatized'])
comments_df['comments_cleaned_lemmatized'].isnull().sum() 
comments_df.reset_index(drop=True, inplace=True)

After lemmatization is done, the text will go through another round of text cleaning on the lemmatized text. This step was deemed necessary after EDA was conducted.  

In [621]:
comments_df['comments_cleaned_lemmatized_cleaned'] = comments_df['comments_cleaned_lemmatized'].map(text_cleaning_post_lemmatized)
comments_df['comments_cleaned_lemmatized_cleaned']

0        politics stff aside way joe lieberman die scar...
1        so bernie sander move rnning senate reelection...
2        cold permanent labor secretary yet jlie s act ...
3        havent look 82 far see still mch mentally phys...
4        im american elect powerfll president world sti...
                               ...                        
11019                                   fck care poll 2024
11020    isnt good news biden it good news america im p...
11021    way criticize poll show trmp lead mst criticiz...
11022                                             good him
11023                                           gain grond
Name: comments_cleaned_lemmatized_cleaned, Length: 11024, dtype: object

All data cleaning has completed. Can proceed to prediction. 

---
### Prediction for r/Politics subreddit


Load the pickle file for the fitted CountVectorizer and trained LogisticRegression 

In [622]:
# trained model for misinformation predictor
with open(r"./model.pkl", 'rb') as rf_model:
    model = pickle.load(rf_model)

# fitted transformer - CountVectorizer, to transform the data before calling to model (for prediction)
with open(r"./count_vectorizer.pkl", 'rb') as rf_cv:
    cvec = pickle.load(rf_cv)

Apply CountVectorization to the cleaned comments

In [623]:
X = cvec.transform(comments_df['comments_cleaned_lemmatized_cleaned'])


Prediction using the trained model

In [624]:
y_predict = model.predict(X)

In [625]:
y_predict

array([0, 0, 0, ..., 1, 0, 1], dtype=int64)

Merge the predicted output to the `comments_df` dataframe

In [626]:
# merge the predicted output (i.e predicted price) to the dataframe 
comments_df['label'] = pd.DataFrame(y_predict)

Prediction Outcome

In [627]:
def calculate_total(row):
    row['Total'] = row[[0,1]].sum()
    return row

def calculate_percentage(row):
    threshold_1 = 45.0
    threshold_2 = 55.0
    row['misinformation_percentage'] = 100* row[1]/row['Total']
    if row['misinformation_percentage'] >= threshold_2:
        row['misinformatoin_ALERT'] = "ALERT"
    elif threshold_1 < row['misinformation_percentage'] < threshold_2:
        row['misinformatoin_ALERT'] = "WARNING"
    else:
        row['misinformatoin_ALERT'] = "N"
    # row['misinformatoin_ALERT'] = "ALERT" if row['misinformation_percentage'] > threshold else "N"
    return row

# groupby 'Pclass' and 'Survived'
t = comments_df.groupby(['title','label']).size()

# unstack 'Survived' to the columns
dt = t.unstack(['label'])

# first apply the calculate_total function to calculate the total of each row, and create a new column 'Total',for each pclass
dt2 = dt.apply(calculate_total, axis=1)

# then, apply the calculate_percentage function to calculate the percentage for each pclass
dt3 = dt2.apply(calculate_percentage, axis=1)
dt3

label,0,1,Total,misinformation_percentage,misinformatoin_ALERT
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"'They should be afraid': Baltimore leader called 'DEI mayor' stands up to right-wing, racist attacks",5.0,13.0,18.0,72.222222,ALERT
A Complete Guide to the Manhattan Trump Election Interference Prosecution,5.0,1.0,6.0,16.666667,N
"A day after bridge collapse, Republicans are blaming Dems, floating unfounded and sometimes racist theories",37.0,38.0,75.0,50.666667,WARNING
"A solution to the retirement crisis? Americans should work for more years, BlackRock CEO says",98.0,87.0,185.0,47.027027,WARNING
"AI Is Making Financial Fraud Easier and More Sophisticated, US Treasury Warns",3.0,4.0,7.0,57.142857,ALERT
...,...,...,...,...,...
X Account Trump Cited in Attacks on Judge’s Daughter Isn’t Even Hers: Court,43.0,47.0,90.0,52.222222,WARNING
[The Washington Post] Judge recommends conservative lawyer John Eastman be disbarred in California\n,2.0,1.0,3.0,33.333333,N
"r/Politics' 2024 US Elections Live Thread, Part 6",42.0,43.0,85.0,50.588235,WARNING
‘Why do they bend their knee?’ 6 Never-Trumpers look back at what went wrong,8.0,13.0,21.0,61.904762,ALERT


> Prediction Outcome:
1. For each comment of a submission, each will be put into the model to predict the outcome of whether "Misinformation". 
2. When there is more than 55% of the comments are predicted as Misinformation, ALERT is displayed. When there is more than 45% but less than 55%, a WARNING is display. Finally, when there is less than 45% of comments are predicted as Misinformation, the "N" is displayed. 
3. Out of the 5 posts deemed as misleading by users, 3 were flagged as "ALERT" (true positive) and 2 were flagged as "WARNING".
4. Out of the 4 posts deemed as not misleading by users, 3 were not flagged at all (true negative) and 1 was flagged incorrectly as "ALERT" (false positive). 
5. When we used a larger sample size of 100 posts on our model, around 43% of posts were flagged as "ALERT". Therefore, our model seems to have a high sensitivity towards detecting misinformation (both TP and FP). The threshold for detecting misinformation should be reviewed periodically as the model continues to improve and with more training data. 

### Export Predicted Output to CSV for offline reference


In [None]:
comments_df.to_csv(f"../data/06_Implementation_Prediction_Outcome.csv", index = False)