In [81]:
# Problem Statement

#     Is it clear what the goal of the project is?
        # Satirical and non-satirical news classifier
        # to distinguish satirical news and news 
        # in light of the recent political events
        # news, in particular fake news are getting harder to discern from satirical news
        # definition of fake and satirical news

#     What type of model will be developed?
        # Classifier model

#     How will success be evaluated?
        # tn tp fn fp

#     Is the scope of the project appropriate?
        # yes?


#     Is it clear who cares about this or why this is important to investigate?
        # helping people identify satirical news
        # https://www.straitstimes.com/singapore/is-satire-fake-news-media-literacy-council-post-sparks-backlash-from-netizens
        # importance in literary history

#     Does the student consider the audience and the primary and secondary stakeholders?


# Satirical News Classifier <a class="anchor" id="project"></a>
---

### Objective of Classifier 

Satirical news has an important part in the literary history. However in today's ridiculous world, it may be hard to discern between satirical and non-satirical news. While this classifier model makes no distinction between fake and real news, the main difference in satirical and non-satirical news is in their intent ion the message to the audience. Satirical news may serve as food for thought in the current political climate, while news serve to inform (and sometimes, sway) the masses. It is not only the masses that may be unable to discern, but also corporation and governments that can be deceived when satire becomes too real and factual.

### Project Description

This project is divided into three notebooks:-
1. Data Collection
2. Data Cleaning & EDA
3. Pre-processing and Modelling

The Data Collection notebook will use Pushshift's API to scrape r/TheOnion and r/worldnews for posts, and collate the scraped posts into a dataset for training up the satirical news classification model. 
The dataset will then be cleaned up to minimize non-english titles and duplicates. NSFW posts will also be removed. 
Exploratory data analysis (EDA) will allow us to gain some insights to the datasets such as the top occurring words that may be considered as stop words (e.g. onion) to the cleaned datasets for modelling.
Lastly, the datasets will be transformed and fitted into different classification models e.g. Multinomial Naive Bayes, Logistic Regression and their success metrics will be evaluated based on reducing Type II errors.  

## Table of Contents <a class="anchor" id="toc"></a>
---
* [The Satirical News Classifier](#project)
* [Overview](#overview)
* [Importing Libraries](#importinglibraries)
* [Creating Custom Functions](#customfunctions)
* [Exploring Pushshift's API](#exploringapi)
* [Using Pushshift's API to scrape from Reddit](#scrape)
    * [r/TheOnion](#theonion)
    * [r/worldnews](#worldnews)
* [Preparing datasets for Cleaning and EDA](#preparingdatasets)
    * [r/TheOnion](#preptheonion)
    * [r/worldnews](#prepworldnews)

# 1. Data Collection

## Overview <a class="anchor" id="overview"></a>
---
[Back to top!](#toc)

Two subreddits will be scraped with the use of [Pushshift's](https://github.com/pushshift/api) API and custom functions to get the desired number of posts for this classification model. <br/>
>#### r/TheOnion
Exported datasets: theonion_scraped.csv, theonion_dataset.csv <br/>
>#### r/worldnews
Exported datasets: worldnews_scraped.csv, worldnews_dataset.csv <br/>

## Importing Libraries <a class="anchor" id="importinglibraries"></a>
---
[Back to top!](#toc)

In [83]:
import numpy as np
import pandas as pd
# allows us to see all rows and columns 
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

import requests
import time
import random

## Exploring Pushshift's API <a class="anchor" id="exploringapi"></a>
---
[Back to top!](#toc) <br/>
<br/>
Credits: 
- [Pushshift's](https://github.com/pushshift/api) API
- [Primer Video](https://youtu.be/AcrjEWsMi_E)

In [84]:
# api to get all posts from reddit 
base_url = "https://api.pushshift.io/reddit/search/submission"

In [85]:
# getting posts only from r/TheOnion
params = {
    'subreddit': 'TheOnion',
    'size': 100,
    'before': 1618614432
}

In [86]:
res = requests.get(base_url, params)

In [87]:
# 200 means good to go!
res.status_code

200

In [88]:
# saving as dictionary
data = res.json()

In [89]:
# accessing posts
posts = data['data']

In [128]:
# checking first 3 posts
posts[:3]

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'riiga',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_42qfz',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1618604368,
  'domain': 'theonion.com',
  'full_link': 'https://www.reddit.com/r/TheOnion/comments/msbkrk/no_way_to_prevent_this_says_only_nation_where/',
  'gildings': {},
  'id': 'msbkrk',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': False,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 7,
  'num_crossposts': 0,
  'over

In [90]:
# checking no. of posts
len(posts)

100

In [91]:
# checking first post
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'riiga',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_42qfz',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1618604368,
 'domain': 'theonion.com',
 'full_link': 'https://www.reddit.com/r/TheOnion/comments/msbkrk/no_way_to_prevent_this_says_only_nation_where/',
 'gildings': {},
 'id': 'msbkrk',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 7,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_sta

In [2]:
type(posts)

NameError: name 'posts' is not defined

In [93]:
# checking last post
posts[-1]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'dwaxe',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_3jamc',
 'author_patreon_flair': False,
 'author_premium': True,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1615576025,
 'domain': 'ogn.theonion.com',
 'full_link': 'https://www.reddit.com/r/TheOnion/comments/m3omh9/hey_gamers_our_source_inside_nintendo_disappeared/',
 'gildings': {},
 'id': 'm3omh9',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 10,
 'num_crossposts': 0,
 'over_18': False,
 'parent_white

In [94]:
# retrieving the created time of last post in scraped posts
posts[-1].get('created_utc')

1615576025

In [95]:
# converting list of dictionary to dataframe
df_posts = pd.DataFrame(posts)
df_posts.head()

## Creating Custom Functions <a class="anchor" id="customfunctions"></a>
---
[Back to top!](#toc) <br/>
<br/>

In [99]:
# defining function to show posts to delete
def has_been_deleted(post):
    '''Input: List of dictionary for checking if the key 'removed_by_category' exists
       Output: True if exists, False if does not exist.'''
    if 'removed_by_category' in post:
        # print("post has been removed")
        return True
    else:
        # print("post has not been removed")
        return False

In [103]:
# defining function to delete posts from has_been_deleted function
def remove_deleted_posts(posts):
    '''Input: List of dictionary for removal of dictionary containing the key 'removed_by_category'
       Output: List of dictionary containing posts not removed by reddit'''
    for i in reversed(range(len(posts)-1)):
        if has_been_deleted(posts[i]):
            del posts[i]

In [108]:
# time of scraping set at 19/04/2021 for sake of this project
# defining function to scrape posts till desired number
def scrape_reddit(subreddit, post_num): 
    '''Input: Subreddit of choice to scrape posts from, No. of posts to scrape
       Output: DataFrame of scraped posts that were not removed by the reddit.'''
    global df # save the dataframe after function ends
    all_posts = [] # initializing empty list to store all scraped posts
    base_url = "https://api.pushshift.io/reddit/search/submission" # set base url
    params = { # setting params for first scrape
        'subreddit': subreddit,
        'size': 100, # max no. of posts per scrape
        'before': 1618604368 # set at this time only for this project
    }
    res = requests.get(base_url, params) 
    print(f'Status Code {res.status_code}') # showing status code
    data = res.json() 
    posts = data['data'] # saving the scraped posts as a list of dictionary
    print(f'Number of posts scraped = {len(posts)}') # should show 100 posts per scrape
    remove_deleted_posts(posts) # calling previously defined function
    all_posts.extend(posts) # adding scraped posts to empty list
    print(f'Total number of posts that has not been removed by category scraped = {len(all_posts)}') # total cumulative posts scraped
    print('---') 
    while len(all_posts) < post_num: # loop for multiple scrapings to hit desired number of posts
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': all_posts[-1].get('created_utc') # to scrape posts before the last in list
            }
        res = requests.get(base_url, params)
        print(f'Status Code {res.status_code}')
        if res.status_code == 200: # ensure that webpage is ready for scraping
            data = res.json()
            posts = data['data']
            print(f'Number of posts scraped = {len(posts)}')
            remove_deleted_posts(posts)
            all_posts.extend(posts) 
            print(f'Total number of posts that has not been removed by category scraped = {len(all_posts)}')
            print('---')
            time.sleep(random.randint(0,5)) # randomizing time between each scrape to not get 'blacklisted'
        else: # if unable to scrape, sleep then try again
            time.sleep(random.randint(0,5)) 
            pass
    else:
        pass
    df = pd.DataFrame(all_posts) # saving df

## Using Pushshift's API to scrape from Reddit <a class="anchor" id="scrape"></a>
---
[Back to top!](#toc) <br/>
<br/>


Each subreddit will have 2,500 posts scraped to ensure sufficient posts from each subreddit will be available for training this satirical news classification model after data cleaning and EDA.

### r/TheOnion <a class="anchor" id="theonion"></a>
---
[Back to top!](#toc) <br/>
<br/>

In [109]:
%%time 
scrape_reddit('TheOnion', 2500) # using custom function to scrape

Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 88
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 166
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 254
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 337
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 424
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 512
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 601
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 691
---
Status Co

In [110]:
# converting list of dictionary into dataframe
df_theonion = df
df_theonion.shape

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,secure_media,secure_media_embed,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,removed_by_category,gallery_data,is_gallery,media_metadata,author_cakeday,crosspost_parent,crosspost_parent_list,author_flair_background_color,author_flair_text_color,steward_reports,removed_by,updated_utc,og_description,og_title
0,[],False,nipoxa4654,,[],,text,t2_9oonbjf4,False,False,[],False,False,1618603680,youtu.be,https://www.reddit.com/r/TheOnion/comments/msb...,{},msbcc7,True,False,False,False,True,False,False,,[],dark,text,False,"{'oembed': {'author_name': 'The Onion', 'autho...","{'content': '&lt;iframe width=""267"" height=""20...",False,False,1,0,False,all_ads,/r/TheOnion/comments/msbcc7/representative_wan...,False,rich:video,"{'enabled': False, 'images': [{'id': 'AUJX28Cl...",6,1618603690,1,"{'oembed': {'author_name': 'The Onion', 'autho...","{'content': '&lt;iframe width=""267"" height=""20...",,True,False,False,TheOnion,t5_2qhmj,160538,public,https://a.thumbs.redditmedia.com/_zPQiD4RKFj37...,105.0,140.0,Representative Wants To Meet More Kids Online,0,[],1.0,https://youtu.be/ocK7rZJ6U5w,https://youtu.be/ocK7rZJ6U5w,all_ads,6,,,,,,,,,,,,,,
1,[],False,pi3141592653589,,[],,text,t2_gi7mk,False,True,[],False,False,1618594518,theonion.com,https://www.reddit.com/r/TheOnion/comments/ms8...,{},ms8987,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,1,0,False,all_ads,/r/TheOnion/comments/ms8987/minnesota_deploys_...,False,link,"{'enabled': False, 'images': [{'id': 'yRgy76Gr...",6,1618594528,1,,,,True,False,False,TheOnion,t5_2qhmj,160536,public,https://b.thumbs.redditmedia.com/BQr5StDDmJxnu...,78.0,140.0,Minnesota Deploys National Guard Ahead Of Next...,0,[],1.0,https://www.theonion.com/minnesota-deploys-nat...,https://www.theonion.com/minnesota-deploys-nat...,all_ads,6,,,,,,,,,,,,,,
2,[],False,dwaxe,,[],,text,t2_3jamc,False,True,[],False,False,1618588625,ogn.theonion.com,https://www.reddit.com/r/TheOnion/comments/ms6...,{},ms65tm,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,7,0,False,all_ads,/r/TheOnion/comments/ms65tm/small_kindnesses_g...,False,link,"{'enabled': False, 'images': [{'id': 'hX5khf7c...",6,1618588636,1,,,,False,False,False,TheOnion,t5_2qhmj,160531,public,https://b.thumbs.redditmedia.com/jP-PZIfVhdohx...,78.0,140.0,Small Kindnesses: Gamer Shields Ailing Grandmo...,0,[],1.0,https://ogn.theonion.com/small-kindnesses-game...,https://ogn.theonion.com/small-kindnesses-game...,all_ads,6,,,,,,,,,,,,,,
3,[],False,FutureOmelet,,[],,text,t2_f4n4p,False,True,[],False,False,1618584339,sports.theonion.com,https://www.reddit.com/r/TheOnion/comments/ms4...,{},ms4o4o,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,0,0,False,all_ads,/r/TheOnion/comments/ms4o4o/report_san_diegans...,False,link,"{'enabled': False, 'images': [{'id': 'Gs3bGIXB...",6,1618584350,1,,,,True,False,False,TheOnion,t5_2qhmj,160530,public,https://a.thumbs.redditmedia.com/ZBxDFOKz-0OsB...,78.0,140.0,Report: San Diegans Just Assumed Padres Were I...,0,[],1.0,https://sports.theonion.com/report-san-diegans...,https://sports.theonion.com/report-san-diegans...,all_ads,6,,,,,,,,,,,,,,
4,[],False,mothershipq,,[],,text,t2_4negm,False,True,[],False,False,1618578024,theonion.com,https://www.reddit.com/r/TheOnion/comments/ms2...,{},ms2mpd,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,4,0,False,all_ads,/r/TheOnion/comments/ms2mpd/colorado_temporari...,False,link,"{'enabled': False, 'images': [{'id': 'bMqj1Xkp...",6,1618578042,1,,,,True,False,False,TheOnion,t5_2qhmj,160527,public,https://b.thumbs.redditmedia.com/IdOdGEIaPjA6b...,78.0,140.0,Colorado Temporarily Re-Bans Marijuana For Sta...,0,[],1.0,https://www.theonion.com/colorado-temporarily-...,https://www.theonion.com/colorado-temporarily-...,all_ads,6,,,,,,,,,,,,,,


In [114]:
# exporting the raw scraped data to csv
df_theonion.to_csv('../data/theonion_scraped.csv')

### r/worldnews <a class="anchor" id="worldnews"></a>
---
[Back to top!](#toc) <br/>
<br/>

In [115]:
%%time
scrape_reddit('worldnews', 2500) # using custom function to scrape

Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 36
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 70
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 105
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 154
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 193
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 249
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 276
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 296
---
Status Cod

Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 2320
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 2347
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 2380
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 2416
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 2467
---
Status Code 200
Number of posts scraped = 100
Total number of posts that has not been removed by category scraped = 2501
---
Wall time: 13min 40s


In [116]:
# converting list of dictionary into dataframe
df_worldnews = df
df_worldnews.shape

(2501, 74)

In [117]:
# exporting the raw scraped data to csv
df_worldnews.to_csv('../data/worldnews_scraped.csv')

## Preparing datasets for Cleaning and EDA <a class="anchor" id="preparingdatasets"></a>
---
[Back to top!](#toc) <br/>
<br/>

### r/TheOnion <a class="anchor" id="preptheonion"></a>
[Back to top!](#toc) <br/>
<br/>

In [118]:
# import csv of scrapped posts
df_theonion = pd.read_csv('../data/theonion_scraped.csv')
df_theonion.head()

Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,secure_media,secure_media_embed,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,removed_by_category,gallery_data,is_gallery,media_metadata,author_cakeday,crosspost_parent,crosspost_parent_list,author_flair_background_color,author_flair_text_color,steward_reports,removed_by,updated_utc,og_description,og_title
0,0,[],False,nipoxa4654,,[],,text,t2_9oonbjf4,False,False,[],False,False,1618603680,youtu.be,https://www.reddit.com/r/TheOnion/comments/msb...,{},msbcc7,True,False,False,False,True,False,False,,[],dark,text,False,"{'oembed': {'author_name': 'The Onion', 'autho...","{'content': '&lt;iframe width=""267"" height=""20...",False,False,1,0,False,all_ads,/r/TheOnion/comments/msbcc7/representative_wan...,False,rich:video,"{'enabled': False, 'images': [{'id': 'AUJX28Cl...",6,1618603690,1,"{'oembed': {'author_name': 'The Onion', 'autho...","{'content': '&lt;iframe width=""267"" height=""20...",,True,False,False,TheOnion,t5_2qhmj,160538,public,https://a.thumbs.redditmedia.com/_zPQiD4RKFj37...,105.0,140.0,Representative Wants To Meet More Kids Online,0,[],1.0,https://youtu.be/ocK7rZJ6U5w,https://youtu.be/ocK7rZJ6U5w,all_ads,6,,,,,,,,,,,,,,
1,1,[],False,pi3141592653589,,[],,text,t2_gi7mk,False,True,[],False,False,1618594518,theonion.com,https://www.reddit.com/r/TheOnion/comments/ms8...,{},ms8987,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,1,0,False,all_ads,/r/TheOnion/comments/ms8987/minnesota_deploys_...,False,link,"{'enabled': False, 'images': [{'id': 'yRgy76Gr...",6,1618594528,1,,,,True,False,False,TheOnion,t5_2qhmj,160536,public,https://b.thumbs.redditmedia.com/BQr5StDDmJxnu...,78.0,140.0,Minnesota Deploys National Guard Ahead Of Next...,0,[],1.0,https://www.theonion.com/minnesota-deploys-nat...,https://www.theonion.com/minnesota-deploys-nat...,all_ads,6,,,,,,,,,,,,,,
2,2,[],False,dwaxe,,[],,text,t2_3jamc,False,True,[],False,False,1618588625,ogn.theonion.com,https://www.reddit.com/r/TheOnion/comments/ms6...,{},ms65tm,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,7,0,False,all_ads,/r/TheOnion/comments/ms65tm/small_kindnesses_g...,False,link,"{'enabled': False, 'images': [{'id': 'hX5khf7c...",6,1618588636,1,,,,False,False,False,TheOnion,t5_2qhmj,160531,public,https://b.thumbs.redditmedia.com/jP-PZIfVhdohx...,78.0,140.0,Small Kindnesses: Gamer Shields Ailing Grandmo...,0,[],1.0,https://ogn.theonion.com/small-kindnesses-game...,https://ogn.theonion.com/small-kindnesses-game...,all_ads,6,,,,,,,,,,,,,,
3,3,[],False,FutureOmelet,,[],,text,t2_f4n4p,False,True,[],False,False,1618584339,sports.theonion.com,https://www.reddit.com/r/TheOnion/comments/ms4...,{},ms4o4o,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,0,0,False,all_ads,/r/TheOnion/comments/ms4o4o/report_san_diegans...,False,link,"{'enabled': False, 'images': [{'id': 'Gs3bGIXB...",6,1618584350,1,,,,True,False,False,TheOnion,t5_2qhmj,160530,public,https://a.thumbs.redditmedia.com/ZBxDFOKz-0OsB...,78.0,140.0,Report: San Diegans Just Assumed Padres Were I...,0,[],1.0,https://sports.theonion.com/report-san-diegans...,https://sports.theonion.com/report-san-diegans...,all_ads,6,,,,,,,,,,,,,,
4,4,[],False,mothershipq,,[],,text,t2_4negm,False,True,[],False,False,1618578024,theonion.com,https://www.reddit.com/r/TheOnion/comments/ms2...,{},ms2mpd,True,False,False,False,True,False,False,,[],dark,text,False,,,False,True,4,0,False,all_ads,/r/TheOnion/comments/ms2mpd/colorado_temporari...,False,link,"{'enabled': False, 'images': [{'id': 'bMqj1Xkp...",6,1618578042,1,,,,True,False,False,TheOnion,t5_2qhmj,160527,public,https://b.thumbs.redditmedia.com/IdOdGEIaPjA6b...,78.0,140.0,Colorado Temporarily Re-Bans Marijuana For Sta...,0,[],1.0,https://www.theonion.com/colorado-temporarily-...,https://www.theonion.com/colorado-temporarily-...,all_ads,6,,,,,,,,,,,,,,


In [1]:
# basic checking of dataframe features and rows
df_theonion.info()

NameError: name 'df_theonion' is not defined

In [119]:
# removing duplicated posts
df_theonion.drop_duplicates(subset = 'title', inplace=True)

# removing posts that contains non ascii characters in 'title' 
# (highly likely non-english articles)
df_theonion[df_theonion['title'].map(lambda x: x.isascii())]

# checking rows remaining
df_theonion.shape

(2356, 81)

In [125]:
# exporting dataframe
df_theonion.to_csv('../data/theonion_dataset.csv')

### r/worldnews <a class="anchor" id="prepworldnews"></a>
[Back to top!](#toc) <br/>
<br/>

In [121]:
# import csv of scrapped posts
df_worldnews = pd.read_csv('../data/worldnews_scraped.csv')
df_worldnews.head()

Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,crosspost_parent,crosspost_parent_list,author_cakeday,removed_by_category,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_text_color
0,0,[],False,lurker_bee,,[],,text,t2_c7n3w,False,True,[],False,False,1618603975,bbc.com,https://www.reddit.com/r/worldnews/comments/ms...,{},msbfvz,True,False,False,False,True,False,False,,coronavirus,[],COVID-19,dark,text,False,False,True,71,0,False,all_ads,/r/worldnews/comments/msbfvz/covid_canada_soun...,False,link,"{'enabled': False, 'images': [{'id': '0S0I-2rB...",6,1618603987,1,,True,False,False,worldnews,t5_2qh13,26152293,public,default,78.0,140.0,Covid: Canada sounds the alarm as cases overta...,0,[],1.0,https://www.bbc.com/news/world-us-canada-56779...,https://www.bbc.com/news/world-us-canada-56779...,all_ads,6,,,,,,,,,,
1,1,[],False,Seek_Adventure,,[],,text,t2_fklj5,False,False,[],False,False,1618603709,themoscowtimes.com,https://www.reddit.com/r/worldnews/comments/ms...,{},msbcoq,True,False,False,False,True,False,False,,russia,[],Russia,dark,text,False,False,True,0,0,False,all_ads,/r/worldnews/comments/msbcoq/navalny_ally_jail...,False,link,"{'enabled': False, 'images': [{'id': 'uIArxU4f...",6,1618603720,1,,True,False,False,worldnews,t5_2qh13,26152212,public,default,95.0,140.0,Navalny Ally Jailed 2 Years for Anti-Governmen...,0,[],1.0,https://www.themoscowtimes.com/2021/04/16/nava...,https://www.themoscowtimes.com/2021/04/16/nava...,all_ads,6,,,,,,,,,,
2,2,[],False,pi3141592653589,,[],,text,t2_gi7mk,False,True,[],False,False,1618603551,bbc.com,https://www.reddit.com/r/worldnews/comments/ms...,{},msbasj,True,False,False,False,True,False,False,,,[],,dark,text,False,False,False,3,0,False,all_ads,/r/worldnews/comments/msbasj/raul_castro_steps...,False,link,"{'enabled': False, 'images': [{'id': 'Bt7zy4Xx...",6,1618603562,1,,True,False,False,worldnews,t5_2qh13,26152191,public,default,78.0,140.0,Raul Castro steps down as Cuban Communist Part...,0,[],1.0,https://www.bbc.com/news/world-latin-america-5...,https://www.bbc.com/news/world-latin-america-5...,all_ads,6,,,,,,,,,,
3,3,[],False,avp1982,,[],,text,t2_7csrx7ux,False,False,[],False,False,1618603321,haaretz.com,https://www.reddit.com/r/worldnews/comments/ms...,{},msb80e,True,False,False,False,True,False,False,,palestisrael,[],Israel/Palestine,dark,text,False,False,True,22,0,False,all_ads,/r/worldnews/comments/msb80e/israeli_troops_sh...,False,link,"{'enabled': False, 'images': [{'id': 'BZdP_60z...",6,1618603333,1,,True,False,False,worldnews,t5_2qh13,26152168,public,default,81.0,140.0,Israeli Troops Shot and Killed a Palestinian F...,0,[],1.0,https://www.haaretz.com/israel-news/twilight-z...,https://www.haaretz.com/israel-news/twilight-z...,all_ads,6,,,,,,,,,,
4,4,[],False,neerajanchan,,[],,text,t2_81ll7dwu,False,False,[],False,False,1618603095,m.timesofindia.com,https://www.reddit.com/r/worldnews/comments/ms...,{},msb52i,True,False,False,False,True,False,False,,coronavirus,[],COVID-19,dark,text,False,False,False,2,0,False,all_ads,/r/worldnews/comments/msb52i/covid19_is_predom...,False,link,"{'enabled': False, 'images': [{'id': 'XkdWe1j4...",6,1618603105,1,,True,False,False,worldnews,t5_2qh13,26152161,public,default,75.0,140.0,Covid-19 is predominantly transmitted through ...,0,[],1.0,https://m.timesofindia.com/home/science/covid-...,https://m.timesofindia.com/home/science/covid-...,all_ads,6,t3_msb4lh,"[{'all_awardings': [], 'allow_live_comments': ...",,,,,,,,


In [None]:
# basic checking of dataframe features and rows
df_worldnews.info()

In [122]:
# removing duplicated posts
df_worldnews.drop_duplicates(subset = 'title', inplace=True)

# removing posts that contains non ascii characters in 'title' 
# (highly likely non-english articles)
df_worldnews = df_worldnews[df_worldnews['title'].map(lambda x: x.isascii())]

# checking rows remaining
df_worldnews.shape

(2255, 75)

In [124]:
# exporting dataframe
df_worldnews.to_csv('../data/worldnews_dataset.csv')