# Project 3: Web APIs & NLP

## Problem Statement

Fake news is a prevalent and harmful problem in our modern society, often misleading the general public on important topics such as healthcare and defense. This can lead to long standing societal issues which are a detriment to nations worldwide.


To tackle the problem, our team aims to develop a model using natural language processing and machine learning models to predict fake news or real news, helping government bodies/regular citizens to identify the fake news, thus creating a secure, and more misinformation-resilient society.


- The Goal of this project is to use python code classifiers to determine if a subreddit title belongs to r/TheOnion (fake news) or r/nottheonion (real news).
- Following tools are used to create the model. <br>
    Vectorisers: `CountVectorizer`, `TfidfVectorizer`  
    Estimators: `RandomForestClassifier`,`MultinomialNB`,`LogisticRegression`, `KNearestNeighbors`, and `Support vector machines model(SVM)`.

## Data Collection

In [1]:
### Import libraries
import requests
import pandas as pd

## Extract data from subreddits

Pulling 5000 data points from each subreddits and combine into one dataframe. 

def get_pushshift_data(params):
    url = 'https://api.pushshift.io/reddit/search/submission'
    res = requests.get(url,params)
    data = res.json()
    posts = data['data']
    df = pd.DataFrame(posts)
    return df

In [2]:
#create function to scrape data from subreddit.

def reddit_scrape(url, subreddit, utc, number):
    #search parameters
    params = {
        'subreddit': subreddit,
        'size': 100,
        'before': utc
    }
    res = requests.get(url, params)
    data = res.json()
    posts = data['data']
    df = pd.DataFrame(posts)
    
    #repeat scrapping to achieve number of post wanted.
    for i in range(number-1):
        params_2 = {
            'subreddit': subreddit,
            'size': 100,
            'before': df.created_utc.iloc[-1]
        }
        res_2 = requests.get(url, params_2)
        data_2 = res_2.json()
        posts_2 = data_2['data']
        df = pd.concat([df, pd.DataFrame(posts_2)], axis=0)
    return df

In [3]:
# scrape data from 1st subreddit post - 'TheOnion' (fake news)
# time - before 2022-01-01 00:00:00 UTC
# number of post #5000
subreddit_1 = reddit_scrape('https://api.pushshift.io/reddit/search/submission' ,'TheOnion',1640995200, 50)

In [4]:
subreddit_1.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media_metadata,author_flair_background_color,author_flair_text_color,author_cakeday,steward_reports,removed_by,updated_utc,og_description,og_title,gilded
0,[],False,mothershipq,,[],,text,t2_4negm,False,False,...,,,,,,,,,,
1,[],False,-ImYourHuckleberry-,,[],,text,t2_g3p2c,False,False,...,,,,,,,,,,
2,[],False,dwaxe,,[],,text,t2_3jamc,False,False,...,,,,,,,,,,
3,[],False,dwaxe,,[],,text,t2_3jamc,False,False,...,,,,,,,,,,
4,[],False,dwaxe,,[],,text,t2_3jamc,False,False,...,,,,,,,,,,


In [5]:
subreddit_1.shape

(4996, 83)

In [6]:
# scrape data from 2nd subreddit post - 'nottheonion' (Real news)
# time - before 2022-01-01 00:00:00 UTC
# number of post #5000
subreddit_2 =  reddit_scrape('https://api.pushshift.io/reddit/search/submission' ,'nottheonion', 1640995200, 50)

In [7]:
print(subreddit_2.head())
print(subreddit_2.shape)

  all_awardings  allow_live_comments               author  \
0            []                False          Taco_duck68   
1            []                False          BlackNingaa   
2            []                False   Lopsided_File_1642   
3            []                False    SkinnyWhiteGirl19   
4            []                False  kids-cake-and-crazy   

  author_flair_css_class author_flair_richtext author_flair_text  \
0                   None                    []              None   
1                   None                    []              None   
2                   None                    []              None   
3                   None                    []              None   
4                   None                    []              None   

  author_flair_type author_fullname  author_is_blocked author_patreon_flair  \
0              text     t2_bqrj5t0e              False                False   
1              text     t2_4dkbfmoo              False            

In [8]:
subreddit_2.columns #check the unique columns extracted

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned',
       'post_hint', 'preview', 'pwls', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail',
   

In [9]:
# select the columns for analyse
subreddit_1 = subreddit_1[['title','selftext','subreddit']]
print(subreddit_1.head())

subreddit_2 = subreddit_2[['title','selftext','subreddit']]
print(subreddit_2.head())


                                               title selftext subreddit
0  Surgeon Kind Of Pissed Patient Seeing Her Defo...           TheOnion
1  McDonald’s blocked from building drive-through...           TheOnion
2  Gwyneth Paltrow Touts New Diamond-Encrusted Tr...           TheOnion
3  Artist Crafting Music Box Hopes It Delights At...           TheOnion
4  Homeowner Trying To Smoke Out Snakes Accidenta...           TheOnion
                                               title selftext    subreddit
0  Man attempts to pay for car with rap, steals p...           nottheonion
1  Former UFC fighter reveals past as sex worker ...           nottheonion
2                                  Log into Facebook           nottheonion
3  McDonald’s blocked from building drive-through...           nottheonion
4  Legendary actress Betty White dies at 99 on Ne...           nottheonion


In [10]:
# Join the two subreddit columns into one
df = pd.concat([subreddit_1,subreddit_2], join='inner', ignore_index = True)
df.shape

(9993, 3)

In [11]:
#export to csv
df.to_csv(r'../datasets/reddit_post.csv', index = False)