# Project 3 : Web APIs & NLP
----------

## Problem Statement

Men In Black (MIB) is a secret association that keeps top secret information regarding extraterrestrials under wraps. Some information is believed to leaked out into the internet via reddit, most of which to aliens and space subreddits. They do not want to completely shut down reddit, or delete the subreddit to raise suspicion. Instead they would like to monitor what people are talking about. As a data science team working for MIB, we are requested to explore and develop a model to be able to accurately classify reddit posts and identify the key words that differentiates posts related from "Aliens" to "Space" in reddit. Beside being to predict its classification, MIB Board of Director would also like to know what the latest trends are, not only to have an idea of what's hot, but to feed said information to other departments such as the marketing effort on social media, events and even podcasts. Thus, data will be collected from 1st January 2022. 

Three classification models, Logistic Regression, Naive Bayes and Random Forest will be developed to assist with the problem statement. The performance of the model will be assessed based on its Accuracy and F1-score on unseen test data.

## Backgrounds
Are there aliens out there? The real starting point for UFO speculation, and government involvement, dates to Roswell, New Mexico, in July 1947. Project Blue Book was a program to investigate UFO sightings by the United States Air Force from March 1952 to its termination on December 17, 1969. A U.S. government report([*source*](https://www.dni.gov/files/ODNI/documents/assessments/Prelimary-Assessment-UAP-20210625.pdf)) on UFOs says it found no evidence of aliens but acknowledged 143 reports of "unidentified aerial phenomena" since 2004 that could not be explained. The report was released on 25th June 2021 by the Office of the Director of National Intelligence with substantial input from the military. The study is part of the most significant public effort so far to deal with decades of speculation, rumor and unhinged conspiracy theories about UFOs. 

Some of the most intriguing cases come from Navy pilots who reported seeing UFOs and filming them off the East Coast of the U.S. over a period of months in 2014 and 2015. The pilots, including some who have spoken publicly, say the mystery objects moved with exceptional speed, agility and acceleration that they had never seen before. And in some incidents, the pilots said the objects went underwater.

## Contents:
### Part 1 Data Collection

1. [Importing Libraries](#1.-Importing-Libraries)
2. [Data Collection](#2.-Data-Collection)

--------

## 1. Importing Libraries

In [48]:
import requests
import re
import time
import pandas as pd

## 2. Data Collection

### 2.1 Define functions to collect data

In [49]:
url = 'https://api.pushshift.io/reddit/search/submission'

def get_posts(subreddit, n_iters):
    full_df = pd.DataFrame()
    current_time = 1640908800 # Set a scrapping date - 1st January 2022
    
    for i in range(n_iters):
        params = {'subreddit' : subreddit, 'size' : 100, 'before' : current_time}
        
        res = requests.get(url, params)
        print(f'Iteration {i + 1}: Status Code {res.status_code}')
        if res.status_code != 200:
            return f'Error! Status code: {res.status_code}'
        else:
            data = res.json()
            posts = data['data']
            df = pd.DataFrame(posts)
            current_time = df['created_utc'].iloc[-1]
            full_df = pd.concat([full_df, df]).reset_index(drop = True)
        time.sleep(5)
    return full_df

### 2.2 Aliens : 1,000 Submissions

In [50]:
# Scrape 1,000 submissions from r/aliens and convert to dataframe

df_aliens = get_posts('aliens', 10)

Iteration 1: Status Code 200
Iteration 2: Status Code 200
Iteration 3: Status Code 200
Iteration 4: Status Code 200
Iteration 5: Status Code 200
Iteration 6: Status Code 200
Iteration 7: Status Code 200
Iteration 8: Status Code 200
Iteration 9: Status Code 200
Iteration 10: Status Code 200


In [51]:
# Check the first 5 row of aliens dataframe

df_aliens.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,url_overridden_by_dest,is_gallery,author_cakeday,media_metadata,author_flair_background_color,author_flair_text_color,author_flair_template_id,edited,banned_by,gilded
0,[],False,mackmick86,,[],,text,t2_gk07nnr6,False,False,...,,,,,,,,,,
1,[],False,Tenshi-Hinanawi,,[],,text,t2_emuxzm2y,False,False,...,,,,,,,,,,
2,[],False,Outrageous_Resist447,,[],,text,t2_daj28e5r,False,False,...,https://youtube.com/watch?v=weZ-a7ZnQS4&amp;fe...,,,,,,,,,
3,[],False,Everystuffcreeker,,[],,text,t2_cq8w74j8,False,False,...,,,,,,,,,,
4,[],False,mackmick86,,[],,text,t2_gk07nnr6,False,False,...,,,,,,,,,,


In [52]:
# Check the shape of aliens dataframe

df_aliens.shape

(1000, 81)

In [53]:
# Export aliens submission dataframe to a csv

df_aliens.to_csv('../datasets/aliens_sub.csv', index = False)

### 2.3 Space : 1,000 Submissions

In [54]:
# Scrape 1,000 submissions from r/space and convert to dataframe

df_space = get_posts('space', 10)

Iteration 1: Status Code 200
Iteration 2: Status Code 200
Iteration 3: Status Code 200
Iteration 4: Status Code 200
Iteration 5: Status Code 200
Iteration 6: Status Code 200
Iteration 7: Status Code 200
Iteration 8: Status Code 200
Iteration 9: Status Code 200
Iteration 10: Status Code 200


In [55]:
# Check the first 5 row of space dataframe

df_space.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,secure_media,secure_media_embed,gallery_data,is_gallery,media_metadata,author_flair_background_color,author_flair_text_color,author_cakeday,suggested_sort,link_flair_template_id
0,[],False,TheOGSyphonZero,,[],,text,t2_1l6cx8pb,False,False,...,,,,,,,,,,
1,[],False,DukkyDrake,,[],,text,t2_11lmay,False,False,...,,,,,,,,,,
2,[],False,No_Landscape7074,,[],,text,t2_bd8j6xo7,False,False,...,,,,,,,,,,
3,[],False,malcolm58,,[],,text,t2_w8l39,False,False,...,,,,,,,,,,
4,[],False,S077Y,,[],,text,t2_9xab4jvg,False,False,...,,,,,,,,,,


In [56]:
# Check the shape of space dataframe

df_space.shape

(1000, 79)

In [57]:
# Export space submission dataframe to a csv

df_space.to_csv('../datasets/space_sub.csv', index = False)