# Problem Statement

### Goal of the project

(MIB) is a secret association that keeps top secret information regarding extraterrestrials under wraps. Think "Area 51" and the "Roswell incident".
But, nothing is foolproof. Things get leaked out anyway.

Some information is leaked out into the internet via reddit, most of which are to "Aliens" and "Space" subreddits.

They do not want to completely shut down reddit, or delete the subreddit, Instead they would like to monitor what people are talking about.

The goal of the project is to use machine learning to accurately predict whether a post in "Space" or any other subreddit belongs to "Aliens".

Besides predicting its classification, they would also like to know what the latest trends are, not only to have an idea of what's hot, but to feed said information to their other departments such as their marketing effort on social media, events and even podcasts.

Future work are sentiment analysis and topic modelling and will allow MIB to further their understanding on what people are posting and ensuring effective monitoring.

### Type of model that will be developed

Four classification models, Naive Bayes, Logistic Regression, Random Forest and CatboostClassfication will be developed for this project.

### Success evaluation

Success evaluation will be based on its precision, whether its classified correctly and also its F1 score on unseen test data.

### Scope of the project

"1. Using [Pushshift's](https://github.com/pushshift/api) API, you'll collect posts from two subreddits of your choosing." <br>
"2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem."

##### To fulfill the above requirements, the scope of the project will be to use Pushshift's API with a wrapper called "pmaw" to collect posts from our choosing which are "Aliens" and "Space".

##### We will then use NLP techniques like tokenization, stemming, lemmatization and classification models to train the model on binary classification.

##### 4 classifier models consisting of Naive Bayes, Logistic Regression, Random Forest and CatboostClassfication will be compared to fulfill the requirements

##### We will then assess the model based on its precision and F1 score to determine whether our model classification was accurate.

### Stakeholders and why is it important to investigate

Men In Black (MIB) is a government association that researches & promotes extraterrestrial knowledge.

To fulfill their short term goals (this project):
1. Classify posts accurately to achieve the aim of collecting suitable posts for analysis
2. To get key words that the model classifier uses to accurately classify these posts to understand its method for its marketing and research efforts.

To fulfill their long term goals (future work):
1. Sentiment analysis of possible extraterrestrial on Earth to determine if its suitable to release information that might cause panic among citizens.
2. Understand how ordinary citizens think about extraterrestrial activities on Earth
3. Topic modelling to understand what topics ordinary citizens are discussing about Aliens to: <br>
    a. Market products <br>
    b. Push their agenda (whether its required to block or release sensitive information) <br>
    c. Whether public are getting closer to the truth about Aliens

### Primary and Secondary stakeholders

1. Men in Black (MIB)
2. Media groups associated with MIB

Secondary stakeholders are media groups because they are closely linked with the purpose of this project for classification and also future work stated above.

## Executive Summary

Part 1:
1. Problem statement
2. Scope of project
3. Import relevant libraries
4. Web scraping through pushshift API and wrapper "pmaw", while considering server requests.
5. Scrape "Aliens" subreddit and "Space" subreddit.

# Data Collection

## Was enough data gathered to generate significant results?

At least 2000 usable posts from each subreddit posts must remain after data cleaning for the model to be effective at generalizing and classifying future unseen data.

Hence, there will be overscraping of 10,000 posts to take into account empty/deleted/unusable posts.

## Relevance to the project

Data collected will be from subreddit and its "selftext" and "title" and "subreddit" will remain as these are relevant to the project and will be useful for training the classification model.

# 1. Import libraries and setup conditions

### Import libraries

### Data collection and storage was optimized through custom functions, pipelines, and/or automation:

Using a wrapper called "pmaw", this library can be used to automate post collection beyond its post limit and also allows useful parameters in consideration of the server receiving the requests.

In [12]:
import requests
import pandas as pd
from pmaw import PushshiftAPI
from pandas import DataFrame as df

### Use pushshiftAPI to scrape more than 100 posts limit

In [13]:
api = PushshiftAPI()

### Setting the date range for scraping

In [14]:
import datetime as dt
before = int(dt.datetime(2022,1,8,0,0).timestamp())
after = int(dt.datetime(2019,12,1,0,0).timestamp())

# 4. Scrape Aliens subreddit posts

### Setting scrape limit to 100,000

### Thought given to the server receiving the requests such as considering number of requests per second

### By setting the following parameters:

#### "rate_limit=60, max_sleep=10"

Thought is given to the server receiving the requests by setting parameters such as rate limit and max sleep after every 100 posts, we do not flood the server with requests and our requests do not get rejected as possible DDoS attack on the server.

In [15]:
subreddit="aliens"
limit=10000
aliens_posts = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after, rate_limit=60, max_sleep=10)
print(f'Retrieved {len(aliens_posts)} posts from Pushshift')

INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 88.00% - Requests: 100 - Batches: 10 - Items Remaining: 1201
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 87.07% - Requests: 116 - Batches: 12 - Items Remaining: 0
Retrieved 10000 posts from Pushshift


In [16]:
aliens_posts = pd.DataFrame(aliens_posts)

In [17]:
aliens_posts_df = aliens_posts[['subreddit', 'selftext', 'title']]

### Output Aliens posts to csv

In [18]:
aliens_posts_df.to_csv('./datasets/aliens_posts.csv', index=False)

# 5. Scrape Space subreddit posts

### Setting scrape limit to 10,000

In [19]:
subreddit="space"
limit=10000
space_posts = api.search_submissions(subreddit=subreddit, limit=limit, before=before, after=after, rate_limit=60, max_sleep=10)
print(f'Retrieved {len(space_posts)} posts from Pushshift')

INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 89.00% - Requests: 100 - Batches: 10 - Items Remaining: 1108
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 88.60% - Requests: 114 - Batches: 12 - Items Remaining: 0
Retrieved 10000 posts from Pushshift


In [20]:
space_posts = pd.DataFrame(space_posts)

In [21]:
space_posts_df = space_posts[['subreddit', 'selftext', 'title']]

### Output Space posts to csv

In [22]:
space_posts_df.to_csv('./datasets/space_posts.csv', index=False)

# 6. Project will be continued in proj3_part2