<img src="https://i.imgur.com/EGtMXKh.jpg?1" style="float: left; margin: 20px; height: 55px">

# Project 3: Natural Language Processing Challenge (Part 1)

---

## Problem Statement

Our company is currently building a new wellness and mindfulness app to help promote better mental health. The app aims to provide thought of the day based on natural philosophies such as Buddhism and Stoicism. As a start, everyday, the app will allow the user to write one sentence against a prompt on certain topics. The app will then assess if the user is more in line with Buddhist or Stoic philosophies, and in return decide to return either Buddhist advice or Stoic advice based on the prompt written. 

To train the app to gain an understanding of the types of people who will be more aligned with Buddhist or Stoic philosophies, we will be using posts from the subreddits r/Buddhism and r/Stoicism to machine train our model to distinguish and classify posts from the different subreddits.

# Content Page

---

# Background of the Project

<b> Buddhism </b>

Buddhism is one of the world’s major religions. It originated in India in 563–483 B.C.E. with Siddhartha Gautama, and over the next millennia it spread across Asia and the rest of the world. As a summary,  Buddhists believe that human life is a cycle of suffering and rebirth, but that if one achieves a state of enlightenment (nirvana), it is possible to escape this cycle forever. The Buddha taught Four Noble Truths. The first truth is called “Suffering (dukkha),” which teaches that everyone in life is suffering in some way. The second truth is “Origin of suffering (samudāya).” This states that all suffering comes from desire (tanhā). The third truth is “Cessation of suffering (nirodha),” and it says that it is possible to stop suffering and achieve enlightenment. The fourth truth, “Path to the cessation of suffering (magga)” is about the Middle Way, which are the steps to achieve enlightenment.

<b> Stoicism </b>

Stoicism is a school of Hellenistic philosophy founded by Zeno of Citium in Athens in the early 3rd century BC. It is a philosophy of personal eudaemonic virtue ethics informed by its system of logic and its views on the natural world, asserting that the practice of virtue is both necessary and sufficient to achieve eudaimonia—flourishing by means of living an ethical life. The Stoics identified the path to eudaimonia with a life spent practicing the cardinal virtues and living in accordance with nature. Stoicism teaches the development of self-control and fortitude as a means of overcoming destructive emotions; the philosophy holds that becoming a clear and unbiased thinker allows one to understand the universal reason (logos). Stoicism's primary aspect involves improving the individual's ethical and moral well-being: "Virtue consists in a will that is in agreement with Nature". This principle also applies to the realm of interpersonal relationships; "to be free from anger, envy, and jealousy", and to accept even slaves as "equals of other men, because all men alike are products of nature".


<b> App Development </b>

Both Buddhism and Stoicism have many things in common and helps people live better lives. Stoicism and Buddhism both deal with suffering by understanding its role in life and making peace with it - A Buddhist eliminates suffering by detaching himself from his desires. A Stoic eliminates suffering by being indifferent to all external events. Both also believe that practicing meditation or mindfulness are beneficial techniques. Studies have also shown that having mindfulness and meditation are beneficial to a person's mental health. Hence, we hope that users will be able to abide by either philosophy depending on their alignment and improve mental health wellness.

To get a sense of the types of people and topics that people have gravitated towards for either philosophies, we will be using the Reddit API to scrape posts from both r/Buddhism and r/Stoicism subreddits as data and apply machine training modelling to allow the app to possibly distinguish the similarities and differences between the types of users who browse such subreddits. We will be aiming to build a classification model and topic modelling system as a start.




---

# Import Libraries

We will first be importing libraries to help us webscrape the data using the Reddit API.


In [1]:
#Import Libraries
import requests
import pandas as pd
import time
from random import randint



In [7]:
#Set Pandas Options:

pd.set_option("display.max_columns", 500)
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_colwidth", None)

Here, we will be using a function to help us pull datasets from each subreddit. For this, we will be pulling about 10,000 posts from each subreddit before 1st June 2022. Setting a fixed end-date will allow us to keep our datasets standardised whenever we scrape our data. We will only need the subreddit name, title and selftext from the data before adding it to a dataframe.

In [9]:
# Function to webscrape data
dict = {}
# URL to pull reddit submissions
submission_url = "https://api.pushshift.io/reddit/search/submission"

def webscrape(subreddit, size, end_date):
    for i in range(1, 101):
        if i == 1:
            params = {"subreddit": subreddit, 'size': size, 'before': end_date}
        else:
            params = {"subreddit": subreddit,
                      'size': size, 'before': new_end_date}
        res = requests.get(submission_url, params)
        time.sleep((randint(5, 10)))
        data = res.json()
        dat = data['data']
        new_end_date = dat[-1]['created_utc']
        df_name = subreddit+str(i)
        dict[df_name] = pd.DataFrame(
            dat, columns=['subreddit', 'title', 'selftext']) #Pulling up the subreddit, title and selftext columns only


In [10]:
#Webscrape 10,000 posts from Buddhism and Stoicism Subreddit each
webscrape('Buddhism', 100, 1654041600 )

webscrape('Stoicism', 100, 1654041600)

dict.keys()

dict_keys(['Buddhism1', 'Buddhism2', 'Buddhism3', 'Buddhism4', 'Buddhism5', 'Buddhism6', 'Buddhism7', 'Buddhism8', 'Buddhism9', 'Buddhism10', 'Buddhism11', 'Buddhism12', 'Buddhism13', 'Buddhism14', 'Buddhism15', 'Buddhism16', 'Buddhism17', 'Buddhism18', 'Buddhism19', 'Buddhism20', 'Buddhism21', 'Buddhism22', 'Buddhism23', 'Buddhism24', 'Buddhism25', 'Buddhism26', 'Buddhism27', 'Buddhism28', 'Buddhism29', 'Buddhism30', 'Buddhism31', 'Buddhism32', 'Buddhism33', 'Buddhism34', 'Buddhism35', 'Buddhism36', 'Buddhism37', 'Buddhism38', 'Buddhism39', 'Buddhism40', 'Buddhism41', 'Buddhism42', 'Buddhism43', 'Buddhism44', 'Buddhism45', 'Buddhism46', 'Buddhism47', 'Buddhism48', 'Buddhism49', 'Buddhism50', 'Buddhism51', 'Buddhism52', 'Buddhism53', 'Buddhism54', 'Buddhism55', 'Buddhism56', 'Buddhism57', 'Buddhism58', 'Buddhism59', 'Buddhism60', 'Buddhism61', 'Buddhism62', 'Buddhism63', 'Buddhism64', 'Buddhism65', 'Buddhism66', 'Buddhism67', 'Buddhism68', 'Buddhism69', 'Buddhism70', 'Buddhism71', 'Bud

In [11]:
#Subreddit_posts dataframe
subreddit_posts = pd.concat(dict.values(), ignore_index= True)
subreddit_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19979 entries, 0 to 19978
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  19979 non-null  object
 1   title      19979 non-null  object
 2   selftext   19972 non-null  object
dtypes: object(3)
memory usage: 468.4+ KB


We will then be saving our data to a csv file for EDA and modelling in a separate notebook.

In [12]:
#Zip to clean CSV

subreddit_posts.to_csv("subreddit_posts.csv", index=False)