# NLP - Classification and Sentiment Analysis of Reddit Posts

## Problem Statement

Apple and Google are two of the biggest tech giants today. The two companies compete on several fronts: operating systems, browsers, app store, cloud computing etc ([source](https://www.marketingsociety.com/the-library/apple-v-google-they%E2%80%99re-rivals-many-ways-it%E2%80%99s-not-quite-death-match)). 

The present project aims to levearage on natural language processing to analyze Google and Apple's social media presence, in particular Reddit. The project attempts to understand the contents of discussion on Google and Apple subreddit. The deliverables are a classification model that is able to predict whether a post comes from Google and Apple subreddit, as well as sentiment analysis of users' posts. The insights would shed light on what service and products offered by either company were most talked about in Reddit, and how well received these products and service are among users, hence drive future businesss decision-making in product and service improvement.

## Part 1: Web API Data Collection

Part 1: Web API Data Collection <br>
Part 2: Exploratory Data Analysis <br>
Part 3: Baseline Classification Models and Zero Shot Classification  <br>
Part 4: PyCaret Classification Models <br>
Part 5: Sentiment Analysis <br>

---

In part 1, text data from two subreddits, 'Google' and 'Apple', will be collected using [Pushshift's](https://github.com/pushshift/api) API.

In [2]:
import requests
import pandas as pd

In [2]:
# Function to extract reddit posts
def extract_reddit_posts(epoch_time, subreddit, no_of_extractions):
    url =  'https://api.pushshift.io/reddit/search/submission'
    all_posts = []
    params = {'subreddit':subreddit,'size':250,'before':epoch_time}
    
    # For loop to repeat the data collection 
    for i in range(no_of_extractions):
        res = requests.get(url,params)
        
        # Raise error if connection to API fails
        if res.status_code != 200:
            raise ConnectionError
        
        # Collect posts and other subreddit data
        data = res.json()
        posts = data['data']
        all_posts += posts
        
        # Update the epoch time to epoch time of last post in current for loop
        params['before'] = posts[-1]['created_utc']
        print(f'Extraction {i+1} completed')
    
    # Convert all extracted data to dataframe
    df = pd.DataFrame(all_posts)
    
    return df

In [3]:
# Extract data from 'Apple' subreddit
apple_posts = extract_reddit_posts(epoch_time= 1664521047, subreddit = 'apple', no_of_extractions = 80)

Extraction 1 completed
Extraction 2 completed
Extraction 3 completed
Extraction 4 completed
Extraction 5 completed
Extraction 6 completed
Extraction 7 completed
Extraction 8 completed
Extraction 9 completed
Extraction 10 completed
Extraction 11 completed
Extraction 12 completed
Extraction 13 completed
Extraction 14 completed
Extraction 15 completed
Extraction 16 completed
Extraction 17 completed
Extraction 18 completed
Extraction 19 completed
Extraction 20 completed
Extraction 21 completed
Extraction 22 completed
Extraction 23 completed
Extraction 24 completed
Extraction 25 completed
Extraction 26 completed
Extraction 27 completed
Extraction 28 completed
Extraction 29 completed
Extraction 30 completed
Extraction 31 completed
Extraction 32 completed
Extraction 33 completed
Extraction 34 completed
Extraction 35 completed
Extraction 36 completed
Extraction 37 completed
Extraction 38 completed
Extraction 39 completed
Extraction 40 completed
Extraction 41 completed
Extraction 42 completed
E

In [4]:
# Extract data from 'Google' subreddit
google_posts = extract_reddit_posts(epoch_time= 1664521047, subreddit = 'google', no_of_extractions = 80)

Extraction 1 completed
Extraction 2 completed
Extraction 3 completed
Extraction 4 completed
Extraction 5 completed
Extraction 6 completed
Extraction 7 completed
Extraction 8 completed
Extraction 9 completed
Extraction 10 completed
Extraction 11 completed
Extraction 12 completed
Extraction 13 completed
Extraction 14 completed
Extraction 15 completed
Extraction 16 completed
Extraction 17 completed
Extraction 18 completed
Extraction 19 completed
Extraction 20 completed
Extraction 21 completed
Extraction 22 completed
Extraction 23 completed
Extraction 24 completed
Extraction 25 completed
Extraction 26 completed
Extraction 27 completed
Extraction 28 completed
Extraction 29 completed
Extraction 30 completed
Extraction 31 completed
Extraction 32 completed
Extraction 33 completed
Extraction 34 completed
Extraction 35 completed
Extraction 36 completed
Extraction 37 completed
Extraction 38 completed
Extraction 39 completed
Extraction 40 completed
Extraction 41 completed
Extraction 42 completed
E

In [5]:
# Export data
apple_posts.to_csv('Scraped Datasets/apple_posts',index=False)
google_posts.to_csv('Scraped Datasets/google_posts',index=False)