## Project 3 - Subreddit Classification using NLP

### I: Webscraping from Reddit using PRAW API

#### Introduction
**Reddit** is a network of communities where people can post comments and discuss just about any topic under the sun. Each community is known as a Subreddit.
 **AMD** and **Nvidia** are two major competitors in the graphic cards manufacturing space, with AMD widely regarded as the more "budget" option and hence market perception is that they are of lower quality. This project will be leveraging the AMD and Nvidia subreddit communities as the database.

- Stakeholders: AMD Strategy Team (Consumer Products)
- Who we are: Consultants to help boost AMD sales<br>

#### Problem Statement
As consultants, AMD has hired our team to analyse how to improve sales of AMD products. It is unclear if the perception of AMD products is driven by a *marketing* or *engineering* issue. To understand market sentiments of AMD products, we have decided to leverage Reddit as one of the data sources as it is an uncensored forum.

#### Objective
1. Classify whether or not a comment is from the AMD or Nvidia Subreddit community -->  a binary classification task - to determine which subreddit a post came from based on its content. This is to future deployment of this model is scalable and possible
2. On top of the classification task in point 1, sentiment analysis will also be explored to understand the differences between the two subreddits. So that we can do a mini root cause analysis on the reason for AMD's comparatively lower sales than Nvidia. Hence we can advise the AMD team which department to deploy more resources into (i.e engineering or marketing team)

#### Success Metrics
1. F1 Score
2. ROC-AUC Score

This is because we are not overly concerned about minimizing either false negatives or false positives -- ideal scenario is to minimize both.

#### Workflow:
- Notebook 1: Webscraping from Reddit using PRAW API
- Notebook 2: EDA, Preprocessing and Sentiment Analysis
- Notebook 3: Modelling and Tuning

In [2]:
import requests
import time
import pandas as pd
import datetime

#### Webscraping Using PRAW API
I have opted to use PRAW API as 1) it is the official API wrapper for Reddit, hence it provides reliable access to Reddit's data and functionality and 2) well-documented, making it easy to use.

In [None]:
# pip install praw
# https://www.youtube.com/watch?v=NRgfgtzIhBQ&t=530
# https://praw.readthedocs.io/en/latest/code_overview/models/submission.html
import praw

In [None]:
# client id: assgnxQ_xl3vDcnijLsTuw
# secret: YE9jFCFSwjWCQ81hh-1cczz1oc-zoQ

reddit = praw.Reddit(client_id = 'assgnxQ_xl3vDcnijLsTuw' ,
                    client_secret = 'YE9jFCFSwjWCQ81hh-1cczz1oc-zoQ',
                    username = 'intelligentstraw',
                    password = 'intelligentpassword',
                    user_agent = 'tiffanyt')

In [None]:
# check if can access API
print(reddit.user.me())

intelligentstraw


In [None]:
subreddit_nvidia = reddit.subreddit('nvidia')
new_nvidia = subreddit_nvidia.new(limit=500) # get what is the latest news in reddit

In [None]:
# filter to posts from 2022 onwards

start_date = '01-01-22 00:00:00'
start_date = datetime.datetime.strptime(start_date, '%d-%m-%y %H:%M:%S').timestamp()

In [None]:
data = []  # Initialize lists to store data
counter = 0

for post in new_nvidia:
    date = post.created_utc
    try:
        if not post.stickied and date > start_date:   # filter to posts from 2022 onwards only
            post_data = {
                'id':post.id,
                 'date': post.created_utc,
                'title': post.title,
                'selftext': post.selftext,
                'n_comments': post.num_comments,
                 'author': post.author
            }

            post.comments.replace_more(limit=None)  # Fetch all comments from each post
            for comment in post.comments.list():
                try:
                    comment_text = comment.body.encode("utf-8", errors='ignore').decode("utf-8", errors='ignore')
                    comment_data = post_data.copy()  # Copy post data for each comment 
                    comment_data['comment'] = comment_text
                    data.append(comment_data) # put comments in rows
                except Exception as e:
                    print(f"Error processing comment: {e}")
    except Exception as e:
        print(f"Error processing post: {e}")
        
    counter += 1 # counter of no. of posts being scraped
    print(counter)

# Create a DataFrame 
df_nvidia = pd.DataFrame(data)

# Create column to identify that it is Nvidia posts
df_nvidia['Subreddit'] = "Nvidia"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
Error processing post: received 429 HTTP response
256
257
258
259
260
261
262
263
264
26

In [None]:
# convert date to datetime format
df_nvidia['date'] = pd.to_datetime(df_nvidia['date'], unit='s')

In [None]:
# check shape of data
df_nvidia.shape

(21823, 8)

In [None]:
df_nvidia = df_nvidia.drop_duplicates() # drop duplicates
df_nvidia.reset_index(drop=True, inplace=True)
df_nvidia.shape # check no. of rows after dropping

(21609, 8)

In [None]:
# export as csv
df_nvidia.to_csv('datasets/nvidia_reddit.csv')

In [None]:
# view posts with top comments
selected_nvda = df_nvidia[['title', 'n_comments']].drop_duplicates()
sorted_nvda = selected_nvda.sort_values(by='n_comments', ascending=False)
sorted_nvda.head(20)

In [None]:
# Repeat the same process for AMD

subreddit_amd = reddit.subreddit('AMD')
new_amd = subreddit_amd.new(limit=500) # get what is the latest news in reddit

In [None]:
data = []  # Initialize lists to store data
counter = 0

for post in new_amd:
    date = post.created_utc
    try:
        if not post.stickied and date > start_date:   # filter to posts from 2022 onwards only
            post_data = {
                'id':post.id,
                 'date': post.created_utc,
                'title': post.title,
                'selftext': post.selftext,
                'n_comments': post.num_comments,
                 'author': post.author
            }

            post.comments.replace_more(limit=None)  # Fetch all comments from each post
            for comment in post.comments.list():
                try:
                    comment_text = comment.body.encode("utf-8", errors='ignore').decode("utf-8", errors='ignore')
                    comment_data = post_data.copy()  # Copy post data for each comment 
                    comment_data['comment'] = comment_text
                    data.append(comment_data) # put comments in rows
                except Exception as e:
                    print(f"Error processing comment: {e}")
    except Exception as e:
        print(f"Error processing post: {e}")
        
    counter += 1 # counter of no. of posts being scraped
    print(counter)

# Create a DataFrame 
df_amd = pd.DataFrame(data)

# Create column to identify that it is Nvidia posts
df_amd['Subreddit'] = "AMD"

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
Error processing post: received 429 HTTP response
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
26

In [None]:
# convert date to datetime format
df_amd['date'] = pd.to_datetime(df_amd['date'], unit='s')

In [None]:
# check shape
df_amd.shape

(38032, 8)

In [None]:
df_amd = df_amd.drop_duplicates() # drop duplicates
df_amd.reset_index(drop=True, inplace=True)
df_amd.shape # check no. of rows after dropping

(37520, 8)

In [None]:
# export to csv
df_amd.to_csv('datasets/amd_reddit.csv')

With that, I have successfully completed the main objective of the notebook, which is to scrape the Nvidia and AMD posts from the Reddit website. This data was then exported to csv, and will be used for EDA and modelling in Notebooks 2 and 3.