<a href="https://colab.research.google.com/github/umbertoselva/NER-based-Sentiment-Analysis/blob/main/01_Reddit_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01 USING THE REDDIT API TO GET A DATASET

This is part 01 of my NER-based Sentiment Analysis Project: 
https://github.com/umbertoselva/NER-based-Sentiment-Analysis

Our goal in this notebook will be to extract a dataset of movie reviews from the "I Just Watched" subreddit.

In the next few notebooks we will apply Named Entity Recognition to this dataset to extract the 'PERSON' entities (we will be looking for actors and movie directors), then we will carry out a Sentiment Analysis of each review to find out who is more popular among the IJW subreddit users.

TABLE OF CONTENTS

1. Authentication
2. Retrieving posts from a specific subreddit
3. Saving the dataframe

### 1) AUTHENTICATION

In order to use the Reddit API you need to 
1) be registered on the website with a username and password
2) navigate to https://www.reddit.com/prefs/apps, "Create another app" and note down the client id and the secret key

I saved the above four items in txt files which I have uploaded to Google Colab in order to read them into our notebook

In [None]:
with open('reddit_username.txt', 'r') as usr:
  reddit_username = usr.read()

with open('reddit_password.txt', 'r') as pwd:
  reddit_password = pwd.read()

with open('client_id.txt', 'r') as clid:
  client_id = clid.read()

with open('secret_key.txt', 'r') as skey:
  secret_key = skey.read()

We use the client id and the secret key for authentication together with login information and headers

In [None]:
import requests

In [None]:
auth = requests.auth.HTTPBasicAuth(client_id, secret_key)

We now need to prepare a dictionary with the login details:
- login method aka 'grant_type' as 'password'
- username
- password

In [None]:
login = {'grant_type': 'password',
         'username': reddit_username,
         'password': reddit_password}

In [None]:
headers = {'User-Agent': 'GetDataAPI/0.0.1'}

We send a post request to the "access_token" endpoint

In [None]:
response = requests.post('https://www.reddit.com/api/v1/access_token',
                         auth=auth,
                         data=login,
                         headers=headers)

In [None]:
response

<Response [200]>

We got an access token which will expire in a few hours

In [None]:
response.json()

{'access_token': '1129284099355-9MqAdRN6MI8NoL6J8l96TpIYFM_92w',
 'expires_in': 86400,
 'scope': '*',
 'token_type': 'bearer'}

In [None]:
response.json()['access_token']

'1129284099355-9MqAdRN6MI8NoL6J8l96TpIYFM_92w'

In [None]:
auth_token = response.json()['access_token']

Let's add this access token to the headers of the auth, within the 'Authorization' key

In [None]:
headers['Authorization'] = f'bearer {auth_token}'

In [None]:
headers

{'Authorization': 'bearer 1129284099355-9MqAdRN6MI8NoL6J8l96TpIYFM_92w',
 'User-Agent': 'GetDataAPI/0.0.1'}

If everything is correct, sending a GET request to the "me" endpoint will return a 200 response

In [None]:
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)

<Response [200]>

### 2) RETRIEVING POSTS FROM A SPECIFIC SUBREDDIT

We are going to extract the post from the "I Just Watched" subreddit. In particular the posts at https://www.reddit.com/r/ijustwatched/new.

Each post is a movie review.

In [None]:
api = 'https://oauth.reddit.com'

In [None]:
requests.get(f'{api}/r/ijustwatched/new', headers=headers)

<Response [200]>

In [None]:
res_25 = requests.get(f'{api}/r/ijustwatched/new', headers=headers)

Now the returned json dict contains a 'data' key, whose value is a dict, whithin which we find a 'children' key, whose value is a list of all the returned posts, which by default is 25.

In [None]:
len(res_25.json()['data']['children'])

25

In [None]:
res_25.json()['data']['children'][0]

{'data': {'all_awardings': [],
  'allow_live_comments': False,
  'approved_at_utc': None,
  'approved_by': None,
  'archived': False,
  'author': 'filmgamegeek',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_olhmd',
  'author_is_blocked': False,
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'banned_at_utc': None,
  'banned_by': None,
  'can_gild': True,
  'can_mod_post': False,
  'category': None,
  'clicked': False,
  'content_categories': None,
  'contest_mode': False,
  'created': 1657905770.0,
  'created_utc': 1657905770.0,
  'discussion_type': None,
  'distinguished': None,
  'domain': 'self.Ijustwatched',
  'downs': 0,
  'edited': False,
  'gilded': 0,
  'gildings': {},
  'hidden': False,
  'hide_score': False,
  'id': 'vzu4cb',
  'is

We need to go into the value corresponding to the further 'data' key, which is also a dict. And we shall grab the following information from each post:

- `'name'` = we'll need this to identify the earliest post of the batch of 100, so that we can then extract the previous 100 and so on
- `'created_utc'` = the timestamp of when the post was created
- `'subreddit'` = useful in case we then want to grab data from other subreddits
- `'title'` = the title of the post
- `'selftext'` = the raw text of the post.
- `'upvote_ratio'`
- `'ups'` = positive votes
- `'downs'` = negative votes
- `'score'`

But first we need to get more posts, not just 25.

The following is the documentation for the Reddit API at [https://www.reddit.com/dev/api/](https://www.reddit.com/dev/api/)


```
GET [/r/subreddit]/new

This endpoint is a listing.

after	= fullname of a thing

before	= fullname of a thing

count	= a positive integer (default: 0)

limit	= the maximum number of items desired (default: 25, maximum: 100)

show	= (optional) the string all

sr_detail	= (optional) expand subreddits
```

So if we want to return more than 25 values, we need to specify the `limit` parameter.

We can select to return up to a maximum of 100 posts per request.

In [None]:
res_100 = requests.get(f'{api}/r/ijustwatched/new', 
                   headers=headers,
                   params={'limit': '100'}
                   )

In [None]:
len(res_100.json()['data']['children'])

100

Now let's put these first 100 posts into a dataframe with Pandas

Let's initiate a dataframe

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(
    {
        'name': [],
        'created_utc': [],
        'subreddit': [],
        'title': [],
        'selftext': [],
        'upvote_ratio': [],
        'ups': [],
        'downs': [],
        'score': []
    }
)

In [None]:
for post in res_100.json()['data']['children']:
  df = df.append(
      {
          'name': post['data']['name'],
          'created_utc': int(post['data']['created_utc']),
          'subreddit': post['data']['subreddit'],
          'title': post['data']['title'],
          'selftext': post['data']['selftext'],
          'upvote_ratio': post['data']['upvote_ratio'],
          'ups': post['data']['ups'],
          'downs': post['data']['downs'],
          'score': post['data']['score']
      }, 
      ignore_index=True
  )

In [None]:
df

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...
95,t3_uyu144,1.653642e+09,Ijustwatched,IJW: Emergency (2022),The way this mixes comedy social issues and su...,0.93,11.0,0.0,11.0
96,t3_uyfi3i,1.653594e+09,Ijustwatched,IJW: Top Gun: Maverick (2022),"A sequel forty-six years in the making, Top Gu...",1.00,1.0,0.0,1.0
97,t3_uyahd8,1.653580e+09,Ijustwatched,IJW: Cyber Hell: Exposing an Internet Horror (...,[https://www.reeladvice.net/2022/05/cyber-hell...,1.00,3.0,0.0,3.0
98,t3_uy8ce2,1.653574e+09,Ijustwatched,IJW: Castle in the Sky (1986),In trying to maintain my anime cred I watched ...,0.84,4.0,0.0,4.0


Now if we want to loop back in time and grab 100 posts at a time, we need to use the 'name' of the earliest post (i.e. the post that will appear last in out dataframe at each loop), and set that as the value of the 'after' parameter

In [None]:
df['name'].iloc[len(df)-1]

't3_uy73xm'

In [None]:
# res_100_more = requests.get(f'{api}/r/ijustwatched/new',
#                             headers=headers,
#                             params={
#                                 'limit': '100',
#                                 'after': df['name'].iloc[len(df)-1]
#                             }
#                 )

The above cell by itself will return 100 more posts.

However, we need a loop that will keep going until we get all the data

In [None]:
while True:

  # get a 100 posts 
  res_100_post_batch = requests.get(f'{api}/r/ijustwatched/new',
                                    headers=headers,
                                    params={
                                        'limit': '100',
                                        'after': df['name'].iloc[len(df)-1]
                                    }
                        )
  
  # keep going until you run out of posts
  if len(res_100_post_batch.json()['data']['children']) == 0:
    break

  # for each of the 100 retrieved posts, populate the dataframe
  for post in res_100_post_batch.json()['data']['children']:
    df = df.append(
        {
            'name': post['data']['name'],
            'created_utc': int(post['data']['created_utc']),
            'subreddit': post['data']['subreddit'],
            'title': post['data']['title'],
            'selftext': post['data']['selftext'],
            'upvote_ratio': post['data']['upvote_ratio'],
            'ups': post['data']['ups'],
            'downs': post['data']['downs'],
            'score': post['data']['score']
        }, 
        ignore_index=True
    )

In [None]:
df

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,t3_vzu4cb,1.657906e+09,Ijustwatched,IJW: Ang Babaeng Nawawala sa Sarili (2022),Source: [https://www.reeladvice.net/2022/07/an...,0.86,5.0,0.0,5.0
1,t3_vz90er,1.657840e+09,Ijustwatched,Ijw: Paws of Fury: The Legend of Hank (2022),"For a very little kid’s first parody/farce, it...",0.89,7.0,0.0,7.0
2,t3_vyxfuj,1.657810e+09,Ijustwatched,IJW: Kitty K7 (2022),Source: [https://www.reeladvice.net/2022/07/ki...,1.00,1.0,0.0,1.0
3,t3_vx6v7n,1.657617e+09,Ijustwatched,IJW : Man from Toronto (2022),"Was a pretty dope movie, watched it online ye...",0.74,4.0,0.0,4.0
4,t3_vwmwkm,1.657558e+09,Ijustwatched,IJW: Thor: Love and Thunder (2022),Source: [https://www.reeladvice.net/2022/07/th...,0.74,4.0,0.0,4.0
...,...,...,...,...,...,...,...,...,...
992,t3_oj9jvl,1.626156e+09,Ijustwatched,IJW: Fired Up! [2009],Fired Up! is a dramedy romcom type film about ...,1.00,4.0,0.0,4.0
993,t3_oinxgw,1.626083e+09,Ijustwatched,IJW: The 8th Night (2021),Plot is confusing to say the least. It appears...,1.00,5.0,0.0,5.0
994,t3_oilr8d,1.626072e+09,Ijustwatched,IJW: Diary of a Chambermaid [1964],Diary of a Chambermaid is a drama mystery roma...,1.00,3.0,0.0,3.0
995,t3_oiisdi,1.626059e+09,Ijustwatched,IJW: Soldier (1998),I remember watching this growing up. Good acti...,1.00,5.0,0.0,5.0


We succesfully retrieved 997 posts!

Now let's do some cleaning in case we have collected some invalid data item

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 997 entries, 0 to 996
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          997 non-null    object 
 1   created_utc   997 non-null    float64
 2   subreddit     997 non-null    object 
 3   title         997 non-null    object 
 4   selftext      997 non-null    object 
 5   upvote_ratio  997 non-null    float64
 6   ups           997 non-null    float64
 7   downs         997 non-null    float64
 8   score         997 non-null    float64
dtypes: float64(5), object(4)
memory usage: 70.2+ KB


As we can see a few columns have Dtype `object`. 

This is automatic when a column contains string datatype items.

But it can also mean that the column contains mixed data types, or it could mean that some of the cells contain an empty string, which might cause issues for us later on.

Let's check for null values

In [None]:
df['selftext'].isnull().values.any()

False

So there are no NaN values.

Indeed, after some tinkering, I did find out that all cells do contain strings, but a couple of cells contain emtpy strings. This would indeed cause problems later on, so we should fix that.

Let us check if any cell contains an empty string

In [None]:
import numpy as np

In [None]:
np.where(df['selftext'].apply(lambda x: x == ''))

(array([398, 924]),)

An equivalent method is the following:

In [None]:
df[df['selftext'] == ''].index

Int64Index([398, 924], dtype='int64')

Ok, so the cells at row 398 and 924 both contain an empty string. Let us confirm that and replace the empty string with something.

In [None]:
df['selftext'].iloc[398]

''

In [None]:
df['selftext'].iloc[398] = 'Empty'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [None]:
df['selftext'].iloc[398]

'Empty'

In [None]:
df['selftext'].iloc[924]

''

In [None]:
df['selftext'].iloc[924] = 'Empty'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [None]:
df['selftext'].iloc[924]

'Empty'

In [None]:
type(df['selftext'].iloc[398])

str

In [None]:
type(df['selftext'].iloc[924])

str

In [None]:
df[df['selftext'] == ''].index

Int64Index([], dtype='int64')

### 3) SAVING THE DATAFRAME

Let's save our dataframe in a CSV file.

Before doing that, let's replace any '|' characters in our dataframe, so that we can use that character as our delimiter/separator in our CSV file.

In [None]:
df = df.replace({'|': ''}, regex=True)

Let us make sure that we have not just accidentally created any new empty string cells

In [None]:
df[df['selftext'] == ''].index

Int64Index([], dtype='int64')

Ok, we can save our dataframe in a file

In [None]:
df.to_csv('ijw_subreddit.csv', sep='|', encoding='utf-8', index=False)

In [None]:
!pwd

/content


In [None]:
!ls

client_id.txt	   reddit_password.txt	sample_data
ijw_subreddit.csv  reddit_username.txt	secret_key.txt


I shall save this file to Google Drive for later use.