<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Reddit - Webscrapping

--- 
# Notebook 2

The second notebook will comprise of scrapping the data using API and putting it into 2 dataframes.

---

# 1.0 Data Collection form subreddit Cats
Extracting 10000 posts from subreddit Cats and create a dataframe using title.

In [33]:
import requests
import pandas as pd
import time

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Stating the parameters
url = 'https://api.pushshift.io/reddit/search/submission'
params = {
    'subreddit': 'cats',
    'size': 100,
    'before' : 1641398400 #Local Datetime as 6th Jan 12am
}

In [3]:
res = requests.get(url, params)
res.status_code # check connection status, if 200 means successfully connected

200

In [4]:
data = res.json()
posts = data['data']
len(posts) # Check to see if we extracted 100 posts

100

In [5]:
# Create dataframe for cats
df_cats = pd.DataFrame(posts)

In [6]:
# Initiate variable for time taken to scrap 9900 posts
cats_total_time = 0

# Extracting another 9900 posts to get 10000 rows for df_cats
for i in range(99):
    start_time = time.time()
    params = {'subreddit': 'cats', 'size': 100, 'before': posts[-1]['created_utc']}
    response = requests.get(url, params)
    data = response.json()
    posts = data['data']
    df_cats = df_cats.append(pd.DataFrame(posts))
    end_time = time.time()
    exe_time = end_time - start_time
    cats_total_time += exe_time
    time.sleep(2)

# Check if we have 10000 posts
df_cats.shape

(9995, 83)

We did not manage to scrap 10,000 posts. This might be some posts been deleted and the gap is not been removed. However we did obtain close to 10,000 post and is sufficient for our analysis.

# 1.1 Data Collection form subreddit Dogs
Extracting 1000 posts from subreddit Dogs and create a dataframe using title.

In [7]:
# Changing the parameters for dogs
params = {
    'subreddit': 'dogs',
    'size': 100,
    'before' : 1641398400 #Local Datetime as 6th Jan 12am
}

In [8]:
# Create dataframe for dogs
res = requests.get(url, params)
data = res.json()
posts = data['data']
df_dogs = pd.DataFrame(posts)

In [9]:
# Initiate variable for time taken to scrap 9900 posts
dogs_total_time = 0

# Extracting another 9900 posts to get 10000 rows for df_dogs
for i in range(99):
    start_time = time.time()
    params = {'subreddit': 'dogs', 'size': 100, 'before': posts[-1]['created_utc']}
    response = requests.get(url, params)
    data = response.json()
    posts = data['data']
    df_dogs = df_dogs.append(pd.DataFrame(posts))
    end_time = time.time()
    exe_time = end_time - start_time
    dogs_total_time += exe_time
    time.sleep(2)
    
# Check if we have 10000 posts
df_dogs.shape

(9996, 78)

We only manage to scrap 9996 posts. This is only 1 more than the scraps from the cat subreddit. The 2 scraps should be similar in size and not cause one class to overwhelm another.

# 2.0 Usefulness of Data

In [10]:
df_cats.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_text_color,author_flair_template_id,author_cakeday,category,banned_by
0,[],False,BeepBeep-Richie,,[],,text,t2_33zkxlgz,False,False,...,,,,,,,,,,
1,[],False,__katsby,,[],,text,t2_40jfdgd6,False,False,...,,,,,,,,,,
2,[],False,TinyTotoro3,,[],,text,t2_676c778q,False,False,...,,,,,,,,,,
3,[],False,clickbaitbabe,,[],,text,t2_gvp1d1m,False,False,...,,,,,,,,,,
4,[],False,TinyTotoro3,,[],,text,t2_676c778q,False,False,...,,,,,,,,,,


The first 5 rows shows empty posts for selftext. Lets examine the number of empty posts in df_cats using value_counts below.

In [11]:
df_cats['selftext'].value_counts().head(3)

             8593
[removed]     180
[deleted]      14
Name: selftext, dtype: int64

In [12]:
df_dogs.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,post_hint,preview,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,thumbnail_height,thumbnail_width,distinguished,banned_by,edited
0,[],False,Substantial-Koala-32,,[],,text,t2_gp3egwzs,False,False,...,,,,,,,,,,
1,[],False,Unsd,,[],,text,t2_4d18z6g4,False,False,...,,,,,,,,,,
2,[],False,Cubs017,,[],,text,t2_rt2lm,False,False,...,,,,,,,,,,
3,[],False,stuffiwannaknow,,[],,text,t2_cbmsbxv,False,False,...,,,,,,,,,,
4,[],False,seeyouinthesun,,[],,text,t2_ho8lowto,False,False,...,,,,,,,,,,


In [13]:
df_dogs['selftext'].value_counts().head(3)

[removed]    792
              68
[deleted]     43
Name: selftext, dtype: int64

## <u>Usefulness of Title and Selftext column</u>
* All the titles from the cats reddit and the dogs reddit are not empty.
* The selftext from cats has 8593 blanks, 180 removed and 14 deleted which totals to 8787 missing data.
* The blanks are likely due to posting of pictures which is not helpful to us.
* The selftext from dogs has 68 blanks and 792 removed and 43 deleted which totals to 862 missing data.

This suggests that using selftext from the cat reddit will not generate sufficient data with 1000 scraps. For this project, we will only use body and not selftext

# 2.1 Relevance of Data

In [36]:
X = df_cats['title']
y = df_cats['subreddit']

In [37]:
cvec = CountVectorizer(ngram_range=(1, 1), stop_words='english')
X = cvec.fit_transform(X)

In [38]:
df_cats_title = pd.DataFrame(X.todense(), columns = cvec.get_feature_names())

In [39]:
df_cats_title.sum().sort_values(ascending=False).head(10)

cat       2899
cats       690
new        614
just       500
like       429
year       428
little     402
kitten     345
old        305
kitty      290
dtype: int64

In [40]:
X = df_dogs['title']
y = df_dogs['subreddit']

In [41]:
cvec = CountVectorizer(ngram_range=(1, 1), stop_words='english')
X = cvec.fit_transform(X)

In [42]:
df_dogs_title = pd.DataFrame(X.todense(), columns = cvec.get_feature_names())

In [21]:
df_dogs_title.sum().sort_values(ascending=False).head(10)

dog       5264
dogs      1111
help       878
puppy      618
old        425
advice     397
need       337
breed      314
does       304
new        299
dtype: int64

## <u>Relevance of the top 10 Highest Occurence Words</u>
* The top words for cats reddit that are relevant are cat, cats, kitty, kitten
* The top words for dogs reddit that are relevant are dog, dogs, puppy, breed

The rest of the top words for cats and dogs are too general and can appear in both reddits. After lementization, we probably will be left with 2 relevnat words for cats and 3 for dogs. Although this is suggestive that it will be harder to identify which reddit a post is from, however general words can also be learned by the machine. For example, the word love might mean cat owners post about how they love their cats while the word food might mean dog owners are more concern with what they feed their dogs

# 3.0 Storage Optimization

An initial scrap of 100 posts are done and a dataframe is created for each subreddit and a for loop is utilised to scrap the remaining 9900 posts and append to the dataframe. A function is not required as we are only repeating the process twice.

# 4.0 Server Load

In [22]:
print(f'The total execution time to scrap 9900 posts from cats subreddit is {cats_total_time/60:0.2f} minutes\n and the mean time for each scrap is {cats_total_time/99:0.2f} seconds')
print(f'The total execution time to scrap 9900 posts from dogs subreddit is {dogs_total_time/60:0.2f} minutes\n and the mean time for each scrap is {dogs_total_time/99:0.2f} seconds')

The total execution time to scrap 9900 posts from cats subreddit is 10.65 minutes
 and the mean time for each scrap is 6.45 seconds
The total execution time to scrap 9900 posts from dogs subreddit is 9.03 minutes
 and the mean time for each scrap is 5.47 seconds


The average time to scrap 100 posts from both reddits is under 7 seconds and the total time taken to scrap 9900 posts from both subreddit is less than 11 minutes each. As the mean time is reasonable small, it shows that the server is not overloaded. 

A 2 seconds lag is also implemented in each loop to delay the execution of the next to not tax on the server.

# 5.0 Saving Data

In [23]:
# Saving the datasets to csv so as not to scrap again.
df_cats.to_csv('./dataset/cats.csv', index = False)
df_dogs.to_csv('./dataset/dogs.csv', index = False)