# Webscraping (Android Subreddit)

In this notebook we carry out a webscraping algorithm to scrape data from the Android subreddit on Reddit.com.
There are 3 main steps that we carry out:

1. Initial webscraping
2. Full Webscraping
3. Webscraping Evaluation and Export

In [1]:
import pandas as pd
import requests
import time
import random
from bs4 import BeautifulSoup

## 1.  Initial Webscraping

In [5]:
url = 'https://www.reddit.com/r/Android/new.json'
res = requests.get(url, headers={'User-agent': 'Zaini Inc'})

In [6]:
res.status_code

200

In [7]:
reddit_dict = res.json()

In [5]:
print(reddit_dict)

{'kind': 'Listing', 'data': {'modhash': '', 'dist': 25, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'iphone', 'selftext': 'Welcome to the Daily Tech Support thread for /r/iphone. \n\nHave a question you need answered? Ask away! Please remember to adhere to our rules, which can be found in the sidebar. As usual, if you have a serious issue with the subreddit, please contact [the moderators directly.](https://www.reddit.com/message/compose?to=%2Fr%2Fiphone)\n\nPlease be informed that any questions about bypassing iCloud lock,  or anything similar that may infer that you are trying to get access to a locked iPhone, are no longer allowed and will be removed. Thank you.\n\nCheck our [Tech Support FAQ page](https://www.reddit.com/r/iphone/wiki/support-faq)\n\nJoin our Discord room for support:\n\n[Discord](https://discord.gg/iphone)\n\n**Note: Comments are sorted by /new for your convenience.**\n\nThis is the previous [archive](https://www.reddit.com/r/iphone/s

**Initial Run**

We have carried out our initial webscraping and received the results. Next, we need to look into the 'messy data' we received to find the exact information we need.

In [8]:
reddit_dict.keys()

dict_keys(['kind', 'data'])

In [9]:
reddit_dict['data']

{'modhash': '',
 'dist': 25,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'Android',
    'selftext': '',
    'author_fullname': 't2_1unjxir5',
    'saved': False,
    'mod_reason_title': None,
    'gilded': 0,
    'clicked': False,
    'title': 'Samsung trade in program is back again for Baltic countries (but not as good as before)',
    'link_flair_richtext': [],
    'subreddit_name_prefixed': 'r/Android',
    'hidden': False,
    'pwls': 6,
    'link_flair_css_class': None,
    'downs': 0,
    'thumbnail_height': 140,
    'top_awarded_type': None,
    'hide_score': True,
    'name': 't3_k1w6p8',
    'quarantine': False,
    'link_flair_text_color': 'dark',
    'upvote_ratio': 0.5,
    'author_flair_background_color': None,
    'subreddit_type': 'public',
    'ups': 0,
    'total_awards_received': 0,
    'media_embed': {},
    'thumbnail_width': 140,
    'author_flair_template_id': None,
    'is_original_content': False,
    'user_reports': [],
  

In [10]:
reddit_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [11]:
reddit_dict['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'Android',
   'selftext': '',
   'author_fullname': 't2_1unjxir5',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'Samsung trade in program is back again for Baltic countries (but not as good as before)',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/Android',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': None,
   'downs': 0,
   'thumbnail_height': 140,
   'top_awarded_type': None,
   'hide_score': True,
   'name': 't3_k1w6p8',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'upvote_ratio': 0.5,
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 0,
   'total_awards_received': 0,
   'media_embed': {},
   'thumbnail_width': 140,
   'author_flair_template_id': None,
   'is_original_content': False,
   'user_reports': [],
   'secure_media': None,
   'is_reddit_media_domain': False,
   'is_meta': 

In [12]:
reddit_dict['data']['children'][1]['data']

{'approved_at_utc': None,
 'subreddit': 'Android',
 'selftext': '',
 'author_fullname': 't2_56a46frx',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'Vivo unveils Origin Os - Will it be less polarizing than the previous one?',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/Android',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': None,
 'downs': 0,
 'thumbnail_height': 78,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_k1vlqc',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'upvote_ratio': 0.92,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 21,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': None,
 'is_reddit_media_domain': False,
 'is_meta': False,
 'category': None,
 'secure_media_embed': {},
 'link_flair_text': None,
 'can_mod_post': False,


Here we have managed to find and isolate a single posting on the subreddit.
Next, we try to process the data into a dataframe for easier manipulation.

In [13]:
posts = [p['data'] for p in reddit_dict['data']['children']]  

In [14]:
initialdf = pd.DataFrame(posts)

In [15]:
initialdf.shape

(25, 108)

In [16]:
initialdf.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id
0,,Android,,t2_1unjxir5,False,,0,False,Samsung trade in program is back again for Bal...,[],...,/r/Android/comments/k1w6p8/samsung_trade_in_pr...,all_ads,False,https://www.samsung.com/lt/grazinimas/,2281497,1606460000.0,0,,False,
1,,Android,,t2_56a46frx,False,,0,False,Vivo unveils Origin Os - Will it be less polar...,[],...,/r/Android/comments/k1vlqc/vivo_unveils_origin...,all_ads,False,http://www.techradar.com/news/vivo-unveils-ori...,2281497,1606457000.0,0,,False,
2,,Android,,t2_b8kyv,False,,0,False,Nokia 2.4 hands-on - GSMArena,[],...,/r/Android/comments/k1gn34/nokia_24_handson_gs...,all_ads,False,https://www.gsmarena.com/nokia_24_handson-news...,2281497,1606403000.0,0,,False,
3,,Android,After recently acquiring a 1TB microSD card in...,t2_cqcpv2h,False,,0,False,"Are high-speed/high-capacity (A1, A2, U3) micr...",[],...,/r/Android/comments/k1o0rz/are_highspeedhighca...,all_ads,False,https://www.reddit.com/r/Android/comments/k1o0...,2281497,1606427000.0,0,,False,
4,,Android,,t2_5l3x1fsc,False,,0,False,Android App Bundles could have a drawback for ...,[],...,/r/Android/comments/k1bqzi/android_app_bundles...,all_ads,False,https://www.slashgear.com/android-app-bundles-...,2281497,1606380000.0,0,,False,


**Columns**

There are a lot of columns in this dataset. We will need to parse through them to get the features that we want.

In [17]:
columns = initialdf.columns

In [51]:
initialdf[columns[100:]].head()

Unnamed: 0,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id
0,False,https://www.samsung.com/lt/grazinimas/,2281497,1606460000.0,0,,False,
1,False,http://www.techradar.com/news/vivo-unveils-ori...,2281497,1606457000.0,0,,False,
2,False,https://www.gsmarena.com/nokia_24_handson-news...,2281497,1606403000.0,0,,False,
3,False,https://www.reddit.com/r/Android/comments/k1o0...,2281497,1606427000.0,0,,False,
4,False,https://www.slashgear.com/android-app-bundles-...,2281497,1606380000.0,0,,False,


In [35]:
initialdf['created_utc']  
#this is an important feature we will need later on, as it gives us the time the posting was made

0     1.606489e+09
1     1.606486e+09
2     1.606432e+09
3     1.606456e+09
4     1.606409e+09
5     1.606419e+09
6     1.606430e+09
7     1.606442e+09
8     1.606436e+09
9     1.606420e+09
10    1.606436e+09
11    1.606432e+09
12    1.606430e+09
13    1.606418e+09
14    1.606405e+09
15    1.606404e+09
16    1.606387e+09
17    1.606399e+09
18    1.606398e+09
19    1.606392e+09
20    1.606377e+09
21    1.606371e+09
22    1.606359e+09
23    1.606360e+09
24    1.606359e+09
Name: created, dtype: float64

In [19]:
text_cols = []

for column in columns:
    if 'text' in column:
        text_cols.append(column)

In [20]:
text_cols

['selftext',
 'link_flair_richtext',
 'link_flair_text_color',
 'link_flair_text',
 'author_flair_richtext',
 'selftext_html',
 'author_flair_text',
 'author_flair_text_color']

In [20]:
initialdf[['author_fullname','selftext']]

Unnamed: 0,author_fullname,selftext
0,t2_6l4z3,Welcome to the Daily Tech Support thread for /...
1,t2_frdyc,I have hooked up a MagSafe charger in my car a...
2,t2_4r7upir5,"Hi guys, if I was to buy a iphone second hand ..."
3,t2_1qd43scz,"Hello everyone, I have the iPhone 11 and until..."
4,t2_4ikub,Proud owner of a new iPhone 11 Pro. First tim...
5,t2_4yrsbgdq,Is there a way to lock the screen so that it s...
6,t2_4urhkkj1,"My current iphone works fine, but I was thinki..."
7,t2_4a5l8jkf,
8,t2_7wdg9wl0,I am really excited with iOS 14's new Back Tap...
9,t2_8dxjwtqt,


**Initial Webscraping**

We have been able to locate the important data features we need in the raw data we scraped from reddit, and we are also able to transform the data into a more manageable format. 

We now move on to carry out a larger scale webscraping to get more data for our analysis.

## 2.  Full Webscraping

In order to do the full webscraping to get enough data for analysis, we run a for loop that will scrape the website multiple times.

In [52]:
url = 'https://www.reddit.com/r/Android/new.json'

In [53]:
posts = []
after = None

for a in range(50):
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Zaini Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(6,10)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/Android/new.json
8
https://www.reddit.com/r/Android/new.json?after=t3_k0ybf6
7
https://www.reddit.com/r/Android/new.json?after=t3_jyu0c2
9
https://www.reddit.com/r/Android/new.json?after=t3_jx8hv0
10
https://www.reddit.com/r/Android/new.json?after=t3_jwh3du
7
https://www.reddit.com/r/Android/new.json?after=t3_jv53go
9
https://www.reddit.com/r/Android/new.json?after=t3_jtefir
8
https://www.reddit.com/r/Android/new.json?after=t3_js1sg3
7
https://www.reddit.com/r/Android/new.json?after=t3_jqa25l
6
https://www.reddit.com/r/Android/new.json?after=t3_jox1k7
10
https://www.reddit.com/r/Android/new.json?after=t3_jnbmjl
9
https://www.reddit.com/r/Android/new.json?after=t3_jmau95
8
https://www.reddit.com/r/Android/new.json?after=t3_jksv12
9
https://www.reddit.com/r/Android/new.json?after=t3_jjryih
10
https://www.reddit.com/r/Android/new.json?after=t3_jiv5h6
10
https://www.reddit.com/r/Android/new.json?after=t3_jgx91h
6
https://www.reddit.com/r/Android/new.json?after=t3_j

In [54]:
len(posts)

1225

### 3.0  Export Data

We immediately export the raw webscraped data into a csv file, to ensure that our data is saved.

In [55]:
df = pd.DataFrame(posts)

In [56]:
df.to_csv('Android-posts.csv')

### 3.1  Evaluate Data

We need to evaluate our data to see whether it is adequate for analysis.

In [57]:
df[['author_fullname','title','selftext', 'subreddit']]

Unnamed: 0,author_fullname,title,selftext,subreddit
0,t2_1unjxir5,Samsung trade in program is back again for Bal...,,Android
1,t2_56a46frx,Vivo unveils Origin Os - Will it be less polar...,,Android
2,t2_b8kyv,Nokia 2.4 hands-on - GSMArena,,Android
3,t2_cqcpv2h,"Are high-speed/high-capacity (A1, A2, U3) micr...",After recently acquiring a 1TB microSD card in...,Android
4,t2_5l3x1fsc,Android App Bundles could have a drawback for ...,,Android
...,...,...,...,...
1220,t2_qlf0lq6,Google Assistant adds settings page to select ...,,Android
1221,t2_k75n8,OnePlus Buds Z launched at ₹3190 ($44),,Android
1222,t2_5nm7hbhd,Is it me or the gap between iPhone and Android...,(about hardware features) I mean I used to hes...,Android
1223,t2_234mq3b6,I hate how Apple pulls moves like these and in...,1) Headphone jack gone. Headphones are now wir...,Android


**Missing Data**

The first thing we notice is a lot of missing data in the text column. However, we do still have text in the title column which would be useful for analysis.

We want to combine the selftext and title columns, but before we do that we carry out some data cleaning first.

In [58]:
df['selftext'][1221]

''

In [38]:
df['selftext'][2]

'Hi guys, if I was to buy a iphone second hand but brand new in sealed condition would the person selling me the phone be able to claim insurance and block the phone that was sold to me, essentially leaving me with a useless phone? Please help.'

In [59]:
df['selftext'].replace(r'^\s*$', 'NA', regex=True, inplace = True)

In [60]:
df['title/text'] = df['selftext'] + df['title']

In [61]:
df[['author_fullname','title','selftext', 'title/text', 'subreddit']]

Unnamed: 0,author_fullname,title,selftext,title/text,subreddit
0,t2_1unjxir5,Samsung trade in program is back again for Bal...,,NASamsung trade in program is back again for B...,Android
1,t2_56a46frx,Vivo unveils Origin Os - Will it be less polar...,,NAVivo unveils Origin Os - Will it be less pol...,Android
2,t2_b8kyv,Nokia 2.4 hands-on - GSMArena,,NANokia 2.4 hands-on - GSMArena,Android
3,t2_cqcpv2h,"Are high-speed/high-capacity (A1, A2, U3) micr...",After recently acquiring a 1TB microSD card in...,After recently acquiring a 1TB microSD card in...,Android
4,t2_5l3x1fsc,Android App Bundles could have a drawback for ...,,NAAndroid App Bundles could have a drawback fo...,Android
...,...,...,...,...,...
1220,t2_qlf0lq6,Google Assistant adds settings page to select ...,,NAGoogle Assistant adds settings page to selec...,Android
1221,t2_k75n8,OnePlus Buds Z launched at ₹3190 ($44),,NAOnePlus Buds Z launched at ₹3190 ($44),Android
1222,t2_5nm7hbhd,Is it me or the gap between iPhone and Android...,(about hardware features) I mean I used to hes...,(about hardware features) I mean I used to hes...,Android
1223,t2_234mq3b6,I hate how Apple pulls moves like these and in...,1) Headphone jack gone. Headphones are now wir...,1) Headphone jack gone. Headphones are now wir...,Android


In [62]:
df_filtered = df.drop_duplicates(subset = ['title/text'], keep='first')

In [63]:
df_filtered.shape

(725, 110)

In [65]:
df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,author_cakeday,title/text
0,,Android,,t2_1unjxir5,False,,0,False,Samsung trade in program is back again for Bal...,[],...,False,https://www.samsung.com/lt/grazinimas/,2281500,1606460000.0,0,,False,,,NASamsung trade in program is back again for B...
1,,Android,,t2_56a46frx,False,,0,False,Vivo unveils Origin Os - Will it be less polar...,[],...,False,http://www.techradar.com/news/vivo-unveils-ori...,2281500,1606457000.0,0,,False,,,NAVivo unveils Origin Os - Will it be less pol...
2,,Android,,t2_b8kyv,False,,0,False,Nokia 2.4 hands-on - GSMArena,[],...,False,https://www.gsmarena.com/nokia_24_handson-news...,2281500,1606403000.0,0,,False,,,NANokia 2.4 hands-on - GSMArena
3,,Android,After recently acquiring a 1TB microSD card in...,t2_cqcpv2h,False,,0,False,"Are high-speed/high-capacity (A1, A2, U3) micr...",[],...,False,https://www.reddit.com/r/Android/comments/k1o0...,2281500,1606427000.0,0,,False,,,After recently acquiring a 1TB microSD card in...
4,,Android,,t2_5l3x1fsc,False,,0,False,Android App Bundles could have a drawback for ...,[],...,False,https://www.slashgear.com/android-app-bundles-...,2281500,1606380000.0,0,,False,,,NAAndroid App Bundles could have a drawback fo...


After combining the text columns and dropping duplicates, we still have enough posts for our analysis.

Next, we carry out Exploratory Data Analysis in the next notebook.