# 1. Webscraping

In this section, webscraping was performed on Reddit, which is a social news site where users create and share content. Users create posts in topic-based communities (i.e. subreddits) for different interests and interaction within a community is possible by leaving comments under each post. Data was retrieved from 2 different subreddits (r/DotA2 & r/leagueoflegends) in the form of JSON objects and the information from each post was retrieved and saved as a csv file.

- This notebook has been refreshed to scrape data until 19/3/2023 to demonstrate Reddit webscraping. 
- For subsequent notebooks, the 13/3 dataset will be used for data cleaning and modelling.

In [1]:
# import necessary Python modules

import requests
import pandas as pd
import datetime
import time

In [2]:
# instantiate API variables

url = 'https://www.reddit.com/'
post_nature = ['','hot/','new/','top/','rising/','best/']
limit_num = 110
headers = {'User-Agent': 'Mozilla/5.0'}

dota2_url = []  
lol_url = []

# retrieve DotA2 and LoL subreddit json

for i in post_nature:
    dota2_sub = url + 'r/DotA2/' + i    
    lol_sub = url + 'r/leagueoflegends/' + i 
    dota2_json = dota2_sub + '.json?limit=' + str(limit_num) + '&after=None' 
    lol_json = lol_sub + '.json?limit=' + str(limit_num) + '&after=None'   

    for i in range(20):    
        if dota2_json in dota2_url:
            break
        dota2_res = requests.get(dota2_json,headers=headers)
        if dota2_res.status_code==200:
            dota2_url.append(dota2_json)
            dota2_json = dota2_sub + '.json?limit=' + str(limit_num) + '&after=' + str(dota2_res.json()['data']['after'])
        else:
            print(f'Connection failed, pls try again. Last successful {dota2_sub[-(len(dota2_sub)-len(url)-2):-1]} URL was {dota2_url[-1]}.')
            break
        time.sleep(6)
        
    for i in range(20):    
        if lol_json in lol_url:
            break
        lol_res = requests.get(lol_json,headers=headers)
        if lol_res.status_code==200:
            lol_url.append(lol_json)
            lol_json = lol_sub + '.json?limit=' + str(limit_num) + '&after=' + str(lol_res.json()['data']['after'])
        else:
            print(f'Connection failed, pls try again. Last successful {lol_sub[-(len(lol_sub)-len(url)-2):-1]} URL was {lol_url[-1]}.')
            break
        time.sleep(6)

In [3]:
# example of Dota2 json output

dota2_url[0:5]+dota2_url[-5:]

['https://www.reddit.com/r/DotA2/.json?limit=110&after=None',
 'https://www.reddit.com/r/DotA2/.json?limit=110&after=t3_11ukk6e',
 'https://www.reddit.com/r/DotA2/.json?limit=110&after=t3_11u8dk9',
 'https://www.reddit.com/r/DotA2/.json?limit=110&after=t3_11tnhdd',
 'https://www.reddit.com/r/DotA2/.json?limit=110&after=t3_11swge5',
 'https://www.reddit.com/r/DotA2/best/.json?limit=110&after=t3_11rsp9p',
 'https://www.reddit.com/r/DotA2/best/.json?limit=110&after=t3_11riddj',
 'https://www.reddit.com/r/DotA2/best/.json?limit=110&after=t3_11qpuv1',
 'https://www.reddit.com/r/DotA2/best/.json?limit=110&after=t3_11phfqd',
 'https://www.reddit.com/r/DotA2/best/.json?limit=110&after=t3_11oqh7g']

In [4]:
# example of LoL json output

lol_url[0:5]+lol_url[-5:]

['https://www.reddit.com/r/leagueoflegends/.json?limit=110&after=None',
 'https://www.reddit.com/r/leagueoflegends/.json?limit=110&after=t3_11twdvu',
 'https://www.reddit.com/r/leagueoflegends/.json?limit=110&after=t3_11usoff',
 'https://www.reddit.com/r/leagueoflegends/.json?limit=110&after=t3_11ts3zt',
 'https://www.reddit.com/r/leagueoflegends/.json?limit=110&after=t3_11t9t0u',
 'https://www.reddit.com/r/leagueoflegends/best/.json?limit=110&after=t3_11u5ljx',
 'https://www.reddit.com/r/leagueoflegends/best/.json?limit=110&after=t3_11slndb',
 'https://www.reddit.com/r/leagueoflegends/best/.json?limit=110&after=t3_11su4p1',
 'https://www.reddit.com/r/leagueoflegends/best/.json?limit=110&after=t3_11s8xev',
 'https://www.reddit.com/r/leagueoflegends/best/.json?limit=110&after=t3_11s2tnl']

In [5]:
# Dota2 post url

dota2_post_url=[]

for i in dota2_url:
    dota2_info=requests.get(i,headers=headers)
    dota2_post_url+=[url[:-1]+x['data']['permalink'] for x in dota2_info.json()['data']['children']]

In [6]:
# LoL post url

lol_post_url=[]

for i in lol_url:
    lol_info=requests.get(i,headers=headers)
    lol_post_url+=[url[:-1]+x['data']['permalink'] for x in lol_info.json()['data']['children']]

In [7]:
# example of Dota2 posts url output

list(set(dota2_post_url))[0:5]+list(set(dota2_post_url))[-5:]

['https://www.reddit.com/r/DotA2/comments/11u3yct/io_skin/',
 'https://www.reddit.com/r/DotA2/comments/11rm669/dpc_2023_tour_2_division_i_march_15_matches/',
 'https://www.reddit.com/r/DotA2/comments/11oi4gp/what_hero_has_worst_synergies_with_each_other/',
 'https://www.reddit.com/r/DotA2/comments/11uentd/is_valve_going_to_investigate_other_regions_for/',
 'https://www.reddit.com/r/DotA2/comments/11rlbtc/oldg/',
 'https://www.reddit.com/r/DotA2/comments/11tz68w/who_would_have_thought_this_would_happen_a_year/',
 'https://www.reddit.com/r/DotA2/comments/11v4dar/valve_please_add_the_shard_to_the_sequential_item/',
 'https://www.reddit.com/r/DotA2/comments/11ot29s/wisp_enjoyer/',
 'https://www.reddit.com/r/DotA2/comments/11ufjme/tumblers_toy_outplay_i_thought_it_was_kinda_funny/',
 'https://www.reddit.com/r/DotA2/comments/11oqfut/update_old_g_kitrak_and_resolut1on/']

In [8]:
# example of LoL posts url output

list(set(lol_post_url))[0:5]+list(set(lol_post_url))[-5:]

['https://www.reddit.com/r/leagueoflegends/comments/11qzzta/climbing_in_elo_hell/',
 'https://www.reddit.com/r/leagueoflegends/comments/11uuwob/azael_to_miss_this_weeks_the_dive_and_at_least/',
 'https://www.reddit.com/r/leagueoflegends/comments/11u3w44/want_to_learn_jax/',
 'https://www.reddit.com/r/leagueoflegends/comments/11pjesp/would_pykes_your_cut_be_a_good_addition_to_every/',
 'https://www.reddit.com/r/leagueoflegends/comments/11rfnen/nashor_tooth_used_to_have_ability_haste_then_it/',
 'https://www.reddit.com/r/leagueoflegends/comments/11r9af2/yuumi_shouldve_gotten_tools_that_encourage/',
 'https://www.reddit.com/r/leagueoflegends/comments/11rh05d/help_with_queue_timer_and_accept_match_dont_show/',
 'https://www.reddit.com/r/leagueoflegends/comments/11ucwf4/summoners_sanctuary_is_looking_for_new_members/',
 'https://www.reddit.com/r/leagueoflegends/comments/11thq90/dom_and_yamato_talk_about_blaber/',
 'https://www.reddit.com/r/leagueoflegends/comments/11s5pe4/old_ryze_otp_playe

In [9]:
print(f"There are a total of {len(set(dota2_post_url))} Dota 2 posts and {len(set(lol_post_url))} League of Legends (LoL) posts.")

There are a total of 1022 Dota 2 posts and 990 League of Legends (LoL) posts.


The categories that were retrieved from the json objects are listed in the table below.

| Subreddit features | Details |
| :-: | :-: |
| Subreddit | Subreddit that a post belongs to |
| Title | Title of a post |
| Selftext | Contents of a post |
| Author | The username of a post's creator |
| Created_UTC | The username of a post's creator |
| URL | Web address of a post |

We proceed to collect the above-mentioned subreddit features for both Dota 2 and LoL posts.

In [10]:
#dota2 post features collection

dota2_post_subreddit=[]
dota2_post_title=[]
dota2_post_selftext=[]
dota2_post_author=[]
dota2_post_created_utc=[]

dota2_masterlist=[dota2_post_subreddit,dota2_post_title,dota2_post_selftext,dota2_post_author,dota2_post_created_utc]
items=['subreddit','title','selftext','author','created_utc']

for i in list(set(dota2_post_url)):
    try:
        res=requests.get(i+'.json',headers=headers)
    except:
        continue
    for j in range(len(items)):
        try:
            value=res.json()[0]['data']['children'][0]['data'][items[j]]
            if value=='':
                dota2_masterlist[j]+=['']
            else:
                dota2_masterlist[j].append(value)
        except:
            dota2_masterlist[j]+=['']

In [11]:
#dota2 post created in unix time

dota2_post_created_utc[0:5]

[1679087548.0, 1678854944.0, 1678535009.0, 1679114808.0, 1678852453.0]

The time where a post is created is represented in Unix time, which is the number of seconds that have elapsed since January 1, 1970 at midnight. At midnight of January 1, 1970, Unix time was 0. We can convert Unix time into a string-formatted datetime that is more easily understood as seen below.

In [12]:
#dota2 post created datetime after unix time conversion

dota2_post_created_datetime=[]
for i in dota2_post_created_utc:
    if i !='':
        dota2_post_created_datetime.append(datetime.datetime.fromtimestamp(i).strftime("%d/%m/%y %I:%M:%S %p"))
    else:
        dota2_post_created_datetime+=['']
    
dota2_post_created_datetime[0:5]

['18/03/23 05:12:28 AM',
 '15/03/23 12:35:44 PM',
 '11/03/23 07:43:29 PM',
 '18/03/23 12:46:48 PM',
 '15/03/23 11:54:13 AM']

In [13]:
# create dota2 dataframe

dota2_df = {'subreddit':dota2_post_subreddit,
            'title': dota2_post_title,
             'selftext': dota2_post_selftext,
             'author':dota2_post_author,
             'create_datetime':dota2_post_created_datetime,
             'url':list(set(dota2_post_url))}
dota2_df=pd.DataFrame(dota2_df)
dota2_df

Unnamed: 0,subreddit,title,selftext,author,create_datetime,url
0,DotA2,IO skin ?,How much is the chance to get the IO skin ?,Spiconz,18/03/23 05:12:28 AM,https://www.reddit.com/r/DotA2/comments/11u3yc...
1,DotA2,DPC 2023 Tour 2: Division I - March 15 Matches,#[Dota Pro Circuit 2023: Tour 2](https://cdn.c...,D2TournamentThreads,15/03/23 12:35:44 PM,https://www.reddit.com/r/DotA2/comments/11rm66...
2,DotA2,What hero has worst synergies with each other?,For the betterment of drafting.,Impress-Solid,11/03/23 07:43:29 PM,https://www.reddit.com/r/DotA2/comments/11oi4g...
3,DotA2,Is valve going to investigate other regions fo...,Just like for CN during the DPC for cheating a...,UserLesser2004,18/03/23 12:46:48 PM,https://www.reddit.com/r/DotA2/comments/11uent...
4,DotA2,Old'G,I finally caught up with Old'G games. It was f...,DifficultyBig4224,15/03/23 11:54:13 AM,https://www.reddit.com/r/DotA2/comments/11rlbt...
...,...,...,...,...,...,...
1017,DotA2,who would have thought this would happen a yea...,,Comfortable_Pin_166,18/03/23 02:20:22 AM,https://www.reddit.com/r/DotA2/comments/11tz68...
1018,DotA2,"Valve, please add the shard to the sequential ...","Title.\n\nOften times, I play heroes that have...",Shamballa93,19/03/23 07:25:49 AM,https://www.reddit.com/r/DotA2/comments/11v4da...
1019,DotA2,wisp enjoyer,,-Rupas-,12/03/23 03:41:45 AM,https://www.reddit.com/r/DotA2/comments/11ot29...
1020,DotA2,Tumblers Toy outplay. (I thought it was kinda ...,,TheRacerX,18/03/23 01:34:22 PM,https://www.reddit.com/r/DotA2/comments/11ufjm...


In [15]:
#LoL post features collection
   
lol_post_subreddit=[]
lol_post_title=[]
lol_post_selftext=[]
lol_post_author=[]
lol_post_created_utc=[]

lol_masterlist=[lol_post_subreddit,lol_post_title,lol_post_selftext,lol_post_author,lol_post_created_utc]
items=['subreddit','title','selftext','author','created_utc']

for i in list(set(lol_post_url)):
    try:
        res=requests.get(i+'.json',headers=headers)
    except:
        continue
    for j in range(len(items)):
        try:
            value=res.json()[0]['data']['children'][0]['data'][items[j]]
            if value=='':
                lol_masterlist[j]+=['']
            else:
                lol_masterlist[j].append(value)
        except:
            lol_masterlist[j]+=['']

In [16]:
#LoL post created in unix time

lol_post_created_utc[0:5]

[1678779978.0, 1679161707.0, 1679087410.0, 1678640152.0, 1678839331.0]

In [17]:
#LoL post created datetime after unix time conversion

lol_post_created_datetime=[]
for i in lol_post_created_utc:
    if i !='':
        lol_post_created_datetime.append(datetime.datetime.fromtimestamp(i).strftime("%d/%m/%y %I:%M:%S %p"))
    else:
        lol_post_created_datetime+=['']
    
lol_post_created_datetime[0:5]

['14/03/23 03:46:18 PM',
 '19/03/23 01:48:27 AM',
 '18/03/23 05:10:10 AM',
 '13/03/23 12:55:52 AM',
 '15/03/23 08:15:31 AM']

In [18]:
# create LoL dataframe

lol_df = {'subreddit':lol_post_subreddit,
          'title': lol_post_title,
          'selftext': lol_post_selftext,
          'author': lol_post_author,
          'create_datetime': lol_post_created_datetime,
          'url': list(set(lol_post_url))}
lol_df=pd.DataFrame(lol_df)
lol_df

Unnamed: 0,subreddit,title,selftext,author,create_datetime,url
0,leagueoflegends,Climbing in elo hell,So last season i was platinum 1 and this one i...,migga95,14/03/23 03:46:18 PM,https://www.reddit.com/r/leagueoflegends/comme...
1,leagueoflegends,Azael to miss this weeks The Dive and at least...,https://twitter.com/AzaelOfficial/status/16371...,untamedlazyeye,19/03/23 01:48:27 AM,https://www.reddit.com/r/leagueoflegends/comme...
2,leagueoflegends,Want to learn jax,Only 3 days of playing so far. Lvl 20 around 6...,zeepers_13,18/03/23 05:10:10 AM,https://www.reddit.com/r/leagueoflegends/comme...
3,leagueoflegends,"Would Pykes ""Your Cut"" be a good addition to e...",What i am thinking is that every support start...,Slimexsan,13/03/23 12:55:52 AM,https://www.reddit.com/r/leagueoflegends/comme...
4,leagueoflegends,Nashor Tooth used to have ability haste then i...,&amp;#x200B;\n\n[13.6 patch coming next week](...,Boudynasr,15/03/23 08:15:31 AM,https://www.reddit.com/r/leagueoflegends/comme...
...,...,...,...,...,...,...
985,leagueoflegends,Yuumi should've gotten tools that encourage ho...,Yuumi currently has no real reason to hop off ...,TooBad_Vicho,14/03/23 11:31:47 PM,https://www.reddit.com/r/leagueoflegends/comme...
986,leagueoflegends,help with queue timer and accept match don't s...,"It's been a while now , getting into a game be...",ssrx3,15/03/23 08:59:24 AM,https://www.reddit.com/r/leagueoflegends/comme...
987,leagueoflegends,Summoner's Sanctuary is looking for new members!,Hello I'd like to invite you to a new League o...,Adorable-Company-283,18/03/23 11:15:42 AM,https://www.reddit.com/r/leagueoflegends/comme...
988,leagueoflegends,Dom and Yamato talk about Blaber,,BeautifulChocolate87,17/03/23 01:06:23 PM,https://www.reddit.com/r/leagueoflegends/comme...


In [19]:
#save output to csv

reddit_df=pd.concat([dota2_df,lol_df],axis=0)
reddit_df.to_csv("../dataset/webscraping_190323.csv",index=False)