# Scraping Subreddits' Names

This aims to retrieve Swiss-related subreddits from the suggestions in /r/Switzerland

NOTE: We could have simply copied the names from the website, as they are not an unbereable amount to be done by hand, but we thought this would be an OK warm up exercise to touch more different thing of the typical Data Analysis pipeline in our project.

In [1]:
## Importing libraries
import requests
from bs4 import BeautifulSoup
import numpy as np

### Getting the list and Saving to CSV

'Requests' library is used to get the HTML data using a base url. 'BeautifulSoup' is then used to parse the HTML.

In [None]:
form_url = 'https://www.reddit.com/r/Switzerland/'
r = requests.get(form_url)
soup = BeautifulSoup(r.text, 'html.parser')
#soup

We have found a problems with being mistaken by a bot because we ran this several times consecutively!

In [None]:
#The objective is to come up with a SubReddits' name list
subReddits = list()
subReddits.append('Switzerland')

#First we define the categories we are interested in, to avoid including links related to switzerland because of
# neighbouring countries, etc.
list_of_interesting_categories = ['» Other general Swiss Subreddits','» Special Interest Swiss Subreddits', '» Universities and Institutions']

# Then we loop in search for our links
for bq in soup.find_all('blockquote'):
    if bq.h3.text in list_of_interesting_categories:
        
        #getting all the links
        for link in bq.find_all('a', href=True):    
            parts = link['href'].split("/")
            if len(parts) == 3:
                subReddits.append(parts[2])
    
print(subReddits)                

In [None]:
# We save the list to csv for later use
np.savetxt("Swiss_SR.csv", SubReddits, delimiter=",", fmt='%s', header='Swiss SubReddit Names')

### Scraping names of subreddits related to United Kingdom

In [5]:
url = 'https://www.reddit.com/r/unitedkingdom/wiki/british_subreddits'
r = requests.get(url)

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')

In [17]:
links = soup.find_all('a')

In [60]:
uk_subreddits = []
for link in links:
    if link.text.startswith('/r'):
        uk_subreddits.append(link.text.strip('/r'))

In [61]:
uk_subreddits

['AskUK',
 'britpics',
 'SampleSize',
 'AskUK',
 'english_articles',
 'UKcirclejerk',
 'Fringe',
 'Humou',
 'INGLIN',
 'BritishProblems',
 'TheRedLion',
 'gbr4',
 'BritishSuccess',
 'TwoXUK',
 'UKskeptic',
 'PoliceUk',
 'AskUK',
 'GCSE',
 'UKVisa',
 'LegalAdviceUK',
 'UKPersonalFinance',
 'UKHistory',
 'ScottishHistory',
 'HistoryWales',
 'UKLaw',
 'UKLGBT',
 'londonlgbt',
 'bisexualuk',
 'TransgenderUK',
 'BritishMilitary',
 'UKNews',
 'UKGoodNews',
 'PrivateEye',
 'edtop',
 'MHoC',
 'MHoCLabou',
 'MHoCLiberalDemocrats',
 'MHoCConservatives',
 'MHoCUKIP',
 'MHoCGreens',
 'MHoCcommunist',
 'MHoCBIP',
 'MHoCCWL',
 'MHoCIndependents',
 'UKPolitics',
 'BritishPolitics',
 'LabourUK',
 'Tories',
 'LibDem',
 'UKIPParty',
 'UKGreens',
 'PiratePartyUK',
 'tusc',
 'scottishpolitics',
 'upliftingnewsuk',
 'ukgovbriefs',
 'UKandIrishBee',
 'UKBiscuits',
 'ukcigars',
 'UKTrees',
 'UKBike',
 'BristolCycling',
 'cambridgecycling',
 'leedscycling',
 'londoncycling',
 'cyclemc',
 'scottishcycling',
 '

In [63]:
# We save the list to csv for later use
np.savetxt("UK_SR.csv", uk_subreddits, delimiter=",", fmt='%s', header='UK SubReddit Names')