# Scraping

This aims to retrieve Swiss-related subreddits from the suggestions in /r/Switzerland

NOTE: We could have simply copied the names from the website, as they are not an unbereable amount to be done by hand, but we thought this would be an OK warm up exercise to touch more different thing of the typical Data Analysis pipeline in our project.

In [8]:
## Importing libraries
import requests
from bs4 import BeautifulSoup
import numpy as np

### Getting the list and Saving to CSV

'Requests' library is used to get the HTML data using a base url. 'BeautifulSoup' is then used to parse the HTML.

In [None]:
form_url = 'https://www.reddit.com/r/Switzerland/'
r = requests.get(form_url)
soup = BeautifulSoup(r.text, 'html.parser')
#soup

We have found a problems with being mistaken by a bot because we ran this several times consecutively!

In [None]:
#The objective is to come up with a SubReddits' name list
subReddits = list()
subReddits.append('Switzerland')

#First we define the categories we are interested in, to avoid including links related to switzerland because of
# neighbouring countries, etc.
list_of_interesting_categories = ['» Other general Swiss Subreddits','» Special Interest Swiss Subreddits', '» Universities and Institutions']

# Then we loop in search for our links
for bq in soup.find_all('blockquote'):
    if bq.h3.text in list_of_interesting_categories:
        
        #getting all the links
        for link in bq.find_all('a', href=True):    
            parts = link['href'].split("/")
            if len(parts) == 3:
                subReddits.append(parts[2])
    
print(subReddits)                

In [None]:
# We save the list to csv for later use
np.savetxt("Swiss_SR.csv", SubReddits, delimiter=",", fmt='%s', header='Swiss SubReddit Names')

### Load if already saved

In [17]:
import numpy as np
subReddits = np.loadtxt("Swiss_SR.csv", delimiter=",", dtype=bytes).astype(str)
subReddits = subReddits.tolist()
print(subReddits)

['Switzerland', 'AskSwitzerland', 'Basel', 'Bern', 'BielBienne', 'Buenzli', 'Frauenfeld', 'Fribourg', 'Geneva', 'Liestal', 'Luzern', 'Morcote', 'Neuchatel', 'Schaffhausen', 'SanktGallen', 'Schwiiz', 'Solothurn', 'Stans', 'Suisse', 'Thun', 'Ticino', 'Winterthur', 'Zermatt', 'Zug', 'Zurich', 'Breitling', 'CHTrees', 'FCBasel', 'MatterhornPorn', 'Migros', 'Schweiz', 'SwissArmy', 'SwissArmyKnives', 'SwissBuyers', 'SwissGaming', 'SwissGuns', 'SwissHockey', 'SwissESports', 'SwissMountainDogs', 'SwissNews', 'SwissProblems', 'SwissRap', 'SwissSuperLeague', 'SwissHistory', 'CERN', 'EPFL', 'ETHZ', 'UZH']


# Extract all comments from the subreddits

In this section, we will extract all the submissions and comments for each of the above subreddits. For ease of use, we use Python Reddit API Wrapper (PRAW) module to extract all the comments.

Usage of Reddit API requires authentication through 'Oauth'. An unique user agent is required along with a client id and secret (which is generated when an Reddit application is created).

In [1]:
my_user_agent = "ada:dvd_ada:v1.0.0 (by /u/dk01reddit)"
my_client_id = ""
my_client_secret = ""
my_username = ""
my_password = ""

In [2]:
import requests
import requests.auth
client_auth = requests.auth.HTTPBasicAuth(my_client_id, my_client_secret)
post_data = {"grant_type": "password", "username": my_username, "password": my_password}
headers = {"User-Agent": my_user_agent}
response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
#response.json()

In [3]:
headers = {"Authorization": "bearer x7db0XJl42n_Q53h4n--hKIpMkw", "User-Agent": my_user_agent}
response = requests.get("https://oauth.reddit.com/api/v1/me", headers=headers)
#response.json()

In [6]:
import praw

r = praw.Reddit(user_agent=my_user_agent,
                     client_id=my_client_id,
                     client_secret=my_client_secret)

Once authenticated, we can get all the comments for each of the subreddits. 

In [23]:
swiss = reddit.subreddit(subReddits[0])

In [29]:
i = 0
for submission in swiss.new(limit=1000):
    i=i+1
    
print(i)

989


In [None]:
## Old version!

In [102]:
for i in range(0,len(subReddits)):
    sr = r.get_subreddit(subReddits[i])
    
    #for com in sr.get_comments():


In [4]:
import praw
r = praw.Reddit(user_agent=my_user_agent)

r.set_oauth_app_info(client_id=my_client_id,
                     client_secret=my_client_secret,
                     redirect_uri='http://localhost:8000/')

ClientException: Required configuration setting 'client_id' missing. 
This setting can be provided in a praw.ini file, as a keyword argument to the `Reddit` class constructor, or as an environment variable.