# Webscraping with Python
For [Dataharvest+ 2020](https://dataharvesteijc2020.sched.com/event/dkjh/data-analysis-with-pandas-on-jupyter-3?iframe=no) by [Adriana Homolova](https://twitter.com/naberacka) & [Winny de Jong](https://twitter.com/winnydejong)

Webscraping is the automated downloading of structured information from the open web. You should consider it a last resort for getting the information you need: try calling and asking an organisation for information first. It's the polite thing to do.

## Setup

We're going to do some basic webscraping using the following Python libraries: 
- Requests: to request websites
- BeautifulSoup: to parse the HTML of the requested website, note to import this library you'll have to use `from bs4 import BeautifulSoup`
- Pandas: to do data analysis on the scraped data

In [1]:
# import all libraries needed


## Scraper 1

Goal: Let's scrape the DataHarvest Program, and save the program to a csv file. 

Data we want to collect for every session: 
- name
- url to sessionpage

### request website

In [None]:
# request webpage, save to variable named r
# use url https://dataharvesteijc2020.sched.com/?iframe=no


### check status request

Check if the request went correctly, by looking at the status code. The http web protocol has multiple status codes, 404 is probably the one you'll know. :) If a website works, the status code is 200.

In [None]:
# print status code of r


In [None]:
# another way to get the status code is simply printing the request


### parse HTML

We're going to use the Beautiful Soup library to parse the HTML of the webpage. To start, we'll save the HTML of the requested webpage to a variable we'll call 'soup'. Weird? Nah, it's the default name to use...

In [None]:
# BeautifulSopu take the content of the request called r, 
# parse it as HTML, and save it to a variable called soup


In [None]:
# check what's inside soup, should be the source code of 
# the requested webpage...


### collect data for 1 workshop

Remember, this is the data we want to collect for every session:
- name
- url to sessionpage

Look at the source-code of the page to figure out what you're looking for on the backend. Every workshop is a `<div class='sched-container'>` in the source code...

In [None]:
# find the first workshop in the soup (read: source code), save to w1


In [None]:
# print content of w1


In [None]:
# get name of first workshop


In [None]:
# print name of first workshop


In [None]:
# no, we only want the name... not all the html!
# let's try again...


In [None]:
# get url of first workshop


In [None]:
# that's not the entire url... now what? 
# you can easily combine strings in Python, like so: 

### collect data for all workshops

If you were to collect all data like this, simply typing it over or copy-pasting would be faster... That's were for-loops come in. For every workshop listed on the page, we want to do the exact same thing. Python makes that very easy to do.

In [None]:
# let's go and collect all 'div' HTML-elements with the class 'sched-container'
# that exist in soup and save these divs to a variable called sessions


In [None]:
# let's look at what's inside the sessions variable


Note how the sessions variable begins and ends with square brackets? That's because it's a list that contains all workshops.

In [None]:
# since it's a list, you can request to only see the first item
# note how computers start counting at zero...
# if everything works out, this should look familiar...


In [None]:
# since it's a list, you can request its length


OK. So far, so good, right? To get the link and name of every session, we're going to use a **for-loop**. Basically telling our computers to extract the name and the url **for every item** in the list. 

In [None]:
# for every item (I named it `s`, but I could have chosen anything really, as long as I'm consistent)
# in the list called sessions, go and do...:

    # select the name

    # select the url

This all looks mighty fine, but how do we get this in a table? Where are we going to store this information? Printing it out does not make it easy to use...

Let's say we create a list for every session, like so:   
`listForSession = [nameSession, linkSession]`

If we then create a list of these lists, like so:  
`listOfLists = [[nameSession, linkSession],
                [nameSession, linkSession]
                [nameSession, linkSession]
                [nameSession, linkSession]]`
                
We can use the Pandas function `pd.DataFrame` to easily create a dataframe (read: table) based on a list of lists, where every list in the list becomes a row. Confused? I'll show you...

In [None]:
# create an empty list, to be filled with lists

# for every item (I named it `s`, but I could have chosen anything really, as long as I'm consistent)
# in the list called sessions...
    # ...select the name
    # ...select the url
    # ...create a list that contains name and link
    # ...add this list named sList to the inputTable list
    
# see how the indent of the code ends here? 
# everything that follows is no longer done for every s in sessions.


In [None]:
# if you want to check upon your own work, and see if there is a list for every s in sessions;
# the length the inputTable list - the length of the sessions list should be zero


In [None]:
# create a dataframe named df based on a list of lists called inputTable using Pandas


In [None]:
# set new column names


In [None]:
# check head of dataframe


In [None]:
# save dataframe to csv


## Scraper 2

Goal: get summary, speakers, date and time for all DataHarvest 2020 sessions

In [None]:
# show me a sample of dataframe df


In [None]:
# give me the shape of dataframe df


What is it we want to do exactly?
  
For every workshop in the df dataframe, we want to: 
- request the webpage
- parse the html
- extract the data:
    - name Workshop
    - date
    - time
    - summary
    - speakers

### collect data for 1 workshop

Let's try it for 1 workshop first....

In [None]:
# request webpage


In [None]:
# check status code of request


In [None]:
# create soup = parse HTML


In [None]:
# get workshop title


In [None]:
# get workshop title, text only


In [None]:
# get workshop title, text without surrounding spaces


In [None]:
# get date from HTML, by selecting div


In [None]:
# get date from HTML, only get the id of said div


In [None]:
# get date from HTML, by selecting div


In [None]:
# get date from HTML, only get the text of the div


In [None]:
# get date from HTML, remove surrounding spaces of text from div


In [None]:
# get date from HTML, only keep the last 13 letters of the text from the div


In [None]:
# get description from HTML, by selecting div


In [None]:
# get all speakers from HTML, by finding all speaker divs


In [None]:
# create empty list, to save speakers in

# select names from variable called speakersInput

    # add name to list named speakers


### collect data for all workshops

In [None]:
# create an empty list to store all data in

# lets create a for-loop;
# for every link in the column named link from the table named df:

    # request the webpage

    
    # check if request status code is 200, if so

        # parse HTML, create soup
      
    
        # get workshop title
        
            
        # get date from HTML, only get the id of said div
        
            
        # get date from HTML, only keep the last 13 letters of the text from the div
        
            
        # get description from HTML, by selecting div
        
            
        # create empty list, to save speakers in
        
        # get all speakers from HTML, by finding all speaker divs
        
            
        # create a list with all scraped info
        
        
        # append list named workshopIfno to list named detailsWorkshops
            
    # if request status code is not 200...
    
        # print warning + link
        
        
# create dataframe named dh (dataharvest) based on detailsWorkshops list

# check out dh


In [None]:
# set column names


In [None]:
# check out shape of dh dataframe


In [None]:
# save data to csv


## Analyse data

### Be ware: explosion ahead ;)

Question: which speaker was involved in most sessions? 

Answering this question would be easy, if you had 1 speaker in every cell in the speakers-column. But our data doesn't look like that... It looks like this... take a sample using `.sample()`

Now what do we do? Well, every entry in the speakers-column is a list. We know so, because we made it so. So we can 'explode' these list, and create a new row for every item of said list; where all other columns stay the same.... 

In [None]:
# let's look at the first row, use .head()


In [None]:
# let's look at the speakers column of the first row of the dh dataframe
# use .iloc[] to select the row, and ['columnName'] to select the column


If we were to explode our dataset, this first row would explode into three rows which will be all identical expect for the speaker. There will be one row for every speaker. : )

In [None]:
# let's try it, just to see...
# dear Pandas, explode the dh dataset, using the speakers-column, 
# and only show us the first 3 rows of the result of this explosion
# use .explode() to do this


In [None]:
# let's explode some data, but this time save it 
# in a new dataframe called dhS (short for dataharvest Speakers)


In [None]:
# get no rows + columns of dh


In [None]:
# get no rows + columns of dhS


In [None]:
# let's 
# group on the speakers column
# not creating an index out of said speakers column
# sort values
# drop all empty rows
# after grouping count the number of workshops
# sort values based on workshop-column 
# sort from high to low (descending)
# filter dataframe, only keep speakers + workshop columns
# only show me the top 10

In [None]:
# now, let's do all that once more but then also save the data. 
