# Webscraping with Python
For [Dataharvest+ 2020](https://dataharvesteijc2020.sched.com/event/dkjh/data-analysis-with-pandas-on-jupyter-3?iframe=no) by [Adriana Homolova](https://twitter.com/naberacka) & [Winny de Jong](https://twitter.com/winnydejong)

Webscraping is the automated downloading of structured information from the open web. You should consider it a last resort for getting the information you need: try calling and asking an organisation for information first. It's the polite thing to do.

## Setup

We're going to do some basic webscraping using the following Python libraries: 
- Requests: to request websites
- BeautifulSoup: to parse the HTML of the requested website
- Pandas: to do data analysis on the scraped data

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Scraper 1

Goal: Let's scrape the DataHarvest Program, and save the program to a csv file. 

Data we want to collect for every session: 
- name
- url to sessionpage

### request website

In [6]:
# request webpage, save to variable named r
r = requests.get('https://dataharvesteijc2020.sched.com/?iframe=no2')

### check status request

Check if the request went correctly, by looking at the status code. The http web protocol has multiple status codes, 404 is probably the one you'll know. :) If a website works, the status code is 200.

In [7]:
# print status code of r
r.status_code

200

In [9]:
# another way to get the status code is simply printing the request
r

<Response [200]>

### parse HTML

We're going to use the Beautiful Soup library to parse the HTML of the webpage. To start, we'll save the HTML of the requested webpage to a variable we'll call 'soup'. Weird? Nah, it's the default name to use...

In [10]:
# BeautifulSopu take the content of the request called r, 
# parse it as HTML, and save it to a variable called soup
soup = BeautifulSoup(r.content, 'html.parser')

In [11]:
# check what's inside soup, should be the source code of 
# the requested webpage...
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--
           _              _
  ___  ___| |__   ___  __| |                 @@@@@@@@@@@@@@
 / __|/ __| '_ \ / _ \/ _` |             @@@@@@@@@@@@@@@@@@@@@@
 \__ \ (__| | | |  __/ (_| |          @@@@@@@@@@@@@@@@@@@@@@@@@@@@
 |___/\___|_| |_|\___|\__,_|        @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
                                  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
                _               @@@@@@@@@@@@@@@@@@@@@@@@@@@@@    @@
   ___ ___   __| | ___         @@@@@@@@@@@@@@@@@@@@@@@@@@@@    @@@@@@
  / __/ _ \ / _` |/ _ \       @@@@@@@@@@@@@@@@@@@@@@@@@@@   @@@@@@@@@@
 | (_| (_) | (_| |  __/      @@@@@@@@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@
  \___\___/ \__,_|\___|      @@@@@@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@@@@
                             @@@@@@@@@@@@@@@@@@@@@@@   @@@@@@@@@@@@@@@@@
                             @@@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@@@@@@@
                     

### collect data for 1 workshop

Remember, this is the data we want to collect for every session:
- name
- url to sessionpage

Look at the source-code of the page to figure out what you're looking for on the backend. Every workshop is a `<div class='sched-container'>` in the source code...

In [23]:
# find the first workshop in the soup (read: source code)
w1 = soup.find('div', {'class': 'sched-container'})

In [24]:
# print content of w1
w1

<div class="sched-container"><div class="sched-container-inner">
<span class="event ev_9 ev_9_sub_1"><a class="name" href="event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu" id="b0bafb5a9ab3c073728f4f2c93d513b8">How to get access to documents and data in the EU <span class="vs">TBA</span></a></span>
<br style="clear:both"/> </div></div>

In [34]:
# get name of first workshop
w1Name = w1.find('a')

In [35]:
# print name of first workshop
w1Name

<a class="name" href="event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu" id="b0bafb5a9ab3c073728f4f2c93d513b8">How to get access to documents and data in the EU <span class="vs">TBA</span></a>

In [36]:
# no, we only want the name... not all the html!
# let's try again...
w1Name = w1.find('a').text
w1Name

'How to get access to documents and data in the EU TBA'

In [38]:
# get url of first workshop
w1Link = w1.find('a')['href']
w1Link

'event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu'

In [39]:
# that's not the entire url... now what? 
# you can easily combine strings in Python, like so: 
w1Link = 'https://dataharvesteijc2020.sched.com/' + w1.find('a')['href']
w1Link

'https://dataharvesteijc2020.sched.com/event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu'

### collect data for all workshops

If you were to collect all data like this, simply typing it over or copy-pasting would be faster... That's were for-loops come in. For every workshop listed on the page, we want to do the exact same thing. Python makes that very easy to do.

In [50]:
# let's go and collect all 'div' HTML-elements with the class 'sched-container'
# that exist in soup and save these divs to a variable called sessions
sessions = soup.findAll('div', {'class': 'sched-container'})

In [51]:
# let's look at what's inside the sessions variable
sessions

[<div class="sched-container"><div class="sched-container-inner">
 <span class="event ev_9 ev_9_sub_1"><a class="name" href="event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu" id="b0bafb5a9ab3c073728f4f2c93d513b8">How to get access to documents and data in the EU <span class="vs">TBA</span></a></span>
 <br style="clear:both"/> </div></div>,
 <div class="sched-container"><div class="sched-container-inner">
 <span class="event ev_9 ev_9_sub_1"><a class="name" href="event/dkmw/using-access-to-environmental-information-rules" id="44cf910d4f2c6a81ec373f91b7121b80">Using access to environmental information rules <span class="vs">TBA</span></a></span>
 <br style="clear:both"/> </div></div>,
 <div class="sched-container"><div class="sched-container-inner">
 <span class="event ev_9 ev_9_sub_3"><a class="name" href="event/doQR/foi-lunch-2-how-to-deal-with-data-access-obstruction" id="b07571bc0c9ff9a1aeff7a93261cf0d0">FOI lunch 2: How to deal with data access obstruction? <span class="

Note how the sessions variable begins and ends with square brackets? That's because it's a list that contains all workshops.

In [54]:
# since it's a list, you can request to only see the first item
# note how computers start counting at zero...
# if everything works out, this should look familiar...
sessions[0]

<div class="sched-container"><div class="sched-container-inner">
<span class="event ev_9 ev_9_sub_1"><a class="name" href="event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu" id="b0bafb5a9ab3c073728f4f2c93d513b8">How to get access to documents and data in the EU <span class="vs">TBA</span></a></span>
<br style="clear:both"/> </div></div>

In [59]:
# since it's a list, you can request its length
len(sessions)

83

OK. So far, so good, right? To get the link and name of every session, we're going to use a **for-loop**. Basically telling our computers to extract the name and the url **for every item** in the list. 

In [64]:
# for every item (I named it `s`, but I could have chosen anything really, as long as I'm consistent)
# in the list called sessions, go and do...:
for s in sessions:
    # select the name
    name = s.find('a').text
    # select the url
    link = 'https://dataharvesteijc2020.sched.com/' + s.find('a')['href']
    print(name)
    print(link)
    print('--')

How to get access to documents and data in the EU TBA
https://dataharvesteijc2020.sched.com/event/dkmt/how-to-get-access-to-documents-and-data-in-the-eu
--
Using access to environmental information rules TBA
https://dataharvesteijc2020.sched.com/event/dkmw/using-access-to-environmental-information-rules
--
FOI lunch 2: How to deal with data access obstruction? TBA
https://dataharvesteijc2020.sched.com/event/doQR/foi-lunch-2-how-to-deal-with-data-access-obstruction
--
Data analysis with Pandas on Jupyter - 2 TBA
https://dataharvesteijc2020.sched.com/event/dkje/data-analysis-with-pandas-on-jupyter-2
--
Going to court to get information - in German and European courts TBA
https://dataharvesteijc2020.sched.com/event/doQa/going-to-court-to-get-information-in-german-and-european-courts
--
Going to court to get information - cases from Spain and Greece TBA
https://dataharvesteijc2020.sched.com/event/eUPQ/going-to-court-to-get-information-cases-from-spain-and-greece
--
FOI lunch 3: Experiences

This all looks mighty fine, but how do we get this in a table? Where are we going to store this information? Printing it out does not make it easy to use...

Let's say we create a list for every session, like so:   
`listForSession = [nameSession, linkSession]`

If we then create a list of these lists, like so:  
`listOfLists = [[nameSession, linkSession],
                [nameSession, linkSession]
                [nameSession, linkSession]
                [nameSession, linkSession]]`
                
We can use the Pandas function `pd.DataFrame` to easily create a dataframe (read: table) based on a list of lists, where every list in the list becomes a row. Confused? I'll show you...

In [65]:
# create an empty list, to be filled with lists
inputTable = []

# for every item (I named it `s`, but I could have chosen anything really, as long as I'm consistent)
# in the list called sessions...
for s in sessions:
    # ...select the name
    name = s.find('a').text
    # ...select the url
    link = 'https://dataharvesteijc2020.sched.com/' + s.find('a')['href']
    # ...create a list that contains name and link
    sList = [name, link]
    # ...add this list named sList to the inputTable list
    inputTable.append(sList)
    
# see how the indent of the code ends here? 
# everything that follows is no longer done for every s in sessions.
print(len(inputTable))

83


In [66]:
# if you want to check upon your own work, and see if there is a list for every s in sessions;
# the length the inputTable list - the length of the sessions list should be zero
len(inputTable) - len(sessions)

0

In [67]:
# create a dataframe named df based on a list of lists called inputTable using Pandas
df = pd.DataFrame(inputTable)
df

Unnamed: 0,0,1
0,How to get access to documents and data in the...,https://dataharvesteijc2020.sched.com/event/dk...
1,Using access to environmental information rule...,https://dataharvesteijc2020.sched.com/event/dk...
2,FOI lunch 2: How to deal with data access obst...,https://dataharvesteijc2020.sched.com/event/do...
3,Data analysis with Pandas on Jupyter - 2 TBA,https://dataharvesteijc2020.sched.com/event/dk...
4,Going to court to get information - in German ...,https://dataharvesteijc2020.sched.com/event/do...
...,...,...
78,How to work in multi-disciplinary teams TBA,https://dataharvesteijc2020.sched.com/event/dk...
79,Overcoming cultural bias in collaborations TBA,https://dataharvesteijc2020.sched.com/event/dk...
80,Working together apart TBA,https://dataharvesteijc2020.sched.com/event/dk...
81,Journalism that empowers TBA,https://dataharvesteijc2020.sched.com/event/dk...


In [68]:
# set new column names
df.columns = ['workshop', 'link']

In [70]:
df.head()

Unnamed: 0,workshop,link
0,How to get access to documents and data in the...,https://dataharvesteijc2020.sched.com/event/dk...
1,Using access to environmental information rule...,https://dataharvesteijc2020.sched.com/event/dk...
2,FOI lunch 2: How to deal with data access obst...,https://dataharvesteijc2020.sched.com/event/do...
3,Data analysis with Pandas on Jupyter - 2 TBA,https://dataharvesteijc2020.sched.com/event/dk...
4,Going to court to get information - in German ...,https://dataharvesteijc2020.sched.com/event/do...


In [71]:
# save data to csv
df.to_csv('names and links sessions dataharvest 2020.csv')

## Scraper 2

Goal: get summary, speakers, date and time for all DataHarvest 2020 sessions

In [72]:
# show me a sample of dataframe df
df.sample()

Unnamed: 0,workshop,link
70,Digital security: Protect your communication TBA,https://dataharvesteijc2020.sched.com/event/dk...


In [73]:
# give me the shape of dataframe df
df.shape

(83, 2)

What is it we want to do exactly?
  
For every workshop in the df dataframe, we want to: 
- request the webpage
- parse the html
- extract the data:
    - name Workshop
    - date
    - time
    - summary
    - speakers

### collect data for 1 workshop

Let's try it for 1 workshop first....

In [75]:
# request webpage
r = requests.get(df['link'][0])

In [76]:
# check status code of request
r.status_code

200

In [77]:
# create soup = parse HTML
soup = BeautifulSoup(r.content, 'html.parser')
soup


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<!--
           _              _
  ___  ___| |__   ___  __| |                 @@@@@@@@@@@@@@
 / __|/ __| '_ \ / _ \/ _` |             @@@@@@@@@@@@@@@@@@@@@@
 \__ \ (__| | | |  __/ (_| |          @@@@@@@@@@@@@@@@@@@@@@@@@@@@
 |___/\___|_| |_|\___|\__,_|        @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
                                  @@@@@@@@@@@@@@@@@@@@@@@@@@@@@
                _               @@@@@@@@@@@@@@@@@@@@@@@@@@@@@    @@
   ___ ___   __| | ___         @@@@@@@@@@@@@@@@@@@@@@@@@@@@    @@@@@@
  / __/ _ \ / _` |/ _ \       @@@@@@@@@@@@@@@@@@@@@@@@@@@   @@@@@@@@@@
 | (_| (_) | (_| |  __/      @@@@@@@@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@
  \___\___/ \__,_|\___|      @@@@@@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@@@@
                             @@@@@@@@@@@@@@@@@@@@@@@   @@@@@@@@@@@@@@@@@
                             @@@@@@@@@@@@@@@@@@@@@@  @@@@@@@@@@@@@@@@@@@
                    

In [111]:
# get workshop title
workshopName = soup.find('a', {'class': 'name'})
workshopName

<a class="name" href="#" id="b0bafb5a9ab3c073728f4f2c93d513b8" onclick="return false">How to get access to documents and data in the EU                      </a>

In [112]:
# get workshop title, text only
workshopName = soup.find('a', {'class': 'name'}).text
workshopName

'How to get access to documents and data in the EU                      '

In [113]:
# get workshop title, text without surrounding spaces
workshopName = soup.find('a', {'class': 'name'}).text.strip()
workshopName

'How to get access to documents and data in the EU'

In [92]:
# get date from HTML, by selecting div
date = soup.find('div', {'class': 'sched-container-header'})
date

<div class="sched-container-header" id="2020-09-23">
<div class="sched-container-dates">
<b>Wednesday</b>, September 23 •
      10:00 - 11:15  </div>
</div>

In [93]:
# get date from HTML, only get the id of said div
date = soup.find('div', {'class': 'sched-container-header'})['id']
date

'2020-09-23'

In [94]:
# get date from HTML, by selecting div
time = soup.find('div', {'class': 'sched-container-dates'})
time

<div class="sched-container-dates">
<b>Wednesday</b>, September 23 •
      10:00 - 11:15  </div>

In [95]:
# get date from HTML, only get the text of the div
time = soup.find('div', {'class': 'sched-container-dates'}).text
time

'\nWednesday, September 23 •\n      10:00 - 11:15  '

In [96]:
# get date from HTML, remove surrounding spaces of text from div
time = soup.find('div', {'class': 'sched-container-dates'}).text.strip()
time

'Wednesday, September 23 •\n      10:00 - 11:15'

In [97]:
# get date from HTML, only keep the last 13 letters of the text from the div
time = soup.find('div', {'class': 'sched-container-dates'}).text.strip()[-13:]
time 

'10:00 - 11:15'

In [100]:
# get description from HTML, by selecting div
descr = soup.find('div', {'class': 'tip-description'}).text.strip()
descr

"EU Regulation 1049/2001 regarding public access to documents is a powerful, yet underused tool to obtain documents from EU institutions – from its regulatory agencies such as the European Medicines Agency up to the Council. Come learn how to use it and get official documents you never dreamed existed. Staffan will talk about the importance of the Council working parties and point at the EU Ombudsman's initiatives to (try to) open them up. Stéphane will show how to have FOIfun with the EU Commission using examples from various investigations (Implant files, lobbying against the regulation of toxic chemicals and pesticides)."

In [101]:
# get all speakers from HTML, by finding all speaker divs
speakersInput = soup.findAll('div', {'class': 'sched-person'})
speakersInput

[<div class="sched-person"><a class="sched-avatar" href="moderator/brigitte30"><span><img alt="avatar for Brigitte Alfter" onerror="this.onerror=null;this.src='//cdn.sched.co/common/img/avatar-empty.png';" src="//avatars.sched.co/3/08/7053403/avatar.jpg?44b"/></span></a>
 <h2><a href="moderator/brigitte30" title="Brigitte Alfter">Brigitte Alfter</a></h2><div class="sched-event-details-role-company">Director, Arena for Journalism in Europe</div><div class="sched-event-details-role-bio"></div></div>,
 <div class="sched-person"><a class="sched-avatar" href="speaker/staffan1"><span><img alt="avatar for Staffan Dahllöf" onerror="this.onerror=null;this.src='//cdn.sched.co/common/img/avatar-empty.png';" src="//pbs.twimg.com/profile_images/2187150867/staffanbild_400x400.jpg"/></span></a>
 <h2><a href="speaker/staffan1" title="Staffan Dahllöf">Staffan Dahllöf</a></h2><div class="sched-event-details-role-company">Freelance reporter, freelance, Denemarken</div><div class="sched-event-details-role

In [108]:
# create empty list, to save speakers in
speakers = []
# select names from variable called speakersInput
for s in speakersInput: 
    name = s.find('h2').find('a').text
    # add name to list named speakers
    speakers.append(name)

### collect data for all workshops

In [116]:
# create an empty list to store all data in
detailsWorkshops = []

# lets create a for-loop;
# for every link in the column named link from the table named df:
for l in df['link']:    
    # request the webpage
    r = requests.get(l)
    
    # check if request status code is 200, if so
    if r.status_code == 200:
        # parse HTML, create soup
        soup = BeautifulSoup(r.content, 'html.parser')
    
        # get workshop title
        try:
            workshopName = soup.find('a', {'class': 'name'}).text.strip()
        except:
            workshopName = None
            
        # get date from HTML, only get the id of said div
        try:
            date = soup.find('div', {'class': 'sched-container-header'})['id']
        except:
            date = None
            
        # get date from HTML, only keep the last 13 letters of the text from the div
        try:
            time = soup.find('div', {'class': 'sched-container-dates'}).text.strip()[-13:]
        except:
            time = None
            
        # get description from HTML, by selecting div
        try:
            descr = soup.find('div', {'class': 'tip-description'}).text.strip()
        except:
            descr = None
            
        # create empty list, to save speakers in
        speakers = []
        # get all speakers from HTML, by finding all speaker divs
        try:
            speakersInput = soup.findAll('div', {'class': 'sched-person'})
            # select names from variable called speakersInput
            for s in speakersInput: 
                name = s.find('h2').find('a').text
                # add name to list named speakers
                speakers.append(name)
        except:
            speakersInput = None
            
        # create a list with all scraped info
        workshopInfo = [workshopName,
                        date,
                        time,
                        descr,
                        speakers]
        
        # append list named workshopIfno to list named detailsWorkshops
        detailsWorkshops.append(workshopInfo)
            
    # if request status code is not 200...
    else:
        # print warning + link
        print('Oh No! ' + l)
        
# create dataframe named dh (dataharvest) based on detailsWorkshops list
dh = pd.DataFrame(detailsWorkshops)
# check out dh
dh.head()

Unnamed: 0,0,1,2,3,4
0,How to get access to documents and data in the EU,2020-09-23,10:00 - 11:15,EU Regulation 1049/2001 regarding public acces...,"[Brigitte Alfter, Staffan Dahllöf, Stéphane Ho..."
1,Using access to environmental information rules,2020-09-23,11:30 - 12:45,Pesticides? Air pollution? Climate emissions? ...,"[Brigitte Alfter, Daniel Simons]"
2,FOI lunch 2: How to deal with data access obst...,2020-09-23,13:00 - 14:00,"""We asked for a database and got a 78.000 page...","[Brigitte Alfter, Nils Mulvad, Helena Bengtsson]"
3,Data analysis with Pandas on Jupyter - 2,2020-09-23,14:00 - 15:15,We will showcase Pandas: the most popular Pyth...,"[Adriana Homolova, Winny de Jong]"
4,Going to court to get information - in German ...,2020-09-24,10:00 - 11:15,Can companies or authorities claim author’s ri...,"[Brigitte Alfter, Arne Semsrott]"


In [117]:
# set column names
dh.columns = ['workshop',
              'date', 
              'time', 
              'description', 
              'speakers']

In [118]:
# check out shape of dh dataframe
dh.shape

(83, 5)

In [140]:
# save data to csv
dh.to_csv('dataharvest 2020 workshop details.csv')

## Analyse data

### Be ware: explosion ahead ;)

Question: which speaker was involved in most sessions? 

Answering this question would be easy, if you had 1 speaker in every cell in the speakers-column. But our data doesn't look like that... It looks like this:

In [119]:
dh.sample(3)

Unnamed: 0,workshop,date,time,description,speakers
6,FOI lunch 3: Experiences with online document ...,2020-09-24,13:00 - 14:00,How do freedom-of-information request websites...,"[Brigitte Alfter, Tarjei Leer-Salvesen, Arne S..."
62,Aviation and Shipping industry - why are there...,2020-11-10,12:45 - 13:45,"Shipping, like aviation, is not directly inclu...","[Jelena Prtoric, Faig Abbasov, Chloe Farand]"
18,Worth paying for?,2020-10-01,17:00 - 18:00,Which paid services do you use in your investi...,"[Ruben Brugnera, Marcus Lindemann, Jonathan St..."


Now what do we do? Well, every entry in the speakers-column is a list. We know so, because we made it so. So we can 'explode' these list, and create a new row for every item of said list; where all other columns stay the same.... 

In [123]:
# let's look at the first row
dh.head(1)

Unnamed: 0,workshop,date,time,description,speakers
0,How to get access to documents and data in the EU,2020-09-23,10:00 - 11:15,EU Regulation 1049/2001 regarding public acces...,"[Brigitte Alfter, Staffan Dahllöf, Stéphane Ho..."


In [125]:
# let's look at the speakers column of the first row of the dh dataframe
dh.iloc[0]['speakers']

['Brigitte Alfter', 'Staffan Dahllöf', 'Stéphane Horel']

If we were to explode our dataset, this first row would explode into three rows which will be all identical expect for the speaker. There will be one row for every speaker. : )

In [127]:
# let's try it, just to see...
# dear Pandas, explode the dh dataset, using the speakers-column, 
# and only show us the first 3 rows of the result of this explosion
dh.explode('speakers').head(3)

Unnamed: 0,workshop,date,time,description,speakers
0,How to get access to documents and data in the EU,2020-09-23,10:00 - 11:15,EU Regulation 1049/2001 regarding public acces...,Brigitte Alfter
0,How to get access to documents and data in the EU,2020-09-23,10:00 - 11:15,EU Regulation 1049/2001 regarding public acces...,Staffan Dahllöf
0,How to get access to documents and data in the EU,2020-09-23,10:00 - 11:15,EU Regulation 1049/2001 regarding public acces...,Stéphane Horel


In [128]:
# let's explode some data, but this time save it 
# in a new dataframe called dhS (short for dataharvest Speakers)
dhS = dh.explode('speakers')

In [129]:
# get no rows + columns of dh
dh.shape

(83, 5)

In [130]:
# get no rows + columns of dhS
dhS.shape

(205, 5)

In [139]:
# let's group data...
dhS.groupby(by='speakers', # group on the speakers column
            as_index=False, # not creating an index out of said speakers column
            sort=True, # sort values
            dropna=True # drop all empty rows
           ).count( # after grouping count the number of workshops
                  ).sort_values(by='workshop', # sort values based on workshop-column 
                                 ascending=False # sort from high to low (descending)
                                )[['speakers', # filter dataframe, only keep speakers + workshop columns
                                   'workshop']].head(10) # only show me the top 10

Unnamed: 0,speakers,workshop
1,Adriana Homolova,15
16,Brigitte Alfter,13
48,Jose Miguel Calatayud,11
103,Trine Smistrup,7
81,Ruben Brugnera,6
45,Jelena Prtoric,6
53,Kim Brice,4
37,Freja Wedenborg,4
80,Robin Van Raaij,4
89,Staffan Dahllöf,3


In [143]:
# now, let's do all that once more but then also save the data. 

# let's group data...
dhS.groupby(by='speakers', # group on the speakers column
            as_index=False, # not creating an index out of said speakers column
            sort=True, # sort values
            dropna=True # drop all empty rows
           ).count( # after grouping count the number of workshops
                  ).sort_values(by='workshop', # sort values based on workshop-column 
                                 ascending=False # sort from high to low (descending)
                                )[['speakers', # filter dataframe, only keep speakers + workshop columns
                                   'workshop']].head(10 # only show me the top 10 and then save data
                                                    ).to_csv('dataharvest 2020 top 10 most active speakers.csv')