# Extracting data from the web

## Programming and Data Management (EDI 3400)

### *Vegard H. Larsen (Department of Data Science and Analytics)*

# 1. Extracting data from the web

## HTML Scraping

In [1]:
# Let´s use a package from the Standard Library to open a webpage 

from urllib.request import urlopen

In [2]:
# We can look at the HTML content of a webpage

with urlopen('https://www.bi.edu') as response:
    bi_homepage = response.read()

In [5]:
bi_homepage.find('bi')

TypeError: argument should be integer or bytes-like object, not 'str'

In [6]:
# We can turn the file into a Python string

bi_homepage_as_html_text = bi_homepage.decode('utf-8')

In [7]:
bi_homepage_as_html_text

'\r\n\r\n<!DOCTYPE html>\r\n<html lang="en">\r\n<head lang="en">\r\n    <meta charset="utf-8" />\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n    <meta name="format-detection" content="telephone=no">\r\n    <meta name="google-site-verification" content="bGJqkozEiI_vcRg_DpIZznkIZC0ZC2BGyRLG6p343ak" />\r\n    <meta name="msapplication-TileColor" content="#00467f">\r\n    <meta name="theme-color" content="#ffffff">\r\n    \r\n        <link rel="canonical" href="https://www.bi.edu/">\r\n    <meta content="BI Norwegian Business School is Norway&#39;s only triple-accredited school. We offer a range of degrees." name="description">\r\n    <meta content="https://www.bi.edu/" property="og:url"><meta content="BI Norwegian Business School" property="og:title"><meta content="BI Norwegian Business School is Norway&#39;s only triple-accredited school. We offer a range of degrees." property="og:description"><meta content="BI Business School" property="og:site_name"

In [8]:
# We can analyze the content on the webpage 

bi_homepage_as_html_text.count('Oslo')

4

## Making sense of HTML code: BeautifulSoup

In [9]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(bi_homepage_as_html_text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head lang="en">
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="bGJqkozEiI_vcRg_DpIZznkIZC0ZC2BGyRLG6p343ak" name="google-site-verification">
   <meta content="#00467f" name="msapplication-TileColor"/>
   <meta content="#ffffff" name="theme-color"/>
   <link href="https://www.bi.edu/" rel="canonical"/>
   <meta content="BI Norwegian Business School is Norway's only triple-accredited school. We offer a range of degrees." name="description"/>
   <meta content="https://www.bi.edu/" property="og:url"/>
   <meta content="BI Norwegian Business School" property="og:title"/>
   <meta content="BI Norwegian Business School is Norway's only triple-accredited school. We offer a range of degrees." property="og:description"/>
   <meta content="BI Business School" property="og:site_name"/>
   <meta content="https://www.bi.edu/globalassets/met

In [11]:
soup.find_all('h2')

[<h2 class="category-name" id="submenuButton-f149fedd-d3c9-4492-96e9-01274d2d4615">
                           Programmes and individual courses
                         </h2>,
 <h2 class="category-name" id="submenuButton-a4d078de-bfde-49ea-8c01-be5a2808dbc6">
                           Study at BI
                         </h2>,
 <h2 class="category-name" id="submenuButton-1782caba-241a-4b65-924e-8c24bdc21ec4">
                           Faculty and research
                         </h2>,
 <h2 class="category-name" id="submenuButton-7aa71029-657d-4b40-ad86-c3945b9cce19">
                           Business and Alumni
                         </h2>,
 <h2 class="category-name" id="submenuButton-89c88f42-374b-4a7a-a883-3cf9b4abcd23">
                           About BI
                         </h2>,
 <h2 aria-label="Why study in Norway? - Read more" class="title">Why study in Norway?</h2>,
 <h2 aria-label="Bachelor programmes" class="title">Bachelor programmes</h2>,
 <h2 aria-label="Ma

## Downloading CSV files from webpages

In [12]:
import pandas as pd

winterolympicsmedals = pd.read_csv('http://winterolympicsmedals.com/medals.csv')

In [16]:
winterolympicsmedals.sample(3)

Unnamed: 0,Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
2120,2006,Turin,Skiing,Freestyle Ski.,CHN,aerials,W,Silver
1999,2002,Salt Lake City,Skating,Figure skating,RUS,individual,M,Silver
322,1956,Cortina d'Ampezzo,Skiing,Alpine Skiing,AUT,giant slalom,M,Bronze


## In class exercises:

1. How many of the medals were given to men and how many were given to women?

2. How many unique sports does the data set have? List the unique sports in the data set.

3. Use a loop to iterate through the whole data set and print out the Year and the City for each of the medals given. 

4. Use the code from 3. but now modify it so that you only print out the Year and City if the Year has not been printed out before.

## Pandas-datareader library

In [17]:
# This library might not be installed. 
# If not you can try to install it.

from pandas_datareader.data import DataReader