# Webscraping

Author: Justin Chun-ting Ho

Date: 27 Nov 2023

Credit: Some sections are adopted from the slides prepared by Damian Trilling

### What is an website?

Let's take a look at [this](https://ascor.uva.nl/staff/ascor-faculty/ascor-staff---faculty.html)

### Typical Workflow

- Download the source code (HTML)
- Identify the pattern to isolate what we want
- Write a script to extract

## Approach 1: Regular Expression

You probably need [this](https://images.datacamp.com/image/upload/v1665049611/Marketing/Blog/Regular_Expressions_Cheat_Sheet.pdf).

In [None]:
import requests
import re

In [None]:
response = requests.get('https://ascor.uva.nl/staff/ascor-faculty/ascor-staff---faculty.html')

In [None]:
text = response.text

In [None]:
emails = re.findall(r'mailto:(.*?)\"',text)

## Approach 2: Modern Packages

### Tools
- Beautiful Soup: `pip install beautifulsoup4` or `conda install -c anaconda beautifulsoup4`
- SelectorGadget: https://selectorgadget.com/

In [None]:
from bs4 import BeautifulSoup 
import csv
import pandas as pd
import numpy as np

In [None]:
URL = 'https://ascor.uva.nl/staff/faculty.html'
r = requests.get(URL) 
soup = BeautifulSoup(r.content) 

In [None]:
r.content

In [None]:
emails = soup.find_all(class_="mail")

In [None]:
emails[0:6]

In [None]:
for email in emails[0:6]:
    print(email['href'])

### Another Way

In [None]:
soup = BeautifulSoup(r.content) 
items = soup.find_all(class_="c-item__link")

In [None]:
items[0]

In [None]:
links = []
for i in items:
    link = i['href']
    links.append(link) 

In [None]:
links[0:10]

In [None]:
link = '/profile/h/o/j.c.ho/j.c.ho.html?origin=%2BkELbJiCRnm%2F56cOYZSXzA'
url = 'https://ascor.uva.nl/' + link
r = requests.get(url)
soup = BeautifulSoup(r.content)

In [None]:
name = soup.find(class_="c-profile__name").get_text()
name

In [None]:
summary = soup.find(class_="c-profile__summary").get_text()
summary

In [None]:
profile = soup.find(id="Profile").get_text()
profile

In [None]:
divs = soup.find_all('div', class_="c-profile__list")
divs

In [None]:
divs[1].find_all('li')

In [None]:
divs[1].find_all('li')[0].get_text()

In [None]:
profiles = []

for link in links:
    print(link)
    url = 'https://ascor.uva.nl' + link
    r = requests.get(url)
    soup = BeautifulSoup(r.content) 
    name = soup.find(class_="c-profile__name").get_text()
    summary = soup.find(class_="c-profile__summary").get_text()
#    profile = soup.find(id="Profile").get_text()
    divs = soup.find_all('div', class_="c-profile__list")
    email = divs[1].find_all('li')[0].get_text()
    
    profile = {} 
    profile['name'] = name
    profile['summary'] = summary
#     profile['profile'] = profile
    profile['email'] = email
    profiles.append(profile) 

In [None]:
profiles = []

for link in links:
    print(link)
    url = 'https://ascor.uva.nl' + link
    r = requests.get(url)
    soup = BeautifulSoup(r.content) 
    name = soup.find(class_="c-profile__name").get_text()
    summary = soup.find(class_="c-profile__summary").get_text()
    try:
        profile_text = soup.find(id="Profile").get_text()
    except:
        profile_text = np.nan
    divs = soup.find_all('div', class_="c-profile__list")
    try:
        email = divs[1].find_all('li')[0].get_text()
    except:
        email = np.nan
    
    profile = {} 
    profile['name'] = name
    profile['summary'] = summary
    profile['profile_text'] = profile_text
    profile['email'] = email
    profiles.append(profile) 

In [None]:
df = pd.DataFrame(profiles)

In [None]:
df

In [None]:
df.to_csv('profiles.csv')

In [None]:
df.to_json('profiles.json', orient='records', lines=True)

### Exercise

Get the full text of all the news item here: https://ascor.uva.nl/news/newslist.html