# Scraping Websites

In [1]:
import sys
print(sys.executable)

/Users/ehuntley/Desktop/duspviz/spatial-data-science/python-bs/.venv/bin/python3


The idea behind webscraping is this: gather data quickly and replicably from websites by taking advantage of the fact that web pages are structured documents. While they are individually idiosyncratic, they tend to be _internally consistent_. This means that web scraping is always a bespoke affair---you can't build a web scraper that will simply work in a generalizable way for all pages. However, there _are_ general principles that, when rehearsed, will allow you to develop a scraper for a given website without too much effort.

The scraper we're going to build today downloads key information about DUSP faculty from the [DUSP website's 'people' list](http://dusp.mit.edu/people). We're going to scrape information about affiliation, navigate through weblinks, and download photos. Along the way, we'll be doing some neat tricks---naming downloaded photos such that their nomenclature is consistent and dealing with missing and inaccessible information.

To do this, we'll be using a couple of Python packages. The first is `bs4`, or Beautiful Soup 4. This is an HTML and XML parser which takes a downloaded web page and gives us objects and methods for navigating its structure inuititively. It's a very, very standard tool for use in web scraping applications. We'll also be using `wget` to download files, and `requests` to request webpages.

As such, the first thing you'll need to do is install these packages. Assuming you've created and activated your virtual environment, you'll want to install these packages using `pip`.

```sh
pip install requests wget bs4
```

In [4]:
import requests
from pprint import pprint
from bs4 import BeautifulSoup
import wget

In [5]:
base_url = 'http://dusp.mit.edu/people'
base_page = requests.get(base_url)
soup = BeautifulSoup(base_page.content, 'html.parser')

In [6]:
people = soup.find_all('div', class_='row-people')
pprint(people[0:3])

[<div class="views-row views-row-1 views-row-odd views-row-first row-disc-IDG row-people">
<div> <span><div class="bull"></div></span> </div>
<div class="views-field views-field-field-user-picture"> <div class="field-content"><img alt="" src="http://dusp.mit.edu/sites/dusp.mit.edu/files/styles/profile_pic/public/user/pictures/cherie.jpg?itok=tqCmbLrz"/></div> </div>
<div class="views-field views-field-name"> <span class="field-content"><a class="username" href="/faculty/cherie-abbanat" title="View user profile.">Cherie Abbanat</a></span> </div>
<div class="views-field views-field-field-position-and-title-1"> <div class="field-content">Lecturer of International Development and Urban Studies</div> </div>
<div class="views-field views-field-field-position-and-title-2"> <div class="field-content"></div> </div>
<div class="views-field views-field-field-other-division"> <div class="field-content"></div> </div> </div>,
 <div class="views-row views-row-2 views-row-even row-disc-IDG row-people"

Now that we have found all elements with, and we can access each element's components using the element class's methods. For example:

In [7]:
pprint(people[0])
pprint(people[0].get_text().strip())

<div class="views-row views-row-1 views-row-odd views-row-first row-disc-IDG row-people">
<div> <span><div class="bull"></div></span> </div>
<div class="views-field views-field-field-user-picture"> <div class="field-content"><img alt="" src="http://dusp.mit.edu/sites/dusp.mit.edu/files/styles/profile_pic/public/user/pictures/cherie.jpg?itok=tqCmbLrz"/></div> </div>
<div class="views-field views-field-name"> <span class="field-content"><a class="username" href="/faculty/cherie-abbanat" title="View user profile.">Cherie Abbanat</a></span> </div>
<div class="views-field views-field-field-position-and-title-1"> <div class="field-content">Lecturer of International Development and Urban Studies</div> </div>
<div class="views-field views-field-field-position-and-title-2"> <div class="field-content"></div> </div>
<div class="views-field views-field-field-other-division"> <div class="field-content"></div> </div> </div>
'Cherie Abbanat \n Lecturer of International Development and Urban Studies'


In [8]:
for person in people[0:3]:
    name_href = person.find('a', class_='username')
    name = name_href.get_text()
    href = name_href.get('href')
    pos_1 = person.find('div', class_='views-field-field-position-and-title-1').get_text()
    pos_2 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
    other = person.find('div', class_='views-field-field-other-division').get_text()
    print(name, href, pos_1, pos_2, other)

Cherie Abbanat /faculty/cherie-abbanat  Lecturer of International Development and Urban Studies       
Paul Altidor /faculty/paul-altidor  Visiting Lecturer of International Development and Planning        
Mariana Arcaya /faculty/mariana-arcaya  Associate Professor of Urban Planning and Public Health    Associate Department Head    


Nice! But it turns out that we can do more - we can use the Python requests module to automatically comb through each faculty member's personal page to get their biography and their Areas of Interest.

In [9]:
for person in people[0:3]:
    name_href = person.find('a', class_='username')
    name = name_href.get_text()
    href = name_href.get('href')
    pos_1 = person.find('div', class_='views-field-field-position-and-title-1').get_text()
    pos_2 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
    other = person.find('div', class_='views-field-field-other-division').get_text()
    if href:
        person_soup = BeautifulSoup(requests.get('http://dusp.mit.edu' + href).content, 'html.parser')
        bio = person_soup.find('div', class_='pane-user-field-bio').get_text()
    print(name, href, pos_1, pos_2, other, bio)

Cherie Abbanat /faculty/cherie-abbanat  Lecturer of International Development and Urban Studies        

Cherie is a lecturer at DUSP and in the Department of Architecture where she has been teaching for over fifteen years. Cherie lectures on policy, non-profit management, post-disaster rebuilding in New Orleans and Haiti, and the need for grassroots initiatives. 
As a practitioner, Cherie joined Haiti Projects Inc., a 501 (c)3 non-profit, in 2013 as its CEO to transform Haiti Projects from a fledgling non-profit into a growing social enterprise. Cherie successfully turned Haiti Projects around financially and the non-profit is ready to grow. Haiti Projects boasts 4 employees in the US and close to 90 employees in Haiti. Haiti Projects operates a women's sewing cooperative, a women's health clinic that focuses on family planning, health and hygiene, and a community library. With support from the Kellogg Foundation, Haiti Projects plans to build a new community multi-purpose center in 2

In [10]:
for person in people[0:3]:
    name_href = person.find('a', class_='username')
    name = name_href.get_text()
    href = name_href.get('href')
    pos_1 = person.find('div', class_='views-field-field-position-and-title-1').get_text()
    pos_2 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
    other = person.find('div', class_='views-field-field-other-division').get_text()
    if href:
        person_soup = BeautifulSoup(requests.get('http://dusp.mit.edu' + href).content, 'html.parser')
        bio = person_soup.find('div', class_='pane-user-field-bio')
        if bio:
            bio = bio.get_text()
        office = person_soup.find('div', class_='views-field views-field-field-office')
        if office:
            office = office.get_text()
        email = person_soup.find('div', class_='views-field views-field-field-secondary-email')
        if email:
            email = email.get_text()
        interests = person_soup.find('strong', class_='views-label views-label-field-areas-of-interest')
        if interests:
            interests = interests.next_sibling.next_sibling.get_text()
    print(name, href, pos_1, pos_2, other, bio, office, email, interests)

Cherie Abbanat /faculty/cherie-abbanat  Lecturer of International Development and Urban Studies        

Cherie is a lecturer at DUSP and in the Department of Architecture where she has been teaching for over fifteen years. Cherie lectures on policy, non-profit management, post-disaster rebuilding in New Orleans and Haiti, and the need for grassroots initiatives. 
As a practitioner, Cherie joined Haiti Projects Inc., a 501 (c)3 non-profit, in 2013 as its CEO to transform Haiti Projects from a fledgling non-profit into a growing social enterprise. Cherie successfully turned Haiti Projects around financially and the non-profit is ready to grow. Haiti Projects boasts 4 employees in the US and close to 90 employees in Haiti. Haiti Projects operates a women's sewing cooperative, a women's health clinic that focuses on family planning, health and hygiene, and a community library. With support from the Kellogg Foundation, Haiti Projects plans to build a new community multi-purpose center in 2

In [11]:
import csv
output_csv = 'faculty.csv'

with open(output_csv, 'w') as f:
    field_names = ['name', 'href', 'pos_1', 'pos_2', 'other_affil', 'bio', 'office', 'email', 'interests']
    writer = csv.DictWriter(f, field_names)
    writer.writeheader()
    for person in people[0:3]:
        name_href = person.find('a', class_='username')
        name = name_href.get_text()
        href = 'http://dusp.mit.edu' + name_href.get('href')
        pos_1 = person.find('div', class_='views-field-field-position-and-title-1').get_text()
        pos_2 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
        other_affil = person.find('div', class_='views-field-field-other-division').get_text()
        if href:
            person_soup = BeautifulSoup(requests.get(href).content, 'html.parser')
            bio = person_soup.find('div', class_='pane-user-field-bio')
            if bio:
                bio = bio.get_text()
            office = person_soup.find('div', class_='views-field views-field-field-office')
            if office:
                office = office.get_text()
            email = person_soup.find('div', class_='views-field views-field-field-secondary-email')
            if email:
                email = email.get_text()
            interests = person_soup.find('strong', class_='views-label views-label-field-areas-of-interest')
            if interests:
                interests = interests.next_sibling.next_sibling.get_text()
        row = {
            'name': name,
            'href': href,
            'pos_1': pos_1,
            'pos_2': pos_2,
            'other_affil': other_affil,
            'bio': bio,
            'office': office,
            'email': email,
            'interests': interests
        }
        writer.writerow(row)

## Download Photos

In [33]:
import csv
import os
import wget
import time

output_csv = 'faculty.csv'

with open(output_csv, 'w') as f:
    field_names = ['name', 'href', 'pos_1', 'pos_2', 'other_affil', 
                   'bio', 'office', 'email', 'interests', 'image_file']
    writer = csv.DictWriter(f, field_names)
    writer.writeheader()
    for person in people:
        name_href = person.find('a', class_='username')
        name = name_href.get_text()
        print(f'Scraping {name}...')
        href = 'http://dusp.mit.edu' + name_href.get('href')
        image_url = person.find('img')
        if image_url:
            image_url = image_url.get('src')
            out_dir = os.getcwd() + '/images/'
            image_file = name.replace(' ', '_').lower().replace('.', '') + '.jpg'
            try:
                wget.download(image_url, out_dir + image_file)
            except Exception as err: 
                print(err + '---Could not download faculty photo.')
                image_file = err
        pos_1 = person.find('div', class_='views-field-field-position-and-title-1').get_text()
        pos_2 = person.find('div', class_='views-field-field-position-and-title-2').get_text()
        other_affil = person.find('div', class_='views-field-field-other-division').get_text()
        if href:
            person_soup = BeautifulSoup(requests.get(href).content, 'html.parser')
            bio = person_soup.find('div', class_='pane-user-field-bio')
            if bio:
                bio = bio.get_text()
            office = person_soup.find('div', class_='views-field views-field-field-office')
            if office:
                office = office.get_text()
            email = person_soup.find('div', class_='views-field views-field-field-secondary-email')
            if email:
                email = email.get_text()
            interests = person_soup.find('strong', class_='views-label views-label-field-areas-of-interest')
            if interests:
                interests = interests.next_sibling.next_sibling.get_text()
        row = {
            'name': name,
            'href': href,
            'pos_1': pos_1,
            'pos_2': pos_2,
            'other_affil': other_affil,
            'bio': bio,
            'office': office,
            'email': email,
            'interests': interests,
            'image_file': image_file
        }
        writer.writerow(row)
        time.sleep(1.5)

Scraping Cherie Abbanat...
Scraping Paul Altidor...
Scraping Mariana Arcaya...
Scraping Nicholas Ashford...
Scraping Lawrence S. Bacow...
Scraping Eran Ben-Joseph...
Scraping Alan Berger...
Scraping Devin Michelle Bunten...
Scraping Gabriella Carolini...
Scraping Phillip Clay...
Scraping Joseph Coughlin...
Scraping Karilyn Crockett...
Scraping Dayna Cunningham...
Scraping Alexander D'Hooghe...
Scraping Catherine D'Ignazio...
Scraping Mary Jane Daly...
Scraping Michael Dennis...
Scraping Fabio Duarte...
Scraping Louise Elving...
Scraping John E. Fernández...
Scraping Joseph Ferreira...
Scraping Robert Fogelson...
Scraping Dennis Frenchman...
Scraping Ralph Gakenheimer...
Scraping David Geltner...
Scraping Amy Glasmeier...
Scraping Ezra Haber Glenn...
Scraping Gary Hack...
Scraping David Hsu...
Scraping Eric Huntley...
Scraping Jason Jackson...
Scraping Erica Caple James...
Scraping Langley Keyes...
Scraping Melvin King...
Scraping Eric Klopfer...
Scraping Janelle Knox-Hayes...
Scraping 