# Homework 07: Scraping with BeautifulSoup (and friends) 

## Part One

**Scrape the list of US presidents from https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States to an external site using pandas and save them as a CSV.**

In [1]:
import pandas as pd

URL = "https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
df = pd.read_html(URL, header=0)[0]

df.head()

Unnamed: 0,No.[a],Portrait,Name (birth–death),Term[16],Party[b][17],Party[b][17].1,Election,Vice President[18]
0,1,,George Washington (1732–1799) [19],"April 30, 1789 – March 4, 1797",,Unaffiliated,1788–891792,John Adams[c]
1,2,,John Adams (1735–1826) [21],"March 4, 1797 – March 4, 1801",,Federalist,1796,Thomas Jefferson[d]
2,3,,Thomas Jefferson (1743–1826) [23],"March 4, 1801 – March 4, 1809",,Democratic- Republican,1800 1804,Aaron BurrGeorge Clinton
3,4,,James Madison (1751–1836) [24],"March 4, 1809 – March 4, 1817",,Democratic- Republican,18081812,"George Clinton[e]Vacant after April 20, 1812El..."
4,5,,James Monroe (1758–1831) [26],"March 4, 1817 – March 4, 1825",,Democratic- Republican,18161820,Daniel D. Tompkins


In [2]:
df.to_csv('presidents.csv', index=False)

## Part Two

**Scrape the content of https://www.lemonde.fr/ and save it as a CSV.**

**We want: titles, subhead, article URL, whether it's premium or not, byline, article type, image URL.**

Bonus, if you want to get fancy:

Make the CSV file auto-updating. Use this tutorial (<a href="https://www.youtube.com/watch?v=QNKxzkNpsko">video</a>, <a href="https://jonathansoma.com/everything/git/auto-updating-scaper-viz/">text</a>) but just ignore the visualization/datawrapper aspect

In [3]:
from bs4 import BeautifulSoup
import requests

response = requests.get('https://www.lemonde.fr/en/')
doc = BeautifulSoup(response.text)

doc

<!DOCTYPE html>
<html lang="en" prefix="og: https://ogp.me/ns#"> <head> <meta charset="utf-8"/> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <meta content="no-referrer-when-downgrade" name="referrer"/> <meta content="#ffffff" name="theme-color"/> <script async="1" data-gdpr-purposes="personalization" data-gdpr-src="https://www.lemonde.fr/en/bucket/resources/english/js/chartbeatMab.88d00fd6428d5de2.js" type="text/plain"></script> <link data-rh="true" href="https://www.lemonde.fr/en/" hreflang="en-US" rel="alternate"/> <link data-rh="true" href="https://www.lemonde.fr/en/" hreflang="en" rel="alternate"/> <link data-rh="true" href="https://www.lemonde.fr/en/" hreflang="en-CA" rel="alternate"/> <link data-rh="true" href="https://www.lemonde.fr/en/" hreflang="en-GB" rel="alternate"/> <link href="//img.lemde.fr" rel="preconnect"/> <link as="image" fetchpriority="high" imagesizes="100vw" imagesrc

In [4]:
items = doc.find_all(class_='article')

items

[<a class="article article--nav" data-suggestion="" href="https://www.lemonde.fr/en/international/article/2025/06/19/in-the-donbas-firefighters-operate-under-the-constant-threat-of-drones_6742482_4.html"> <div class="article__media-container"> <picture class="article__media"> <img alt="" class="teaser__media teaser__media--nav js-media-nav" data-lazy="https://img.lemde.fr/2025/06/17/1/0/4000/2666/180/0/95/0/8ef2188_upload-1-rxik4i39xlrv-sujet-pompiers-kostantynivka-10.jpg" data-lazy-retina="https://img.lemde.fr/2025/06/17/1/0/4000/2666/360/0/95/0/8ef2188_upload-1-rxik4i39xlrv-sujet-pompiers-kostantynivka-10.jpg 2x" height="120" width="180"/> <noscript><img alt="" height="120" src="https://img.lemde.fr/2025/06/17/1/0/4000/2666/180/0/95/0/8ef2188_upload-1-rxik4i39xlrv-sujet-pompiers-kostantynivka-10.jpg" srcset="https://img.lemde.fr/2025/06/17/1/0/4000/2666/360/0/95/0/8ef2188_upload-1-rxik4i39xlrv-sujet-pompiers-kostantynivka-10.jpg 2x" width="180"/></noscript> </picture> </div> <div cla

In [5]:
rows = []

for item in items:
    row = {}

    row["title"] = item.find(class_='article__title').text

    try:
        row["subhead"] = item.find(class_='article__desc').text
    except:
        pass

    try:
        row['url'] = item.find('a').get('href', None)
    except:
        row['url'] = item.get('href', None)

    try:
        row['premium'] = item.find(class_='icon__premium').text
    except:
        pass

    try:
        row['byline'] = item.find(class_='article__byline').text.strip()
    except: 
        pass
        
    try:
        row['article type'] = item.find(class_='article__type').text
    except:
        pass

    try:
        row['image URL'] = item.find('img').get('data-src', None)
    except:
        row['image URL'] = item.get('data-src', None)
    
    rows.append(row)

print(item)

<div class="article article--opinion old__article-square"> <a href="https://www.lemonde.fr/en/international/article/2025/06/16/trump-s-messianic-allies-lead-the-charge-against-gaza_6742393_4.html"> <div class="article__type">Column</div> <span class="icon__premium icon--outside"><span class="sr-only">Subscribe</span></span><p class="article__title">Trump's messianic allies lead the charge against Gaza</p> <div class="article__author"> <img class="article__author-picture" data-src="https://img.lemde.fr/2023/07/31/0/0/0/0/100/100/60/0/d45bb07_1690791602941-eac5c14-1647333440018-jean-pierre-filiu-2019.jpg"/> <div> <div class="article__author-name">Jean-Pierre Filiu</div> <div class="article__author-desc">Historian and professor at Sciences Po Paris</div> </div> </div> </a> </div>


In [6]:
df.to_csv('lemonde.csv', index=False)

In [7]:
df = pd.json_normalize(rows)
df.head(10)

Unnamed: 0,title,url,premium,image URL,subhead,byline,article type
0,How Israel gained air superiority over Iran's sky,https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
1,"As Israel and Iran clash, global fears of nucl...",https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
2,Prospect of US military intervention in Iran d...,https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
3,"Behind Trump and Macron's war of words, G7 rev...",https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
4,EU lawmakers seek to end statute of limitation...,https://www.lemonde.fr/en/europe/article/2025/...,,,,,
5,Criticism is mounting in the Netherlands again...,https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
6,"Five years after Brexit, UK and EU aim for a f...",https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
7,"In Lithuania, the fear of war with Russia lurk...",https://www.lemonde.fr/en/international/articl...,Subscribers only,,,,
8,"From Chenonceau to the Acropolis, heritage sit...",https://www.lemonde.fr/en/culture/article/2025...,Subscribers only,,,,
9,Beloved black Corsican beaches threatened by n...,https://www.lemonde.fr/en/france/article/2025/...,Subscribers only,,,,


## Part Three

**Scrape the list of third party drivers license locations from https://travel-id-documents.az.gov/authorized-third-party-driver-license-locations but include the link. Save as a CSV. Since it's more than just the text from the table, this requires actually using BeautifulSoup :(**

In [31]:
response = requests.get('https://travel-id-documents.az.gov/authorized-third-party-driver-license-locations')
doc = BeautifulSoup(response.text)

items = doc.find('table', class_='table table-striped table-hover table-bordered')

items

<table class="table table-striped table-hover table-bordered" style="width: 100%;"><thead><tr><td><strong>Company</strong></td>
<td><strong>Address</strong></td>
<td><strong>Telephone</strong></td>
<td><strong>Hours</strong></td>
</tr></thead><tbody><tr><td><a href="http://az-mvd.com/" target="_blank">1 Stop Title &amp; Registration Services</a></td>
<td>940 N. Alma School Rd., #104<br/>
			Chandler, AZ 85224</td>
<td>480.821.3288</td>
<td>Mon.-Fri. 8:00 a.m.-6:00 p.m. Sat. 9:00 a.m.-4:30 p.m.</td>
</tr><tr><td><a href="http://az-mvd.com/" target="_blank">1 Stop Title &amp; Registration Services</a></td>
<td>5036 W. Cactus Rd., Ste. 2<br/>
			Glendale, AZ 85304</td>
<td>602.264.2400</td>
<td>Mon.-Fri. 8:00 a.m.-6:00 p.m. Sat. 8:30 a.m.-4:30 p.m.</td>
</tr><tr><td>Academy of Driving Motor Vehicle Center</td>
<td>4733 E. Broadway Blvd.<br/>
			Tucson, AZ 85711</td>
<td>520.750.7572</td>
<td>Mon.-Fri. 9 a.m.-5 p.m. and Sat. 9 a.m.-3 p.m.</td>
</tr><tr><td>Arizona Auto License</td>
<td>133

In [36]:
ArizonaSite = 'https://travel-id-documents.az.gov/authorized-third-party-driver-license-locations'

rows = []

for item in items.tbody.find_all('tr'):
    row = {}

    columns = item.find_all('td')
    if (columns != []):
        row ['company'] = columns[0].text.strip()

        try:
            row['company link'] = ArizonaSite+columns[0].find('a').get('href')
        except:
            pass
        
        row ['address'] = columns[1].text.strip()
        row ['telephone'] = columns[2].text.strip()
        row ['hours'] = columns[3].text.strip()

    rows.append(row)



print(item)

<tr><td>West Valley Motor Vehicle Title Express LLC</td>
<td>12801 W. Bell Rd. Ste. #113<br/>
			Surprise, AZ 85379</td>
<td>623.977.0929 or fax 623.977.4006</td>
<td>Mon.-Fri. 9 a.m.-5 p.m.</td>
</tr>


In [43]:
df = pd.json_normalize(rows)

df = df.replace(to_replace='\n\t\t\t', value=' ', regex=True)
df.head(10)

Unnamed: 0,company,company link,address,telephone,hours
0,1 Stop Title & Registration Services,https://travel-id-documents.az.gov/authorized-...,"940 N. Alma School Rd., #104 Chandler, AZ 85224",480.821.3288,Mon.-Fri. 8:00 a.m.-6:00 p.m. Sat. 9:00 a.m.-4...
1,1 Stop Title & Registration Services,https://travel-id-documents.az.gov/authorized-...,"5036 W. Cactus Rd., Ste. 2 Glendale, AZ 85304",602.264.2400,Mon.-Fri. 8:00 a.m.-6:00 p.m. Sat. 8:30 a.m.-4...
2,Academy of Driving Motor Vehicle Center,,"4733 E. Broadway Blvd. Tucson, AZ 85711",520.750.7572,Mon.-Fri. 9 a.m.-5 p.m. and Sat. 9 a.m.-3 p.m.
3,Arizona Auto License,,"1337 W. Prince Rd Tucson, AZ 85705",520.696.2023,Driver License Hours: Mon.-Fri. 9 a.m.-5 p.m....
4,Arizona Auto License Service LLC,,"1457 N. Eliseo C Felix Jr. Way, Ste. 105 and 1...",623.925.5455 or Fax 623.925.5879,Mon.-Fri. 8 a.m.-5 p.m.
5,Arizona Auto License Service LLC,,"5130 W Baseline Rd. Ste. 105 Laveen, AZ 85339",602.334.1700 or Fax 602.272.2480,Mon.-Sat. 8 a.m.-5 p.m.
6,Arizona Loan Solutions Motor Vehicle Center,,"4401 N. Hwy. 89 Suite 1 Flagstaff, AZ 86004",928.527.3215 or Fax 928.526.1895,Mon.-Fri. 9 a.m.-5 p.m. and Sat. 10 a.m.-2 p.m.
7,Arizona Motor Vehicle Express LLC,https://travel-id-documents.az.gov/authorized-...,"6741 N Thornydale #147 Tucson, AZ 85741",520.219.8852,Driver License Hours: M-F 9:00 a.m.-5:30 p.m. ...
8,Arizona State Express Title & Registration,,"20924 N. John Wayne Parkway Ste. D1 Maricopa,...",520.568.9299,Driver License Hours: M-F 9:00 a.m.- 5:00 p.m...
9,Arizona Tags & Title Inc.,,8307 E. State Route 69 Suite A Prescott Valley...,928.759.9700 or Fax 928.772.5283,Mon.-Fri. 10 a.m.-2 p.m.


## Part Four

**Visit https://www.tnwb.uscourts.gov/Search/Search.aspx and search for "CAR." Scrape the results into a CSV, with four columns: the URL to the case, the name of the case, the category (e.g. "Judge's Opinions), the additional details (terms match/size/pdf URL).**

In [64]:
response = requests.get('https://www.tnwb.uscourts.gov/Search/Search.aspx?zoom_sort=0&zoom_xml=0&zoom_query=CAR&zoom_per_page=132&zoom_and=1&zoom_cat%5B%5D=-1')
doc = BeautifulSoup(response.text)

items = doc.find_all('div', attrs={'class': 'result_block'})

items2 = doc.find_all('div', attrs={'class': 'result_altblock'})

items = items + items2

In [65]:
rows = []

for item in items:
    row = {}

    link_list = item.find_all('a')

    for link in link_list:
         row['url'] = link.get('href', None)

    titles = item.find_all(class_='result_title')

    for title in titles:
        row['name'] = title.text

    categories = item.find_all(class_='category')

    for category in categories:
        row['category'] = category.text

    infolines = item.find_all(class_='infoline')

    for infoline in infolines:
        row['info line'] = infoline.text

    rows.append(row)


In [67]:
df = pd.json_normalize(rows)
df.head()

Unnamed: 0,url,name,category,info line
0,https://www.tnwb.uscourts.gov/Opinions/jdl/pdf...,1. JDL: 04-24318 Jacquelline D. Black [Judges'...,[Judges' Opinions],Terms matched: 1 - 102k - URL: https://ww...
1,https://www.tnwb.uscourts.gov/Opinions/ghb/pdf...,3. GHB: 97-12368 Billy G. Woffard [Judges' Opi...,[Judges' Opinions],Terms matched: 1 - 71k - URL: https://www...
2,https://www.tnwb.uscourts.gov/Opinions/mrh/pdf...,5. MRH: 20-20967 Jacob Braxton Herring 20-0009...,[Judges' Opinions],Terms matched: 1 - 303k - URL: https://ww...
3,https://www.tnwb.uscourts.gov/Opinions/jdl/pdf...,7. JDL: 09-20339 Diane M. Miller [Judges' Opin...,[Judges' Opinions],Terms matched: 1 - 92k - URL: https://www...
4,https://www.tnwb.uscourts.gov/Opinions/ghb/pdf...,"9. GHB: 02-31651 Neil Bond Stewart, Jr. and Ti...",[Judges' Opinions],Terms matched: 1 - 291k - URL: https://ww...


In [13]:
df.to_csv('uscourts.csv', index=False)