# Homework 7: Scraping part 4

## Part Four

Visit https://www.tnwb.uscourts.gov/Search/Search.aspx and search for "CAR." Scrape the results into a CSV, with four columns: the URL to the case, the name of the case, the category (e.g. "Judge's Opinions), the additional details (terms match/size/pdf URL).

**Bonuses, if you want to get fancy**:

- Split up the additional details into multiple columns
- Download all of the PDFs of the cases

Tips:

- There are only 10 results, and so many pages! ...maybe there's a secret way to get them all on one page?
- Check the class you're using and see if it matches the number of results (it probably doesn't!). Inspect the page to find out why. You have two options: use something like we did in class with item.select("h1, h2") – but slightly different, since we're talking about classes – or have two separate loops.
- .split is often a convenient way to separate semi-structured text
- Downloading PDFs in Python probably does not involve wget (unless you really want to)

In [7]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [8]:
#Starting with the first results page - setting the "results per page to 100"
url = "https://www.tnwb.uscourts.gov/Search/search.aspx?zoom_query=car&zoom_page=1&zoom_per_page=100&zoom_and=1&zoom_sort=0&zoom_xml=0"

response = requests.get(url)
doc = BeautifulSoup(response.text)

In [9]:
items = doc.find_all(class_=["result_block", "result_altblock"])

In [10]:
len(items) #It's showing just half

100

In [11]:
items[-1]

<div class="result_altblock">
<div class="result_title"><b>100.</b> <a href="https://www.tnwb.uscourts.gov/Opinions/jlc/pdf/jlc20150114nn7.pdf#search=%22car%22" target="_blank">JLC: 11-12703 Bobbie L. Baker</a><span class="category"> [Judges' Opinions]</span></div>
<div class="context">
<b>...</b> seeking to collect all disbursements to which a West Tennessee business entity, Highway 64 <span class="highlight">Car</span> and Truck Sales (" Highway 64") , would be entitled, <b>...</b></div>
<div class="infoline">Terms matched:  1  -  725k  -  URL: https://www.tnwb.uscourts.gov/Opinions/jlc/pdf/jlc20150114nn7.pdf</div>
</div>

In [12]:
cases = []
for item in items:
    case = {}
    
    #URL
    case["case_url"] = item.find("a").get("href")

    #Name of the case
    case["case_name"] = item.find("a").text

    #Category
    case["category"] = item.find(class_="category").text.strip()

    #Terms match
    case["terms_match"] = item.find(class_="context").text.strip()

    #Size
    weight = item.find(class_="infoline").text.strip()
    parts = weight.split(" - ")
    weight = parts[1].strip()
    case["weight"] = weight

    cases.append(case) 

In [13]:
len(cases)

100

In [14]:
cases[0]

{'case_url': 'https://www.tnwb.uscourts.gov/Opinions/jdl/pdf/jdl20071024nn1.pdf#search=%22car%22',
 'case_name': 'JDL: 04-24318 Jacquelline D. Black',
 'category': "[Judges' Opinions]",
 'terms_match': "... the basis that the Debtor failed to prove that K's Auto had custody of the car or knew of the whereabouts of the car. This adversary proceeding was administratively ...",
 'weight': '102k'}

In [15]:
#Now for the different pages with results
urls = []

for number in range(1,3): #Include the number of pages you see in the results
    url = f"https://www.tnwb.uscourts.gov/Search/search.aspx?zoom_query=car&zoom_page={number}&zoom_per_page=100&zoom_and=1&zoom_sort=0&zoom_xml=0"
    urls.append(url)

len(urls)

2

In [16]:
cases = []

for url in urls:
    response = requests.get(url)
    doc = BeautifulSoup(response.text)
    items = doc.find_all(class_=["result_block", "result_altblock"])
    
    for item in items:
        case = {}
        
        #URL
        case["case_url"] = item.find("a").get("href")

        #Name of the case
        case["case_name"] = item.find("a").text

        #Category
        case["category"] = item.find(class_="category").text.strip()

        #Terms match
        case["terms_match"] = item.find(class_="context").text.strip()

        #Size
        weight = item.find(class_="infoline").text.strip()
        parts = weight.split(" - ")
        weight = parts[1].strip()
        case["weight"] = weight

        cases.append(case) 

len(cases)

132

In [17]:
cases[0]

{'case_url': 'https://www.tnwb.uscourts.gov/Opinions/jdl/pdf/jdl20071024nn1.pdf#search=%22car%22',
 'case_name': 'JDL: 04-24318 Jacquelline D. Black',
 'category': "[Judges' Opinions]",
 'terms_match': "... the basis that the Debtor failed to prove that K's Auto had custody of the car or knew of the whereabouts of the car. This adversary proceeding was administratively ...",
 'weight': '102k'}

In [18]:
df = pd.json_normalize(cases)
df.head()

Unnamed: 0,case_url,case_name,category,terms_match,weight
0,https://www.tnwb.uscourts.gov/Opinions/jdl/pdf...,JDL: 04-24318 Jacquelline D. Black,[Judges' Opinions],... the basis that the Debtor failed to prove ...,102k
1,https://www.tnwb.uscourts.gov/Opinions/whb/pdf...,WHB: 95-26401 Mary Lucy Cooper,[Judges' Opinions],"... MARY LUCY COOPER, Plaintiff, v. Adversary ...",27k
2,https://www.tnwb.uscourts.gov/Opinions/ghb/pdf...,GHB: 97-12368 Billy G. Woffard,[Judges' Opinions],"... G. Woffard, ("" Woffard"") , was partners in...",71k
3,https://www.tnwb.uscourts.gov/Opinions/jdl/pdf...,JDL: 97-30580 Mary Chrlis Hurst,[Judges' Opinions],... UNITED STATES BANKRUPTCY COURT WESTERN DIS...,32k
4,https://www.tnwb.uscourts.gov/Opinions/mrh/pdf...,MRH: 20-20967 Jacob Braxton Herring 20-00094,[Judges' Opinions],... and soon thereafter the contract was assig...,303k


In [19]:
df.to_csv("cars_cases_wd_tennessee.csv")

In [20]:
#For downloading the pdfs

for case in cases:
    url = case["case_url"]
    filename = f"{case["case_name"]}.pdf"
    
    response = requests.get(url)
    with open(filename, 'wb') as f:
            f.write(response.content)

    print(f"Downloadaded {filename} OK")

Downloadaded JDL: 04-24318 Jacquelline D. Black.pdf OK
Downloadaded WHB: 95-26401 Mary Lucy Cooper.pdf OK
Downloadaded GHB: 97-12368 Billy G. Woffard.pdf OK
Downloadaded JDL: 97-30580 Mary Chrlis Hurst.pdf OK
Downloadaded MRH: 20-20967 Jacob Braxton Herring 20-00094.pdf OK
Downloadaded GHB: 95-11365 Melissa L. Bryan.pdf OK
Downloadaded JDL: 09-20339 Diane M. Miller.pdf OK
Downloadaded GHB: 00-12340 Wanda K. Autry.pdf OK
Downloadaded GHB: 02-31651 Neil Bond Stewart, Jr. and Tina R. Stewart.pdf OK
Downloadaded GHB: 96-12039 Randy and Janice Willson.pdf OK
Downloadaded GHB: 02-12407 Luis Rossi and Evelyn Rossi.pdf OK
Downloadaded GHB: 99-12067 James Dean and Patsy Dean.pdf OK
Downloadaded WHB: 95-29798 Byron Crumb.pdf OK
Downloadaded JDL: 04-23035 Jennifer Ann Jamison-McGee.pdf OK
Downloadaded JDL: 04-23035 Jennifer Ann Jamison-McGee.pdf OK
Downloadaded GHB: 93-11057 Steven Lynn Hornsby and Teresa Lynn Hornsby.pdf OK
Downloadaded JDL: 97-25357 Bobbie Louise Taylor Yarbrough.pdf OK
Downloa