# Homework: Intro Scraping Practice

In this assignment, we'll practicing our scraping skills by examining the "Published Reproductions" section of Soma's Investigate.ai project: https://investigate.ai/

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that, even though we installed the library as `pip install beautifulsoup4`, the import statement uses a slightly different name.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### 1) Grab the HTML for https://investigate.ai/

Use `requests` to get the HTML, assigning it to a variable

In [2]:
investigate_requests = requests.get("https://investigate.ai/")
investigate = investigate_requests.text
print(investigate)

<!doctype html>
<html lang="en-US">



<head>
  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-5541738-27"></script>
  <script>
    window.dataLayer = window.dataLayer || [];
    function gtag(){dataLayer.push(arguments);}
    gtag('js', new Date());

    gtag('config', 'UA-5541738-27');
  </script>

  <meta charset="utf-8">
  <title>investigate.ai: Data Science for Journalism</title>
  <link href="/font-awesome/css/all.min.css?v=500d1a92f875b1d96d37a3a3f8f0438c" rel="stylesheet">
  <!-- <link href="https://cdnjs.cloudflare.com/ajax/libs/bulma/0.7.5/css/bulma.min.css" rel="stylesheet">
        <link href="https://fonts.googleapis.com/css?family=Raleway:400,700|Open+Sans:400,700&display=swap" rel="stylesheet"> -->
  <link rel="stylesheet" href="/css/spectre.min.css?v=5cd401d486f79e82913923fe7d7f47ff">
  <link rel="stylesheet" href="/css/spectre-exp.min.css?v=5909d80638a6ae6aa3a455b6f6a6d768">
  <link rel="stylesh

### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [3]:
soup = BeautifulSoup(investigate)
type(soup)

bs4.BeautifulSoup

### 3) Use `.select(...)` to select *just* the "Published reproductions" section

Assign the selection to a new variable named `projects_section`.

You'll want "View Source" or pop open the Element Inspector to figure out which element to target.

Reminder: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

Hint: Look for a `div` with a particular class.

In [6]:
projects_section = soup.select(".section-projects")[0]
print(projects_section.text.strip()[:500])


Published reproductions
Data science and machine learning can be used anywhere! From a small visualization at The Upshot to a year-long investigations by Reveal, let's try to put these new skills in context.






Searching for faulty airbags in vehicle complaints


The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?


                        The New 


### 4) Use `projects_section.select(...)` to select elements that represent a single project

Assign the list to a variable named `project_els`.

In [8]:
project_els = projects_section.select("article")
project_els[:2]
#. means "class"

[<article class="card">
 <div class="card-header">
 <h5><a href="/nyt-takata-airbags/">Searching for faulty airbags in vehicle complaints</a></h5>
 </div>
 <div class="card-body">
 <p>The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?</p>
 </div>
 <div class="card-footer">
                         The New York Times
                     </div>
 </article>,
 <article class="card">
 <div class="card-header">
 <h5><a href="/latimes-crime-classification/">Building a crime classification engine</a></h5>
 </div>
 <div class="card-body">
 <p>Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.</p>
 </div>
 <div class="card-footer">
                         Los Angeles Times
                     </div>
 </article>]

### 5) Count the number of matching elements, using `len`

Does it match the number of projects you see on the page? (It should.)

In [9]:
len(project_els)

26

### 6) For each project, print its title, publisher, summary, and link

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Title: Searching for faulty airbags in vehicle complaints

Publisher: The New York Times

Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Link: https://investigate.ai/nyt-takata-airbags/
---
```

In [12]:
for el in project_els:
    title_el = el.select(".card-header a")[0]
    title = title_el.text.strip()
    link = "https://investigate.ai/" + title_el["href"]
    summary = el.select(".card-body")[0].text.strip()
    publisher = el.select(".card-footer")[0].text.strip()
    print(f"Title: {title}\n")
    print(f"Publisher: {publisher}\n")
    print(f"Summary: {summary}\n")
    print(f"Link: {link}")
    print("---")

Title: Searching for faulty airbags in vehicle complaints

Publisher: The New York Times

Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Link: https://investigate.ai//nyt-takata-airbags/
---
Title: Building a crime classification engine

Publisher: Los Angeles Times

Summary: Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.

Link: https://investigate.ai//latimes-crime-classification/
---
Title: Chinese museum analysis

Publisher: Caixin

Summary: A word-count analysis of the names of around 4500 museums in China.

Link: https://investigate.ai//caixin-museum-word-count/
---
Title: Analyzing online safety through app store reviews

Publisher: The Washington Post

Summary: After downloading over a hundred thousand reviews of "random chat apps," how to f

### 7) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `title`, `publisher`, `summary`, and `link`.

In [17]:
#can also do...
#all_projects = []
#for el in project_els:
    #project =  {
    #"title": el.select(".card-header a")[0].text.strip(),
    #"publisher": el.select(".card-footer")[0].text.strip(),
    #"summary": el.select(".card-body")[0].text.strip(),
    #"link": "https://investigate.ai" + el.select(".card-header a")[0]["href"],
#}
    #all_projects.append(project)
#df = pd.DataFrame(all_projects)
#df

In [18]:
df = pd.DataFrame([{
    "title": el.select(".card-header a")[0].text.strip(),
    "publisher": el.select(".card-footer")[0].text.strip(),
    "summary": el.select(".card-body")[0].text.strip(),
    "link": "https://investigate.ai" + el.select(".card-header a")[0]["href"],
} for el in project_els])

df

Unnamed: 0,title,publisher,summary,link
0,Searching for faulty airbags in vehicle compla...,The New York Times,The National Highway Transportation Safety Adm...,https://investigate.ai/nyt-takata-airbags/
1,Building a crime classification engine,Los Angeles Times,Using machine learning as an investigative too...,https://investigate.ai/latimes-crime-classific...
2,Chinese museum analysis,Caixin,A word-count analysis of the names of around 4...,https://investigate.ai/caixin-museum-word-count/
3,Analyzing online safety through app store reviews,The Washington Post,After downloading over a hundred thousand revi...,https://investigate.ai/wapo-app-reviews/
4,Uncovering abusive doctors that were allowed t...,Atlanta Journal-Constitution,"How to comb through 100,000 disciplinary docum...",https://investigate.ai/ajc-doctors-abuse/
5,Analyzing the tone of Trump's speeches,The New York Times,Standard sentiment analysis scores a document ...,https://investigate.ai/upshot-trump-emolex/
6,Detecting special interest model legislation i...,"USA Today, The Arizona Republic, and the Cente...",Special interest groups use model legislation ...,https://investigate.ai/azcentral-text-reuse-mo...
7,Detecting bots in FCC comment submissions,,The comment period on the FCC's net neutrality...,https://investigate.ai/fcc-comments/
8,Figuring out what Democratic candidates care a...,Bloomberg,In the wide field of Democratic presidential c...,https://investigate.ai/bloomberg-tweet-topics/
9,What does Trump tweet about?,The New York Times,What does Trump tweet about? An analysis of ov...,https://investigate.ai/nyt-trump-tweets/


### 8) Using that `DataFrame`, calculate:

Who are the most-featured publishers?

In [19]:
df.value_counts("publisher")

publisher
ProPublica                                                              3
The New York Times                                                      3
                                                                        2
Bloomberg                                                               1
BuzzFeed News                                                           1
USA Today, The Arizona Republic, and the Center for Public Integrity    1
The Washington Post                                                     1
Atlanta Journal-Constitution                                            1
The Boston Globe                                                        1
The Associated Press                                                    1
Tampa Bay Times                                                         1
Review of Economic Studies                                              1
Reveal                                                                  1
Reuters                     

Which project has the longest summary, in number of characters of text? How long is it?

In [21]:
(
    df
    .assign(
        sumlen = lambda df: df["summary"].apply(len)
    )
    .nlargest(1, "sumlen", keep="all")
    .to_dict("records")
)
#.assign adds new column to df
#but does not modify og df

[{'title': 'Bias in the jury selection process',
  'publisher': 'APM Reports',
  'summary': 'When selecting a jury, both the defense and the prosecution are allowed to strike potential jurors from the pool. While the potential jurors provide answers to a questionnaire, what kind of role might race play in their selection or rejection?',
  'link': 'https://investigate.ai/apm-reports-jury-bias/',
  'sumlen': 243}]

How many times longer is it than the average summary?

In [22]:
(
    df
    .assign(
        sumlen = lambda df: df["summary"].apply(len),
        sumlen_vs_avg = lambda df: (df["sumlen"] / df["sumlen"].mean()).round(2)
    )
    .nlargest(1, "sumlen", keep="all")
    [[ "sumlen", "sumlen_vs_avg"]]
)

Unnamed: 0,sumlen,sumlen_vs_avg
21,243,1.82


In [None]:
#lambda is...

---

---

---