# Homework: Intro Scraping Practice

In this assignment, we'll practicing our scraping skills by examining the "Published Reproductions" section of Soma's Investigate.ai project: https://investigate.ai/

### 0) Setup

Import `requests`, `BeautifulSoup`, and `pandas`. Remember that, even though we installed the library as `pip install beautifulsoup4`, the import statement uses a slightly different name.

In [23]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

### 1) Grab the HTML for https://investigate.ai/

Use `requests` to get the HTML, assigning it to a variable

In [24]:
investigate_html = requests.get("https://investigate.ai/").text
print(investigate_html[:200])

<!doctype html>
<html lang="en-US">



<head>
  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async src="https://www.googletagmanager.com/gtag/js?id=UA-5541738-27"></script>
  <scrip


### 2) Use `BeautifulSoup` to convert the HTML into its DOM representation

In [25]:
soup = BeautifulSoup(investigate_html)
type(soup)

bs4.BeautifulSoup

### 3) Use `.select(...)` to select *just* the "Published reproductions" section


You'll want "View Source" or pop open the Element Inspector to figure out which element to target.

Reminder: An element's `class` attribute can contain *multiple* classes, separated by spaces. For example: `<div class="potato hamburger">Hello</div>` has two classes, `potato` and `hamburger`. A CSS selector for *either* of the classes — `soup.select(".potato")` *or* `soup.select(".hamburger")` — will both match that element.

Hint: Look for a `div` with a particular class.

In [26]:
projects_section = soup.select(".section-projects")
projects_section

[<div class="content section-projects">
 <div class="columns">
 <div class="column col-sm-auto col-8 col-mx-auto text-center">
 <h3>Published reproductions</h3>
 <p>Data science and machine learning can be used anywhere! From a small visualization at The Upshot to a year-long investigations by Reveal, let's try to put these new skills in context.</p>
 </div>
 <div class="column col-12">
 <div class="columns">
 <div class="column col-4 col-lg-6 col-sm-12">
 <article class="card">
 <div class="card-header">
 <h5><a href="/nyt-takata-airbags/">Searching for faulty airbags in vehicle complaints</a></h5>
 </div>
 <div class="card-body">
 <p>The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?</p>
 </div>
 <div class="card-footer">
                         The New York Times
                     </div>
 </article>
 </div>
 <div class="column col-4 

### 4) Use `projects_section.select(...)` to select elements that represent a single project

Assign the list to a variable named `project_els`.

In [27]:
soup.select(".column.col-4.col-lg-6.col-sm-12")
project_els = soup.select(".column.col-4.col-lg-6.col-sm-12")

### 5) Count the number of matching elements, using `len`

Does it match the number of projects you see on the page? (It should.)

In [28]:
len(project_els)

26

### 6) For each project, print its title, publisher, summary, and link

You'll want to construct a `for` loop. In each iteration of the loop, print out something that looks like this:

```
Title: Searching for faulty airbags in vehicle complaints

Publisher: The New York Times

Summary: The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Link: https://investigate.ai/nyt-takata-airbags/
---
```

In [29]:
for item in project_els:
    print(item.h5.text)
    print(item.p.text)
    print(item.div.text)
    print(item.find('div', class_="card-footer").text.strip())
    print(f"https://investigate.ai/{item.find('a')['href']}")

Searching for faulty airbags in vehicle complaints
The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Searching for faulty airbags in vehicle complaints

The New York Times
https://investigate.ai//nyt-takata-airbags/
Building a crime classification engine
Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.

Building a crime classification engine

Los Angeles Times
https://investigate.ai//latimes-crime-classification/
Chinese museum analysis
A word-count analysis of the names of around 4500 museums in China.

Chinese museum analysis

Caixin
https://investigate.ai//caixin-museum-word-count/
Analyzing online safety through app store reviews
After downloading over a hundred thousand reviews of "random chat apps," how to find reports of bullying, racism, and unwanted sex

In [30]:
item.h5.text

'Predicting FOIA requests success rates'

In [31]:
item.p.text

'Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?'

In [32]:
item

<div class="column col-4 col-lg-6 col-sm-12">
<article class="card">
<div class="card-header">
<h5><a href="/foia-predictor/">Predicting FOIA requests success rates</a></h5>
</div>
<div class="card-body">
<p>Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?</p>
</div>
<div class="card-footer">
                        data.world
                    </div>
</article>
</div>

In [33]:
item.div.text

'\nPredicting FOIA requests success rates\n'

In [34]:
item.find('p').text

'Government agencies seem to fulfill or reject FOIA request without rhyme or reason. Can a journalist use machine learning to improve their chances?'

In [35]:
item.find('div', class_="card-footer").text.strip()

'data.world'

In [36]:
item.find("a")["href"]

'/foia-predictor/'

In [37]:
print(f"https://investigate.ai/{item.find('a')['href']}")

https://investigate.ai//foia-predictor/


In [38]:
for title in projects_section:
    print("a.href") 
    print("---")

a.href
---


In [39]:
for publisher in projects_section:
    print("div.class = 'card-footer'")
    print("---")

div.class = 'card-footer'
---


In [40]:
for summary in projects_section:
    print("<p>")
    print("---")

<p>
---


### 7) Now, let's do the same thing, but storing the info in a `pandas` `DataFrame`

Specifically, a `DataFrame` with the columns `title`, `publisher`, `summary`, and `link`.

In [42]:
for item in project_els:
    print(item.h5.text)
    print(item.p.text)
    print(item.div.text)
    print(item.find('div', class_="card-footer").text.strip())
    print(f"https://investigate.ai/{item.find('a')['href']}")

Searching for faulty airbags in vehicle complaints
The National Highway Transportation Safety Administration receives thousands and thousands of vehicle complaints each year. Can we train a computer to filter out leads on Takata airbag malfunctions?

Searching for faulty airbags in vehicle complaints

The New York Times
https://investigate.ai//nyt-takata-airbags/
Building a crime classification engine
Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.

Building a crime classification engine

Los Angeles Times
https://investigate.ai//latimes-crime-classification/
Chinese museum analysis
A word-count analysis of the names of around 4500 museums in China.

Chinese museum analysis

Caixin
https://investigate.ai//caixin-museum-word-count/
Analyzing online safety through app store reviews
After downloading over a hundred thousand reviews of "random chat apps," how to find reports of bullying, racism, and unwanted sex

In [43]:
tpsl_df = pd.DataFrame([{
    "title": item.h5.text,
    "publisher": (item.find('div', class_="card-footer").text.strip()),
    "summary": item.p.text,
    "link": f"https://investigate.ai/{item.find('a')['href']}",}
for item in project_els])

In [53]:
tpsl_df['summary'].apply(len)

AttributeError: 'list' object has no attribute 'apply'

### 8) Using that `DataFrame`, calculate:

Who are the most-featured publishers?

In [45]:
tpsl_df['publisher'].value_counts().head(2)

publisher
The New York Times    3
ProPublica            3
Name: count, dtype: int64

Which project has the longest summary, in number of characters of text? How long is it?

In [54]:
tpsl_df['summary'].apply(len)

0     198
1     126
2      67
3     143
4      87
5     193
6     149
7     157
8     126
9      63
10    167
11     53
12     87
13    142
14    149
15    125
16    122
17    150
18     74
19    161
20    136
21    243
22    109
23    123
24    167
25    147
Name: summary, dtype: int64

In [55]:
list_number = tpsl_df['summary'].apply(len)

In [58]:
tpsl_df['list_number']= list_number

In [59]:
tpsl_df

Unnamed: 0,title,publisher,summary,link,list_number
0,Searching for faulty airbags in vehicle compla...,The New York Times,The National Highway Transportation Safety Adm...,https://investigate.ai//nyt-takata-airbags/,198
1,Building a crime classification engine,Los Angeles Times,Using machine learning as an investigative too...,https://investigate.ai//latimes-crime-classifi...,126
2,Chinese museum analysis,Caixin,A word-count analysis of the names of around 4...,https://investigate.ai//caixin-museum-word-count/,67
3,Analyzing online safety through app store reviews,The Washington Post,After downloading over a hundred thousand revi...,https://investigate.ai//wapo-app-reviews/,143
4,Uncovering abusive doctors that were allowed t...,Atlanta Journal-Constitution,"How to comb through 100,000 disciplinary docum...",https://investigate.ai//ajc-doctors-abuse/,87
5,Analyzing the tone of Trump's speeches,The New York Times,Standard sentiment analysis scores a document ...,https://investigate.ai//upshot-trump-emolex/,193
6,Detecting special interest model legislation i...,"USA Today, The Arizona Republic, and the Cente...",Special interest groups use model legislation ...,https://investigate.ai//azcentral-text-reuse-m...,149
7,Detecting bots in FCC comment submissions,,The comment period on the FCC's net neutrality...,https://investigate.ai//fcc-comments/,157
8,Figuring out what Democratic candidates care a...,Bloomberg,In the wide field of Democratic presidential c...,https://investigate.ai//bloomberg-tweet-topics/,126
9,What does Trump tweet about?,The New York Times,What does Trump tweet about? An analysis of ov...,https://investigate.ai//nyt-trump-tweets/,63


In [62]:
list_number2=[]
for a in tpsl_df['summary']:
    list_number2.append(len(a))
list_number2

[198,
 126,
 67,
 143,
 87,
 193,
 149,
 157,
 126,
 63,
 167,
 53,
 87,
 142,
 149,
 125,
 122,
 150,
 74,
 161,
 136,
 243,
 109,
 123,
 167,
 147]

How many times longer is it than the average summary?

In [75]:
tpsl_df.list_number.sort_values(ascending = False)
tpsl_df['list_number'].sort_values(ascending = False)

21    243
0     198
5     193
10    167
24    167
19    161
7     157
17    150
6     149
14    149
25    147
3     143
13    142
20    136
1     126
8     126
15    125
23    123
16    122
22    109
12     87
4      87
18     74
2      67
9      63
11     53
Name: list_number, dtype: int64

In [81]:
longest_summary = tpsl_df['list_number'].sort_values(ascending = False)

---

---

---

In [82]:
longest_summary/list_number.median()

21    1.748201
0     1.424460
5     1.388489
10    1.201439
24    1.201439
19    1.158273
7     1.129496
17    1.079137
6     1.071942
14    1.071942
25    1.057554
3     1.028777
13    1.021583
20    0.978417
1     0.906475
8     0.906475
15    0.899281
23    0.884892
16    0.877698
22    0.784173
12    0.625899
4     0.625899
18    0.532374
2     0.482014
9     0.453237
11    0.381295
Name: list_number, dtype: float64