# INTRO

Library = an 'addon' or 'extension' to Python

* **`requests`** : download the website
* **`BeautifulSoup`** : pick out the important parts.
* **`pandas`** : just for exporting the csv in the end

`requests` methods we'll use:
<br>➡ `requests.get(webpage)` : download the page, where `webpage` is the URL of the page
<br>➡ `.text` : get the HTML out of the download

`BeautifulSoup` methods we'll use:
<br>➡ `BeautifulSoup(downloaded_page_text)` : read the downloaded page
<br>➡ `.select` : select the elements we want to get based on their tags or attributes
<br>➡ `.get()` : get an attribute value
<br>➡ `.text` : get the content

![title](../img/html-syntax.png)

See also: 
* [CSS selector reference](https://www.w3schools.com/cssref/css_selectors.asp)
* [BeautifulSoup reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Requests reference](https://realpython.com/python-requests/#the-get-request)
* [Pandas export csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

# Import libraries

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas

# SCRAPER #1 : Get info from multiple pages

Let's get all the headlines in the last 5 pages of the 'Today in focus' page of The Guardian https://www.theguardian.com/news/series/todayinfocus

### 1. Download and read the website

Save the website as a string in variable `URL`
<br>➡ **Important**: we need to take the pagination into account

In [2]:
URL = "https://www.theguardian.com/news/series/todayinfocus"

Download the website with `requests.get()` and save it into variable `website_request`

In [3]:
r = requests.get(URL)
r

<Response [200]>

Get the HTML of the website using `.text` and save it into variable `website_content`

In [5]:
website_content = r.text

Read the website with `BeautifulSoup` and save it into variable `website_read`

In [10]:
website_read = bs(website_content)
# is the same as BeautifulSoup(website_content)
# webiste_read

### 2. scrape all the headers

Use the `website_read.select()` method here

In [12]:
headlines = website_read.select("a.js-headline-text")

In [13]:
len(headlines)

20

In [15]:
headlines[4].text

'The countdown to Cop26: can world leaders save the planet?'

Now get only the content of the elements
<br>➡  Use a `for` loop
<br>➡  Use `print`

In [22]:
for head in headlines:
    print(head.text)

Newcastle fans think they’ve got their club back. But at what cost?
Has England gone back to the office?
What went wrong with the UK’s handling of the Covid pandemic?
Dubai’s ruler and the Pegasus phone hacking exposed in a UK court
The countdown to Cop26: can world leaders save the planet?
The whistleblower who plunged Facebook into crisis
Can women trust the police?
Why everything you’ve heard about panic buying might be wrong
The Pandora papers: who’s giving money to the Conservatives?
Inside the Pandora papers – financial secrets of the rich and powerful
Boris Johnson wants a conference reset. Will reality ruin it?
The conviction of R Kelly
Can China help end the world’s addiction to coal?
Bond is back. Where’s he going next?
The Pegasus project: hacked in London
Keir Starmer’s make-or-break conference week
The energy crisis no one saw coming
Germany decides: who will follow Angela Merkel?
Going nuclear: the secret submarine deal to challenge China 
Finally! Get ready for a new sea

<br>➡  Use a `for` loop
<br>➡  save the elements in a list using `append`

In [None]:
data = []

for head in headlines:
    data.append(head.text)
    print(data)


In [26]:
data

['Newcastle fans think they’ve got their club back. But at what cost?',
 'Has England gone back to the office?',
 'What went wrong with the UK’s handling of the Covid pandemic?',
 'Dubai’s ruler and the Pegasus phone hacking exposed in a UK court',
 'The countdown to Cop26: can world leaders save the planet?',
 'The whistleblower who plunged Facebook into crisis',
 'Can women trust the police?',
 'Why everything you’ve heard about panic buying might be wrong',
 'The Pandora papers: who’s giving money to the Conservatives?',
 'Inside the Pandora papers – financial secrets of the rich and powerful',
 'Boris Johnson wants a conference reset. Will reality ruin it?',
 'The conviction of R Kelly',
 'Can China help end the world’s addiction to coal?',
 'Bond is back. Where’s he going next?',
 'The Pegasus project: hacked in London',
 'Keir Starmer’s make-or-break conference week',
 'The energy crisis no one saw coming',
 'Germany decides: who will follow Angela Merkel?',
 'Going nuclear: the 

<br>➡  Use another `for` loop to loop through the pages
<br>➡ **pro tip**: use the `range()` function
<br>➡ **Important**: we need to request, open and read new pages every time. What does this mean?

In [28]:
data = []

for n in range(1,6):
    
    URL = "https://www.theguardian.com/news/series/todayinfocus?page=" + str(n)
    
    website_request = requests.get(URL)
    website_content = website_request.text
    website_read = bs(website_content)
    
    headine_class = "a.js-headline-text"
    headlines = website_read.select(headine_class) 
    
    for h in headlines:
        h = h.text
        data.append(h)

In [32]:
URL

'https://www.theguardian.com/news/series/todayinfocus?page=5'

In [30]:
len(data)

100

<br>➡  save the list as a csv using `pandas.DataFrame(list).to_csv("filename.csv")`

In [31]:
pandas.DataFrame(data).to_csv("filename.csv")

# SCRAPER #2 : Get more detailed info from one page

### 1. Download and read the website

The first steps are the same

### 2. Scrape the article details

Use:
<br>➡ `.select()` : select the elements we want to get based on their tags or attributes
<br>➡ `.get()` : get an attribute value
<br>➡ `.text` : get the content
<br>➡ `.strip()` : clean the text

* headline
* link
* intro

Make a `list` to store these dictionaries<br>Use a `for` loop<br>Make a `dictionary` to store the article information. 

In [34]:
data = []
URL = "https://www.theguardian.com/news/series/todayinfocus"

website_request = requests.get(URL)
website_content = website_request.text
website_read = bs(website_content)

articles_details = website_read.select("div.fc-item__content")


In [35]:
len(articles_details)

20

In [52]:
articles_details[3].select(".fc-item__standfirst")[0].text.strip()

'A high court judge has ruled that Sheikh Mohammed bin Rashid al-Maktoum hacked the phone of his ex-wife Princess Haya using Pegasus spyware. In this episode we look at the implications of the affair'

In [45]:
articles_details[3].select(".fc-item__link")[0].text

'Sheikh Mohammed bin Rashid al-Maktoum  Dubai’s ruler and the Pegasus phone hacking exposed in a UK court '

In [49]:
articles_details[3].select(".fc-item__link")[0].get("href")

'https://www.theguardian.com/news/audio/2021/oct/12/dubai-ruler-and-pegasus-phone-hacking-exposed-in-a-uk-court'

In [54]:
for article in articles_details:
    details = {}
    
    details["headline"] = article.select(".fc-item__link")[0].text
    details["intro"] = article.select(".fc-item__standfirst")[0].text.strip()
    details["link"] = article.select(".fc-item__link")[0].get("href")
    
    data.append(details)

### 3. save as CSV

In [58]:
pandas.DataFrame(data).to_csv("details.csv")

# Questions?

---
## Thank you for your attention!
I'd be happy if you [leave me some feedback](https://goo.gl/forms/OtuNECgexYSyJGjh1) for this session so I can make it better.