# INTRO

Library = an 'addon' or 'extension' to Python

* **`requests`** : download the website
* **`BeautifulSoup`** : pick out the important parts.
* **`pandas`** : just for exporting the csv in the end

`requests` methods we'll use:
<br>➡ `requests.get(webpage)` : download the page, where `webpage` is the URL of the page
<br>➡ `.text` : get the HTML out of the download

`BeautifulSoup` methods we'll use:
<br>➡ `BeautifulSoup(downloaded_page_text)` : read the downloaded page
<br>➡ `.select` : select the elements we want to get based on their tags or attributes
<br>➡ `.get()` : get an attribute value
<br>➡ `.text` : get the content

![title](img/html-syntax.png)

See also: 
* [CSS selector reference](https://www.w3schools.com/cssref/css_selectors.asp)
* [BeautifulSoup reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Requests reference](https://realpython.com/python-requests/#the-get-request)
* [Pandas export csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

# Import libraries

Copy and paste these in the cell beneath 

`import requests`<br> 
`import pandas`<br> 
`from bs4 import BeautifulSoup as bs`

In [77]:
import requests
from bs4 import BeautifulSoup as bs
import pandas

# SCRAPER #1 : focus on loops

Let's get all the headlines in the last 5 pages of the 'Today in focus' page of The Guardian https://www.theguardian.com/news/series/todayinfocus
<br>➡ **Important**: we need to take the pagination into account

### 1. Download and read the website

Save the website as a string in variable `URL`

In [79]:
URL = "https://www.theguardian.com/news/series/todayinfocus?page="

Download the website with `requests.get()` and save it into variable `website_request`

In [80]:
website_request = requests.get(URL)

Get the HTML of the website using `.text` and save it into variable `website_content`

In [81]:
website_content = website_request.text

Read the website with `BeautifulSoup` and save it into variable `website_read`

In [82]:
website_read = bs(website_content)

### 2. scrape all the headers

Use the `website_read.select()` method here

In [83]:
website_read.select("span.js-headline-text")

[<span class="js-headline-text">The five brothers forced apart by the war in Syria</span>,
 <span class="js-headline-text">Naomi Klein on how politics can solve the climate crisis</span>,
 <span class="js-headline-text">Are Fox News and Donald Trump falling out of love?</span>,
 <span class="js-headline-text">Is this the end of the road for remainers?</span>,
 <span class="js-headline-text">How did a town in West Virginia become the opioid capital of the US?</span>,
 <span class="js-headline-text">Naming and shaming the polluters</span>,
 <span class="js-headline-text">Will parliament vote for a Brexit deal?</span>,
 <span class="js-headline-text">On the frontline as US troops leave northern Syria</span>,
 <span class="js-headline-text">Hong Kong: the story of one protester</span>,
 <span class="js-headline-text">What is the truth about vaping?</span>,
 <span class="js-headline-text">A fatal crash and the problem of diplomatic immunity</span>,
 <span class="js-headline-text">Brexit and

Now get only the content of the elements
<br>➡  Use a `for` loop
<br>➡  Use `print`

In [84]:
for headline in website_read.select("span.js-headline-text"):
    print(headline.text.strip())

The five brothers forced apart by the war in Syria
Naomi Klein on how politics can solve the climate crisis
Are Fox News and Donald Trump falling out of love?
Is this the end of the road for remainers?
How did a town in West Virginia become the opioid capital of the US?
Naming and shaming the polluters
Will parliament vote for a Brexit deal?
On the frontline as US troops leave northern Syria
Hong Kong: the story of one protester
What is the truth about vaping?
A fatal crash and the problem of diplomatic immunity
Brexit and the Irish border: is there a solution?
Shell, Nigeria and a 24-year fight for justice
Thirteen children have been shot dead in St Louis, Missouri. Why?
The strange world of TikTok: viral videos and Chinese censorship
Reality TV, social media and living your life online with Jia Tolentino
Boris Johnson’s Brexit speech: preparing for an election
Boris Johnson and the Jennifer Arcuri allegations
Could this impeachment inquiry end Trump’s presidency?
Is it over for Justi

<br>➡  Use a `for` loop
<br>➡  save the elements in a list using `append`

In [85]:
data = []
for headline in website_read.select("span.js-headline-text"):
    data.append(headline.text.strip())

<br>➡  Use another `for` loop to loop through the pages
<br>➡ **pro tip**: use the `range()` function
<br>➡ **Important**: we need to request, open and read new pages every time. What does this mean?

In [None]:
data = []
for page in range(1,6):
    URL = "https://www.theguardian.com/news/series/todayinfocus?page=" + str(page)

    website_request = requests.get(URL)
    website_content = website_request.text
    website_read = bs(website_content)
    
    for headline in website_read.select("span.js-headline-text"):
        data.append(headline.text.strip())

<br>➡  save the list as a csv using `pandas.DataFrame(list).to_csv("filename.csv")`

In [67]:
pandas.DataFrame(data).to_csv("headlines.csv")

# SCRAPER #2 : focus on selecting

We're gonna scrape the https://www.purehelp.no website. Click on the link to open it.

### 1. Download and read the website

The first steps are the same

In [75]:
URL = "https://www.purehelp.no/m/solrSearch/search/a/1/Turnover_mer_enn_100_millioner/County_Svalbard/"

In [76]:
website_request = requests.get(URL)
website_content = website_request.text
website_read = bs(website_content)

### 2. Scrape the company

* name
* sector
* location
* turnovers
* link to the details page

Use a `for` loop<br>Make a `dictionary` to store the company information. <br> Make a `list` to store these dictionaries

Use:
<br>➡ `.select()` : select the elements we want to get based on their tags or attributes
<br>➡ `.get()` : get an attribute value
<br>➡ `.text` : get the content
<br>➡ `tag[attribute^='value']` : get the tag where the attribute begins with some text
<br>➡ `tag:nth-of-type(n)` : select a tag 

In [41]:
data = []
for company in website_read.select(".cRL li"):
    company_details = {}
    
    company_details["name"] = company.select("a")[0].get("title")
    company_details["link"] = company.select("a")[0].get("href")
    company_details["sector"] = company.select(".d-md-none")[0].text
    company_details["location"] = company.select("div[title^='Lokalisert'] span:nth-of-type(1)")[0].text
    company_details["turnover"] = company.select("div[title^='Omsetning'] span:nth-of-type(2)")[0].text
    
    data.append(company_details)

### 3. save as CSV

In [42]:
pandas.DataFrame(data).to_csv("companies.csv")

# Questions?

---
## Thank you for your attention!
I'd be happy if you [leave me some feedback](https://goo.gl/forms/OtuNECgexYSyJGjh1) for this session so I can make it better.