> **DO YOU USE GITHUB?**  
If True: print('Remember to make your edits in a personal copy of this notebook')  
Else: print('You don't have to understand. Continue your life.')

# Module 6: Web Scraping 1

In this module you will be introduced to `web scraping`: 
- What it web scraping?
- How to web scrape?
- Why is web scrpaing important to master as a data scientist?

Readings for `session 6+7+8`:
- [Python for Data Analysis, chapter 6](https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf)
- [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/)
- [An introduction to web scraping with Python](https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-a2601e8619e5)
- [Introduction to Web Scraping using Selenium](https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72)

Video materiale from `ISDS 2020`:
- [Web Scraping 1](https://bit.ly/ISDS2021_6)
- [Web Scraping 2](https://bit.ly/ISDS2021_7)
- [Web Scraping 3](https://bit.ly/ISDS2021_8)

Other ressources:
- [Nicklas Webpage](https://nicklasjohansen.netlify.app/)
- [Data Driven Organizational Analysis, Fall 2021](https://efteruddannelse.kurser.ku.dk/course/2021-2022/ASTK18379U)
- [Master of Science (MSc) in Social Data Science](https://www.socialdatascience.dk/education)

## Ethical Considerations
* If a regular user can’t access it, we shouldn’t try to get it [That is considered hacking](https://www.dr.dk/nyheder/penge/gjorde-opmaerksom-paa-cpr-hul-nu-bliver-han-politianmeldt-hacking). 
* Don't hit it to fast: Essentially a DENIAL OF SERVICE attack (DOS). [Again considered hacking](https://www.dr.dk/nyheder/indland/folketingets-hjemmeside-ramt-af-hacker-angreb). 
* Add headers stating your name and email with your requests to ensure transparency. 
* Be careful with copyrighted material.
* Fair use (take only the stuff you need)
* If monetizing on the data, be careful not to be in direct competition with whom you are taking the data from.

<img src="https://github.com/snorreralund/images/raw/master/Sk%C3%A6rmbillede%202017-08-03%2014.46.32.png"/>

## The Web Scraping Recipe

To scrape information from the web is:
1. **MAPPING**: Finding URLs of the pages containing the information you want.
2. **DOWNLOAD**: Fetching the pages via HTTP.
3. **PARSE**: Extracting the information from HTML.  
  
  
You could also add `connection`, `storing`, `logging`, etc.        
   


### Packages used
Today we will mainly build on the python skills you have gotten so far, and tomorrow we will look into more specialized packages.

* for connecting to the internet we use: **requests**
* for parsing: **beautifulsoup** and **regex**
* for automatic browsing / screen scraping: **selenium** 
* for mitigating errors we use: **time**

We will write our scrapers with basic python, for larger projects consider looking into the packages **scrapy**

In [None]:
# check that you can import these lbraries
# otherwise you they can easily be installed using pip
# example: https://pypi.org/project/beautifulsoup4/

import requests
from bs4 import BeautifulSoup
import re
import selenium
import time
import pandas as pd

## Connecting to the Internet


**Connecting to the internet** **HTTP**

*URL* : the adressline in our browser.

Via HTTP we send a **get** request to an *address* with *instructions* ( - or rather our dns service provider redirects our request to the right address)
*Address / Domain*: www.google.com

*Instructions*: /search?q=who+is+mister+miyagi

*Header*: information send along with the request, including user agent (operating system, browser), cookies, and prefered encoding.

*HTML*: HyperTextMarkupLanguage the language of displaying web content. More on this tomorrow.


In [None]:
import requests
response = requests.get('https://www.google.com')
#response.text

In [None]:
import requests
response = requests.get('https://isdsucph.github.io/isds2021/')
#response.text

##  Static Webpage Example

Visit the following website (https://www.basketball-reference.com/leagues/NBA_2018.html).

The page displays tables of data that we want to collect.
Tomorrow you will see how to parse such a table, but for now I want to show you a neat function that has already implemented this.

In [None]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html' # link to the website
dfs = pd.read_html(url) # parses all tables found on the page.
dfs[0]

If we did not have a neat function we would have to navigate the website to point at the data we wanted to collect. Below I show how to find the headline of the table. This is something you will learn about in session 7.

In [None]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h2')[0].text

## Navigating websites to collect links
Now I will show you a few common ways of finding the links to the pages you want to scrape.

### Building URLS using a recognizable pattern.
A nice trick is to understand how urls are constructed to communicate with a server. 

Lets look at how [jobindex.dk](https://www.jobindex.dk/) does it. We simply click around and take note at how the addressline changes.

This will allow us to navigate the page, without having to parse information from the html or click any buttons.

* / is like folders on your computer.
* ? entails the start of a query with parameters 
* = defines a variable: e.g. page=1000 or offset = 100 or showNumber=20
* & separates different parameters.
* \+ is html for whitespace

In [None]:
# Mapping exercise
url = 'https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=2&q=python'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
#soup

In [None]:
jobs = int(soup.find('span',attrs={'class':'d-md-none'}).text[0:3])
jobs

In [None]:
# 20 jobs per page
for i in range(round(jobs/20)+1):
    print(i)

In [None]:
for i in range(round(jobs/20)+1):
    print('https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=' + str(i+1) +'&q=python')

## Good practices
* Transparency: send your email and name in the header so webmasters will know you are not a malicious actor.
* Ratelimiting: Make sure you don't hit their servers to hard.
* Reliability: 
    * Make sure the scraper can handle exceptions (e.g. bad connection) without crashing.
    * Keep a log.
    * Store raw data.


## Logging
Even if logging is not important for the below exercises, get in the habit of using this class for connecting to the internet, to practice logging your activity.

You should run `pip install scraping_class` to install the module to be used.

In [None]:
import scraping_class
logfile = 'log.csv'## name your log file.
connector = scraping_class.Connector(logfile)

# Exercise Set 6: Web Scraping 1

In this Exercise Set we shall practice our webscraping skills utiilizing only basic python.  
We shall cover variations between static and dynamic pages and build. 

## Exercise Section 6.1: Scraping Jobnet.dk

This exercise you get to practice locating the request that the JavaScript sends to get the job data that it builds the joblistings from. You should use the **>Network Monitor<** tool in your browser. I recomend using Chrome.

Furthermore you practice spotting how the pagination is done, without clicking on the next page button, but instead changing a small parameter in the URL.

> **Ex. 6.1.1:** Go to  www.jobnet.dk and investigate the page. Start your `mapping`. Figure our what url you need to scrape to collect jobposting data. Sometimes this can be hard and requires you to inspect the page.

> **Ex. 6.1.2.:** Use the `request` module to collect the first 20 results and unpack the relevant `json` data into a `pandas` DataFrame.

> **Ex. 6.1.3.:** How many results do you find in total? Store this number as 'n_listings' for later use.

In [None]:
import json

In [None]:
#test_url = 'https://job.jobnet.dk/CV/FindWork?SearchString=python&Offset=0&SortValue=BestMatch'
test_url = 'https://job.jobnet.dk/CV/FindWork/Search?&Offset=&SortValue=BestMatch'
response = requests.get(test_url)
res_json = response.json()
df = pd.DataFrame(res_json['JobPositionPostings'])


In [None]:
df

In [None]:
col1 = df.columns

In [None]:
col1 = sorted(df.columns)
col = ['Abroad', 'AnonymousEmployer', 'AssignmentStartDate', 'AutomatchType', 'Country', 
                              'DetailsUrl', 'EmploymentType', 'FormattedLastDateApplication', 'HasLocationValues', 
                              'HiringOrgCVR', 'HiringOrgName', 'ID', 'IsExternal', 'IsHotjob', 'JobAnnouncementType', 
                              'JobHeadline', 'JobLogUrl', 'JoblogWorkTime', 'LastDateApplication', 'Latitude', 'Location',
                              'Longitude', 'Municipality', 'Occupation', 'OccupationArea', 'OccupationGroup', 
                              'OrganisationId', 'PostalCode', 'PostalCodeName', 'PostingCreated', 'Presentation',
                              'Region', 'ShareUrl', 'Title', 'Url', 'UseWorkPlaceAddressForJoblog', 'UserLoggedIn',
                              'Weight', 'WorkHours', 'WorkPlaceAbroad', 'WorkPlaceAddress', 'WorkPlaceCity',
                              'WorkPlaceNotStatic', 'WorkPlaceOtherAddress', 'WorkPlacePostalCode', 'WorkplaceID']
set(col).symmetric_difference(col1)

In [None]:
assert sorted(df.columns) == ['Abroad', 'AnonymousEmployer', 'AssignmentStartDate', 'AutomatchType', 'Country', 
                              'DetailsUrl', 'EmploymentType', 'FormattedLastDateApplication', 'HasLocationValues', 
                              'HiringOrgCVR', 'HiringOrgName', 'ID', 'IsExternal', 'IsHotjob', 'JobAnnouncementType', 
                              'JobHeadline', 'JobLogUrl', 'JoblogWorkTime', 'LastDateApplication', 'Latitude', 'Location',
                              'Longitude', 'Municipality', 'Occupation', 'OccupationArea', 'OccupationGroup', 
                              'OrganisationId', 'PostalCode', 'PostalCodeName', 'PostingCreated', 'Presentation',
                              'Region', 'ShareUrl', 'Title', 'Url', 'UseWorkPlaceAddressForJoblog', 'UserLoggedIn',
                              'Weight', 'WorkHours', 'WorkPlaceAbroad', 'WorkPlaceAddress', 'WorkPlaceCity',
                              'WorkPlaceNotStatic', 'WorkPlaceOtherAddress', 'WorkPlacePostalCode', 'WorkplaceID']
assert len(df) == 20

In [None]:
res_json = response.json()
res_json.keys()
n_listings = res_json['TotalResultCount']

> **Ex. 6.1.4:** This exercise is about paging the results. We need to understand the websites pagination scheme. 

> Now scroll down the webpage and press the next page button. See how the parameters of the url changes as you turn the pages.

> **Ex. 6.1.5:** Design a`for` loop using the `range` function that changes this paging parameter in the URL. Use 'n_listings' from before to define the limits of the range function. Store these urls in a container. 

In [None]:
list_of_urls = []
for i in range(0,n_listings+1,20):
    url = f'https://job.jobnet.dk/CV/FindWork/Search?&Offset={i}&SortValue=BestMatch'
    list_of_urls.append(url)

#list_of_urls

> **Ex.6.1.6:** Pick 20 random links using the `random.sample()` function and scrape their content. Use the `time.sleep()` function to limit the rate of your calls. Load all the results into a DataFrame. ***extra***: monitor the time left to completing the loop by using `tqdm.tqdm()` function.


In [None]:
import random

In [None]:
random.sample(list_of_urls, 20)
df = pd.DataFrame()
for i in random.sample(list_of_urls, 20):
    response = requests.get(i)
    res_json = response.json()
    df1 = pd.json_normalize(res_json['JobPositionPostings'])
    df = pd.concat([df,df1])

In [None]:
df

> **Ex.6.1.7:** Snorre Ralund, a resaercher at SODAS, has build a connector class. Repeat 6.1.6 but try to to use Snorres connector to log your activity.


In [None]:
raise NotImplementedError()

## Exercise Section 6.2: Scraping Trustpilot.com
Now for a slightly more elaborate, yet still simple scraping problem. Here we want to scrape trustpilot for user reviews. This data is very nice since it provides free labeled data (rating) to train a machine learning model to understand positive and negative sentiment. 

Here you will practice crawling a website collecting the links to each company review page, and finally locate another behind the scenes JavaScript request that gets the review data in a neat json format.

> **Ex. 6.2.1:** Visit the https://www.trustpilot.com/ website and locate the categories page.
From this page you find links to company listings.

> **Ex. 6.2.2:**
Get the category page using the `requests` module and extract each link to a specific category page from the HTML. This can be done using the basic python `.split()` string method. Make sure only links within the ***/categories/*** section are kept, checking each string using the ```if 'pattern' in string``` condition. 

*(Hint: The links are relative. You need to add the domain name)*


In [None]:
# Incorrect solution
"""
trust_url = 'https://www.trustpilot.com/categories'
trust_response = requests.get(trust_url)
trust_soup = BeautifulSoup(trust_response.content, 'html.parser')
find_ = trust_soup.find_all('span', attrs={'class':'categories_categoryListItem__1dO4P'})
#qwer = [x.split('<span class="categories_categoryListItem__1dO4P">') for x in find_]
#qwer
listy = []
for i in find_:
    listy.append(" ".join(str(i)[49:].split(' &amp; ')))
    
listy = [x[:-7] for x in listy]
listy
"""

In [None]:
# CORRECT SOLUTION
new_url = 'https://www.trustpilot.com/categories'
new_res = requests.get(new_url)
new_soup = BeautifulSoup(new_res.content, 'html.parser')
categories = new_soup.findAll('div',{'class':'categories_subCategoryItem__2Qwj8'})

list_of_urls_trust = []

for i in range(len(categories)):
    list_of_urls_trust.append(f"https://www.trustpilot.com{categories[i].a['href']}")

list_of_urls_trust

In [None]:
# Incorrect solution
"""
links_with_text = []
for a in trust_soup.find_all('a', href=True): 
    if a.text:
        links_with_text.append(a['href'])
links_with_text
"""

> **Ex. 6.2.3:** Get one of the category section links. Write a function to extract the links to the company review page from the HTML.

> **Ex. 6.2.4:** Figure out how the pagination is done, by following how the url changes when pressing the **next page**-button to obtain more company listings. Write a function that builds links to paging all the company listing results of each category. This includes parsing the number of subpages of each category and changing the correct parameter in the url.

(Hint: Find the maximum number of result pages, right before the next page button and make a loop change the page parameter of the url.)


> **Ex. 6.2.5:** Loop through all categories and build the paging links using the above defined function.

> **Ex. 6.2.6:** Randomly pick one of category listing links you have generated, and get the links to the companies listed using the other function defined. 

> **Ex. 6.2.7:** Visit one of these links and inspect the **>Network Monitor<** to locate the request that loads the review data. Use the requests module to retrieve this link and unpack the json results to a pandas DataFrame.
