![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

## Introduction to Text Mining and Natural Language Processing

##### !!!Homework 3 is at the end of this notebook!!!

## Session 7: Scraping

In this session we will learn some of the fundamentals of using web scraping to generate a corpus of text. There are several packages that help you achieve this. I will assume throughout that you have already covered basics and so we will jump right in. Even if you have never done scraping it should be intuitive.

Here is the checklist for this class:
- you have installed beautifulsoup
- you have installed selenium
- you have installed requests
- you have installed time
- you have the firefox browser
- you have put the right geckodriver.exe file in the folder where the other codes will be
- you have put the python file scrape_functions.py into that folder
- you have entered the python file scrape_functions.py and entered the location of your firexfox.exe

#### Try this:
import requests

from bs4 import BeautifulSoup

import os

from selenium import webdriver

import time

import scrape_functions as kzd


## Scraping UN Resolutions

We will first use beautifulsoup to scrape UN resolutions from the UN security council.

The Security Council has primary responsibility for the maintenance of international peace and security. It has 15 Members, and each Member has one vote. Under the Charter of the United Nations, all Member States are obligated to comply with Council decisions.

The Security Council takes the lead in determining the existence of a threat to the peace or act of aggression. It calls upon the parties to a dispute to settle it by peaceful means and recommends methods of adjustment or terms of settlement. In some cases, the Security Council can resort to imposing sanctions or even authorize the use of force to maintain or restore international peace and security.

In today's class we will go through the code together first that downloads a single UN resolution from the webpage. I want you to then do an exercise alone or in teams to scrape all resolutions from the years 2023 and 2022 automatically.

We will then discuss in class the use of metadata and how you would scrape it and put it into a dataframe that can be matched with the text you scraped.

### Step 1: Inspect the Webpage

#### Visual Inspection
The first step is always to check out the webpage. Let's go to the UN webpage of UN security council resolutions. https://www.un.org/securitycouncil/content/resolutions-0

or (once we click on a year)

https://www.un.org/securitycouncil/content/resolutions-adopted-security-council-2023

### Step 2: Inspect using Developer Tools
Second step is to use Developer tools to understand the structure of a website. All modern browsers come with developer tools installed. We will now do this for Chrome.

On macOS, you can open up the developer tools through the menu by selecting View → Developer → Developer Tools. On Windows and Linux, you can access them by clicking the top-right menu button (⋮) and selecting More Tools → Developer Tools. You can also access your developer tools by right-clicking on the page and selecting the Inspect option or using a keyboard shortcut:

Mac: Cmd+Alt+I

Windows/Linux: Ctrl+Shift+I

Try to find where the resolution links sit on the webpage.

### Step 3: Try getting the pdf to download.

We tried (and failed) to use beautiful soup for this. But it is nonetheless decided it is useful to show you how to approach webpages with the package.

In the end we will use Selenium. The main difference is that beautiful soup parses the html code and allows you to identify and store items in the resulting object whereas Selenium uses your web broaser as if you are clicking on things by hand

## Beautiful soup

Beautiful Soup is a Python package that is used for web scraping. It is used to parse and navigate through the HTML or XML code of a webpage, allowing you to extract the desired information. Here is a guide to Beautiful soup
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

When using Beautiful Soup for web scraping, you typically start by making a request to a webpage using a package like **requests**. This allows you to download the HTML or XML code of the webpage, which you can then pass to Beautiful Soup. Once you have the page's code, you can use Beautiful Soup to navigate and search through the code to find the specific information you are looking for.

The requests package is a python library that helps to send HTTP request to a webpage. It allows to send request using method like GET, POST, PUT, DELETE, etc. It also allows to add headers and payloads to the request. It is a very useful library when working with web scraping as it allows you to download the HTML or XML code of the webpage which can then be parsed using Beautiful Soup package.

In [60]:
from bs4 import BeautifulSoup
import requests

# Make a request to a webpage
url = "https://elpais.com/"
page = requests.get(url)

# Create a BeautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')

To inspect elements on a webpage, you can use the developer tools in your web browser. In most modern browsers, you can right-click on an element and select "Inspect" to open the developer tools. This will allow you to view the HTML source code for the page and see the structure of the elements. You can then use the HTML tags, classes, and ids to identify the elements you want to find with BeautifulSoup.

In [61]:
# Find all elements with a specific tag
tags = soup.find_all('p')
print(tags)

[<p class="c_d">Un total de 750 millones de dólares salieron de la Secretaría de Gobernación, la Policía Federal y la Fiscalía capitalina a través de testaferros durante una década, según la demanda civil presentada por la UIF en una corte de Florida </p>, <p class="c_d">El primer declarante en el juicio contra García Luna, Sergio Villarreal, es un viejo conocido de los juzgados en México. Ha testificado contra generales y policías, además del matrimonio Abarca en el ‘caso Ayotzinapa’</p>, <p class="c_d">La comedia mexicana pierde a uno de sus principales exponentes, un hombre que arrancó risas con montajes misóginos y homofóbicos, que hoy se someten al tamiz de la crítica</p>, <p class="c_d">Una investigación psicológica correlaciona el consumo constante de información política y sesgada con un mayor estrés y deterioro del bienestar emocional</p>, <p class="c_d">Las autoridades federales vigilan la evolución de nueve personas en contacto con un perro infectado en Sonora</p>, <p class=

In [62]:
# Find all elements with a specific tag
tags = soup.find_all('article')
print(tags)

[<article class="c c-d c--m"><figure class="c_m a_m-h"><a class="c_m_c _pr _db" href="/mexico/2023-01-25/mexico-denuncia-en-eeuu-que-garcia-luna-obtuvo-mas-de-400-millones-de-dolares-de-desvios-durante-el-gobierno-de-pena-nieto.html"><img alt="Genaro Garcia Luna se pone la mano en el corazón durante la audiencia en Nueva York." class="c_m_e _re lazyload a_m-h" decoding="auto" height="233" loading="lazy" src="https://imagenes.elpais.com/resizer/G3gpMdLxxrE1znJSd2yjabG9rJ4=/414x233/cloudfront-eu-central-1.images.arcpublishing.com/prisa/Q3QGIGQVARCLVFEW7DBWPABLGI" srcset="https://imagenes.elpais.com/resizer/G3gpMdLxxrE1znJSd2yjabG9rJ4=/414x233/cloudfront-eu-central-1.images.arcpublishing.com/prisa/Q3QGIGQVARCLVFEW7DBWPABLGI 414w,https://imagenes.elpais.com/resizer/_1aK45E3_D06Vo8ZD4de4gY_-OE=/828x466/cloudfront-eu-central-1.images.arcpublishing.com/prisa/Q3QGIGQVARCLVFEW7DBWPABLGI 828w" width="414"/></a><figcaption class="c_m_p"><span>Genaro Garcia Luna se pone la mano en el corazón duran

In [63]:
# Find all elements with a specific class
elements = soup.find_all(class_='ad')
print(elements)

[<div class="ad" id="elpais_gpt-SKIN"></div>, <div class="ad" id="elpais_gpt-INTER"></div>, <div class="ad ad-giga ad-giga-1"><div class="ad ad-ldb ad-ldb1" id="elpais_gpt-LDB1"></div><div class="mldb1-wrapper" id="mldb1-wrapper"><div class="ad ad-mldb ad-mldb1" id="elpais_gpt-MLDB1"></div></div></div>, <div class="ad ad-ldb ad-ldb1" id="elpais_gpt-LDB1"></div>, <div class="ad ad-mldb ad-mldb1" id="elpais_gpt-MLDB1"></div>, <div class="ad ad-nstd-bd _df" id="ad-ntsd-bd"></div>, <div class="ad ad-nstd ad-nstd1" id="elpais_gpt-NSTD1"></div>, <div class="ad ad ad-300" id="elpais_gpt-MPU1"></div>, <div class="ad ad ad-300" id="elpais_gpt-MPU2"></div>, <div class="ad ad ad-300" id="elpais_gpt-MPU3"></div>, <div class="ad ad ad-300" id="elpais_gpt-MPU4"></div>, <div class="ad ad ad-300" id="elpais_gpt-MPU5"></div>, <div class="ad ad-giga ad-giga-2"><div class="ad ad-ldb ad-ldb2" id="elpais_gpt-LDB2"></div><div class="mldb2-wrapper" id="mldb2-wrapper"><div class="ad ad-mldb ad-mldb2" id="elpa

In [64]:
# Find all elements with a specific id
elements = soup.find_all(id="elpais_gpt-SKIN")
print(elements)

[<div class="ad" id="elpais_gpt-SKIN"></div>]


In [65]:
# Find all elements with a specific attribute
# You will use xpath a lot for the homework that is why I introduce it here already.
element = soup.find_all(attrs={'xpath': '//*[@id="fusion-app"]/main/div[1]/section/div[1]/article[1]/header/h2/a'})
print(elements)

[<div class="ad" id="elpais_gpt-SKIN"></div>]


## Selenium

Selenium is a Python package that is used for automating web browsers. It allows you to interact with web pages in a programmatic way, which can be useful for web scraping. Selenium can be used to automate tasks such as **downloading**, filling out forms, clicking buttons, and navigating between pages.

When using Selenium for web scraping, you can use the package to open a browser, navigate to a specific webpage, and then extract the desired information from the page's HTML code. The package can also be used to interact with the page, such as clicking buttons or filling out forms, in order to access the desired information.

The browser plays an important role in this process, as Selenium uses the browser's rendering engine to load and display web pages. This allows Selenium to interact with web pages in the same way that a user would, which can be important when trying to scrape information from sites that use **JavaScript** or other client-side technologies to load and display content.

In [73]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service

# Start a new browser instance - for me this is the minimal set to start the browser
# You need to change to your location
option = webdriver.FirefoxOptions()
option.binary_location = 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
driverService = Service(r'C:\Users\santi\Documents\M_DSDM\Term_2\Text_Mining\HW3\geckodriver.exe')
browser = webdriver.Firefox(service=driverService, options=option)

# Navigate to a webpage
browser.get("https://elpais.com/")

#getting past cookies
time.sleep(5)
acc_cookie=browser.find_element('xpath','/html/body/div[1]/div/div/div/div/div[2]/button[2]/span')
acc_cookie.click()
time.sleep(2)

# Find an element on the page using its ID
element = browser.find_element('xpath','//*[@id="fusion-app"]/main/div[1]/section/div[1]/article[1]/header/h2/a')

# Interact with the element by clicking it
element.click()

### Is it ok to scrape?

It should be noted that some websites may not be designed for scraping, so it is important to check the website's terms of service or robots.txt file to see if web scraping is allowed before attempting to scrape the site. 

My opinion: usually for mild use for non-commercial reasons this seems not to be a huge problem for any webpage. It's more a question of whether you can get around the many barriers put in place. (This is not legal advice!)

In [72]:

import requests
from bs4 import BeautifulSoup
import os
from selenium import webdriver
import time
import scrape_functions as kzd


#import requests
#from bs4 import BeautifulSoup
#import os
#from selenium import webdriver
#import time
#import scrape_functions as kzd

for year in range(2021,2024): #last year of the range should be +1 of the year you want
    resol_year="resolutions-adopted-security-council-"+str(year)
    URL = root+resol_year
    #now we make a folder "resolutions" in your scraping folder
    path=r'C:\Users\santi\Documents\M_DSDM\Term_2\Text_Mining\HW3'
    
    # Here I make a new folder to store the resolution pdfs
    new_dir=f'{path}/resolutions'
    if not os.path.exists(new_dir):
        os.makedirs(new_dir)

destination=r'C:\Users\santi\Documents\M_DSDM\Term_2\Text_Mining\HW3\resolutions'


root="https://www.un.org/securitycouncil/content/"
first_year="resolutions-adopted-security-council-2023"
URL = root+first_year

#now we make a folder "resolutions" in your scraping folder
path="C:\\Dropbox\\teaching\\Text Mining DSDM 2023\\Session7_scraping\\"

# Here I make a new folder to store the resolution pdfs
new_dir=f'{path}/resolutions'
if not os.path.exists(new_dir):
    os.makedirs(new_dir)

destination="C:\\Dropbox\\teaching\\Text Mining DSDM 2023\\Session7_scraping\\resolutions\\"


The following header is used to escape blockades. Headers by default indicate that you're calling python. The ones here make a general user profile that, in principle, is more robust to blockers. When I tried the UN webpage without this I got blocked out and found out how to enter here https://stackoverflow.com/questions/38489386/python-requests-403-forbidden.

In [68]:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
response = requests.get(URL, headers=headers)
print(response.text)
# This is the html code that generates the webpage. What whall we do from here?

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <meta http-equiv="X-UA-Compatible" content="IE=edge" />
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="https://www.un.org/securitycouncil/profiles/panopoly/themes/unite/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="description" content="S/RES/2673 (2023) 11 January 2023 Identical letters dated 19 January 2016 from the Permanent Representative of Colombia to the United Nations addressed to the Secretary-General and the President of the Security Council (S/2016/53) S/RES/2672 (2023) 9 January 2023 The situation in the Middle East" />
<meta name="generator" content="Drupal 7 (https://www.drupal.org)" />
<link rel="canonical" href="https://www.un.org/securitycouncil/content/resolutions-adopted-security-council-2023" />
<link rel="shortlink" href="https://www.un.org/secur

In [40]:
# 1 transform to beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
# This is now a beautiful soup object


# Just print the ones we are interested in :
links=[]
for url in soup.find_all('a',href=True):
    mylink=url['href']
    if mylink.startswith('http://undocs.org/en/S/RES'):
    # if mylink.startswith('2300965'):
        print(mylink)
        links.append(mylink)
        # we only need the last part ( this is for later)
        suffix=mylink.split('/')[-1]

print("We took from this", suffix)    
# clicking on this I even get an error message.

http://undocs.org/en/S/RES/2673(2023)
http://undocs.org/en/S/RES/2672(2023)
We took from this 2672(2023)


In [74]:
root="https://www.un.org/securitycouncil/content/"

for year in range(2022,2024): #last year of the range should be +1 of the year you want
    resol_year="resolutions-adopted-security-council-"+str(year)
    URL = root+resol_year
    
    #now we make a folder "resolutions" in your scraping folder
    path=r'C:\Users\santi\Documents\M_DSDM\Term_2\Text_Mining\HW3'
    
    # Here I make a new folder to store the resolution pdfs
    new_dir=f'{path}/resolutions'
    if not os.path.exists(new_dir):
        os.makedirs(new_dir)

    destination=r'C:\Users\santi\Documents\M_DSDM\Term_2\Text_Mining\HW3\resolutions'
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
    response = requests.get(URL, headers=headers)
    
    # 1 transform to beautifulsoup objec
    soup = BeautifulSoup(response.text, 'html.parser')
    #This is now a beautiful soup object
    #Just print the ones we are interested in :
    links=[]
    
    for url in soup.find_all('a',href=True):
        mylink=url['href']
        if mylink.startswith('http://undocs.org/en/S/RES'):
            # if mylink.startswith('2300965'):
            print(mylink)
            links.append(mylink)
            # we only need the last part ( this is for later)
            suffix=mylink.split('/')[-1]
            print("This is resolution:", suffix)    


http://undocs.org/en/S/RES/2671(2022)
This is resolution: 2671(2022)
http://undocs.org/en/S/RES/2670(2022)
This is resolution: 2670(2022)
http://undocs.org/en/S/RES/2669(2022)
This is resolution: 2669(2022)
http://undocs.org/en/S/RES/2668(2022)
This is resolution: 2668(2022)
http://undocs.org/en/S/RES/2667(2022)
This is resolution: 2667(2022)
http://undocs.org/en/S/RES/2666(2022)
This is resolution: 2666(2022)
http://undocs.org/en/S/RES/2665(2022)
This is resolution: 2665(2022)
http://undocs.org/en/S/RES/2664(2022)
This is resolution: 2664(2022)
http://undocs.org/en/S/RES/2663(2022)
This is resolution: 2663(2022)
http://undocs.org/en/S/RES/2662(2022)
This is resolution: 2662(2022)
http://undocs.org/en/S/RES/2661(2022)
This is resolution: 2661(2022)
http://undocs.org/en/S/RES/2660(2022)
This is resolution: 2660(2022)
http://undocs.org/en/S/RES/2659(2022)
This is resolution: 2659(2022)
http://undocs.org/en/S/RES/2658(2022)
This is resolution: 2658(2022)
http://undocs.org/en/S/RES/2657(20

In [70]:
# Problem
"""
IMPORTANT! How would we proceed from here? We have a link to a pdf. 
Requests should be enough to get the pdf and store it properly in our folder.
Let's try to do it with the command get:
"""

pdf=requests.get(mylink,headers=headers)

print(pdf.content)

"""
This is not a pdf! We are getting a warning message.
The UN webpage is encrypted to prevent requests from scraping modules.
Are there ways to avoid this?

- Changing ips
- Proxy server
- Sending cookies to the request
- Random user profiles


I tried several but they don't appear to work. Is there hope?

Yes! Selenium simulated a browser and can scape a lot of these problems.


"""



b'\r\n<!DOCTYPE html>\r\n<html dir="ltr" lang="en"><head profile="http://www.w3.org/1999/xhtml/vocab">\r\n\t\t<meta charset="utf-8"/>\r\n\t\t<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/>\r\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1.0"/>\r\n\r\n\t\t<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->\r\n\t\t<meta name="description" content=""/>\r\n\t\t<meta name="author" content="United Nations"/>\r\n\r\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta name="Generator" content="Drupal 7 (http://drupal.org)" />\n<link rel="canonical" href="/en/about-us/terms-of-use" />\n<link rel="shortlink" href="/en/node/133494" />\n<link rel="shortcut icon" href="https://www.un.org/sites/un2.un.org/themes/bootstrap_un2/favicon.ico" type="image/vnd.microsoft.icon" />\n\r\n\t\t<title>Terms of Use | United Nations</title>\t\r\n\r\n                <meta name="DC.Title



In [75]:
links

['http://undocs.org/en/S/RES/2673(2023)',
 'http://undocs.org/en/S/RES/2672(2023)']

The following code is calling a function in the python file that is using Selenium. The script takes in three arguments: "dfolder", "link", and "download".

The first thing it does is create a Firefox profile and set the default download directory to the folder specified by "dfolder". Then it creates a FirefoxOptions object, which is used to configure the browser instance, and sets the binary location of Firefox.exe to a specific file path. It also creates a Service object, which is used to start and stop the GeckoDriver server.

If the "download" argument is set to True, it sets a number of preferences related to downloading files, such as the download folder, and the types of files that should be automatically saved to disk without prompting the user. Finally, it creates a Firefox webdriver instance, passing in the service, profile, and options objects.
Since I am setting my download path within the download preferences of firefeox, it will automatically download archives there.
Key parts:

I force firefox to download pdfs automatically. This is done within the the option download . All these preferences start up
a browser with them. 




In [71]:
#calling the program - notice option download

profile, browser, download_path = kzd.start_up(destination, links[0], download=False)


Now suppose you wanted to obtain resolutions in a loop fashion. How would you do it?

One way is to call browser.quit() at the end of each pdf download. This is very very costly as
you need to regenerate a new session each time. 

Hence, once the browser is open and you got the list of links. Just go to them within the same session.


## Metadata: For class discussion after scraping resolutions

This code uses the BeautifulSoup library to parse an HTML document and extract information from it.

First, it creates a BeautifulSoup object "soup" by passing in the text of an HTTP response (likely obtained via a request library such as requests or httplib2) and specifying that the text should be parsed as HTML using the 'html.parser' parser.

Then, it uses the find() method to locate the first <table> element in the HTML, and assigns it to the variable table. It then uses the find() method again to locate the first <tbody> element within the table element and assigns it to the variable table_body.

Next, it uses the find_all() method to find all <tr> elements within the table_body variable. It then iterates through each <tr> element (stored in the variable rows), and for each row it uses the find_all() method again to find all <td> elements within that row.

For each <td> element it finds, it prints the element. This will print the contents of each cell in the table.

In [17]:
#seeing it as a table with links and text
soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find('table')
table_body = table.find('tbody')

#the row command identifies them through <tr>
rows = table_body.find_all('tr')
for row in rows:
    print(row)
    cols = row.find_all('td')
    print(cols)

<tr>
<td><a href="http://undocs.org/en/S/RES/2673(2023)">S/RES/2673 (2023)</a></td>
<td>11 January 2023</td>
<td>Identical letters dated 19 January 2016 from the Permanent Representative of Colombia to the United Nations addressed to the Secretary-General and the President of the Security Council (<a href="https://undocs.org/en/S/2016/53">S/2016/53</a>)</td>
</tr>
[<td><a href="http://undocs.org/en/S/RES/2673(2023)">S/RES/2673 (2023)</a></td>, <td>11 January 2023</td>, <td>Identical letters dated 19 January 2016 from the Permanent Representative of Colombia to the United Nations addressed to the Secretary-General and the President of the Security Council (<a href="https://undocs.org/en/S/2016/53">S/2016/53</a>)</td>]
<tr>
<td><a href="http://undocs.org/en/S/RES/2672(2023)">S/RES/2672 (2023)</a></td>
<td>9 January 2023</td>
<td>The situation in the Middle East</td>
</tr>
[<td><a href="http://undocs.org/en/S/RES/2672(2023)">S/RES/2672 (2023)</a></td>, <td>9 January 2023</td>, <td>The sit

In [18]:
#seeing it as a table with links and text
#let's try regular expressions to extract meta-data
#let's also be a bit economic about iterations
import re
soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find('table')
table_body = table.find('tbody')

rows = table_body.find_all('tr')
metadata=[]
for row in rows:
    cols = row.find_all('td')
    for ele in cols:
        text=ele.text.strip()
        resolution = r'S/RES/(\d+).*\((.*)\)'
        if re.findall(resolution, text):
            info=re.findall(resolution, text)[0]
        else:
            #adds Title to previous tuple (might be error prone)
            info += (text,)
            print(info)
    metadata.append(info)

('2673', '2023', '11 January 2023')
('2673', '2023', '11 January 2023', 'Identical letters dated 19 January 2016 from the Permanent Representative of Colombia to the United Nations addressed to the Secretary-General and the President of the Security Council (S/2016/53)')
('2672', '2023', '9 January 2023')
('2672', '2023', '9 January 2023', 'The situation in the Middle East')


In [19]:
metadata

[('2673',
  '2023',
  '11 January 2023',
  'Identical letters dated 19 January 2016 from the Permanent Representative of Colombia to the United Nations addressed to the Secretary-General and the President of the Security Council (S/2016/53)'),
 ('2672', '2023', '9 January 2023', 'The situation in the Middle East')]

In [20]:
import pandas as pd
df = pd.DataFrame(metadata, columns=['Number', 'Year', 'Date', 'Subject'])


In [21]:
df

Unnamed: 0,Number,Year,Date,Subject
0,2673,2023,11 January 2023,Identical letters dated 19 January 2016 from t...
1,2672,2023,9 January 2023,The situation in the Middle East


# Exercise Homwork 3

This homework will be intense. You can work on the scraping part of this homework during the TA session with Luis' help. The TA session will be there to help you - make good use of this and prepare a little before by doing step 2) before the TA session and then working on steps 3) following.

1) Get together in groups as randomized here.https://docs.google.com/spreadsheets/d/1W0qAKkJ1_J6wXDOs0PhSTnWgwA9buZmMeI3YDGIYAYY/edit?usp=sharing

2) Download the material for the homework and get the booking scrape code to run. Prove that you did by attaching a jupyter notebook. (this is only relevant for those who fail later steps)

Design and implement a mini research project in which you research the effect of a big annual event in Barcelona on rental prices on booking by scraping data for at least two separate weeks (note that search results go across different pages) for Barcelona and at least one more city.

3) Identify an event that makes a lot people come to Barcelona.

4) Think of the time periods to scrape and what second city to scrape. Explain your choices in written.

5) Write down a fixed effects regression equation that allows you to derive a difference-in-difference estimate of the effect of the event on prices. Think of controls to add, why is this relevant? Explain why you need a second city for this.

6) Scrape using a modified code that implements a loop across pages of the booking page and searches for different dates and cities.

7) Discuss two additional ideas (no need to implement): a) the use of the text that can be scraped from the hotel/appartment description b) how to de-construct the treatment effect in the effect that comes from the same hotel changing prices and composition of places changing.