*Francisco Pereira [camara@dtu.dk], DTU Management*

*Stanislav Borysov [stabo@dtu.dk], DTU Management*
# Advanced Business Analytics

## Lecture 1 - Web Data Mining - Part 3: Web Scraping

Web crawling and scraping represent a very flexible way to get the content from the Internet. Essentially, it imitates a user who visits different webpages and views their content. The only difference is that Internet companies usually love real users and hate scraping bots. So be prepared to be blocked. **In the worst case, you can get into serious troubles so please always read Terms & Conditions and follow the company's policy about automatic data collection (or ask them directly if you are not sure)!**

In this exercise, we will implement our web scraper. However, if you want to do it properly, just use one of the many frameworks available, for example, Scrapy (https://scrapy.org/).

Let's try to get some data from the DTU website. Particularly, we want to get news headlines together with the date and short description from https://www.dtu.dk/english/news. Hopefully, we will not get banned or sued.

### 1. Navigating the website

By default, the webpage shows 20 recent news items. But you can automatically navigate through the list of news in to whatever order you prefer e.g. get the news items from 25 to 45. Or even more interesting, you can ask for a specific time window. 

`fr=FIRST_ITEM`- From Nth item, where 0 is the most recent one

`fd=DD-MM-YYYY`- From date

`td=DD-MM-YYYY`- To date

For example, https://www.dtu.dk/english/news?fd=05-01-2021

***Tip: you can find yourself which other parameters you can use, besides "fr", "fd" and "td", by manually using the website... ;-) for example, can you figure out to search for specific string in the news feed? ***

In [53]:
url_base = "https://www.dtu.dk/english/news"

Generate and print a list of URLs to get webpages which contain the last 200 news items. 

***WARNING! Just generate the list and do not access the content!***

In [54]:
import math

urls = []
n_items = 200
items_per_page = 20
start = 1

for i in range(math.ceil(n_items / items_per_page)):
    parameters = [
        "fr={}".format(start)
    ]
    url = url_base + "?" + "&".join(parameters)
    urls.append(url)
    start += items_per_page

print(urls)

['https://www.dtu.dk/english/news?fr=1', 'https://www.dtu.dk/english/news?fr=21', 'https://www.dtu.dk/english/news?fr=41', 'https://www.dtu.dk/english/news?fr=61', 'https://www.dtu.dk/english/news?fr=81', 'https://www.dtu.dk/english/news?fr=101', 'https://www.dtu.dk/english/news?fr=121', 'https://www.dtu.dk/english/news?fr=141', 'https://www.dtu.dk/english/news?fr=161', 'https://www.dtu.dk/english/news?fr=181']


### 2. Scraping the webpage content

***We will not do it.*** I already scrapped a webpage with the first 20 news items and saved it using the code below.

In [55]:
'''
import urllib.request
url = "https://www.dtu.dk/english/news?fr=0"

contents = urllib.request.urlopen(url).read()
with open("dtu_news.html", "wb") as f:
    f.write(contents)
''';

Note, that all markup and images are missing since we saved only the HTML content.

### 3. Extracting the data

Load the data from the file

In [56]:
with open("dtu_news.html", "r") as f:
    contents = f.read()

The task to extract the news items from the contents using either regular expressions or HTML parser (e.g., PyQuery, BeautifulSoup) and save them to a JSON file.

*HINT: To better understand the structure of the webpage, try:*
- Chrome: right click -> show source
- Safari, Firefox, Edge: right click -> inspect element

In [63]:
"""
# Our goal is to end up with an output that should look like the following
news = [
    ...,
    {
        'url': 'https://www.dtu.dk/english/news/Nyhed?id={B4C5541F-8E2F-4118-9DA5-07A2BA6E2407}',
        'title': 'New study underlines sea level rise in the Arctic ocean',
        'desc': 'Sea levels in the Arctic oceans have risen an average of 2.2 millimeters per year over the last 22 years.This is the conclusion reached by a Danish-German research team...',
        'date': '16 JUL'
    },
    ...
]
""";


Hint: the PyQuery package can help you directly extract content from HTML

In [64]:
from pyquery import PyQuery

In [65]:
pq = PyQuery(contents)

In [66]:
pq('div.newsItem')

[<div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>, <div.newsItem>]

In [67]:
newsItems.eq(2).text()

''

In [62]:
news = []

#Loops through the elments in newsItems, scrapping the info needed adding it to the array news 
for i in range(len(newsItems)):
    item = newsItems.eq(i)
    # url
    item_url = item.find('h2>a').attr('href')
    # title
    item_title = item.find('h2>a').text()
    # desc
    item_desc = item.find('p').text()
    # date
    item_date = item.find('span.date').text()
    # 
    new_item = {
        'url': item_url,
        'title': item_title,
        'desc': item_desc,
        'date': item_date
    }
    print(new_item)
    news.append(new_item)

{'url': None, 'title': '', 'desc': '', 'date': ''}
