*Stanislav Borysov [stabo@dtu.dk], DTU Management*
# Advanced Business Analytics

## Web Data Mining - Part 3: Web Scraping

Web crawling and scraping represent a very flexible way to get the content from the Internet. Essentially, it imitates a user who visits different webpages and views their content. The only difference is that Internet companies usually love real users and hate scraping bots. So be prepared to be blocked. **In the worst case, you can get into serious troubles so please always read Terms & Conditions and follow the company's policy about automatic data collection (or ask them directly if you are not sure)!**

In this exercise, we will implement our web scraper. However, if you want to do it properly, just use one of the many frameworks available, for example, Scrapy (https://scrapy.org/).

Let's try to get some data from the DTU website. Particularly, we want to get news headlines together with the date and short description from https://www.dtu.dk/english/news. Hopefully, we will not get banned or sued.

### 1. Navigating the website

By default, the webpage shows 10 recent news items. The first task is to find out the URL structure to automatically navigate through the list of news, e.g. how to get the next 10 news items, show 100 news items instead of 10, etc. Try to figure out how which parameters in the URL are responsible for what.

`fr=FIRST_ITEM`

`mr=NUMBER_OF_NEWS_TO_SHOW`

For example, https://www.dtu.dk/english/news?&fr=11&mr=10

In [1]:
url_base = "https://www.dtu.dk/english/news"

Generate and print a list of URLs to get webpages which contain the first 200 news items. 

***WARNING! Just generate the list and do not access the content!***

In [2]:
import math

urls = []

n_items = 200
start = 1
items_to_show = 100 # The webpage cannot show more then 100 items at once
for i in range(math.ceil(n_items / items_to_show)):
    parameters = [
        "fr={}".format(start), 
        "mr={}".format(items_to_show)
    ]
    url = url_base + "?" + "&".join(parameters)
    urls.append(url)
    start += items_to_show

print(urls)

['https://www.dtu.dk/english/news?fr=1&mr=100', 'https://www.dtu.dk/english/news?fr=101&mr=100']


### 2. Scraping the webpage content

***We will not do it.*** I already scrapped a webpage with news items and saved it using the code below.

In [None]:
"""
import urllib.request
url = "https://www.dtu.dk/english/news?fr=101&mr=100"
contents = urllib.request.urlopen(url).read()
with open("dtu_news.html", "wb") as f:
    f.write(contents)
"""

Note, that all markup and images are missing since we saved only the HTML content.

### 3. Extracting the data

Load the data from the file

In [3]:
with open("dtu_news.html", "r") as f:
    contents = f.read()

The task to extract the news items from the contents using either regular expressions or HTML parser (e.g., PyQuery, BeautifulSoup) and save them to a JSON file.

*HINT: To better understand the structure of the webpage, try:*
- Chrome: right click -> inspect
- Safari, Firefox, Edge: right click -> inspect element

In [None]:
"""
# the output should be the following
news = [
    ...,
    {
        'url': 'https://www.dtu.dk/english/news/Nyhed?id={B4C5541F-8E2F-4118-9DA5-07A2BA6E2407}',
        'title': 'New study underlines sea level rise in the Arctic ocean',
        'desc': 'Sea levels in the Arctic oceans have risen an average of 2.2 millimeters per year over the last 22 years.This is the conclusion reached by a Danish-German research team...',
        'date': '16 JUL'
    },
    ...
]
"""

In [4]:
from pyquery import PyQuery

In [5]:
pq = PyQuery(contents)
newsItems = pq('div.newsItem')

In [6]:
news = []
#
for i in range(len(newsItems)):
    item = newsItems.eq(i)
    # url
    item_url = item.find('h2>a').attr('href')
    # title
    item_title = item.find('h2>a').text()
    # desc
    item_desc = item.find('p').text()
    # date
    item_date = item.find('span.date').text()
    # 
    new_item = {
        'url': item_url,
        'title': item_title,
        'desc': item_desc,
        'date': item_date
    }
    print(new_item)
    news.append(new_item)

{'desc': 'A new DTU energy model has been used to reality-check the climate policy goals announced by the Danish government and six political parties.', 'date': '09 APR', 'title': 'Realistic climate initiatives from political parties? DTU crunches the numbers', 'url': 'https://www.dtu.dk/english/news/Nyhed?id={99BB89DB-206B-4F1F-A9FF-2AB72F060253}'}
{'desc': 'DTU spin-out Tetramer Shop has gone to market with technology that can colour and thereby track immune cells and, thus, measure whether a treatment is working correctly...', 'date': '08 APR', 'title': 'DTU spin-out sells technology to monitor cancer treatments', 'url': 'https://www.dtu.dk/english/news/Nyhed?id={BBB6C6F0-7314-4DCD-8935-BBCDE795455A}'}
{'desc': 'DTU Wind Energy is part of the new project, PivotBuoy. The aim of the project is to\xa0reduce the cost of floating offshore wind.', 'date': '08 APR', 'title': 'PivotBuoy project receives €4m to unlock cost competitive floating wind', 'url': 'https://www.dtu.dk/english/news/N

In [7]:
import json

with open('dt_news_parsed.json', 'w', encoding='utf-8') as f:
    json.dump(news, f, ensure_ascii=False, indent=4)