*****************************************************************
#  The Social Web: data representation
- Instructors: Davide Ceolin, Filip Ilievski and Zubaria Inayat.
- TAs: Sandro Barres-Hamers, Alexander Schmatz, Márton Bodó and Danae Mitsea.
- Exercises for Hands-on session 2
*****************************************************************

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.

**Prerequisites:**
- Python 3.8
- Python packages: `requests`, `BeautifulSoup4`, `HTMLParser`, `rdflib`


In [None]:
# If you're using a virtualenv, make sure it's activated before running
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser
!pip install rdflib
!pip install cloudscraper

##  Exercise 1

Even if web pages do not use microformat, interesting data can often be extracted from the HTML. You may use packages such as [BeautifulSoup][b] to extract arbitrary pieces of data from any HTML page.  
The example below shows how we can find the URL of the first image in the infobox table of the [wikipedia page on Amsterdam][a]. 

**Tip:** compare the code below with HTML source code of the wikipedia page: the image url is in the `"src"` attribute of the `"img"` element of in the `"table"` element with `class="infobox"`.

[b]: https://beautiful-soup-4.readthedocs.io/en/latest/
[a]: https://en.wikipedia.org/wiki/Amsterdam

In [1]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
#print(req.text)
image1 = soup.findAll('table', class_='infobox')[0].find('img')
print(image1['src'])
print('https:' + image1['src'])


//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png/268px-Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png
https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png/268px-Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png


Let's look at another example:

The code bellow extracts the coordinates from a webpage and reformats them into geo microformat (based on Example 8-1 in Mining the Social Web). **Note** that wikipages may encode long/lat information in different ways.   
One of the ways used by the Amsterdam wikipedia page is in a `span` element that is **not** shown to the user:
```html
<span class="geo">52.367; 4.900</span>
```
This `span` element has a single child: 
```html
len(geoTag == 1) 
```
and no further structure. Therefore, we have to manually get the long/lat by splitting the string on the `';'` semicolon.

In [None]:
geoTag = soup.find(True, 'geo')
print(geoTag)

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print (('Location is at'), lat, lon)
else:
        print ('Location not found')


### Task 1

**1.1 Convert the output of Exercise 1 into KML**   
- Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 
- And here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index

**1.2 Visualise the point in Google Maps**  
- To do so, use the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features

**Note:**
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code. This means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there, and request the RAW format URL. If it fails to work, you can also use KML viewer websites like https://kmzview.com/.

In [5]:
# TODO.

**1.3 Is KML a microformat, why (not)?**

Answer: (add your answer here)

## Exercise 2 
In order to find information in the web, we can use microformats such as [hRecipe](https://microformats.org/wiki/hrecipe) or Schema.org's [Recipe](https://schema.org/Recipe). But first, we'll show you how to find arbitrary tags in a webpage.

### Task 2 
Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web).

In [10]:
import cloudscraper
import json
from bs4 import BeautifulSoup

# A yummy webpage (feel free to change to your likings.)
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

# Create a CloudScraper object
scraper = cloudscraper.create_scraper()

# Use the CloudScraper object to fetch the HTML content
response = scraper.get(URL)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Now you can work with the 'soup' object as you did before
listchildren = list(soup.children)
#print(listchildren)


We can find any element in the page through **css tag selectors** (you can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp))

But in short, these are:
- `"."` for classes
- `#` for ids 
- and `plain text` for the element name

You can also combine them, so for example, looking for `".class1.class2"` would select all elements displaying both classes. For a deeper overview, please check the above link (or google "html tag selectors"). 

In [None]:
print(len(listchildren)) # we can see here how many children the html doc has got.
title_unparsed = soup.select_one("title")
#show the title element
print(title_unparsed)

Not so pretty.... Use the `.text` method.

In [None]:
print(title_unparsed.text)

The website has a block of `JSON-LD` data embedded. Try to see if you can find it in the soup object. We can load the `JSON-LD` script to work with it easier.  
Let's get a list of the ingredients.

In [None]:
# Find the script tag containing the JSON-LD data
json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

# Extract the content of the script tag
script_content = json_ld_script.string

# Load the JSON data from the script content
data = json.loads(script_content)

# Access the "recipeIngredient" list
recipe_ingredients = data["@graph"][7]["recipeIngredient"]

# Print the list of ingredients
for ingredient in recipe_ingredients:
    print(ingredient)

Let's also print out the instructions.

In [None]:
recipe_instructions= data["@graph"][7]["recipeInstructions"]
#the instructions list contains dictionaries as elements, take a look at how the list is organized
for step in recipe_instructions:
    print(step["text"])

Websites are going to be structured differently. Look at the following JSON-DL snippet.

In [15]:
json_example = {
    "title": "The anarchist cookbook",
    "recipeInstructions": "<ol class=\"recipeSteps\"><li>Cook the linguine according to the packet instructions. </li><li>Meanwhile, carefully crack the eggs into a small bowl and beat them with a fork. Season with a little black pepper, then stir in the ricotta finely grate in most of the lemon zest. </li><li>When the pasta has 3 minutes left, add the peas. Reserve a little cooking water, then drain the linguine and peas, and return to the pan. </li><li>Stir in the egg mixture and spinach with a wooden spoon – they'll cook gently in the residual heat. Add a little pasta water to loosen, if needed. </li><li>Share between bowls and serve with a green salad.</li></ol>",
    "ingredients": ["a lot of effort", "the right mindset"]
}

recipe_instructions = json_example["recipeInstructions"]
example_soup = BeautifulSoup(recipe_instructions, 'html.parser')

To get a nice and clean list of the instructions, step by step, we can use the `.find` method to get the first `"ol"` element,  with attribute `"class.."`, and then use `.find_all` to get all list elements in there.  
Lastly, we can `strip` the list items to obtain the instructions.

In [None]:
list_items = example_soup.find('ol', class_='recipeSteps').find_all('li')
instructions = [item.get_text(strip=True) for item in list_items]
print(instructions)

## Task 2.1
Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). 

Make sure to:

- Return itemized content (e.g. ingredients) in a list. (You may want to use a list comprehension here)
- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```). Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).


In [None]:
#Here you can see the solution for our example website

URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

def parse_website(url):
    # Create a CloudScraper object
    scraper = cloudscraper.create_scraper()

    # Use the CloudScraper object to fetch the HTML content
    response = scraper.get(URL)

    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #Get the title
    title_unparsed = soup.select_one("title")
    fn = title_unparsed.text
    
    json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

    # Extract the content of the script tag
    script_content = json_ld_script.string

    # Load the JSON data from the script content
    data = json.loads(script_content)

    # Access the "recipeIngredient" list
    recipe_ingredients = data["@graph"][7]["recipeIngredient"]
    
    ingredients = [ingredient for ingredient in recipe_ingredients]
    
    #Access the instructions
    recipe_instructions= data["@graph"][7]["recipeInstructions"]
    #the instructions list contains dictionaries as elements, take a look at how the list is organized
    instructions = [step["text"] for step in recipe_instructions]

    return {'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print (recipe)
        

**Implement your function in the cell bellow:**

In [None]:
# -*- coding: utf-8 -*-

import cloudscraper
import json
from bs4 import BeautifulSoup

# Pass in a URL containing hRecipe, such as
# https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/

URL = #YOUR RECIPE HERE

# Parse out some of the pertinent information for a recipe.
# See http://microformats.org/wiki/hrecipe.

#Solution for jamie oliver
def parse_website(url):

    return {
            'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print (recipe)

#### How Can We Extract Information from Multiple Websites?

**The answer:** microformats.

Instead of manually extracting information from microformats like `schema.org` or `hRecipe`, you can use a package called `scrape-schema-recipe`.

Feel free to give it a try!

### Task 2.2
`hRecipe` is a microformat specifically created for recipes. 

**For this task**, you have to compare different dessert recipe ingredients. (For inspiration, you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.)

In [None]:
import scrape_schema_recipe

# TODO.

## Exercise 3

`Schema.org` is one of the most widely used annotations formats. It is a multipurpose template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe `entities`, `events`, `products` etc. 

Check out the vocabulary specs on [schema.org][s].

[s]: https://schema.org/

### Task 3 - Parsing schema.org microdata

To parse this data, you need to install the `rdflib-microdata` package. (which you have done in one of the previous steps)

In [None]:
from rdflib import Graph

# Source: https://www.youtube.com/watch?v=sCU214rbRZ0
# Pass in a URL containing Schema.org microformats
URL = "http://dbpedia.org/resource/Micheal_Jackson"

# Initialize a graph
g = Graph()

# Parse in an RDF file graph dbpedia
result = g.parse(location=URL)

# Loop through first 10 triples in the graph
for index, (sub, pred, obj) in enumerate(g):
    print(sub, pred, obj)
    if index == 10:
        break

In [None]:
# Print the size of the Graph
print(f'Graph has {len(g)} facts')

In [None]:
# Print out the entire Graph in the RDF Turtle format
print(g.serialize(format='ttl'))

### Task 3.1 
Compare the [`schema.org`][s] information about a band (you can choosy any) on [`Last.fm`][l] to the `Facebook Open Graph` information about the **same band** on `Facebook`. 
- What are the differences? 
- Which format do you think offers better interoperability?

Be sure to refer to the **Microformat** specifications.

[l]: https://www.last.fm/
[s]: https://schema.org/


**Answer:** (add your answer here)

### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. 

Think about possible microformats you want to support in your final assignment and read up on how to parse them.

**Answer:** (add your answer here)