# Extract Collection Data - HTML

This notebook shows how to read in collection data in HTML format and convert it to a Python dictionary, ready for transformation to Linked Art.

Steps in this notebook:
1. Read HTML file
2. Convert HTML file to Python Dictionary


# Python modules


The following Python modules will be used:
* json
* bs4 including BeautifulSoup


## json


The Python `json` module is used to encode and decode JSON objects and is used in the script to encode JSON objects before printing.


## bs4

<pre>Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.</pre>

### Further Reading 

* https://docs.python.org/3/library/json.html
* https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [5]:
try:
    import json
except:
    %pip install json
    import json
    
try:
    from bs4 import BeautifulSoup
except:
    %pip install bs4
    from bs4 import BeautifulSoup
    
    

## Read HTML file


The following code demonstrates how to read a HTML file and put contents into a `BeautifulSoup` object. 

<pre>The BeautifulSoup object represents the parsed document as a whole.</pre>



In [19]:
file = './data/example/ashmolean.html'

soup = ""

# open file   
content = open(file, 'r').read()

# create soup
soup = BeautifulSoup(content, 'html.parser')


    
# iterate through artwork descriptions in HTML
for artwork in soup.find_all('div',attrs={"class":"list-inner"}):
    
    # title
    title = artwork.find('h3').string
    if title:   
        print("artwork title : "  + title)
    
       



artwork title : The Tower of Gloucester Cathedral
artwork title : Near Bassano, Brenner
artwork title : Bergamo and the Alps, from the road to Brescia
artwork title : Bellagio, Lago di Como
artwork title : End of the Lake of Lecco
artwork title : Axmouth Landslip from Dolands Farm
artwork title : Study for Detail of the Piazza delle Erbe, Verona
artwork title : The Palazzo Contarini-Fasan, Venice
artwork title : Outline of Leaves of Oak, touched with Colour
artwork title : Quick Study of Leaf Contour: Bramble
artwork title : Leaf Contour: Laburnum
artwork title : Stone Pines at Sestri, Gulf of Genoa
artwork title : Part of the Façade of the destroyed Church of San Michele in Foro, Lucca, as it appeared in 1845
artwork title : Rough Sketch of Tree Growth: Macugnaga
artwork title : Drawing of the Background of Raphael's 'Virgin and Child with the Infant Saint John' (The 'Madonna del Cardellino')
artwork title : Study of the Marble Inlaying on the Front of the Casa Loredan, Venice
artwork

# Try with your own HTML file

If you'd like to try this with your own HTML file, select the file on your local system using the widget below.

The `ipywidgets` Python module will be used for the file upload.

### Further Reading

https://ipywidgets.readthedocs.io/

In [12]:
try:
    import ipywidgets as widgets
except:
    !pip install ipywidgets
    import ipywidgets as widgets
    
from ipywidgets import Layout, FileUpload 


try:
    import IPython
except:
    %pip install IPython
    import IPython 
from IPython.display import display, IFrame, HTML, Javascript
from IPython.core.display import HTML


import io


## Display file upload widget

In [14]:
# define file upload widget
uploader = widgets.FileUpload(accept='', multiple=False, description='Select file')
uploader.style.button_color = 'orange'

display(uploader)



FileUpload(value={}, description='Select file', style=ButtonStyle(button_color='orange'))

## Read contents of  file 

The following code reads the contents of the XML file uploaded using the FileUpload widget and loads it into a Python dictionary.


In [18]:

for filename in uploader.value:       
    if filename != ""  :  
        content = uploader.value[filename]["content"]
        
        #obj = json.load(io.BytesIO(content)) 

        # open file   
        #content = open(file, 'r').read()

        # create soup
        soup = BeautifulSoup(content, 'html.parser')


    
        # iterate through artwork descriptions in HTML
        for artwork in soup.find_all('div',attrs={"class":"list-inner"}):
    
            # title
            title = artwork.find('h3').string
            if title:   
                print("title : "  + title)
        
        
           

        

title : The Tower of Gloucester Cathedral
title : Near Bassano, Brenner
title : Bergamo and the Alps, from the road to Brescia
title : Bellagio, Lago di Como
title : End of the Lake of Lecco
title : Axmouth Landslip from Dolands Farm
title : Study for Detail of the Piazza delle Erbe, Verona
title : The Palazzo Contarini-Fasan, Venice
title : Outline of Leaves of Oak, touched with Colour
title : Quick Study of Leaf Contour: Bramble
title : Leaf Contour: Laburnum
title : Stone Pines at Sestri, Gulf of Genoa
title : Part of the Façade of the destroyed Church of San Michele in Foro, Lucca, as it appeared in 1845
title : Rough Sketch of Tree Growth: Macugnaga
title : Drawing of the Background of Raphael's 'Virgin and Child with the Infant Saint John' (The 'Madonna del Cardellino')
title : Study of the Marble Inlaying on the Front of the Casa Loredan, Venice
title : The Gryphon bearing the south Shaft of the west Entrance of the Duomo, Verona
title : Part of the Façade of the destroyed Churc