This tutorial shows you how to use ETK to extract information for all soccer teams in Italy. Suppose that you want to construct a list of records containing team name, home city, latitude and longitude for every team in Italy.

We start with a Wikipedia page that lists all soccer teams in Italy:
https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy. The page has a table for each division. Each table contains the team name and home city, as well as other information that we will ignore for now. The tables don’t contain the latitude and longitude of the cities. You will notice that most city names in the table are links to other wikipedia pages, and we could get the latitude and longitudes from there. In this tutorial we will use a different approach, linking the city names to geonames.org, a dataset containing every city in the world.

# Part 1: Extracting The Team Tables

Look at the page, and you will notice that the teams are scattered over multiple tables, one for each division. Fortunately, all the tables have the same structure, which will make our job easier. 

## Defining an ETK module

An ETK module organizes the code for a project so that you can put all the extraction code for a project in a reusable module. Often, large projects will consist of multiple ETK modules for different kinds of documents. In this tutorial we will have only one module

First, we need to load some dependencies we need to cover through this tutorial. Besides, we create an global variable `etk` which we'll also use through the whole process.

In [187]:
import requests
import json
import jsonpath_ng.ext as jex
from etk.extractors.table_extractor import TableExtractor
from etk.extractors.glossary_extractor import GlossaryExtractor
from etk.etk import ETK
etk = ETK()

## Reading the HTML file

In [171]:
url = 'https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy'
response = requests.get(url)
html_page = response.text
html_page

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of football clubs in Italy - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_football_clubs_in_Italy","wgTitle":"List of football clubs in Italy","wgCurRevisionId":857361197,"wgRevisionId":857361197,"wgArticleId":1083851,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Use dmy dates from June 2013","Football clubs in Italy","Lists of association football clubs","Association football in Italy lists"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTran

## Extracting the tables

Extracting the tables in a Web page is very easy as ETK has a table extractor. We devide this phase into two parts.

The first part is to create an instance of TableExtractor, and use that instance to extract the raw tables.

In [172]:
my_table_extractor = TableExtractor()
tables_in_page = my_table_extractor.extract(html_page)[:14]
print('Number of tables in this page:', len(tables_in_page), '\n')
print('The first table in the page shows below: \n')
print(json.dumps(tables_in_page[-1].value, indent=2))

Number of tables in this page: 14 

The first table in the page shows below: 

{
  "features": {
    "no_of_rows": 19,
    "no_of_cells": 95,
    "max_cols_in_a_row": 5,
    "ratio_of_img_tags_to_cells": 0.0,
    "ratio_of_href_tags_to_cells": 0.5894736842105263,
    "ratio_of_input_tags_to_cells": 0.0,
    "ratio_of_select_tags_to_cells": 0.0,
    "ratio_of_colspan_tags_to_cells": 0.0,
    "ratio_of_colons_to_cells": 0.0,
    "avg_cell_len": 12.905263157894737,
    "avg_row_len": 68.42105263157895,
    "avg_row_len_dev": 7.759546336069895,
    "avg_col_len": 245.2,
    "avg_col_len_dev": 4.051833532486392,
    "no_of_cols_containing_num": 2,
    "no_of_cols_empty": 0
  },
  "rows": [
    {
      "cells": [
        {
          "cell": "<th>Club\n</th>",
          "text": "Club",
          "id": "row_0_col_0"
        },
        {
          "cell": "<th>City\n</th>",
          "text": "City",
          "id": "row_0_col_1"
        },
        {
          "cell": "<th>Stadium\n</th>",
     

In the second part, we use JSON path to do further table extraction.

Aside: ETK uses JSON paths to access data in JSON documents. Take a look at the excellent and short introduction to JSON paths: http://goessner.net/articles/JsonPath/

In [173]:
extracted_rows = list()
json_path = '$.rows[*].cells[0:4].text'
keys = ['Team', 'City', 'Stadium', 'Capacity']
for table in tables_in_page:
    doc = etk.create_document(table.value)
    row_values = doc.select_segments(json_path)[4:]
    cell_cnt = 0
    for index in range(0, len(row_values), 4):
        one_row = {keys[0]: row_values[index].value,
                   keys[1]: row_values[index + 1].value,
                   keys[2]: row_values[index + 2].value,
                   keys[3]: row_values[index + 3].value}
        extracted_rows.append(one_row)
print('Number of rows extracted from that page', len(extracted_cells), '\n')
print('Sample rows(5):')
print(json.dumps(extracted_rows[:5], indent=2))

Number of rows extracted from that page 261 

Sample rows(5):
[
  {
    "Team": "Atalanta",
    "City": "Bergamo",
    "Stadium": "Stadio Atleti Azzurri d'Italia",
    "Capacity": "7004213000000000000\u2660 21,300"
  },
  {
    "Team": "Bologna",
    "City": "Bologna",
    "Stadium": "Stadio Renato Dall'Ara",
    "Capacity": "7004382790000000000\u2660 38,279"
  },
  {
    "Team": "Cagliari",
    "City": "Cagliari",
    "Stadium": "Sardegna Arena",
    "Capacity": "7004162330000000000\u2660 16,233"
  },
  {
    "Team": "Chievo",
    "City": "Verona",
    "Stadium": "Stadio Marc'Antonio Bentegodi",
    "Capacity": "7004384020000000000\u2660 38,402"
  },
  {
    "Team": "Empoli",
    "City": "Empoli",
    "Stadium": "Stadio Carlo Castellani",
    "Capacity": "7004162840000000000\u2660 16,284"
  }
]


The extracted tables are now stored in your JSON document.

construct a dict that maps city names to all geonames records that contain the city name with population greater than 25,000.

In [69]:
file_name = 'cities_ppl>25000.json'
file = open(file_name, 'r')
city_dataset = json.loads(file.read())
file.close()
city_list = list(city_dataset.keys())
print('There are', len(city_list), 'cities with population great than or equal to 25,000.\n')
print('City list samples(10):\n')
print(city_list[:10])

There are 15117 cities with population great than or equal to 25,000.

City list samples(10):

['Marion', 'Fes', 'Fes al Bali', 'Gravina in Puglia', 'Nawada', 'Pensacola', 'Pedro Betancourt', 'Uriangato', 'Fiditi', 'Wilkes-Barre']


## Identifying the city names in geonames and linking to geonames 

There are many ways to do this step. We will do it using the ETK glossary extractor to illustrate how to use other extractors and how to chain the results of one extractor as input to other extractors.

Using data from the geonames.org web site, we prepared a list of all cities in the world with population greater than 25,000. We use this small glossary to make the code run faster, but you may want to try it with the full list of cities.

First, we need to load the glossary in ETK.
We're using the default tokenizer to tokenize the strings. Besides, we set `ngrams` to zero to let the program choose the best ngram number automatically.

In [71]:
my_glossary_extractor = GlossaryExtractor(glossary=city_list, extractor_name='tutorial_glossary', 
                                          tokenizer=etk.default_tokenizer, ngrams=0,
                                          case_sensitive=False)

Now we are going to use the glossary to extract from the `Home city` column all the strings that match names in geonames. This method will allow us to extract the geonames city name from cells that may contain extraneous information.


To run the glossary extractor over all cells containing `Home city` we use a JSON path that selects these cells across all tables.
Our list of extractions has the names of cities that we know appear in geonames. Often, different cities in the world have the same name (e.g., Paris, France and Paris, Texas). To get the latitude and longitude, we need to identify the correct city. We know all the cities are in Italy, so we can easily filter.

In [195]:
hit_count = 0
etk.parser = jex.parse
cities_doc = etk.create_document(city_dataset)
for one_row in extracted_rows:
    doc = etk.create_document(one_row)
    city = doc.select_segments('City')
    extractions = doc.extract(my_glossary_extractor, city[0])
    if extractions:
        path = '$."' + extractions[0].value + '"[?(@.country == "Italy")]'
        city_info = cities_doc.select_segments(path)
        if city_info:
            hit_count+=1
            for key in city_info[0].value:
                one_row['city' + key]= city_info[0].value[key]
print('There\'re', hit_count, 'hits for city_list.\n')
print('Final result sample:\n')
print(json.dumps(extracted_rows[0], indent=2))

There're 143 hits for city_list.

Final result sample:

{
  "Team": "Atalanta",
  "City": "Bergamo",
  "Stadium": "Stadio Atleti Azzurri d'Italia",
  "Capacity": "7004213000000000000\u2660 21,300",
  "provenances": [
    {
      "@id": 0,
      "@type": "extraction_provenance_record",
      "method": "tutorial_glossary",
      "confidence": 1.0,
      "origin_record": {
        "path": "City",
        "start_char": 0,
        "end_char": 7
      }
    },
    {
      "@id": 0,
      "@type": "extraction_provenance_record",
      "method": "tutorial_glossary",
      "confidence": 1.0,
      "origin_record": {
        "path": "City",
        "start_char": 0,
        "end_char": 7
      }
    },
    {
      "@id": 0,
      "@type": "extraction_provenance_record",
      "method": "tutorial_glossary",
      "confidence": 1.0,
      "origin_record": {
        "path": "City",
        "start_char": 0,
        "end_char": 7
      }
    },
    {
      "@id": 0,
      "@type": "extraction_provenan