This tutorial shows you how to use ETK to extract information for all soccer teams in Italy. Suppose that you want to construct a list of records containing team name, home city, latitude and longitude for every team in Italy.

We start with a Wikipedia page that lists all soccer teams in Italy:
https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy. The page has a table for each division. Each table contains the team name and home city, as well as other information that we will ignore for now. The tables don’t contain the latitude and longitude of the cities. You will notice that most city names in the table are links to other wikipedia pages, and we could get the latitude and longitudes from there. In this tutorial we will use a different approach, linking the city names to geonames.org, a dataset containing every city in the world.

# Part 1: Extracting The Team Tables

Look at the page, and you will notice that the teams are scattered over multiple tables, one for each division. Fortunately, all the tables have the same structure, which will make our job easier. 

## Defining an ETK module

An ETK module organizes the code for a project so that you can put all the extraction code for a project in a reusable module. Often, large projects will consist of multiple ETK modules for different kinds of documents. In this tutorial we will have only one module

First, we need to load some dependencies we need to cover through this tutorial. Besides, we create an instance `etk` which we'll also use through the whole process.

In [5]:
import requests
import json
import jsonpath_ng.ext as jex
import re
from etk.extractors.table_extractor import TableExtractor
from etk.extractors.glossary_extractor import GlossaryExtractor
from etk.etk import ETK
from etk.knowledge_graph_schema import KGSchema

kg_schema = KGSchema(json.load(open('./resources/master_config.json')))
etk = ETK(kg_schema=kg_schema)
etk.parser = jex.parse


## Reading the HTML file

We read the url of soccer teams, get the body of response. We also create a `cdr`. It contains `raw_content` and `url` field. At the second part of this tutorial, we'll use it.

In [138]:
url = 'https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy'
response = requests.get(url)
html_page = response.text

print('The first 600 chars of the html page:\n')
print(html_page[:600])


The first 600 chars of the html page:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of football clubs in Italy - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_football_clubs_in_Italy","wgTitle":"List of football clubs in Italy","wgCurRevisionId":859334329,"wgRevisionId":859334329,"wgArticl


## Extracting the tables

Extracting the tables in a Web page is very easy as ETK has a table extractor. We devide this phase into two parts.

The first part is to create an instance of TableExtractor, and use that instance to extract the raw tables.

In [156]:
my_table_extractor = TableExtractor()
tables_in_page = my_table_extractor.extract(html_page)[:14]
print('Number of tables in this page:', len(tables_in_page), '\n')
print('The first table in the page shows below: \n')
print(json.dumps(tables_in_page[0].value, indent=2))


Number of tables in this page: 14 

The first table in the page shows below: 

{
  "features": {
    "no_of_rows": 21,
    "no_of_cells": 105,
    "max_cols_in_a_row": 5,
    "ratio_of_img_tags_to_cells": 0.0,
    "ratio_of_href_tags_to_cells": 0.7238095238095238,
    "ratio_of_input_tags_to_cells": 0.0,
    "ratio_of_select_tags_to_cells": 0.0,
    "ratio_of_colspan_tags_to_cells": 0.0,
    "ratio_of_colons_to_cells": 0.0,
    "avg_cell_len": 14.942857142857143,
    "avg_row_len": 78.71428571428571,
    "avg_row_len_dev": 8.490409488646232,
    "avg_col_len": 313.8,
    "avg_col_len_dev": 3.8774340214067022,
    "no_of_cols_containing_num": 2,
    "no_of_cols_empty": 0
  },
  "rows": [
    {
      "cells": [
        {
          "cell": "<th>Team\n</th>",
          "text": "Team",
          "id": "row_0_col_0"
        },
        {
          "cell": "<th>Home city\n</th>",
          "text": "Home city",
          "id": "row_0_col_1"
        },
        {
          "cell": "<th>Stadium\n<

In the second part, we use JSON path to do further table extraction.

Aside: ETK uses JSON paths to access data in JSON documents. Take a look at the excellent and short introduction to JSON paths: http://goessner.net/articles/JsonPath/

In [157]:
all_json_path = '$.cells[0:4].text'
docs = list()
for table in tables_in_page:

    # skipping the first row, the heading
    for row in table.value['rows'][1:]:
        doc = etk.create_document(row)
        row_values = doc.select_segments(all_json_path)

        # add the information we extracted in the knowledge graph of the doc.
        doc.kg.add_value('team', value=row_values[0].value)
        doc.kg.add_value('city_name', value=row_values[1].value)
        doc.kg.add_value('stadium', value=row_values[2].value)
        capacity_split = re.split(' |,', row_values[3].value)
        if capacity_split[-1] != '':
            capacity = int(capacity_split[-2] + capacity_split[-1]) if len(capacity_split) > 1 else int(
                capacity_split[-1])
            doc.kg.add_value('capacity', value=capacity)
        docs.append(doc)

print('Number of rows extracted from that page', len(docs), '\n')
print('Sample rows(5):')
for doc in docs[:5]:
    print(doc.kg.value, '\n')


Number of rows extracted from that page 258 

Sample rows(5):
{'team': [{'value': 'Atalanta', 'key': 'atalanta'}], 'city_name': [{'value': 'Bergamo', 'key': 'bergamo'}], 'stadium': [{'value': "Stadio Atleti Azzurri d'Italia", 'key': "stadio atleti azzurri d'italia"}], 'capacity': [{'value': 21300, 'key': '21300'}]} 

{'team': [{'value': 'Bologna', 'key': 'bologna'}], 'city_name': [{'value': 'Bologna', 'key': 'bologna'}], 'stadium': [{'value': "Stadio Renato Dall'Ara", 'key': "stadio renato dall'ara"}], 'capacity': [{'value': 38279, 'key': '38279'}]} 

{'team': [{'value': 'Cagliari', 'key': 'cagliari'}], 'city_name': [{'value': 'Cagliari', 'key': 'cagliari'}], 'stadium': [{'value': 'Sardegna Arena', 'key': 'sardegna arena'}], 'capacity': [{'value': 16233, 'key': '16233'}]} 

{'team': [{'value': 'Chievo', 'key': 'chievo'}], 'city_name': [{'value': 'Verona', 'key': 'verona'}], 'stadium': [{'value': "Stadio Marc'Antonio Bentegodi", 'key': "stadio marc'antonio bentegodi"}], 'capacity': [{'v

The extracted tables are now stored in your JSON document.

construct a dict that maps city names to all geonames records that contain the city name with population greater than 25,000.

In [4]:
file_name = './resources/cities_ppl>25000.json'
file = open(file_name, 'r')
city_dataset = json.loads(file.read())
file.close()
city_list = list(city_dataset.keys())
print('There are', len(city_list), 'cities with population great than or equal to 25,000.\n')
print('City list samples(20):\n')
print(city_list[:20])


There are 15117 cities with population great than or equal to 25,000.

City list samples(20):

['Marion', 'Fes', 'Fes al Bali', 'Gravina in Puglia', 'Nawada', 'Pensacola', 'Pedro Betancourt', 'Uriangato', 'Fiditi', 'Wilkes-Barre', 'Kafue', 'Chipata', 'Sawangan', 'Tuxpan de Rodriguez Cano', 'Rosny-sous-Bois', 'Caete', 'Kafr ad Dawwar', 'Reynoldsburg', 'Simferopol', 'Ouargla']


## Identifying the city names in geonames and linking to geonames 

There are many ways to do this step. We will do it using the ETK glossary extractor to illustrate how to use other extractors and how to chain the results of one extractor as input to other extractors.

Using data from the geonames.org web site, we prepared a list of all cities in the world with population greater than 25,000. We use this small glossary to make the code run faster, but you may want to try it with the full list of cities.

First, we need to load the glossary in ETK.
We're using the default tokenizer to tokenize the strings. Besides, we set `ngrams` to zero to let the program choose the best ngram number automatically.

In [159]:
my_glossary_extractor = GlossaryExtractor(glossary=city_list, extractor_name='tutorial_glossary',
                                          tokenizer=etk.default_tokenizer, ngrams=3,
                                          case_sensitive=False)


Now we are going to use the glossary to extract from the `Home city` column all the strings that match names in geonames. This method will allow us to extract the geonames city name from cells that may contain extraneous information.


To run the glossary extractor over all cells containing `Home city` we use a JSON path that selects these cells across all tables.
Our list of extractions has the names of cities that we know appear in geonames. Often, different cities in the world have the same name (e.g., Paris, France and Paris, Texas). To get the latitude and longitude, we need to identify the correct city. We know all the cities are in Italy, so we can easily filter.

In [160]:
hit_count = 0
for doc in docs:
    city_json_path = '$.cells[1].text'
    row_values = doc.select_segments(city_json_pathjson_path)

    # use the city field of the doc, run the GlossaryExtractor
    extractions = doc.extract(my_glossary_extractor, row_values[0])
    if extractions:
        path = '$."' + extractions[0].value + '"[?(@.country == "Italy")]'
        jsonpath_expr = jex.parse(path)
        city_match = jsonpath_expr.find(city_dataset)
        if city_match:
            hit_count += 1

            # add corresponding values of city_dataset into knowledge graph of the doc
            for field in city_match[0].value:
                doc.kg.add_value(field, value=city_match[0].value[field])
print('There\'re', hit_count, 'hits for city_list.\n')
print('Final result sample:\n')
print(json.dumps(docs[0].kg.value, indent=2))


There're 138 hits for city_list.

Final result sample:

{
  "team": [
    {
      "value": "Atalanta",
      "key": "atalanta"
    }
  ],
  "city_name": [
    {
      "value": "Bergamo",
      "key": "bergamo"
    }
  ],
  "stadium": [
    {
      "value": "Stadio Atleti Azzurri d'Italia",
      "key": "stadio atleti azzurri d'italia"
    }
  ],
  "capacity": [
    {
      "value": 21300,
      "key": "21300"
    }
  ],
  "population": [
    {
      "value": 114162,
      "key": "114162"
    }
  ],
  "state": [
    {
      "value": "Lombardia",
      "key": "lombardia"
    }
  ],
  "country": [
    {
      "value": "Italy",
      "key": "italy"
    }
  ],
  "latitude": [
    {
      "value": "45.69601",
      "key": "45.69601"
    }
  ],
  "longitude": [
    {
      "value": "9.66721",
      "key": "9.66721"
    }
  ]
}


# Part 2 ETK Module

In [84]:
import os
import sys
import json
import requests
import jsonpath_ng.ext as jex
import re
from etk.etk import ETK
from etk.document import Document
from etk.etk_module import ETKModule
from etk.knowledge_graph_schema import KGSchema
from etk.utilities import Utility
from etk.extractors.table_extractor import TableExtractor
from etk.extractors.glossary_extractor import GlossaryExtractor


class ItalyTeamsModule(ETKModule):
    def __init__(self, etk):
        ETKModule.__init__(self, etk)
        self.my_table_extractor = TableExtractor()

        file_name = './resources/cities_ppl>25000.json'
        file = open(file_name, 'r')
        self.city_dataset = json.loads(file.read())
        file.close()
        self.city_list = list(self.city_dataset.keys())

        self.my_glossary_extractor = GlossaryExtractor(glossary=self.city_list, extractor_name='tutorial_glossary',
                                                       tokenizer=etk.default_tokenizer, ngrams=3,
                                                       case_sensitive=False)

    def process_document(self, cdr_doc: Document):
        new_docs = list()
        doc_json = cdr_doc.cdr_document

        if 'raw_content' in doc_json and doc_json['raw_content'].strip() != '':
            tables_in_page = self.my_table_extractor.extract(
                doc_json['raw_content'])[:14]
            for table in tables_in_page:

                # skipping the first row, the heading
                for row in table.value['rows'][1:]:
                    doc = etk.create_document(row)
                    all_json_path = '$.cells[0:4].text'
                    row_values = doc.select_segments(all_json_path)
                    # add the information we extracted in the knowledge graph of the doc.
                    doc.kg.add_value('team', value=row_values[0].value)
                    doc.kg.add_value('city_name', value=row_values[1].value)
                    doc.kg.add_value('stadium', value=row_values[2].value)
                    capacity_split = re.split(' |,', row_values[3].value)
                    if capacity_split[-1] != '':
                        capacity = int(capacity_split[-2] + capacity_split[-1]) if len(capacity_split) > 1 else int(
                            capacity_split[-1])
                        doc.kg.add_value('capacity', value=capacity)

                    city_json_path = '$.cells[1].text'
                    row_values = doc.select_segments(city_json_path)

                    # use the city field of the doc, run the GlossaryExtractor
                    extractions = doc.extract(
                        self.my_glossary_extractor, row_values[0])
                    if extractions:
                        path = '$."' + \
                            extractions[0].value + '"[?(@.country == "Italy")]'
                        jsonpath_expr = jex.parse(path)
                        city_match = jsonpath_expr.find(self.city_dataset)
                        if city_match:
                            # add corresponding values of city_dataset into knowledge graph of the doc
                            for field in city_match[0].value:
                                doc.kg.add_value(
                                    field, value=city_match[0].value[field])
                    new_docs.append(doc)
        return new_docs

    def document_selector(self, doc) -> bool:
        return doc.cdr_document.get("dataset") == "italy_team"


if __name__ == "__main__":
    url = 'https://en.wikipedia.org/wiki/List_of_football_clubs_in_Italy'
    response = requests.get(url)
    html_page = response.text
    cdr = {
        "raw_content": html_page,
        "url": url,
        "dataset": "italy_team"
    }
    kg_schema = KGSchema(json.load(open('./resources/master_config.json')))
    etk = ETK(modules=ItalyTeamsModule, kg_schema=kg_schema)
    etk.parser = jex.parse
    cdr_doc = Document(etk, cdr_document=cdr, mime_type='json', url=url)
    results = etk.process_ems(cdr_doc)[1:]
    print("Sample result:\n")
    print(json.dumps(results[0].value, indent=2))

Sample result:

{
  "cells": [
    {
      "cell": "<td><a href=\"/wiki/Atalanta_B.C.\" title=\"Atalanta B.C.\">Atalanta</a>\n</td>",
      "text": "Atalanta",
      "id": "row_1_col_0"
    },
    {
      "cell": "<td><a href=\"/wiki/Bergamo\" title=\"Bergamo\">Bergamo</a>\n</td>",
      "text": "Bergamo",
      "id": "row_1_col_1"
    },
    {
      "cell": "<td><a href=\"/wiki/Stadio_Atleti_Azzurri_d%27Italia\" title=\"Stadio Atleti Azzurri d'Italia\">Stadio Atleti Azzurri d'Italia</a>\n</td>",
      "text": "Stadio Atleti Azzurri d'Italia",
      "id": "row_1_col_2"
    },
    {
      "cell": "<td style=\"text-align:center;\"><span class=\"sortkey\" style=\"display:none\">7004213000000000000\u2660</span>21,300\n</td>",
      "text": "7004213000000000000\u2660 21,300",
      "id": "row_1_col_3"
    },
    {
      "cell": "<td><a href=\"/wiki/2017%E2%80%9318_Serie_A\" title=\"2017\u201318 Serie A\">7th in Serie A</a>\n</td>",
      "text": "7th in Serie A",
      "id": "row_1_col_4"
 