# ISI Datamart Demonstration
---
This demonstration illustrates the following capabilities:

- Entity linking to  __[Wikidata](http://wikidata.org)__
- Augmentation with data from the __[Wikidata](http://wikidata.org)__ knowledge graph
- Augmentation with data from Excel, CSV and other structured sources
- Augmentaiton with Wikipedia tables
- Enriching __[Wikidata](http://wikidata.org)__ 

In [1]:
import sys, os
# sys.stdout = open(os.devnull, 'w')
from wikifier import utils
from datamart.entries_new import D3MDatamart, D3MJoinSpec
import pandas as pd
d3mDatamart = D3MDatamart()
# this is our original input dataset
inputs_ds_loc = "/Users/pszekely/Downloads/datamart_demo/DA_poverty_estimation/TRAIN/dataset_TRAIN/datasetDoc.json"
# pd.set_option('display.max_columns', None)
original_dataset = utils.load_d3m_dataset(inputs_ds_loc)


Using TensorFlow backend.
  return f(*args, **kwds)


## Our Original Dataset
---
We start with a dataset that has the number of people in poverty in different counties in the United States. In this demo we are using data from Florida and Georgia only (so it runs faster).

In [2]:
#original_dataset['learningData'].head()
original_dataset['learningData'].sample(n=15)

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016
109,591,12103,FL,Pinellas County,1,125923
35,868,13109,GA,Evans County,6,2500
93,3023,13075,GA,Cook County,6,4268
56,1748,13065,GA,Clinch County,6,1752
3,36,13055,GA,Chattooga County,6,4716
107,525,12007,FL,Bradford County,6,4392
99,33,12015,FL,Charlotte County,3,22087
16,400,13099,GA,Early County,6,3210
60,1968,13227,GA,Pickens County,1,3447
34,826,13163,GA,Jefferson County,6,3867


## Linking To Wikidata
---
Wikidata contains over 80 million identifiers for entities. Datamart can scan a dataset, automatically identify columns containing entity identifiers, and link the identifiers to the appropriate entity in Wikidata.

Here is the wikified data. Clicking on the links takes you to the corresponding wikidata pages where you can see all the data available for each entity.

In [3]:
wikified_dataset = utils.wikifier_for_d3m_all(input_ds=original_dataset).value
# wikified_dataset['learningData']
utils.pretty_print(wikified_dataset,"wikifier")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata
0,1,13297,GA,Walton County,1,11385,Q498312,Q1428
1,2,13137,GA,Habersham County,6,6500,Q501096,Q1428
2,6,13059,GA,Clarke County,3,31950,Q112061,Q1428
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q1428
4,46,13067,GA,Cobb County,1,73446,Q484247,Q1428
5,60,13105,GA,Elbert County,6,4197,Q492016,Q1428
6,82,13195,GA,Madison County,3,4255,Q156387,Q1428
7,92,13263,GA,Talbot County,8,1447,Q498356,Q1428
8,116,13211,GA,Morgan County,6,2358,Q493083,Q1428
9,143,13165,GA,Jenkins County,6,2606,Q389551,Q1428


## Searching Datamart Using Our Wikified Data
---
Datamart finds multiple datasets that can be used to augment the poverty dataset. The results show the title, the columns available in each dataset, and the columns that will be used to join the candidate dataset to the poverty dataset.

In [4]:
search_results = d3mDatamart.search_with_data(supplied_data=wikified_dataset)
# wiki_search_results.display()
utils.print_search_results(search_results)

Unnamed: 0,title,columns,join columns
0,wikidata search result forFIPS_wikidata,"population ,area ,inception ,violent crime off...",FIPS_wikidata
1,wikidata search result forState_wikidata,"population ,motto text ,demonym ,native label ...",State_wikidata
2,Unemployment and median household income for t...,"FIPStxt ,State ,Area_name ,Rural_urban_continu...",[Area]
3,"Poverty estimates for the U.S., States, and co...","FIPStxt ,State ,Area_Name ,Rural-urban_Continu...",[Area]
4,Educational attainment for adults age 25 and o...,"FIPS Code ,State ,Area name ,2003 Rural-urban ...",[Area]
5,Educational attainment for adults age 25 and o...,"FIPS Code ,State ,Area name ,2003 Rural-urban ...",[FIPS_wikidata]
6,"Poverty estimates for the U.S., States, and co...","FIPStxt ,State ,Area_Name ,Rural-urban_Continu...",[FIPS_wikidata]
7,Unemployment and median household income for t...,"FIPStxt ,State ,Area_name ,Rural_urban_continu...",[FIPS_wikidata]
8,"Population estimates for the U.S., States, and...","FIPS ,State ,Area_Name ,Rural-urban_Continuum ...",[Area]
9,"Population estimates for the U.S., States, and...","FIPS ,State ,Area_Name ,Rural-urban_Continuum ...",[FIPS_wikidata]


## Using Search Results To Augment Your Data
---
The first search result, from Wikidata, augments our data using population, area and inception date of counties.

In [5]:
wiki_search_result = search_results[0]
augmented_dataset = wiki_search_result.augment(supplied_data=wikified_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses
0,1,13297,GA,Walton County,1,11385,Q498312,Q1428,55.0,171.0,542.0,82.0,795.0,10.0,,1818-01-01T00:00:00Z,0.0,85754,73.0
1,2,13137,GA,Habersham County,6,6500,Q501096,Q1428,,,,,,,723.0,1818-12-15T00:00:00Z,,43300,
2,6,13059,GA,Clarke County,3,31950,Q112061,Q1428,0.0,0.0,0.0,0.0,0.0,0.0,314.0,1801-01-01T00:00:00Z,0.0,121265,0.0
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q1428,21.0,101.0,233.0,12.0,346.0,3.0,812.0,1838-01-01T00:00:00Z,0.0,25138,26.0
4,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,66.0,8.0,31.0,0.0,39.0,8.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0
5,60,13105,GA,Elbert County,6,4197,Q492016,Q1428,26.0,99.0,166.0,30.0,295.0,0.0,970.0,1790-12-10T00:00:00Z,0.0,19599,27.0
6,82,13195,GA,Madison County,3,4255,Q156387,Q1428,43.0,125.0,270.0,41.0,436.0,8.0,740.0,1811-12-05T00:00:00Z,0.0,28057,58.0
7,92,13263,GA,Talbot County,8,1447,Q498356,Q1428,,,,,,,1022.0,1827-12-14T00:00:00Z,,6456,
8,116,13211,GA,Morgan County,6,2358,Q493083,Q1428,,,,,,,918.0,1862-01-01T00:00:00Z,,17781,
9,143,13165,GA,Jenkins County,6,2606,Q389551,Q1428,,,,,,,913.0,1905-08-17T00:00:00Z,,9269,


## Repeat Augmentation With Additional Search Results From Wikidata
---
The second search result, also from Wikidata, augments our using data from states. This adds columns with information about the states.

In [6]:
augmented_dataset = search_results[1].augment(supplied_data=augmented_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses,elevation above sea level,motto text,native label,short name,water as percent of area
0,2649,13233,GA,Polk County,6,7609,Q498395,Q1428,,,,,,,808.0,1851-12-20T00:00:00Z,,41183,,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
1,413,13305,GA,Wayne County,6,6217,Q491762,Q1428,,,,,,,649.0,1803-05-11T00:00:00Z,,30077,,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
2,116,13211,GA,Morgan County,6,2358,Q493083,Q1428,,,,,,,918.0,1862-01-01T00:00:00Z,,17781,,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
3,548,13311,GA,White County,8,4030,Q389365,Q1428,,,,,,,242.0,1857-01-01T00:00:00Z,,27797,,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
4,2856,13045,GA,Carroll County,1,16713,Q493088,Q1428,124.0,484.0,786.0,128.0,1398.0,21.0,1305.0,1825-06-09T00:00:00Z,1.0,112355,165.0,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
5,2820,13177,GA,Lee County,3,3190,Q491508,Q1428,28.0,112.0,452.0,12.0,576.0,8.0,938.0,1825-06-09T00:00:00Z,0.0,29071,40.0,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
6,1570,13309,GA,Wheeler County,9,2111,Q498332,Q1428,,,,,,,,1912-08-14T00:00:00Z,,7909,,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
7,1205,13091,GA,Dodge County,7,4730,Q115272,Q1428,19.0,86.0,187.0,11.0,284.0,2.0,1303.0,1870-01-01T00:00:00Z,0.0,21221,22.0,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
8,1045,13119,GA,Franklin County,8,4614,Q385931,Q1428,,,,,,,690.0,1784-02-25T00:00:00Z,,22009,,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
9,1614,13145,GA,Harris County,2,2883,Q486133,Q1428,39.0,90.0,119.0,18.0,227.0,1.0,1225.0,1827-12-14T00:00:00Z,0.0,32663,41.0,180,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22



## Download Data
---
Some of the search results are datasets indexed from the web, they are present in the search results because they can be joined with our wikified dataset.

Let's inspect a few of these datasets.

__The 6th search result contains poverty information.__

In [7]:
%%script false

downloaded_dataset = search_results[6].download(supplied_data=wikified_dataset)
utils.pretty_print(downloaded_dataset)
# utils.pretty_print(downloaded_dataset,"download")

CalledProcessError: Command 'b'\ndownloaded_dataset = search_results[6].download(supplied_data=wikified_dataset)\nutils.pretty_print(downloaded_dataset)\n# utils.pretty_print(downloaded_dataset,"download")\n'' returned non-zero exit status 1.

__The 7th search result contains unemployment information.__

In [None]:
%%script false

downloaded_dataset = search_results[7].download(supplied_data=wikified_dataset)
# utils.pretty_print(downloaded_dataset)
utils.pretty_print(downloaded_dataset,"download")

## Augment With Datasets From The Web
---
Datasets from the web can also be used to augment our original data.

Let's augment using the poverty data as it is useful to predict the number of people in poverty. Many new columns appear at the end.

In [None]:
augmented_dataset = search_results[6].augment(supplied_data=augmented_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

## Discovering And Using More Data
---
Crime data may be useful to predict poverty, but no crime data is currently available in Datamart.

Searching in Google for `fbi crime statistics by county`  produces this search result:

[<img src="images/google-search-fbi.png" alt="Google Search Result" title="Google Search Result" /> ](https://ucr.fbi.gov/crime-in-the-u.s)

After navigating to this page, click on `2016`, then `Crime in the U.S. 2016`, then `Violent Crime`. You can explore the various crime datasets. Let's choose `Table 8`, which has crime data for all states, broken doown by county. For example, the Georgia [page](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-8/table-8-state-cuts/georgia.xls) contains crime data for counties in Georgia.

This crime data can be downloaded in Excel using the `Download Excel` [link](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-8/table-8-state-cuts/georgia.xls/output.xls).

<img src="images/fbi-crime-data-georgia.png" alt="Georgia Crime Data" title="Georgia Crime Data" />

---

Challenges for using this data:
- The data for each state is in a separate file
- The column headers start in row 6
- The spreadsheet has notes at the end, and the notes start in different rows for different states
- The name of the state and the year are in the metadata rows (rows 2 and 4)

## Automatic Table Understanding (Poster)
---
The automatic table understanding software performs the following tasks on spreadsheets and CSV files:
- Identifies the type of each cell (data, header, attribute, global metadata)
- Segments the table into blocks
- Identifies relationships among blocks

__Add image here with blocks of GEORGIA table__

## Augmenting Wikidata With Data Extracted From Tables
--- 
After the table understanding step, the data can be indexed in Datamart and used for augmentation. The challenge with the FBI crime data is that the data for each state is in a separate file. Augmentation of our original dataset requires combining the data from multiple files.

Datamart addresses this challenge by mapping the table data to Wikidata and uploading the data to Datamart's Wikidata clone where it can be queried regardless of the file where it came from.

### Adding Crime Properties To Wikidata
Wikidata provides a [user interface](https://test.wikidata.org/wiki/Special:NewProperty) to define properties.

We defined [properties to represent crime data](http://tinyurl.com/y5g7juu6).

### Download The FBI Crime Data

In [None]:
# download_fbi_crime_data("Georgia", "Florida")
# this script should print the URL of each downloaded file.

### Use DIG To Map The Spreadsheets To Wikidata
A DIG script converts the spreadsheet data to Wikidata using a simple API for augmenting Wikidata.

In [None]:
# extract_fbi_crime_data_to_wikidata("Georgia", "Florida")
# this script should print a line after processing each state, like the following
# Generated Wikidata RDF triples for Georgia

### Upload The RDF Triples To Wikidata

In [None]:
# upload_wikidata_triples()

The uploaded FBI data can be visualized, taking advantage of latitude/longitude coordinates present in Wikidata:

- [Map ](http://tinyurl.com/y2a7b7a2) of crime data by county, colored by file where data was present
- [Map ](http://tinyurl.com/y2k66mcd) of crime data by county colored by severity 
- [Map ](http://tinyurl.com/yxwh24vr) of crime data by county colored by severity, per 100,000 inhabitants 


## Search Datamart Again
---
The FBI data is now available as new columns to augment the poverty data.

In [None]:
search_results = d3mDatamart.search_with_data(supplied_data=wikified_dataset)
# wiki_search_results.display()
utils.print_search_results(search_results)

In [None]:
## Augment Using The FBI Data
---
The first search result has the FBI data.

In [None]:
wiki_search_result = search_results[0]
fbi_augmented_dataset = wiki_search_result.augment(supplied_data=wikified_dataset)
utils.pretty_print(fbi_augmented_dataset,"wiki_augment")

__Remove the FBI data from the Datamart Wikidata installation__

In [None]:
# remove_datamart_triples()