# ISI Datamart Demonstration
---
This demonstration illustrates the following capabilities:

- Entity linking to  [Wikidata](http://wikidata.org), [Wikidata statistics](https://www.wikidata.org/wiki/Special:Statistics), [statistics graphs](https://grafana.wikimedia.org/d/000000175/wikidata-datamodel-statements?refresh=30m&orgId=1), [most viewed pages](https://tools.wmflabs.org/topviews/?project=wikidata.org&platform=all-access&date=last-year&excludes=)
- Augmentation with data from the Wikidata knowledge graph
- Augmentation with data from Excel, CSV and other structured sources
- Enriching Wikidata 

In [1]:
import sys, os
# sys.stdout = open(os.devnull, 'w')
from wikifier import utils
from datamart.entries_new import D3MDatamart, D3MJoinSpec
import pandas as pd
d3mDatamart = D3MDatamart()
# this is our original input dataset
inputs_ds_loc = "/Users/pszekely/Downloads/datamart_demo/DA_poverty_estimation/TRAIN/dataset_TRAIN/datasetDoc.json"
# pd.set_option('display.max_columns', None)
original_dataset = utils.load_d3m_dataset(inputs_ds_loc)


Using TensorFlow backend.
  return f(*args, **kwds)


## Our Original Dataset
---
We start with a dataset that has the number of people in poverty in different counties in the United States. In this demo we are using data from Florida and Georgia only (so it runs faster).

In [2]:
#original_dataset['learningData'].head()
original_dataset['learningData'].sample(n=15)

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016
122,1715,12017,FL,Citrus County,4,23472
15,329,13197,GA,Marion County,2,1975
53,1570,13309,GA,Wheeler County,9,2111
75,2445,13005,GA,Bacon County,7,2524
21,505,13255,GA,Spalding County,1,14243
123,1724,12051,FL,Hendry County,4,9712
79,2591,13027,GA,Brooks County,3,3854
102,94,12059,FL,Holmes County,6,4276
13,273,13125,GA,Glascock County,9,539
32,791,13277,GA,Tift County,4,8669


## Linking To Wikidata
---
Wikidata contains over 80 million identifiers for entities. Datamart can scan a dataset, automatically identify columns containing entity identifiers, and link the identifiers to the appropriate entity in Wikidata.

Here is the wikified data. Clicking on the links takes you to the corresponding wikidata pages where you can see all the data available for each entity.

In [3]:
wikified_dataset = utils.wikifier_for_d3m_all(input_ds=original_dataset).value
# wikified_dataset['learningData']
utils.pretty_print(wikified_dataset,"wikifier")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata
0,1,13297,GA,Walton County,1,11385,Q498312,Q58428702
1,2,13137,GA,Habersham County,6,6500,Q501096,Q58428702
2,6,13059,GA,Clarke County,3,31950,Q112061,Q58428702
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q58428702
4,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702
5,60,13105,GA,Elbert County,6,4197,Q492016,Q58428702
6,82,13195,GA,Madison County,3,4255,Q156387,Q58428702
7,92,13263,GA,Talbot County,8,1447,Q498356,Q58428702
8,116,13211,GA,Morgan County,6,2358,Q493083,Q58428702
9,143,13165,GA,Jenkins County,6,2606,Q389551,Q58428702


## Searching Datamart Using Our Wikified Data
---
Datamart finds multiple datasets that can be used to augment the poverty dataset. The results show the title, the columns available in each dataset, and the columns that will be used to join the candidate dataset to the poverty dataset.

In [4]:
search_results = d3mDatamart.search_with_data(supplied_data=wikified_dataset)
# wiki_search_results.display()
utils.print_search_results(search_results)

Unnamed: 0,title,columns,join columns
0,wikidata search result for FIPS_wikidata,"population, area, inception, violent crime off...",FIPS_wikidata
1,wikidata search result for State_wikidata,publication date,State_wikidata
2,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[Area]
3,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[Area]
4,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[Area]
5,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[Area]
6,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[Area]
7,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[Area]
8,PopulationEstimates with q nodes,"FIPS, State, Area_Name, Rural-urban_Continuum ...",[Area]
9,PopulationEstimates with q nodes,"FIPS, State, Area_Name, Rural-urban_Continuum ...",[Area]


## Using Search Results To Augment Your Data
---
The first search result, from Wikidata, augments our data using population, area and inception date of counties.

In [5]:
wiki_search_result = search_results[0]
augmented_dataset = wiki_search_result.augment(supplied_data=wikified_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses
0,1,13297,GA,Walton County,1,11385,Q498312,Q58428702,55.0,171.0,542.0,82.0,795.0,10.0,,1818-01-01T00:00:00Z,0.0,85754,73.0
1,2,13137,GA,Habersham County,6,6500,Q501096,Q58428702,,,,,,,723.0,1818-12-15T00:00:00Z,,43300,
2,6,13059,GA,Clarke County,3,31950,Q112061,Q58428702,0.0,0.0,0.0,0.0,0.0,0.0,314.0,1801-01-01T00:00:00Z,0.0,121265,0.0
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q58428702,21.0,101.0,233.0,12.0,346.0,3.0,812.0,1838-01-01T00:00:00Z,0.0,25138,26.0
4,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,31.0,935.0,39.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,1262.0
5,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,2184.0,8490.0,0.0,39.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,79.0
6,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,66.0,8.0,31.0,935.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,79.0
7,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,66.0,2184.0,31.0,0.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0
8,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,66.0,2184.0,8490.0,935.0,11609.0,8.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0
9,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,8490.0,0.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,1262.0


## Repeat Augmentation With Additional Search Results From Wikidata
---
The second search result, also from Wikidata, augments our using data from states. This adds columns with information about the states.

In [6]:
augmented_dataset = search_results[1].augment(supplied_data=augmented_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses
0,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,153.0,0.0,0.0,0.0,0.0,36.0,1516,1777-02-05T00:00:00Z,2.0,81508,0.0
1,825,13151,GA,Henry County,1,21101,Q492053,Q58428702,168.0,994.0,2697.0,445.0,4136.0,0.0,840,1821-05-15T00:00:00Z,0.0,211128,0.0
2,182,13115,GA,Floyd County,3,14596,Q486389,Q58428702,11.0,21.0,754.0,4.0,81.0,11.0,1343,1832-12-03T00:00:00Z,1.0,95821,84.0
3,646,13285,GA,Troup County,4,13731,Q498295,Q58428702,,,,,,,1155,1825-06-08T00:00:00Z,,69053,
4,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,66.0,2184.0,31.0,0.0,11609.0,461.0,881,1832-12-02T00:00:00Z,16.0,717190,1262.0
5,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,31.0,935.0,39.0,8.0,881,1832-12-02T00:00:00Z,16.0,717190,79.0
6,182,13115,GA,Floyd County,3,14596,Q486389,Q58428702,55.0,351.0,754.0,4.0,1216.0,11.0,1343,1832-12-03T00:00:00Z,1.0,95821,16.0
7,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0.0,0.0,0.0,114.0,1852.0,0.0,1516,1777-02-05T00:00:00Z,0.0,81508,201.0
8,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,8490.0,935.0,11609.0,461.0,881,1832-12-02T00:00:00Z,0.0,717190,79.0
9,825,13151,GA,Henry County,1,21101,Q492053,Q58428702,168.0,0.0,135.0,445.0,135.0,120.0,840,1821-05-15T00:00:00Z,0.0,211128,0.0



## Download Data
---
Some of the search results are datasets indexed from the web, they are present in the search results because they can be joined with our wikified dataset.

Let's inspect a few of these datasets.

__Download the search result contains poverty information.__

Datamart computes a foreign key to join the downloaded dataset with the supplied data (last column)

In [7]:
downloaded_dataset = search_results[12].download(supplied_data=wikified_dataset)
# utils.pretty_print(downloaded_dataset)
utils.pretty_print(downloaded_dataset,"download")

Unnamed: 0,FIPStxt,State,Area_Name,Rural-urban_Continuum_Code_2003,Urban_Influence_Code_2003,Rural-urban_Continuum_Code_2013,Urban_Influence_Code_2013,POVALL_2017,CI90LBAll_2017,CI90UBALL_2017,PCTPOVALL_2017,CI90LBALLP_2017,CI90UBALLP_2017,POV017_2017,CI90LB017_2017,CI90UB017_2017,PCTPOV017_2017,CI90LB017P_2017,CI90UB017P_2017,POV517_2017,CI90LB517_2017,CI90UB517_2017,PCTPOV517_2017,CI90LB517P_2017,CI90UB517P_2017,MEDHHINC_2017,CI90LBINC_2017,CI90UBINC_2017,POV04_2017,CI90LB04_2017,CI90UB04_2017,PCTPOV04_2017,CI90LB04P_2017,CI90UB04P_2017,FIPStxt_wikidata,joining_pairs
391,12121,FL,Suwannee County,6,6,6,6,8299,6497,10101,20.3,15.9,24.7,2728,2061,3395,30.0,22.7,37.3,1884,1384,2384,28.2,20.7,35.7,44144,41156,47132,,,,,,,Q501036,[132]
361,12063,FL,Jackson County,6,6,6,6,7264,5523,9005,18.0,13.7,22.3,2117,1528,2706,24.0,17.3,30.7,1565,1118,2012,24.5,17.5,31.5,41524,37686,45362,,,,,,,Q488537,[131]
378,12095,FL,Orange County,1,1,1,1,201528,188575,214481,15.3,14.3,16.3,65440,58650,72230,21.9,19.6,24.2,44755,39282,50228,20.8,18.3,23.3,54021,52629,55413,,,,,,,Q488543,[130]
371,12083,FL,Marion County,2,2,2,2,55880,49007,62753,16.2,14.2,18.2,17133,13930,20336,26.3,21.4,31.2,12253,9813,14693,25.8,20.7,30.9,43772,42127,45417,,,,,,,Q501014,[129]
357,12055,FL,Highlands County,4,5,3,2,20051,17160,22942,19.8,16.9,22.7,5964,4855,7073,34.0,27.7,40.3,4240,3391,5089,33.1,26.5,39.7,37445,34115,40775,,,,,,,Q488885,[128]
343,12027,FL,DeSoto County,6,5,6,5,8766,7195,10337,26.1,21.4,30.8,2505,2009,3001,37.2,29.8,44.6,1801,1429,2173,36.7,29.1,44.3,37342,33967,40717,,,,,,,Q488796,[127]
345,12031,FL,Duval County,1,1,1,1,138069,128011,148127,15.1,14.0,16.2,48079,42653,53505,23.0,20.4,25.6,31702,27402,36002,21.7,18.8,24.6,52105,50735,53475,,,,,,,Q493605,[126]
350,12041,FL,Gilchrist County,3,2,2,2,2675,2067,3283,16.1,12.4,19.8,871,650,1092,24.4,18.2,30.6,632,464,800,24.8,18.2,31.4,42880,38671,47089,,,,,,,Q111720,[125]
379,12097,FL,Osceola County,1,1,1,1,48892,41590,56194,14.0,11.9,16.1,18351,14938,21764,21.4,17.4,25.4,12571,9944,15198,19.8,15.7,23.9,49284,46191,52377,,,,,,,Q501067,[124]
355,12051,FL,Hendry County,4,5,4,3,9525,7776,11274,23.9,19.5,28.3,3741,2934,4548,34.7,27.2,42.2,2662,2085,3239,34.8,27.3,42.3,38361,34378,42344,,,,,,,Q488488,[123]


__Download the search result contains unemployment information.__

In [8]:
downloaded_dataset = search_results[13].download(supplied_data=wikified_dataset)
# utils.pretty_print(downloaded_dataset)
utils.pretty_print(downloaded_dataset,"download")

Unnamed: 0,FIPStxt,State,Area_Name,Rural-urban_Continuum_Code_2003,Urban_Influence_Code_2003,Rural-urban_Continuum_Code_2013,Urban_Influence_Code_2013,POVALL_2017,CI90LBAll_2017,CI90UBALL_2017,PCTPOVALL_2017,CI90LBALLP_2017,CI90UBALLP_2017,POV017_2017,CI90LB017_2017,CI90UB017_2017,PCTPOV017_2017,CI90LB017P_2017,CI90UB017P_2017,POV517_2017,CI90LB517_2017,CI90UB517_2017,PCTPOV517_2017,CI90LB517P_2017,CI90UB517P_2017,MEDHHINC_2017,CI90LBINC_2017,CI90UBINC_2017,POV04_2017,CI90LB04_2017,CI90UB04_2017,PCTPOV04_2017,CI90LB04P_2017,CI90UB04P_2017,FIPStxt_wikidata,joining_pairs
391,12121,FL,Suwannee County,6,6,6,6,8299,6497,10101,20.3,15.9,24.7,2728,2061,3395,30.0,22.7,37.3,1884,1384,2384,28.2,20.7,35.7,44144,41156,47132,,,,,,,Q501036,[132]
361,12063,FL,Jackson County,6,6,6,6,7264,5523,9005,18.0,13.7,22.3,2117,1528,2706,24.0,17.3,30.7,1565,1118,2012,24.5,17.5,31.5,41524,37686,45362,,,,,,,Q488537,[131]
378,12095,FL,Orange County,1,1,1,1,201528,188575,214481,15.3,14.3,16.3,65440,58650,72230,21.9,19.6,24.2,44755,39282,50228,20.8,18.3,23.3,54021,52629,55413,,,,,,,Q488543,[130]
371,12083,FL,Marion County,2,2,2,2,55880,49007,62753,16.2,14.2,18.2,17133,13930,20336,26.3,21.4,31.2,12253,9813,14693,25.8,20.7,30.9,43772,42127,45417,,,,,,,Q501014,[129]
357,12055,FL,Highlands County,4,5,3,2,20051,17160,22942,19.8,16.9,22.7,5964,4855,7073,34.0,27.7,40.3,4240,3391,5089,33.1,26.5,39.7,37445,34115,40775,,,,,,,Q488885,[128]
343,12027,FL,DeSoto County,6,5,6,5,8766,7195,10337,26.1,21.4,30.8,2505,2009,3001,37.2,29.8,44.6,1801,1429,2173,36.7,29.1,44.3,37342,33967,40717,,,,,,,Q488796,[127]
345,12031,FL,Duval County,1,1,1,1,138069,128011,148127,15.1,14.0,16.2,48079,42653,53505,23.0,20.4,25.6,31702,27402,36002,21.7,18.8,24.6,52105,50735,53475,,,,,,,Q493605,[126]
350,12041,FL,Gilchrist County,3,2,2,2,2675,2067,3283,16.1,12.4,19.8,871,650,1092,24.4,18.2,30.6,632,464,800,24.8,18.2,31.4,42880,38671,47089,,,,,,,Q111720,[125]
379,12097,FL,Osceola County,1,1,1,1,48892,41590,56194,14.0,11.9,16.1,18351,14938,21764,21.4,17.4,25.4,12571,9944,15198,19.8,15.7,23.9,49284,46191,52377,,,,,,,Q501067,[124]
355,12051,FL,Hendry County,4,5,4,3,9525,7776,11274,23.9,19.5,28.3,3741,2934,4548,34.7,27.2,42.2,2662,2085,3239,34.8,27.3,42.3,38361,34378,42344,,,,,,,Q488488,[123]


## Augment With Datasets From The Web
---
Datasets from the web can also be used to augment our original data.

Let's augment using the poverty data as it is useful to predict the number of people in poverty. Many new columns appear at the end.

In [9]:
augmented_dataset = search_results[12].augment(supplied_data=augmented_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses,Area_Name,CI90LB017P_2017,CI90LB017_2017,CI90LB04P_2017,CI90LB04_2017,CI90LB517P_2017,CI90LB517_2017,CI90LBALLP_2017,CI90LBAll_2017,CI90LBINC_2017,CI90UB017P_2017,CI90UB017_2017,CI90UB04P_2017,CI90UB04_2017,CI90UB517P_2017,CI90UB517_2017,CI90UBALLP_2017,CI90UBALL_2017,CI90UBINC_2017,FIPStxt,FIPStxt_wikidata,MEDHHINC_2017,PCTPOV017_2017,PCTPOV04_2017,PCTPOV517_2017,PCTPOVALL_2017,POV017_2017,POV04_2017,POV517_2017,POVALL_2017,Rural-urban_Continuum_Code_2003,Rural-urban_Continuum_Code_2013,Urban_Influence_Code_2003,Urban_Influence_Code_2013
0,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,153,422,0,0,0,36,1516,1777-02-05T00:00:00Z,0,81508,0,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
1,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0,0,0,114,1852,36,1516,1777-02-05T00:00:00Z,0,81508,201,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
2,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,153,0,0,114,0,36,1516,1777-02-05T00:00:00Z,2,81508,201,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
3,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,153,422,1316,0,0,0,1516,1777-02-05T00:00:00Z,0,81508,0,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
4,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0,0,0,114,1852,36,1516,1777-02-05T00:00:00Z,2,81508,0,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
5,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0,0,0,0,1852,36,1516,1777-02-05T00:00:00Z,0,81508,201,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
6,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,153,0,1316,0,0,36,1516,1777-02-05T00:00:00Z,2,81508,0,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
7,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0,0,0,0,0,0,1516,1777-02-05T00:00:00Z,0,81508,0,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
8,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0,0,0,114,0,36,1516,1777-02-05T00:00:00Z,2,81508,0,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2
9,1319,13127,GA,Glynn County,3,15916,Q487016,Q58428702,0,422,1316,0,0,36,1516,1777-02-05T00:00:00Z,0,81508,201,Glynn County,23.5,4359,,,20.8,2838,14,11776,47392,35.3,6565,,,32.8,4472,20,16832,54088,13127,Q487016,50740,29.4,,26.8,17,5462,,3655,14304,3,3,2,2


## Discovering And Using More Data
---
Crime data may be useful to predict poverty, but no crime data is currently available in Datamart.

Searching in Google for `fbi crime statistics by county`  produces this search result:

[<img src="images/google-search-fbi.png" alt="Google Search Result" title="Google Search Result" /> ](https://ucr.fbi.gov/crime-in-the-u.s)

After navigating to this page, click on `2016`, then `Crime in the U.S. 2016`, then `Violent Crime`. You can explore the various crime datasets. Let's choose `Table 8`, which has crime data for all states, broken doown by county. For example, the Georgia [page](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-8/table-8-state-cuts/georgia.xls) contains crime data for counties in Georgia.

This crime data can be downloaded in Excel using the `Download Excel` [link](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-8/table-8-state-cuts/georgia.xls/output.xls).

<img src="images/fbi-crime-data-georgia.png" alt="Georgia Crime Data" title="Georgia Crime Data" />

---

Challenges for using this data:
- The data for each state is in a separate file
- The column headers start in row 6
- The spreadsheet has notes at the end, and the notes start in different rows for different states
- The name of the state and the year are in the metadata rows (rows 2 and 4)

## Augmenting Wikidata With Data Extracted From Tables
--- 
After the table understanding step, the data can be indexed in Datamart and used for augmentation. The challenge with the FBI crime data is that the data for each state is in a separate file. Augmentation of our original dataset requires combining the data from multiple files.

Datamart addresses this challenge by mapping the table data to Wikidata and uploading the data to Datamart's Wikidata clone where it can be queried regardless of the file where it came from.

### Download The FBI Crime Data

In [10]:
utils.download_FBI_data(["Georgia", "Florida"])
# Without parameters it downloads the data for all the states
# utils.download_FBI_data()

### Use DIG To Map The Spreadsheets To Wikidata
A DIG script converts the spreadsheet data to Wikidata using a simple API for augmenting Wikidata.

In [11]:
utils.generate_FBI_data(["Georgia", "Florida"])
# utils.generate_FBI_data()

### Upload The RDF Triples To Wikidata

In [12]:
utils.upload_FBI_data(["Georgia", "Florida"])
# utils.upload_FBI_data()

### View Of Datamart Additions To Wikidata
Wikidata provides a [user interface](https://test.wikidata.org/wiki/Special:NewProperty) to define properties.

We defined [properties to represent crime data](http://tinyurl.com/y5g7juu6).

The uploaded FBI data can be visualized, taking advantage of latitude/longitude coordinates present in Wikidata:

- [Map ](http://tinyurl.com/y2a7b7a2) of crime data by county, colored by file where data was present
- [Map ](http://tinyurl.com/y2k66mcd) of crime data by county colored by severity 
- [Map ](http://tinyurl.com/yxwh24vr) of crime data by county colored by severity, per 100,000 inhabitants 


## Search Datamart Again
---
The FBI data is now available as new columns to augment the poverty data.

In [13]:
search_results = d3mDatamart.search_with_data(supplied_data=wikified_dataset)
utils.print_search_results(search_results)

Unnamed: 0,title,columns,join columns
0,wikidata search result for FIPS_wikidata,"population, area, inception, violent crime off...",FIPS_wikidata
1,wikidata search result for State_wikidata,publication date,State_wikidata
2,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[Area]
3,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[Area]
4,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[Area]
5,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[Area]
6,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[Area]
7,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[Area]
8,PopulationEstimates with q nodes,"FIPS, State, Area_Name, Rural-urban_Continuum ...",[Area]
9,PopulationEstimates with q nodes,"FIPS, State, Area_Name, Rural-urban_Continuum ...",[Area]


## Augment Using The FBI Data
The first search result has the FBI data.

In [14]:
wiki_search_result = search_results[0]
fbi_augmented_dataset = wiki_search_result.augment(supplied_data=wikified_dataset)
utils.pretty_print(fbi_augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses
0,1,13297,GA,Walton County,1,11385,Q498312,Q58428702,55.0,171.0,542.0,82.0,795.0,10.0,,1818-01-01T00:00:00Z,0.0,85754,73.0
1,2,13137,GA,Habersham County,6,6500,Q501096,Q58428702,,,,,,,723.0,1818-12-15T00:00:00Z,,43300,
2,6,13059,GA,Clarke County,3,31950,Q112061,Q58428702,0.0,0.0,0.0,0.0,0.0,0.0,314.0,1801-01-01T00:00:00Z,0.0,121265,0.0
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q58428702,21.0,101.0,233.0,12.0,346.0,3.0,812.0,1838-01-01T00:00:00Z,0.0,25138,26.0
4,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,8490.0,935.0,39.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0
5,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,8490.0,935.0,39.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,1262.0
6,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,66.0,2184.0,8490.0,935.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,1262.0
7,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,66.0,2184.0,8490.0,0.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0
8,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,8.0,31.0,935.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,1262.0
9,46,13067,GA,Cobb County,1,73446,Q484247,Q58428702,697.0,2184.0,31.0,935.0,11609.0,8.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0


__Remove the FBI data from the Datamart Wikidata installation__

In [15]:
# utils.clean_FBI_data()