# ISI Datamart Demonstration
---
This demonstration illustrates the following capabilities:

- Entity linking to  __[Wikidata](http://wikidata.org)__
- Augmentation with data from the __[Wikidata](http://wikidata.org)__ knowledge graph
- Augmentation with data from Excel, CSV and other structured sources
- Augmentaiton with Wikipedia tables
- Enriching __[Wikidata](http://wikidata.org)__ 

In [1]:
import sys, os
# sys.stdout = open(os.devnull, 'w')
from wikifier import utils
from datamart.entries_new import D3MDatamart, D3MJoinSpec
import pandas as pd
d3mDatamart = D3MDatamart()
# this is our original input dataset
inputs_ds_loc = "/Users/minazuki/Desktop/studies/master/2018Summer/data/customize/\
DA_poverty_estimation/TRAIN/dataset_TRAIN/datasetDoc.json"
# pd.set_option('display.max_columns', None)
original_dataset = utils.load_d3m_dataset(inputs_ds_loc)


Using TensorFlow backend.
  return f(*args, **kwds)


## Our Original Dataset
---
We start with a dataset that has the number of people in poverty in different counties in the United States. In this demo we are using data from Florida and Georgia only (so it runs faster).

In [2]:
#original_dataset['learningData'].head()
original_dataset['learningData'].sample(n=15)

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016
122,1715,12017,FL,Citrus County,4,23472
78,2547,13261,GA,Sumter County,6,8290
55,1707,13033,GA,Burke County,2,5986
100,71,12081,FL,Manatee County,2,47042
32,791,13277,GA,Tift County,4,8669
94,3038,13225,GA,Peach County,6,5522
6,82,13195,GA,Madison County,3,4255
81,2625,13073,GA,Columbia County,2,10512
54,1614,13145,GA,Harris County,2,2883
0,1,13297,GA,Walton County,1,11385


## Linking To Wikidata
---
Wikidata contains over 80 million identifiers for entities. Datamart can scan a dataset, automatically identify columns containing entity identifiers, and link the identifiers to the appropriate entity in Wikidata.

Here is the wikified data. Clicking on the links takes you to the corresponding wikidata pages where you can see all the data available for each entity.

In [3]:
wikified_dataset = utils.wikifier_for_d3m_all(input_ds=original_dataset).value
# wikified_dataset['learningData']
utils.pretty_print(wikified_dataset,"wikifier")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata
0,1,13297,GA,Walton County,1,11385,Q498312,Q1428
1,2,13137,GA,Habersham County,6,6500,Q501096,Q1428
2,6,13059,GA,Clarke County,3,31950,Q112061,Q1428
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q1428
4,46,13067,GA,Cobb County,1,73446,Q484247,Q1428
5,60,13105,GA,Elbert County,6,4197,Q492016,Q1428
6,82,13195,GA,Madison County,3,4255,Q156387,Q1428
7,92,13263,GA,Talbot County,8,1447,Q498356,Q1428
8,116,13211,GA,Morgan County,6,2358,Q493083,Q1428
9,143,13165,GA,Jenkins County,6,2606,Q389551,Q1428


## Searching Datamart Using Our Wikified Data
---
Datamart finds multiple datasets that can be used to augment the poverty dataset. The results show the title, the columns available in each dataset, and the columns that will be used to join the candidate dataset to the poverty dataset.

In [4]:
search_results = d3mDatamart.search_with_data(supplied_data=wikified_dataset)
# wiki_search_results.display()
utils.print_search_results(search_results)

Unnamed: 0,title,columns,join columns
0,wikidata search result for FIPS_wikidata,"population, area, inception",FIPS_wikidata
1,wikidata search result for State_wikidata,"population, motto text, demonym, native label,...",State_wikidata
2,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[Area]
3,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[Area]
4,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[Area]
5,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[FIPS_wikidata]
6,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[FIPS_wikidata]
7,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[FIPS_wikidata]
8,"Population estimates for the U.S., States, and...","FIPS, State, Area_Name, Rural-urban_Continuum ...",[Area]
9,"Population estimates for the U.S., States, and...","FIPS, State, Area_Name, Rural-urban_Continuum ...",[FIPS_wikidata]


## Using Search Results To Augment Your Data
---
The first search result, from Wikidata, augments our data using population, area and inception date of counties.

In [5]:
wiki_search_result = search_results[0]
augmented_dataset = wiki_search_result.augment(supplied_data=wikified_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,area,population
0,1,13297,GA,Walton County,1,11385,Q498312,Q1428,,85754
1,2,13137,GA,Habersham County,6,6500,Q501096,Q1428,723.0,43300
2,6,13059,GA,Clarke County,3,31950,Q112061,Q1428,314.0,121265
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q1428,812.0,25138
4,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,881.0,717190
5,60,13105,GA,Elbert County,6,4197,Q492016,Q1428,970.0,19599
6,82,13195,GA,Madison County,3,4255,Q156387,Q1428,740.0,28057
7,92,13263,GA,Talbot County,8,1447,Q498356,Q1428,1022.0,6456
8,116,13211,GA,Morgan County,6,2358,Q493083,Q1428,918.0,17781
9,143,13165,GA,Jenkins County,6,2606,Q389551,Q1428,913.0,9269


## Repeat Augmentation With Additional Search Results From Wikidata
---
The second search result, also from Wikidata, augments our using data from states. This adds columns with information about the states.

In [6]:
augmented_dataset = search_results[1].augment(supplied_data=augmented_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,area,population,elevation above sea level,inception,motto text,native label,short name,water as percent of area
0,1707,13033,GA,Burke County,2,5986,Q211360,Q1428,2163,22923,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
1,3106,13187,GA,Lumpkin County,6,4711,Q492040,Q1428,738,30918,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
2,198,13213,GA,Murray County,3,7055,Q493074,Q1428,892,39267,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
3,6,13059,GA,Clarke County,3,31950,Q112061,Q1428,314,121265,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
4,1065,13301,GA,Warren County,8,1511,Q491529,Q1428,287,5558,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
5,2856,13045,GA,Carroll County,1,16713,Q493088,Q1428,1305,112355,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
6,2547,13261,GA,Sumter County,6,8290,Q503076,Q1428,1276,31364,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
7,3051,13149,GA,Heard County,1,2227,Q486348,Q1428,780,11558,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
8,2649,13233,GA,Polk County,6,7609,Q498395,Q1428,808,41183,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22
9,116,13211,GA,Morgan County,6,2358,Q493083,Q1428,918,17781,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22



## Download Data
---
Some of the search results are datasets indexed from the web, they are present in the search results because they can be joined with our wikified dataset.

Let's inspect a few of these datasets.

__The 6th search result contains poverty information.__

In [7]:
# %%script false

downloaded_dataset = search_results[6].download(supplied_data=wikified_dataset)
utils.pretty_print(downloaded_dataset)
# utils.pretty_print(downloaded_dataset,"download")

Unnamed: 0,FIPStxt,State,Area_Name,Rural-urban_Continuum_Code_2003,Urban_Influence_Code_2003,Rural-urban_Continuum_Code_2013,Urban_Influence_Code_2013,POVALL_2017,CI90LBAll_2017,CI90UBALL_2017,PCTPOVALL_2017,CI90LBALLP_2017,CI90UBALLP_2017,POV017_2017,CI90LB017_2017,CI90UB017_2017,PCTPOV017_2017,CI90LB017P_2017,CI90UB017P_2017,POV517_2017,CI90LB517_2017,CI90UB517_2017,PCTPOV517_2017,CI90LB517P_2017,CI90UB517P_2017,MEDHHINC_2017,CI90LBINC_2017,CI90UBINC_2017,POV04_2017,CI90LB04_2017,CI90UB04_2017,PCTPOV04_2017,CI90LB04P_2017,CI90UB04P_2017,FIPStxt_wikidata,joining_pairs
0,0,US,United States,,,,,42583651,42342619,42824683,13.4,13.3,13.5,13353202,13229339,13477065,18.4,18.2,18.6,9120503,9033090,9207916,17.3,17.1,17.5,60336,60250,60422,3932969.0,3880645.0,3985293.0,20.2,19.9,20.5,,[]
1,1000,AL,Alabama,,,,,802263,784517,820009,16.9,16.5,17.3,262909,253694,272124,24.4,23.5,25.3,180594,172412,188776,22.8,21.8,23.8,48193,47451,48935,78986.0,75009.0,82963.0,27.7,26.3,29.1,,[]
2,1001,AL,Autauga County,2.0,2.0,2.0,2.0,7390,6147,8633,13.4,11.1,15.7,2542,2081,3003,19.3,15.8,22.8,1842,1492,2192,18.6,15.1,22.1,58343,52121,64565,,,,,,,Q156168,[]
3,1003,AL,Baldwin County,4.0,5.0,3.0,2.0,21199,17444,24954,10.1,8.3,11.9,6734,5079,8389,14.7,11.1,18.3,4871,3641,6101,14.3,10.7,17.9,56607,52439,60775,,,,,,,Q156163,[]
4,1005,AL,Barbour County,6.0,6.0,6.0,6.0,7414,6325,8503,33.4,28.5,38.3,2606,2262,2950,50.3,43.7,56.9,1904,1660,2148,48.8,42.6,55.0,32490,29218,35762,,,,,,,Q109437,[]
5,1007,AL,Bibb County,1.0,1.0,1.0,1.0,4137,3187,5087,20.2,15.5,24.9,1242,936,1548,27.3,20.6,34.0,870,641,1099,26.8,19.8,33.8,45795,40924,50666,,,,,,,Q461204,[]
6,1009,AL,Blount County,1.0,1.0,1.0,1.0,7343,5805,8881,12.8,10.1,15.5,2484,1881,3087,18.5,14.0,23.0,1763,1307,2219,17.7,13.1,22.3,48253,43784,52722,,,,,,,Q111250,[]
7,1011,AL,Bullock County,6.0,6.0,6.0,6.0,2956,2316,3596,34.4,27.0,41.8,1015,790,1240,48.3,37.6,59.0,732,562,902,49.0,37.6,60.4,29113,25929,32297,,,,,,,Q111259,[]
8,1013,AL,Butler County,6.0,6.0,6.0,6.0,4154,3155,5153,21.3,16.2,26.4,1471,1085,1857,33.0,24.3,41.7,1049,759,1339,31.8,23.0,40.6,36842,33405,40279,,,,,,,Q108871,[]
9,1015,AL,Calhoun County,3.0,2.0,3.0,2.0,19832,16938,22726,17.7,15.1,20.3,5932,4701,7163,24.2,19.2,29.2,4021,3066,4976,22.2,16.9,27.5,45937,43419,48455,,,,,,,Q108856,[]


__The 7th search result contains unemployment information.__

In [8]:
# %%script false

downloaded_dataset = search_results[7].download(supplied_data=wikified_dataset)
# utils.pretty_print(downloaded_dataset)
utils.pretty_print(downloaded_dataset,"download")

Unnamed: 0,FIPStxt,State,Area_name,Rural_urban_continuum_code_2013,Urban_influence_code_2013,Metro_2013,Civilian_labor_force_2007,Employed_2007,Unemployed_2007,Unemployment_rate_2007,Civilian_labor_force_2008,Employed_2008,Unemployed_2008,Unemployment_rate_2008,Civilian_labor_force_2009,Employed_2009,Unemployed_2009,Unemployment_rate_2009,Civilian_labor_force_2010,Employed_2010,Unemployed_2010,Unemployment_rate_2010,Civilian_labor_force_2011,Employed_2011,Unemployed_2011,Unemployment_rate_2011,Civilian_labor_force_2012,Employed_2012,Unemployed_2012,Unemployment_rate_2012,Civilian_labor_force_2013,Employed_2013,Unemployed_2013,Unemployment_rate_2013,Civilian_labor_force_2014,Employed_2014,Unemployed_2014,Unemployment_rate_2014,Civilian_labor_force_2015,Employed_2015,Unemployed_2015,Unemployment_rate_2015,Civilian_labor_force_2016,Employed_2016,Unemployed_2016,Unemployment_rate_2016,Civilian_labor_force_2017,Employed_2017,Unemployed_2017,Unemployment_rate_2017,Median_Household_Income_2017,Med_HH_Income_Percent_of_State_Total_2017,FIPStxt_wikidata,joining_pairs
394,12121,FL,"Suwannee County, FL",6,6,0,17350,16707,643,3.7,17008,15979,1029,6.1,17450,15765,1685,9.7,19153,17267,1886,9.8,18870,17125,1745,9.2,18342,16871,1471,8.0,18077,16826,1251,6.9,17973,16840,1133,6.3,17937,16957,980,6,18193,17305,888,4.9,18082,17312,770,4.3,"$44,144",84.0,Q501036,[132]
364,12063,FL,"Jackson County, FL",6,6,0,21538,20679,859,4.0,22155,20973,1182,5.3,21958,20328,1630,7.4,18938,17214,1724,9.1,18260,16531,1729,9.5,18067,16502,1565,8.7,17779,16438,1341,7.5,17484,16313,1171,6.7,17313,16291,1022,6,17347,16445,902,5.2,17307,16537,770,4.4,"$41,524",79.0,Q488537,[131]
381,12095,FL,"Orange County, FL",1,1,1,594761,571946,22815,3.8,603287,566854,36433,6.0,596021,533867,62154,10.4,635299,566478,68821,10.8,642178,579251,62927,9.8,653886,600251,53635,8.2,663087,618150,44937,6.8,676557,637038,39519,5.8,686092,651611,34481,5,706819,676243,30576,4.3,731398,704715,26683,3.6,"$54,021",102.7,Q488543,[130]
374,12083,FL,"Marion County, FL",2,2,1,137247,130851,6396,4.7,138375,127220,11155,8.1,135282,118007,17275,12.8,132351,114297,18054,13.6,131031,114905,16126,12.3,130450,117140,13310,10.2,131043,119852,11191,8.5,131605,122005,9600,7.3,130198,121727,8471,7,132083,124426,7657,5.8,133553,126934,6619,5.0,"$43,772",83.2,Q501014,[129]
360,12055,FL,"Highlands County, FL",3,2,1,40249,38284,1965,4.9,40939,37916,3023,7.4,40483,36113,4370,10.8,37247,32587,4660,12.5,38151,33773,4378,11.5,37385,33542,3843,10.3,36128,32688,3440,9.5,35843,32821,3022,8.4,35344,32691,2653,8,35739,33410,2329,6.5,35921,33902,2019,5.6,"$37,445",71.2,Q488885,[128]
346,12027,FL,"DeSoto County, FL",6,5,0,14328,13620,708,4.9,14778,13758,1020,6.9,14967,13458,1509,10.1,13081,11481,1600,12.2,13884,12447,1437,10.4,12899,11616,1283,9.9,13133,12056,1077,8.2,13168,12228,940,7.1,13379,12601,778,6,13478,12770,708,5.3,13833,13225,608,4.4,"$37,342",71.0,Q488796,[127]
348,12031,FL,"Duval County, FL",1,1,1,441216,422676,18540,4.2,446829,418438,28391,6.4,441641,395185,46456,10.5,454798,403171,51627,11.4,456441,409556,46885,10.3,456398,416680,39718,8.7,457588,423256,34332,7.5,459978,429051,30927,6.7,460613,433842,26771,6,467864,444968,22896,4.9,483717,463724,19993,4.1,"$52,105",99.1,Q493605,[126]
353,12041,FL,"Gilchrist County, FL",2,2,1,7613,7318,295,3.9,7828,7373,455,5.8,7712,6993,719,9.3,6874,6132,742,10.8,6857,6151,706,10.3,6717,6108,609,9.1,6623,6103,520,7.9,6556,6115,441,6.7,6499,6131,368,6,6680,6347,333,5.0,6839,6551,288,4.2,"$42,880",81.5,Q111720,[125]
382,12097,FL,"Osceola County, FL",1,1,1,132902,127164,5738,4.3,138080,129094,8986,6.5,138807,123212,15595,11.2,137372,120253,17119,12.5,140286,124469,15817,11.3,143550,129979,13571,9.5,147899,136197,11702,7.9,152667,142399,10268,6.7,157274,148194,9080,6,164327,156211,8116,4.9,169949,162768,7181,4.2,"$49,284",93.7,Q501067,[124]
358,12051,FL,"Hendry County, FL",4,3,0,18008,16654,1354,7.5,17565,15619,1946,11.1,17192,14744,2448,14.2,18361,15786,2575,14.0,17971,15482,2489,13.9,17004,14830,2174,12.8,16651,14661,1990,12.0,15827,14033,1794,11.3,15099,13491,1608,11,15411,14103,1308,8.5,15652,14518,1134,7.2,"$38,361",73.0,Q488488,[123]


## Augment With Datasets From The Web
---
Datasets from the web can also be used to augment our original data.

Let's augment using the poverty data as it is useful to predict the number of people in poverty. Many new columns appear at the end.

In [9]:
augmented_dataset = search_results[6].augment(supplied_data=augmented_dataset)
utils.pretty_print(augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,area,population,elevation above sea level,inception,motto text,native label,short name,water as percent of area,Area_Name,CI90LB017P_2017,CI90LB017_2017,CI90LB04P_2017,CI90LB04_2017,CI90LB517P_2017,CI90LB517_2017,CI90LBALLP_2017,CI90LBAll_2017,CI90LBINC_2017,CI90UB017P_2017,CI90UB017_2017,CI90UB04P_2017,CI90UB04_2017,CI90UB517P_2017,CI90UB517_2017,CI90UBALLP_2017,CI90UBALL_2017,CI90UBINC_2017,FIPStxt,FIPStxt_wikidata,MEDHHINC_2017,PCTPOV017_2017,PCTPOV04_2017,PCTPOV517_2017,PCTPOVALL_2017,POV017_2017,POV04_2017,POV517_2017,POVALL_2017,Rural-urban_Continuum_Code_2003,Rural-urban_Continuum_Code_2013,Urban_Influence_Code_2003,Urban_Influence_Code_2013
0,1707,13033,GA,Burke County,2,5986,Q211360,Q1428,2163,22923,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Burke County,26.1,1521,,,23.5,1000,17.8,3948,37783,43.3,2529,,,40.9,1736,28.4,6294,45943,13033,Q211360,41863,34.7,,32.2,23.1,2025,,1368,5121,2,2,2,2
1,3106,13187,GA,Lumpkin County,6,4711,Q492040,Q1428,738,30918,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Lumpkin County,14.2,820,,,13.2,558,10.6,3179,50276,24.2,1394,,,23.4,992,17.2,5133,60630,13187,Q492040,55453,19.2,,18.3,13.9,1107,,775,4156,6,6,4,4
2,198,13213,GA,Murray County,3,7055,Q493074,Q1428,892,39267,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Murray County,20.4,1985,,,19.5,1408,13.9,5481,41639,30.6,2981,,,29.5,2132,20.5,8049,50311,13213,Q493074,45975,25.5,,24.5,17.2,2483,,1770,6765,3,3,2,2
3,6,13059,GA,Clarke County,3,31950,Q112061,Q1428,314,121265,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Clarke County,21.8,4766,,,20.7,3126,23.3,27212,36791,35.2,7684,,,35.3,5328,29.9,34922,43011,13059,Q112061,39901,28.5,,28.0,26.6,6225,,4227,31067,3,3,2,2
4,1065,13301,GA,Warren County,8,1511,Q491529,Q1428,287,5558,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Warren County,32.4,348,,,29.2,238,22.1,1152,30804,53.8,578,,,51.0,416,33.7,1762,38556,13301,Q491529,34680,43.1,,40.1,27.9,463,,327,1457,8,8,7,7
5,2856,13045,GA,Carroll County,1,16713,Q493088,Q1428,1305,112355,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Carroll County,20.0,5546,,,18.5,3755,14.5,16405,46869,28.8,7976,,,27.7,5627,19.9,22527,55907,13045,Q493088,51388,24.4,,23.1,17.2,6761,,4691,19466,1,1,1,1
6,2547,13261,GA,Sumter County,6,8290,Q503076,Q1428,1276,31364,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Sumter County,28.5,1898,,,25.9,1280,20.2,5685,34050,48.1,3198,,,45.5,2246,30.8,8653,39172,13261,Q503076,36611,38.3,,35.7,25.5,2548,,1763,7169,6,6,5,5
7,3051,13149,GA,Heard County,1,2227,Q486348,Q1428,780,11558,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Heard County,21.5,560,,,19.0,374,14.9,1726,39967,34.7,906,,,32.0,630,22.9,2648,49183,13149,Q486348,44575,28.1,,25.5,18.9,733,,502,2187,1,1,1,1
8,2649,13233,GA,Polk County,6,7609,Q498395,Q1428,808,41183,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Polk County,20.8,2201,,,19.5,1510,15.3,6296,39444,33.8,3567,,,32.7,2524,22.7,9360,47532,13233,Q498395,43488,27.3,,26.1,19.0,2884,,2017,7828,6,4,3,3
9,116,13211,GA,Morgan County,6,2358,Q493083,Q1428,918,17781,180,1788-01-02T00:00:00Z,"Wisdom, Justice, Moderation",State of Georgia,GA,3.22,Morgan County,14.7,603,,,13.7,425,9.1,1659,50158,24.9,1029,,,24.3,751,14.9,2733,60344,13211,Q493083,55251,19.8,,19.0,12.0,816,,588,2196,6,1,4,1


## Discovering And Using More Data
---
Crime data may be useful to predict poverty, but no crime data is currently available in Datamart.

Searching in Google for `fbi crime statistics by county`  produces this search result:

[<img src="images/google-search-fbi.png" alt="Google Search Result" title="Google Search Result" /> ](https://ucr.fbi.gov/crime-in-the-u.s)

After navigating to this page, click on `2016`, then `Crime in the U.S. 2016`, then `Violent Crime`. You can explore the various crime datasets. Let's choose `Table 8`, which has crime data for all states, broken doown by county. For example, the Georgia [page](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-8/table-8-state-cuts/georgia.xls) contains crime data for counties in Georgia.

This crime data can be downloaded in Excel using the `Download Excel` [link](https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-2016/tables/table-8/table-8-state-cuts/georgia.xls/output.xls).

<img src="images/fbi-crime-data-georgia.png" alt="Georgia Crime Data" title="Georgia Crime Data" />

---

Challenges for using this data:
- The data for each state is in a separate file
- The column headers start in row 6
- The spreadsheet has notes at the end, and the notes start in different rows for different states
- The name of the state and the year are in the metadata rows (rows 2 and 4)

## Automatic Table Understanding (Poster)
---
The automatic table understanding software performs the following tasks on spreadsheets and CSV files:
- Identifies the type of each cell (data, header, attribute, global metadata)
- Segments the table into blocks
- Identifies relationships among blocks

__Add image here with blocks of GEORGIA table__

## Augmenting Wikidata With Data Extracted From Tables
--- 
After the table understanding step, the data can be indexed in Datamart and used for augmentation. The challenge with the FBI crime data is that the data for each state is in a separate file. Augmentation of our original dataset requires combining the data from multiple files.

Datamart addresses this challenge by mapping the table data to Wikidata and uploading the data to Datamart's Wikidata clone where it can be queried regardless of the file where it came from.

### Adding Crime Properties To Wikidata
Wikidata provides a [user interface](https://test.wikidata.org/wiki/Special:NewProperty) to define properties.

We defined [properties to represent crime data](http://tinyurl.com/y5g7juu6).

### Download The FBI Crime Data

In [10]:
# download_fbi_crime_data("Georgia", "Florida")
# this script should print the URL of each downloaded file.

### Use DIG To Map The Spreadsheets To Wikidata
A DIG script converts the spreadsheet data to Wikidata using a simple API for augmenting Wikidata.

In [11]:
# extract_fbi_crime_data_to_wikidata("Georgia", "Florida")
# this script should print a line after processing each state, like the following
# Generated Wikidata RDF triples for Georgia

### Upload The RDF Triples To Wikidata

In [12]:
utils.generate_FBI_data(["georgia", "florida"])

The uploaded FBI data can be visualized, taking advantage of latitude/longitude coordinates present in Wikidata:

- [Map ](http://tinyurl.com/y2a7b7a2) of crime data by county, colored by file where data was present
- [Map ](http://tinyurl.com/y2k66mcd) of crime data by county colored by severity 
- [Map ](http://tinyurl.com/yxwh24vr) of crime data by county colored by severity, per 100,000 inhabitants 


## Search Datamart Again
---
The FBI data is now available as new columns to augment the poverty data.

In [13]:
search_results = d3mDatamart.search_with_data(supplied_data=wikified_dataset)
# wiki_search_results.display()
utils.print_search_results(search_results)

Unnamed: 0,title,columns,join columns
0,wikidata search result for FIPS_wikidata,"population, area, inception, violent crime off...",FIPS_wikidata
1,wikidata search result for State_wikidata,"population, motto text, demonym, native label,...",State_wikidata
2,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[Area]
3,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[Area]
4,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[Area]
5,Educational attainment for adults age 25 and o...,"FIPS Code, State, Area name, 2003 Rural-urban ...",[FIPS_wikidata]
6,"Poverty estimates for the U.S., States, and co...","FIPStxt, State, Area_Name, Rural-urban_Continu...",[FIPS_wikidata]
7,Unemployment and median household income for t...,"FIPStxt, State, Area_name, Rural_urban_continu...",[FIPS_wikidata]
8,"Population estimates for the U.S., States, and...","FIPS, State, Area_Name, Rural-urban_Continuum ...",[Area]
9,"Population estimates for the U.S., States, and...","FIPS, State, Area_Name, Rural-urban_Continuum ...",[FIPS_wikidata]


In [14]:
## Augment Using The FBI Data
# The first search result has the FBI data.

In [15]:
wiki_search_result = search_results[0]
fbi_augmented_dataset = wiki_search_result.augment(supplied_data=wikified_dataset)
utils.pretty_print(fbi_augmented_dataset,"wiki_augment")

Unnamed: 0,d3mIndex,FIPS,State,Area,RUCCode,POVALL_2016,FIPS_wikidata,State_wikidata,Aggravated assault,Burglary,Larceny-theft,Motor vehicle theft,Property crime,Robbery,area,inception,murder and non-negligent manslaughter,population,violent crime offenses
0,1,13297,GA,Walton County,1,11385,Q498312,Q1428,55.0,171.0,542.0,82.0,795.0,10.0,,1818-01-01T00:00:00Z,0.0,85754,73.0
1,2,13137,GA,Habersham County,6,6500,Q501096,Q1428,,,,,,,723.0,1818-12-15T00:00:00Z,,43300,
2,6,13059,GA,Clarke County,3,31950,Q112061,Q1428,0.0,0.0,0.0,0.0,0.0,0.0,314.0,1801-01-01T00:00:00Z,0.0,121265,0.0
3,36,13055,GA,Chattooga County,6,4716,Q486179,Q1428,21.0,101.0,233.0,12.0,346.0,3.0,812.0,1838-01-01T00:00:00Z,0.0,25138,26.0
4,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,66.0,8.0,31.0,0.0,39.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,1262.0
5,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,66.0,2184.0,8490.0,935.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,1262.0
6,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,66.0,2184.0,8490.0,935.0,39.0,461.0,881.0,1832-12-02T00:00:00Z,16.0,717190,79.0
7,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,697.0,2184.0,8490.0,935.0,39.0,8.0,881.0,1832-12-02T00:00:00Z,16.0,717190,1262.0
8,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,697.0,2184.0,8490.0,935.0,39.0,8.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0
9,46,13067,GA,Cobb County,1,73446,Q484247,Q1428,66.0,8.0,31.0,0.0,11609.0,461.0,881.0,1832-12-02T00:00:00Z,0.0,717190,79.0


__Remove the FBI data from the Datamart Wikidata installation__

In [16]:
utils.clean_FBI_data()