# Metapack Access Example



[Metatab](http://metatab.org) is a system for documenting data set metadata, which the program [metapack](https://github.com/CivicKnowledge/metapack) uses to create data packages. You can also us the metapack python module to access data packages from the web, which provides easy access to documentation and pandas data frames in Jupyter notebooks. 

To start, you'll need to install metatab, which you should be able to do with:

> pip install metatab

The system is under active development, so if there are problems, you can install the latest development versions of the important modules with the [development requirements.txt file on github:](https://github.com/CivicKnowledge/metapack/blob/master/dev/requirements.txt) 

> pip install -r https://raw.githubusercontent.com/CivicKnowledge/metapack/master/dev/requirements.txt

After installing the module, you should be able to run the code in this notebook. 

# Metapack Packages

Metapack packages are collections of files that contain data and metadata. Metapack has several package types, including [Excel](http://s3.amazonaws.com/library.metatab.org/aspe.hhs.gov-dementia_prevalence-2.xlsx), [Zip](http://s3.amazonaws.com/library.metatab.org/aspe.hhs.gov-dementia_prevalence-2.zip), CSV, File systen and S3. Most of the time with Jupyter notebooks, you will use the CSV packages, but the ZIP and Excel packages will also work. 

First, you'll need to get a reference to a package. Most often, you'll get these from our (CKAN Data Repository at data.sandiegodata.org)[http://data.sandiegodata.org]. In this example, well use the [Community Reinvestment Act Disclosure Files](http://data.sandiegodata.org/dataset/ffiec-gov-cra_disclosure_smb_orig-2010_2015). 

First, visit the [data package page in the data repository](http://data.sandiegodata.org/dataset/ffiec-gov-cra_disclosure_smb_orig-2010_2015). The files list will have both data package files and data files. The data package files are the ones that start with the name of the package, `ffiec.gov-cra_disclosure_smb_orig-2010_2015-2`. SO, these are package files: 

* `ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv`
* `ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.zip`
* `ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.xlsx`


The last one, the `.csv` file, is the CSV package. Using CSV packages is usually most efficient because you only need to download the data files that you use. So, the first step is to get the CSV package URL. From the [data package page on the CKAN repository](http://data.sandiegodata.org/dataset/ffiec-gov-cra_disclosure_smb_orig-2010_2015) you can:

1. Click on the "Explore" button next to the CSV package file, then right-click on "Go to resource" to copy the URL. 
2. Click on the name of the CSV package, then copy the URL at the top of the following page. 

After you have the package URL, pass it into the `open_package` function, as shown in next cell. The function will return a data package object, which Jupyter will print by showing the package documentation. 


In [1]:

import metapack as mp

pkg =  mp.open_package('http://library.metatab.org/ffiec.gov-cra_disclosure_smb_orig-2010_2015-2.csv')

pkg

The `Resources` section lists the datafiles in the package, while the `References` section show the links to datafiles that were used to create the resources. You can use the name of a resource in a call to `pkg.resource` to create a resource object, which like the package object, can be pretty printed in Jupyter. 

In [2]:
r = pkg.resource('sb_loan_orig')
r


Header,Type,Description
table_id,text,Value is D1-1
respondent_id,text,Assigned by regulatory agency (same as HMDAID if applicable); Right justified with leading zeros
agency,integer,"Values are 1=OCC, 2=FRS, 3=FDIC, or 4=OTS"
year,integer,Four digit year (e.g. 2012)
loan_type,integer,Value is 4 (Small Business)
action,integer,Value is 1 (Originations)
state,integer,FIPS code with leading zeros or blank for totals across all states
county,integer,FIPS code with leading zeros or blank for totals across all counties
msa,integer,"As defined by OMB; Right justified with leading zeros, NA left justified for areas outside of MSA/MD or blank for totals across all MSA/MDs"
assessment_area,text,"Values are 0001 through 9999; Right justified with leading zeros, NA left justified for areas outside of an Assessment Area (including predominately military areas) OR blank for totals across all Assessment Areas"


The final step of access is to create a dataframe from the resource. This is really easy, just use the `.dataframe()` method. Note, however, for this dataset, it can take almost 10 minutes to create the whole dataframe, as the data file is very large. 

In [3]:
%%time
df = r.dataframe()

CPU times: user 9min 7s, sys: 8.66 s, total: 9min 16s
Wall time: 9min 36s


In [4]:
len(df)

4391391