# U.S. Census Data Tutorial

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1aQ_3NmphBiqHVA2TrhtUtojZKnIKMKfx#scrollTo=_kl5N3rweK9v)

## Install Dependencies

In [3]:
!pip install pandas
!pip install numpy
!pip install regex
!pip install censusdata

Collecting censusdata
  Downloading CensusData-1.15.tar.gz (26.6 MB)
[K     |████████████████████████████████| 26.6 MB 1.4 MB/s 
Building wheels for collected packages: censusdata
  Building wheel for censusdata (setup.py) ... [?25l[?25hdone
  Created wheel for censusdata: filename=CensusData-1.15-py3-none-any.whl size=28205534 sha256=887a50b8e1f474ece0544305c6235353f76cad9363fe07b8e85b10e880a0dbb6
  Stored in directory: /root/.cache/pip/wheels/17/11/8c/933901298f486bd414f2ab1a62a114085f7d7a19dcbda2dd08
Successfully built censusdata
Installing collected packages: censusdata
Successfully installed censusdata-1.15


### Reference and import

If censusdata package was not in your enviroment, make sure to uncommond above line to pip it.

Reference of the [CensusData library](https://jtleider.github.io/censusdata/index.html)

In [4]:
import pandas as pd
import re
import numpy as np
import censusdata

### Main Methods
[CensusData API Documentation](https://jtleider.github.io/censusdata/api.html)

In [5]:
# Search for ACS 2015-2019 5-year estimate variables where the concept 
# includes the text 'population'.
sample = censusdata.search('acs5', 2019, 'concept', 
                           lambda value: re.search('population', value, re.IGNORECASE))

**Parameters:**	
* src (str) – Census data source: ```‘acs1’``` for **ACS 1-year estimates**, ```‘acs5’``` for **ACS 5-year estimates**, ```‘acs3’``` for **ACS 3-year estimates**, ```‘acsse’``` for **ACS 1-year supplemental estimates**, ```‘sf1’``` for **SF1 data**.
* year (int) – Year of data.
* field (str) – Field in which to search.
* criterion (str or function) – Search criterion. Either string to search for, or a function which will be passed the value of field and return True if a match and False otherwise.
* tabletype (str, optional) – Type of table from which variables are drawn (only applicable to ACS data). Options are ```‘detail’``` (detail tables), ```‘subject’``` (subject tables), ```‘profile’``` (data profile tables), ```‘cprofile’``` (comparison profile tables).

**Returns:**	
List of 3-tuples containing variable names, concepts, and labels matching the search criterion.

**Return type:**	
list

In [6]:
print('Length of the samples:', len(sample))

Length of the samples: 10765


This would be the sample amount we get based on what we use to search. In this case, there are 10765 samples which are ACS 5-year estimates for 2019 include the text 'population'.

In [7]:
print(sample[0])

('B01003_001E', 'TOTAL POPULATION', 'Estimate!!Total')


Let's use the first sample file as an example. Based on the result from above, the first sample is called: 'B01003_001E', which is a total population table under the parent table B01003. 

After you know the parent table you're interested in you can use the ```printtable``` function to get a clean readout of all the subtables in order to check if there are other subtables we might interested about.

In [8]:
censusdata.printtable(censusdata.censustable('acs5', 2019, 'B01003'))

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B01003_001E  | TOTAL POPULATION               | !! Estimate Total                                        | int  
-------------------------------------------------------------------------------------------------------------------


### Data download

**Step 1** and **Step 2** is for people who want to download the data based on a state they want (**Step 1**), and a county they want (**Step 2**).

**Step 1** If you want to download the data for some States, you need to find the geography code for it. And function ```geographies``` is build for that

In [9]:
states = censusdata.geographies(censusdata.censusgeo([('state', '*')]), 'acs5', 2019)
print(states['Michigan'])

Summary level: 040, state:26


**Step 2** Also if you want it be county level you need do almost the same thing but by adding county after state. For example:

In [10]:
counties = censusdata.geographies(censusdata.censusgeo([('state', '26'), ('county', '*')]), 'acs5', 2019)
print(counties['Wayne County, Michigan'])

Summary level: 050, state:26> county:163


**Step 3** Now, is time to download what you want. Example based on Michigan, Wayne County. 

**NOTE:** If you don't want to download the data based on state and county code, leave the state and county as ```'*'``` same as the block group.

In [11]:
data = censusdata.download('acs5', 2019, censusdata.censusgeo([('state', '26'),
                                                               ('county', '163'),
                                                               ('block group', '*')]),
                          ['B01003_001E'])

And this is the length of the data we get.

In [12]:
len(data)

1822

### Extra (data formating, slice)

This part are some extra step if you need, such as change the column name by using pandas, and slice it based on Census Tract by using ```census_cut``` in ```Help_Functions```.

**NOTE:** If you open in colab, the ```census_cut``` function would in this note book.

In [13]:
column_name = ['TOTAL POPULATION']
data.columns = column_name

In [14]:
new_indices = []
for index in data.index.tolist():
    new_indices.append(index)

data.index = new_indices

In [15]:
data.head()

Unnamed: 0,TOTAL POPULATION
"Block Group 0, Census Tract 9901, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:990100> block group:0",0
"Block Group 3, Census Tract 5104, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:510400> block group:3",238
"Block Group 5, Census Tract 5528, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:552800> block group:5",1546
"Block Group 3, Census Tract 5014, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:501400> block group:3",757
"Block Group 2, Census Tract 5044, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:504400> block group:2",427


### ```census_cut``` usage

In ```Help_Functions``` there is a function called ```census_cut```, and it is use for cut the data we download by using ```censusdata``` package based on ```Census Tract``` you want, and it would return a new dataframe.

[More information about ```Census Tract```.](https://www2.census.gov/geo/pdfs/education/CensusTracts.pdf)

And the ```census_cut``` function is looks like this:

In [16]:
def census_cut(tracts, data):
    '''
    This function is use to cut based on Census Tract for data download by using censusdata package.
        
        Parameters:
            tracts: A list of string which are the Census Tract. Such as 'Census Tract 0000'.
            data: Data download by using censusdata package.
        Return:
            result: A new set of data which only include the data based on Census Track.
    '''
    mask = []
    for i in range(len(data.index)):
        string = str(data.index[i])
        check = True
        for tract in tracts:
            match = re.search(tract, string)
            if match:
                mask.append(True)
                check = False
        if check:
            mask.append(False)
    len(mask), len(data.index)
    result = data[mask]
    return result

For example, we want the data for some areas based on Census Tracts are 5301 to 5305

In [17]:
Tracts = ['Census Tract 5301', 'Census Tract 5302', 'Census Tract 5303', 'Census Tract 5304','Census Tract 5305']

In [18]:
df = census_cut(Tracts, data)
df

Unnamed: 0,TOTAL POPULATION
"Block Group 3, Census Tract 5301, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530100> block group:3",771
"Block Group 1, Census Tract 5302, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530200> block group:1",1389
"Block Group 3, Census Tract 5302, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530200> block group:3",139
"Block Group 2, Census Tract 5305, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530500> block group:2",890
"Block Group 2, Census Tract 5304, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530400> block group:2",647
"Block Group 2, Census Tract 5301, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530100> block group:2",366
"Block Group 1, Census Tract 5301, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530100> block group:1",704
"Block Group 4, Census Tract 5301, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530100> block group:4",330
"Block Group 1, Census Tract 5304, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530400> block group:1",398
"Block Group 3, Census Tract 5303, Wayne County, Michigan: Summary level: 150, state:26> county:163> tract:530300> block group:3",702


Based on this, we can check the total population for choosing areas.

In [19]:
print('Total population for choosing areas: ', sum(df['TOTAL POPULATION']))

Total population for choosing areas:  10641
