# Data Analysis Project with Python

# International Database (United States Census Bureau)

In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns

## Exploring the IDB Single Year database

Variable descriptions: https://api.census.gov/data/timeseries/idb/1year/variables.html

In [2]:
idbsingleyear = pd.read_table("idbzip/idbsingleyear.txt", delimiter="|")
idbsingleyear.head(10)

Unnamed: 0,#YR,GEO_ID,AGE,SEX,POP
0,1990,W140000WOAD,0,0,504
1,1990,W140000WOAD,0,1,264
2,1990,W140000WOAD,0,2,240
3,1990,W140000WOAD,1,0,550
4,1990,W140000WOAD,1,1,279
5,1990,W140000WOAD,1,2,271
6,1990,W140000WOAD,2,0,489
7,1990,W140000WOAD,2,1,260
8,1990,W140000WOAD,2,2,229
9,1990,W140000WOAD,3,0,515


In [3]:
# Dimensions (number of rows and columns)
idbsingleyear.shape

(7923753, 5)

In [4]:
# Check for data types and other essential information
idbsingleyear.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7923753 entries, 0 to 7923752
Data columns (total 5 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   #YR     int64 
 1   GEO_ID  object
 2   AGE     int64 
 3   SEX     int64 
 4   POP     int64 
dtypes: int64(4), object(1)
memory usage: 302.3+ MB


In [5]:
# Describe the numerical columns
idbsingleyear.describe()

Unnamed: 0,#YR,AGE,SEX,POP
count,7923753.0,7923753.0,7923753.0,7923753.0
mean,2042.538,50.0,1.0,252635.0
std,33.89533,29.15476,0.8164966,1087511.0
min,1950.0,0.0,0.0,0.0
25%,2014.0,25.0,0.0,1469.0
50%,2043.0,50.0,1.0,24032.0
75%,2072.0,75.0,2.0,134222.0
max,2100.0,100.0,2.0,30630620.0


In [6]:
# Count the number of unique values in a column
for col in idbsingleyear:
    print(f'Number of unique values in the column {col} is: {idbsingleyear[col].nunique()}')
    print(f'Number of NaN values in the column {col} is: {idbsingleyear[col].isna().sum()}')
    print('')

Number of unique values in the column #YR is: 151
Number of NaN values in the column #YR is: 0

Number of unique values in the column GEO_ID is: 227
Number of NaN values in the column GEO_ID is: 0

Number of unique values in the column AGE is: 101
Number of NaN values in the column AGE is: 0

Number of unique values in the column SEX is: 3
Number of NaN values in the column SEX is: 0

Number of unique values in the column POP is: 1095125
Number of NaN values in the column POP is: 0



In [7]:
# Show the first 20 unique values of the GEO_ID
idbsingleyear['GEO_ID'].unique()[:20]

array(['W140000WOAD', 'W140000WOAE', 'W140000WOAF', 'W140000WOAG',
       'W140000WOAI', 'W140000WOAL', 'W140000WOAM', 'W140000WOAO',
       'W140000WOAR', 'W140000WOAS', 'W140000WOAT', 'W140000WOAU',
       'W140000WOAW', 'W140000WOAZ', 'W140000WOBA', 'W140000WOBB',
       'W140000WOBD', 'W140000WOBE', 'W140000WOBF', 'W140000WOBG'],
      dtype=object)

It seems like the GEO_IDs are quite similar; most of them starting with 'W140000WO' and then two distinctive letters.

After doing outside research, I think the two distinctive letters are the two-letter country codes listed in this website: https://www.census.gov/programs-surveys/international-programs/about/idb/countries-and-areas.html

In [8]:
# Check if every GEO_ID starts with 'W140000WO':
for row in idbsingleyear['GEO_ID']:
    if (not row.startswith('W140000WO')):
        result = False
        break
    else:
        result = True
print(result)

True


After checking all elements in the GEO_ID column with Python, my intuition is actually true.