## Analyzing 2019 CPS Data 

This notebook contains the code used to wrangle, join, and analyze the November 2019 CURRENT POPULATION SURVEY (CPS) [Computer and Internet Use Supplement](https://www.census.gov/data/datasets/time-series/demo/cps/cps-supp_cps-repwgt/cps-computer.html).

All csv files can be found in the "data" folder of the [working](https://github.com/danielgrzenda/broadbandequity/tree/working) branch of our Broadband Equity Github repo.

##### Importing Libraries

In [1]:
import os
import sys
sys.path[0] = os.path.join(os.path.abspath(''),'..')  # make sure we can import from our package

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from data_pipeline import fetch_census_data

##### Fetching API data

In [3]:
# If you updated variables in config.ini, rerun the next line to fetch the latest changes 
# fetch_census_data.cps_individual(force_api_call=True)

##### Loading dataset

In [4]:
# loading cps data

census_df=fetch_census_data.cps_individual()

In [5]:
# filtering for Chicago
# 562 rows x 26 columns
# replace -1 with NA

census_df = census_df[(census_df['area'] == 16980) & 
                      (census_df['city'] == 1)].replace(-1, np.NaN)

census_df.head(5)

Unnamed: 0,area,city,use_internet,hh_id,hh_size,employment,education,hispanic,race,internet_provider,...,no_device_internet,other_no_internet,doctor_internet,health_research_online,health_monitor_online,health_records_online,buy_internet_lower_price,online_classes,state,county
31,16980,1,1,310604531644537,3,7.0,43.0,2,4,1.0,...,,,2.0,2.0,2.0,2.0,,1.0,17,0
32,16980,1,1,310604531644537,3,7.0,43.0,2,4,1.0,...,,,2.0,2.0,2.0,2.0,,,17,0
33,16980,1,1,310604531644537,3,1.0,44.0,2,4,1.0,...,,,2.0,2.0,2.0,2.0,,,17,0
34,16980,1,1,107163554710898,2,2.0,43.0,2,1,1.0,...,,,2.0,1.0,2.0,2.0,,,17,0
35,16980,1,1,107163554710898,2,3.0,44.0,2,1,1.0,...,,,2.0,1.0,2.0,2.0,,1.0,17,0


##### Demographics

In [6]:
# first we look at the overall demographics of the sample size

demo_df = census_df[[
                'hh_id',
                'hh_size',
                'employment',
                'education',
                'hispanic',
                'race']]

In [7]:
# collapse employment to 1,2,3 (employed, unemployed, other)

demo_df['employment']=demo_df[['employment']].replace(2,1).replace(3,2).replace(4,2).replace(5,3).replace(6,3).replace(7,3)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_df['employment']=demo_df[['employment']].replace(2,1).replace(3,2).replace(4,2).replace(5,3).replace(6,3).replace(7,3)


In [8]:
# collapse education to 1,2,3 (<=high school, some college, >=bachelors)

demo_df.loc[(demo_df.education <= 39),'education']=1
demo_df.loc[(demo_df.education == 40),'education']=2
demo_df.loc[(demo_df.education == 41),'education']=2
demo_df.loc[(demo_df.education == 42),'education']=2
demo_df.loc[(demo_df.education >= 43),'education']=3

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [9]:
# resulting dataframe

demo_df.head(5)

Unnamed: 0,hh_id,hh_size,employment,education,hispanic,race
31,310604531644537,3,3.0,3.0,2,4
32,310604531644537,3,3.0,3.0,2,4
33,310604531644537,3,1.0,3.0,2,4
34,107163554710898,2,1.0,3.0,2,1
35,107163554710898,2,2.0,3.0,2,1


Let's look at the distribution of each column's variables.

In [10]:
# employment
demo_df.employment.value_counts() / len(demo_df)*100

1.0    52.135231
3.0    32.206406
2.0     1.779359
Name: employment, dtype: float64

In [11]:
# education
demo_df.education.value_counts() / len(demo_df)*100

3.0    35.409253
1.0    33.096085
2.0    17.615658
Name: education, dtype: float64

In [12]:
# hispanic

demo_df.hispanic.value_counts() / len(demo_df)*100

2    71.886121
1    28.113879
Name: hispanic, dtype: float64

In [13]:
# race

demo_df.race.value_counts() / len(demo_df)*100

1    59.964413
2    29.003559
4     8.896797
8     1.067616
6     0.711744
3     0.177936
5     0.177936
Name: race, dtype: float64

In [14]:
# household size average

demo_df.hh_size.mean()

3.0854092526690393

In [15]:
# household size distribution 

demo_df.hh_size.value_counts() / len(demo_df)*100

4    24.021352
2    22.953737
1    20.462633
3    15.836299
6     7.473310
5     6.761566
9     1.245552
7     1.245552
Name: hh_size, dtype: float64

**Demographics Summary**

- 52% of the sample is employed, 32% of the sampled is disabled, retired, or other. <2% of the sample is unemployed.
- 35% of the sample has achieved at least a bachelor's degree. 33% has achieved a high school diploma/GED or less. 18% finished some college.
- 71% of the sample is non-Hispanic. 28% identify as Hispanic. 
- 60% of the sample identifies as White Only, 29% identify as Black Only, 9% identify as Asian only. 
- The average household size in the sample was 3 people. 83% of the population have households of less than or equal to 4 people. 

##### Internet Access

In [16]:
# 0 = no
# 1 = yes

access_df = census_df[[
                'hh_id',
                'high_speed_service',
                'mobile_plan',
                'dont_need_internet',
                'cant_afford_internet',
                'internet_NA',
                'no_device_internet',
                'other_no_internet']].replace(2,0)

In [17]:
access_df.head(5)

Unnamed: 0,hh_id,high_speed_service,mobile_plan,dont_need_internet,cant_afford_internet,internet_NA,no_device_internet,other_no_internet
31,310604531644537,1.0,1.0,,,,,
32,310604531644537,1.0,1.0,,,,,
33,310604531644537,1.0,1.0,,,,,
34,107163554710898,1.0,1.0,,,,,
35,107163554710898,1.0,1.0,,,,,


Let's look at the distribution of each column's variables.

In [18]:
# high speed service 

access_df.high_speed_service.value_counts() / len(access_df)*100

1.0    59.252669
0.0    15.124555
Name: high_speed_service, dtype: float64

In [19]:
# mobile plan

access_df.mobile_plan.value_counts() / len(access_df)*100

1.0    64.056940
0.0    14.946619
Name: mobile_plan, dtype: float64

In [20]:
# dont need internet

access_df.dont_need_internet.value_counts() / len(access_df)*100

1.0    15.302491
0.0    10.320285
Name: dont_need_internet, dtype: float64

In [21]:
# cant afford internet

access_df.cant_afford_internet.value_counts() / len(access_df)*100

0.0    17.615658
1.0     8.007117
Name: cant_afford_internet, dtype: float64

In [22]:
# internet not available

access_df.internet_NA.value_counts() / len(access_df)*100

0.0    25.266904
1.0     0.355872
Name: internet_NA, dtype: float64

In [23]:
# no device for internet

access_df.no_device_internet.value_counts() / len(access_df)*100

0.0    23.309609
1.0     2.313167
Name: no_device_internet, dtype: float64

In [24]:
# other reason for no internet access

access_df.other_no_internet.value_counts() / len(access_df)*100

0.0    23.843416
1.0     1.779359
Name: other_no_internet, dtype: float64

**Internet Access Summary**
- 59% of the sample have access to the Internet using a high-speed internet service installed such as cable, DSL, or fiber optic service. 15% said they did not. 
- 64% of the sample said they access the Internet using a data plan for a cell phone/any smart device. This type of Internet service is provided by a wireless carrier, and may be part of a package that also includes voice calls from a cell phone or smartphone. 15% said no.
- 15% said they do not have internet access because they do not need it or are not interested. 10% answered no to this.
- 8% said they cannot afford internet access, while 18% said this was not the case.
- Less than 1% said there was no internet access in the area. 25% said this was not the case.
- 2% said they did not have a device to use the internet. 23% said this was not the case for them.
- Less than 2% said there was another reason for no internet access. 24% said this was not the case for them. 


Overall, these variables had extremely low response rates (in between 20-30% for the last 4 variables and 80% for the first two variables). 

##### Internet Usage

In [25]:
# 0 = no
# 1 = yes

use_df = census_df[[
                'hh_id',
                'use_internet',
                'doctor_internet',
                'health_research_online',
                'health_monitor_online',
                'health_records_online',
                'buy_internet_lower_price',
                'online_classes']].replace(2,0)

In [26]:
use_df.head(5)

Unnamed: 0,hh_id,use_internet,doctor_internet,health_research_online,health_monitor_online,health_records_online,buy_internet_lower_price,online_classes
31,310604531644537,1,0.0,0.0,0.0,0.0,,1.0
32,310604531644537,1,0.0,0.0,0.0,0.0,,
33,310604531644537,1,0.0,0.0,0.0,0.0,,
34,107163554710898,1,0.0,1.0,0.0,0.0,,
35,107163554710898,1,0.0,1.0,0.0,0.0,,1.0


Let's look at the distribution of each column's variables.

In [27]:
# use internet

use_df.use_internet.value_counts() / len(use_df)*100

1    74.377224
0    25.622776
Name: use_internet, dtype: float64

In [28]:
# doctor internet

use_df.doctor_internet.value_counts() / len(use_df)*100

0.0    52.135231
1.0    26.868327
Name: doctor_internet, dtype: float64

In [29]:
# health_research_online

use_df.health_research_online.value_counts() / len(use_df)*100

0.0    52.135231
1.0    26.868327
Name: health_research_online, dtype: float64

In [30]:
# health_monitor_online

use_df.health_monitor_online.value_counts() / len(use_df)*100

0.0    75.088968
1.0     3.914591
Name: health_monitor_online, dtype: float64

In [31]:
# health_records_online

use_df.health_records_online.value_counts() / len(use_df)*100

0.0    49.466192
1.0    29.537367
Name: health_records_online, dtype: float64

In [32]:
# online_classes

use_df.online_classes.value_counts() / len(use_df)*100

0.0    28.113879
1.0     8.007117
Name: online_classes, dtype: float64

In [33]:
# buy internet lower price

use_df.buy_internet_lower_price.value_counts() / len(use_df)*100

0.0    20.640569
1.0     4.982206
Name: buy_internet_lower_price, dtype: float64

**Internet Usage Summary**
- 74% of the sample said there is someone in the household who uses the Internet at home. 26% said there was no one.
- 27% said they communicate with a doctor / health professional via the internet. 52% said they don't. 
- The same numbers say they do and dont research health information online such as WebMD or similar services.
- Only 4% of the sample uses monitoring devices that collect their information and send to doctors via the internet. 75% said they do not use this.
- 30% say they access their health records online. 50% said they don't.
- 8% said they do online classes. 28% said they dont.
- 5% said they would buy an internet service if it was cheaper. 21% said no this question. 

These variables had better response rates than the internet access variables, but some were still very low. The first five variables were in the 80-100% range, the last two were in the 20-30% range. 

##### Summary

**Overall Summary (highlights)**
- This dataset contained a very small sample size of 562 households out of 1,066,829 in Chicago. (About 0.05% of the population) 
- There were very low response rates for the variables asking why people did not have access to the internet, and for a couple specific uses of the internet. (see below for specifics)
- 60% of the population identified as White only, 29% identified as Black only, and 28% identified as Hispanic (All races). 
- Very small unemployment rate in the sample (less than 2%)
- The average household size in the sample was 3 people. 
- 59% of the sample said they access the Internet using a high-speed internet service.
- 64% said they access the internet using a mobile data plan. 



**Demographics Summary**

- 52% of the sample is employed, 32% of the sampled is disabled, retired, or other. <2% of the sample is unemployed.
- 35% of the sample has achieved at least a bachelor's degree. 33% has achieved a high school diploma/GED or less. 18% finished some college.
- 71% of the sample is non-Hispanic. 28% identify as Hispanic. 
- 60% of the sample identifies as White Only, 29% identify as Black Only, 9% identify as Asian only. 
- The average household size in the sample was 3 people. 83% of the population have households of less than or equal to 4 people. 

**Internet Access Summary**
- 59% of the sample have access to the Internet using a high-speed internet service installed such as cable, DSL, or fiber optic service. 15% said they did not. 
- 64% of the sample said they access the Internet using a data plan for a cell phone/any smart device. This type of Internet service is provided by a wireless carrier, and may be part of a package that also includes voice calls from a cell phone or smartphone. 15% said no.
- 15% said they do not have internet access because they do not need it or are not interested. 10% answered no to this.
- 8% said they cannot afford internet access, while 18% said this was not the case.
- Less than 1% said there was no internet access in the area. 25% said this was not the case.
- 2% said they did not have a device to use the internet. 23% said this was not the case for them.
- Less than 2% said there was another reason for no internet access. 24% said this was not the case for them. 


Overall, these variables had extremely low response rates (in between 20-30% for the last 4 variables and 80% for the first two variables). 


**Internet Usage Summary**
- 74% of the sample said there is someone in the household who uses the Internet at home. 26% said there was no one.
- 27% said they communicate with a doctor / health professional via the internet. 52% said they don't. 
- The same numbers say they do and dont research health information online such as WebMD or similar services.
- Only 4% of the sample uses monitoring devices that collect their information and send to doctors via the internet. 75% said they do not use this.
- 30% say they access their health records online. 50% said they don't.
- 8% said they do online classes. 28% said they dont.
- 5% said they would buy an internet service if it was cheaper. 21% said no this question. 

These variables had better response rates than the internet access variables, but some were still very low. The first five variables were in the 80-100% range, the last two were in the 20-30% range. 