## Analyzing 2019 CPS Data 

This notebook contains the code used to wrangle, join, and analyze the November 2019 CURRENT POPULATION SURVEY (CPS) [Computer and Internet Use Supplement](https://www.census.gov/data/datasets/time-series/demo/cps/cps-supp_cps-repwgt/cps-computer.html).

All csv files can be found in the "data" folder of the [working](https://github.com/danielgrzenda/broadbandequity/tree/working) branch of our Broadband Equity Github repo.

##### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

##### Fetching API data

In [2]:
#from data_pipeline import fetch_census_data

In [3]:
#fetch_census_data.cps_individual(force_api_call=True)

##### Loading dataset

In [18]:
# loading cps data

census_df=pd.read_csv("data/cps_individual.csv").drop(
                        ['Unnamed: 0', 'state', 'county'], axis=1)

In [19]:
# filtering for Chicago
# 562 rows x 26 columns
# replace -1 with NA

census_df = census_df[(census_df['area'] == 16980) & 
                      (census_df['city'] == 1)].replace(-1, np.NaN)

census_df.head(5)

Unnamed: 0,area,city,use_internet,hh_id,hh_size,employment,education,hispanic,race,internet_provider,...,cant_afford_internet,internet_NA,no_device_internet,other_no_internet,doctor_internet,health_research_online,health_monitor_online,health_records_online,buy_internet_lower_price,online_classes
31,16980,1,1,310604531644537,3,7.0,43.0,2,4,1.0,...,,,,,2.0,2.0,2.0,2.0,,1.0
32,16980,1,1,310604531644537,3,7.0,43.0,2,4,1.0,...,,,,,2.0,2.0,2.0,2.0,,
33,16980,1,1,310604531644537,3,1.0,44.0,2,4,1.0,...,,,,,2.0,2.0,2.0,2.0,,
34,16980,1,1,107163554710898,2,2.0,43.0,2,1,1.0,...,,,,,2.0,1.0,2.0,2.0,,
35,16980,1,1,107163554710898,2,3.0,44.0,2,1,1.0,...,,,,,2.0,1.0,2.0,2.0,,1.0


##### Demographics

In [75]:
# first we look at the overall demographics of the sample size

demo_df = census_df[[
                'hh_id',
                'hh_size',
                'employment',
                'education',
                'hispanic',
                'race']]

In [76]:
# collapse employment to 1,2,3 (employed, unemployed, other)

demo_df['employment']=demo_df[['employment']].replace(2,1).replace(3,2).replace(4,2).replace(5,3).replace(6,3).replace(7,3)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  demo_df['employment']=demo_df[['employment']].replace(2,1).replace(3,2).replace(4,2).replace(5,3).replace(6,3).replace(7,3)


In [77]:
# collapse education to 1,2,3 (<=high school, some college, >=bachelors)

demo_df.loc[(demo_df.education <= 39),'education']=1
demo_df.loc[(demo_df.education == 40),'education']=2
demo_df.loc[(demo_df.education == 41),'education']=2
demo_df.loc[(demo_df.education == 42),'education']=2
demo_df.loc[(demo_df.education >= 43),'education']=3

In [93]:
# resulting dataframe

demo_df.head(5)

Unnamed: 0,hh_id,hh_size,employment,education,hispanic,race
31,310604531644537,3,3.0,3.0,2,4
32,310604531644537,3,3.0,3.0,2,4
33,310604531644537,3,1.0,3.0,2,4
34,107163554710898,2,1.0,3.0,2,1
35,107163554710898,2,2.0,3.0,2,1


Let's look at the distribution of each column's variables.

In [87]:
# employment
demo_df.employment.value_counts() / len(demo_df)*100

1.0    52.135231
3.0    32.206406
2.0     1.779359
Name: employment, dtype: float64

In [88]:
# education
demo_df.education.value_counts() / len(demo_df)*100

3.0    35.409253
1.0    33.096085
2.0    17.615658
Name: education, dtype: float64

In [89]:
# hispanic

demo_df.hispanic.value_counts() / len(demo_df)*100

2    71.886121
1    28.113879
Name: hispanic, dtype: float64

In [90]:
# race

demo_df.race.value_counts() / len(demo_df)*100

1    59.964413
2    29.003559
4     8.896797
8     1.067616
6     0.711744
3     0.177936
5     0.177936
Name: race, dtype: float64

In [98]:
# household size average

demo_df.hh_size.mean()

3.0854092526690393

In [99]:
# household size distribution 

demo_df.hh_size.value_counts() / len(demo_df)*100

4    24.021352
2    22.953737
1    20.462633
3    15.836299
6     7.473310
5     6.761566
9     1.245552
7     1.245552
Name: hh_size, dtype: float64

52% of the sample is employed, 32% of the sampled is disabled, retired, or other. <2% of the sample is unemployed.

35% of the sample has achieved at least a bachelor's degree. 33% has achieved a high school diploma/GED or less. 18% finished some college. 

71% of the sample is non-Hispanic. 28% identify as Hispanic. 

60% of the sample identifies as White Only, 29% identify as Black Only, 9% identify as Asian only. 

The average household size in the sample was 3 people. 83% of the population have households of less than or equal to 4 people. 

##### Internet Access

In [118]:
# 0 = no
# 1 = yes

access_df = census_df[[
                'hh_id',
                'high_speed_service',
                'mobile_plan',
                'dont_need_internet',
                'cant_afford_internet',
                'internet_NA',
                'no_device_internet',
                'other_no_internet']].replace(2,0)

In [119]:
access_df.head(5)

Unnamed: 0,hh_id,high_speed_service,mobile_plan,dont_need_internet,cant_afford_internet,internet_NA,no_device_internet,other_no_internet
31,310604531644537,1.0,1.0,,,,,
32,310604531644537,1.0,1.0,,,,,
33,310604531644537,1.0,1.0,,,,,
34,107163554710898,1.0,1.0,,,,,
35,107163554710898,1.0,1.0,,,,,


In [114]:
# high speed service 

access_df.high_speed_service.value_counts() / len(access_df)*100

1.0    59.252669
0.0    15.124555
Name: high_speed_service, dtype: float64

In [120]:
# mobile plan

access_df.mobile_plan.value_counts() / len(access_df)*100

1.0    64.056940
0.0    14.946619
Name: mobile_plan, dtype: float64

In [121]:
# dont need internet

access_df.dont_need_internet.value_counts() / len(access_df)*100

1.0    15.302491
0.0    10.320285
Name: dont_need_internet, dtype: float64

In [122]:
# cant afford internet

access_df.cant_afford_internet.value_counts() / len(access_df)*100

0.0    17.615658
1.0     8.007117
Name: cant_afford_internet, dtype: float64

In [123]:
# internet not available

access_df.internet_NA.value_counts() / len(access_df)*100

0.0    25.266904
1.0     0.355872
Name: internet_NA, dtype: float64

In [124]:
# no device for internet

access_df.no_device_internet.value_counts() / len(access_df)*100

0.0    23.309609
1.0     2.313167
Name: no_device_internet, dtype: float64

In [125]:
# other reason for no internet access

access_df.other_no_internet.value_counts() / len(access_df)*100

0.0    23.843416
1.0     1.779359
Name: other_no_internet, dtype: float64

##### Internet Usage

In [109]:
# 0 = no
# 1 = yes

use_df = census_df[[
                'hh_id',
                'use_internet',
                'doctor_internet',
                'health_research_online',
                'health_monitor_online',
                'health_records_online',
                'buy_internet_lower_price',
                'online_classes']].replace(2,0)

In [126]:
use_df.head(5)

Unnamed: 0,hh_id,use_internet,doctor_internet,health_research_online,health_monitor_online,health_records_online,buy_internet_lower_price,online_classes
31,310604531644537,1,0.0,0.0,0.0,0.0,,1.0
32,310604531644537,1,0.0,0.0,0.0,0.0,,
33,310604531644537,1,0.0,0.0,0.0,0.0,,
34,107163554710898,1,0.0,1.0,0.0,0.0,,
35,107163554710898,1,0.0,1.0,0.0,0.0,,1.0


In [127]:
# use internet

use_df.use_internet.value_counts() / len(use_df)*100

1    74.377224
0    25.622776
Name: use_internet, dtype: float64

In [128]:
# doctor internet

use_df.doctor_internet.value_counts() / len(use_df)*100

0.0    52.135231
1.0    26.868327
Name: doctor_internet, dtype: float64

In [129]:
# health_research_online

use_df.health_research_online.value_counts() / len(use_df)*100

0.0    52.135231
1.0    26.868327
Name: health_research_online, dtype: float64

In [130]:
# health_monitor_online

use_df.health_monitor_online.value_counts() / len(use_df)*100

0.0    75.088968
1.0     3.914591
Name: health_monitor_online, dtype: float64

In [131]:
# health_records_online

use_df.health_records_online.value_counts() / len(use_df)*100

0.0    49.466192
1.0    29.537367
Name: health_records_online, dtype: float64

In [132]:
# online_classes

use_df.online_classes.value_counts() / len(use_df)*100

0.0    28.113879
1.0     8.007117
Name: online_classes, dtype: float64

In [133]:
# buy internet lower price

use_df.buy_internet_lower_price.value_counts() / len(use_df)*100

0.0    20.640569
1.0     4.982206
Name: buy_internet_lower_price, dtype: float64

###### Summary

- Very small sample size
- Very low response rates for internet-related variables
- 