# Project 1: Digital Divide
### Data Prep

#### Based on PPIC's Just the Facts report ["California's Digital Divide"](https://www.ppic.org/publication/californias-digital-divide/)

## Goal:
* explore datafiles (`acsdata.data.gz`) and create a _working dataset_ from it.

## Context:
Obtained American Community Survey (ACS) survey data from [IPUMS](https://usa.ipums.org/usa/). <br>
It contains basic demographics:
  - age
  - gender
  - race/ethnicity

and geographic indicators:
  - state
  - county

***

#### Step 1: Set up your working environment.

Import all necessary libraries and create `Path`s to your data directories. 

In [8]:
# setting up working environment
import pandas as pd
from pathlib import Path
from datetime import datetime as dt

today = dt.today().strftime("%d-%b-%y")

In [2]:
# data folder and paths
RAW_DATA_PATH = Path("../data/raw/")
INTERIM_DATA_PATH = Path("../data/interim/")
PROCESSED_DATA_PATH = Path("../data/processed/")
FINAL_DATA_PATH = Path("../data/final/")

In [3]:
# ensure the dirs exist
import os 

def dir_exists(dir_path):
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
    else:
        print(f"{dir_path} found, skipping")
    
dir_exists(INTERIM_DATA_PATH)
dir_exists(PROCESSED_DATA_PATH)
dir_exists(FINAL_DATA_PATH)

../data/interim found, skipping
../data/processed found, skipping
../data/final found, skipping


***

#### Step 2: Load the data into a pandas dataframe

In [9]:
data = pd.read_csv(RAW_DATA_PATH / 'acs_data.csv.gz')

In [10]:
# import gzip and load data
import gzip

with gzip.open(RAW_DATA_PATH / 'acs_data.dta.gz') as file:
    data = pd.read_stata(file)

***

#### Step 3: Save the info for the file

In [7]:
data.describe()

Unnamed: 0,serial,hhwt,countyfip,pernum,perwt
count,3190040.0,3190040.0,3190040.0,3190040.0,3190040.0
mean,691840.0,96.14892,50.99185,2.083905,102.105
std,406411.7,75.94648,88.09394,1.340533,83.14945
min,1.0,1.0,0.0,1.0,1.0
25%,335000.8,51.0,0.0,1.0,53.0
50%,692532.0,76.0,19.0,2.0,80.0
75%,1047493.0,117.0,73.0,3.0,124.0
max,1394399.0,2401.0,810.0,20.0,2401.0


***

#### Step 4: Trim your data

Right now you're working with your **masterfile** - a dataset containing everything you _could_ need for your analysis. You don't really want to modify this dataset because you might be using it for other analyses. For example, we're going to be analyzing access to high-speed internet in a state of your choosing but next week you might want to run the same analysis on another state or maybe just on a specific county. To make sure you can **reuse** your data and code later let's create an _analytical file_ or a _working dataset_, a dataset that contains only the data needed for **this** specific analysis at hand.

In [23]:
mask_state = (data['statefip'] == 'ohio')
state_data = data[mask_state].copy()

In [26]:
# dropping the columns
state_data.drop(columns = ['related', 'raced', 'hispand'], inplace=True)

Because of this and the fact that most of our observations fall into the 1970 and 1990 definition, we'll stick to those 2 for our analysis.

In [29]:
mask_household = ( state_data['gq'] == 'households under 1970 definition' ) | ( state_data['gq'] == 'additional households under 1990 definition' )

In [30]:
state_data = state_data[mask_household].copy()

Our research question 1 is: "What share of households in X state have access to high-speed internet?"

Mathematically, 
$$ \frac{households\ with\ high\ speed\ internet}{households\ in\ state}$$

Your `state_data` dataset contains all you need to find the answer. 

***

### Step 5 save the data

In [32]:
state_data.to_stata(INTERIM_DATA_PATH / f'state_data-{today}.dta', write_index = False)