The purpose of this notebook is to load the data into two formats:

1. Full dataset (excluding rows with missing data, such as exit dates). This is the dataframe called `df`.
    1. Each person's information gets combined into one row per project enrollment.
2. Each person summarized as one row. This is the dataframe called `df_features`.

You can examine how each sheet is loaded using the dataset loading script: `datasci-sf-homeless-project/src/data/dataset.py`

The script assumes you have access to the data via Dropbox, as mentioned in the sfbrigade Slack team #datasci-homeless channel. Everyone has read access, but talk to Matt, Catherine, or Annalie if you want to be added to the shared folder. If you download the data and/or want to keep it somewhere else, just supply each `process_data_*` function with a datadir argument, e.g.:

```python
df_client = ds.process_data_client(datadir='/path/to/raw/csv/files/')
```

Notes:

- One person can have multiple rows in `df`.
- One person can be enrolled in multiple projects at the same time.
- This notebook does not yet make use of the `Service` or `Bed Inventory` sheets.
- If you save out the CSV at the end (or have it from Dropbox), you can simply load the dataset with the commands:

```python
filename = os.path.join(os.getenv('HOME'), 'Dropbox', 'C4SF-datasci-homeless', 'processed', 'homeless_row_per_enrollment.csv')
df = pd.read_csv(filename, header=0, index_col=0, parse_dates=['Entry Date', 'Exit Date', 'Residential Move In Date'])
```

In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
%load_ext autoreload
# # the "1" means: always reload modules marked with "%aimport"
%autoreload 2

from __future__ import absolute_import, division, print_function
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
# from tqdm import tqdm
# import warnings

sns.set_context("poster", font_scale=1.3)
pd.set_option('display.max_columns', 100)

# add the data functions to the path
src_data_dir = os.path.join(os.getcwd(), os.pardir, 'src/data')
sys.path.append(src_data_dir)

# functions to load the data
import dataset as ds

In [2]:
# load in and process the data in separate sheets

df_client = ds.process_data_client()

df_enroll = ds.process_data_enrollment()
# Only keep rows with entry dates starting in 2012
df_enroll = df_enroll[df_enroll['Entry Date'] >= '2012']

df_disability = ds.process_data_disability()

df_healthins = ds.process_data_healthins()

df_benefit = ds.process_data_benefit()

df_income = ds.process_data_income()

df_project = ds.process_data_project()

df_service = ds.process_data_service()

df_bedinv = ds.process_data_bedinventory()

In [3]:
# Join the client information with enrollment information.
# Inner join because we want to only keep individuals
# for whom we have both client and enrollment information.
df = df_client.merge(df_enroll, how='inner', left_index=True, right_index=True)

# just choose the first non-cash benefit; this is too simple!
# TODO: join on the exact Project ID and Date
df = df.merge(df_benefit.reset_index().groupby(by=['Personal ID'])[['Non-Cash Benefit']].nth(0),
              how='left', left_index=True, right_index=True)

df['Non-Cash Benefit'] = df['Non-Cash Benefit'].fillna('None')

# add information about their disability status
df = df.merge(df_disability.reset_index().groupby(by=['Personal ID'])[['Disability Type']].nth(0),
              how='left', left_index=True, right_index=True)

df['Disability Type'] = df['Disability Type'].fillna('None')

# add Project Type Code to DataFrame
df = df.merge(df_project[['Project Name',
                          'Project Type Code',
                          'Address City',
                          'Address Postal Code',
                         ]], left_on='Project ID', right_index=True)

# sort by entry date
df = df.sort_values('Entry Date')

In [4]:
# number of rows in the dataset
print(df.shape)

# number of people in the dataset
print(df.index.nunique())

(63324, 32)
11363


In [5]:
# set up to count the number of times a person was in the system
df['Enrollments'] = 1

# create feature vectors for each person by subselecting or aggregating their enrollments;
# one row per person
agg = {
    'In Permanent Housing': 'last',
    'Enrollments': 'sum',
    'Race': 'first',
    'Ethnicity': 'first',
    'Gender': 'first',
    'Veteran Status': 'max',
    'Client Age at Entry': 'last',
    'Days Enrolled': 'sum',
    'Domestic Violence Victim': 'max',
    'Disability Type': 'last',
    'Non-Cash Benefit': 'last',
    'Housing Status @ Project Start': 'last',
    'Living situation before program entry?': 'last',
    'Continuously Homeless One Year': 'max',
    'Chronic Homeless': 'max',
    'Project Name': 'last',
    'Project Type Code': 'last',
    }
df_features = df.reset_index().groupby(by=['Personal ID']).agg(agg)

# remove spaces in the variables 
df_features = df_features.rename(
    columns={
        'In Permanent Housing': 'in_permanent_housing', 
        'Enrollments': 'enrollments',
        'Race': 'race',
        'Ethnicity': 'ethnicity',
        'Gender': 'gender',
        'Veteran Status': 'veteran_status',
        'Client Age at Entry': 'client_age_at_entry',
        'Days Enrolled': 'days_enrolled',
        'Domestic Violence Victim': 'domestic_violence_victim',
        'Disability Type': 'disability_type',
        'Non-Cash Benefit': 'non_cash_benefit',
        'Housing Status @ Project Start': 'housing_status_project_start',
        'Living situation before program entry?': 'living_situation_before_program_entry',
        'Continuously Homeless One Year': 'continuously_homeless_one_year',
        'Chronic Homeless': 'chronic_homeless',
        'Project Name': 'project_name',
        'Project Type Code': 'project_type_code',
        })

# convert booleans to integers
cols = [
    'domestic_violence_victim',
    'veteran_status',
    'in_permanent_housing',
    'continuously_homeless_one_year',
    'chronic_homeless',
    ]
for col in cols:
    df_features[col] = df_features[col].astype(int)

In [6]:
# number of people in the dataset
df_features.shape

(11363, 17)

In [7]:
# glance at the data
df_features.head()

Unnamed: 0_level_0,chronic_homeless,non_cash_benefit,in_permanent_housing,project_type_code,enrollments,continuously_homeless_one_year,disability_type,domestic_violence_victim,veteran_status,days_enrolled,race,client_age_at_entry,ethnicity,gender,project_name,living_situation_before_program_entry,housing_status_project_start
Personal ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
173781,0,Food Stamps,0,Emergency Shelter,2,0,,1,0,147,White,35,Hispanic/Latino,Female,MOSBE SOP - Natividad Shelter,"Emergency shelter, including hotel or motel pa...",Category 1 - Homeless
173782,0,,0,Emergency Shelter,1,1,,1,0,147,White,10,Hispanic/Latino,Male,MOSBE SOP - Natividad Shelter,"Emergency shelter, including hotel or motel pa...",Category 1 - Homeless
173783,0,,0,Emergency Shelter,1,1,,1,0,147,White,12,Hispanic/Latino,Female,MOSBE SOP - Natividad Shelter,"Emergency shelter, including hotel or motel pa...",Category 1 - Homeless
173803,0,Food Stamps,0,Emergency Shelter,1,0,,1,0,78,White,32,Hispanic/Latino,Female,MOSBE SOP - Natividad Shelter,"Staying or living in a friend's room, apartmen...",Category 1 - Homeless
173804,0,,0,Emergency Shelter,1,0,,0,0,78,White,11,Hispanic/Latino,Female,MOSBE SOP - Natividad Shelter,"Staying or living in a friend's room, apartmen...",Category 1 - Homeless


In [8]:
# save it for easy loading
filename = os.path.join(os.getenv('HOME'), 'Dropbox', 'C4SF-datasci-homeless', 'processed', 'homeless_row_per_enrollment.csv')
df.to_csv(filename)