The purpose of this notebook is to load the data into two formats:

1. Full dataset (excluding rows with missing data, such as exit dates). This is the dataframe called `df`.
    1. Each person's information gets combined into one row per project enrollment.
2. Each person summarized as one row. This is the dataframe called `df_features`.

You can examine how each sheet is loaded using the dataset loading script: `datasci-sf-homeless-project/src/data/dataset.py`

The script assumes you have access to the data via Dropbox, as mentioned in the sfbrigade Slack team #datasci-homeless channel. Everyone has read access, but talk to Matt, Catherine, or Annalie if you want to be added to the shared folder. If you download the data and/or want to keep it somewhere else, just supply each `process_data_*` function with a datadir argument, e.g.:

```python
df_client = hd.process_data_client(datadir='/path/to/raw/csv/files/')
```

Notes:

- One person can have multiple rows in `df`.
- One person can be enrolled in multiple projects at the same time.
- This notebook does not yet make use of the `Income`, `Service`, or `Bed Inventory` sheets.
- If you save out the CSV at the end (or have it from Dropbox), you can simply load the dataset with the commands:

```python
filename = os.path.join(os.getenv('HOME'), 'Dropbox', 'C4SF-datasci-homeless', 'processed', 'homeless_row_per_enrollment.csv')
df = pd.read_csv(filename, header=0, index_col=0, parse_dates=['Entry Date', 'Exit Date', 'Residential Move In Date'])
```

In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
%load_ext autoreload
# # the "1" means: always reload modules marked with "%aimport"
%autoreload 2

from __future__ import absolute_import, division, print_function
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
# from tqdm import tqdm
# import warnings

sns.set_context("poster", font_scale=1.3)
pd.set_option('display.max_columns', 100)

# add the data functions to the path
src_data_dir = os.path.join(os.getcwd(), os.pardir, 'src/data')
sys.path.append(src_data_dir)

# functions to load the data
import homeless_dataset as hd

In [2]:
# load in and process the data in separate sheets

df_client = hd.process_data_client()

df_enroll = hd.process_data_enrollment()
# Only keep rows with entry dates starting in 2012
df_enroll = df_enroll[df_enroll['Entry Date'] >= '2012']
# Only keep rows with exit dates before 2016-06-01
df_enroll = df_enroll[df_enroll['Exit Date'] <= '2016-06-01']

df_disability = hd.process_data_disability()

df_healthins = hd.process_data_healthins()

df_benefit = hd.process_data_benefit()

df_income = hd.process_data_income()

df_project = hd.process_data_project()

df_service = hd.process_data_service()

df_bedinv = hd.process_data_bedinventory()

In [3]:
# Join the client information with enrollment information.
# Inner join because we want to only keep individuals
# for whom we have both client and enrollment information.
df = df_client.merge(df_enroll, how='inner', left_index=True, right_index=True)

# just choose the first non-cash benefit; this is too simple!
# TODO: join on the exact Project ID, and possible Date
df = df.merge(df_benefit.reset_index().groupby(by=['Personal ID'])[['Non-Cash Benefit']].nth(0),
              how='left', left_index=True, right_index=True)
# # possible fix for above, but this isn't working properly (results in too many rows);
# # probably need date too, but they do not align
# df.reset_index().merge(df_benefit.reset_index()[['Personal ID', 'Project Entry ID', 'Non-Cash Benefit']].drop_duplicates(),
#                        how='left',
#                        on=['Personal ID', 'Project Entry ID'],
#                       ).drop_duplicates().set_index('Personal ID')

df['Non-Cash Benefit'] = df['Non-Cash Benefit'].fillna('None')

# add information about their disability status
# just choose the first disability; this is too simple!
# TODO: join on the exact Project ID
df = df.merge(df_disability.reset_index().groupby(by=['Personal ID'])[['Disability Type']].nth(0),
              how='left', left_index=True, right_index=True)
# # possible fix for above, but this isn't working properly (results in too many rows);
# # probably need date too, but they do not align
# df.reset_index().merge(df_disability.reset_index()[['Personal ID', 'Project Entry ID', 'Disability Type']].drop_duplicates(),
#                        how='left',
#                        on=['Personal ID', 'Project Entry ID'],
#                       ).drop_duplicates().set_index('Personal ID')

df['Disability Type'] = df['Disability Type'].fillna('None')

# add Project Type Code to DataFrame
df = df.merge(df_project[['Project Name',
                          'Project Type Code',
                          'Address City',
                          'Address Postal Code',
                         ]], left_on=['Project ID'], right_index=True)

# sort by entry date
df = df.sort_values('Entry Date')

In [4]:
# rename the columns to have no spaces
df = hd.rename_columns(df)

In [5]:
# save it for easy loading
filename = os.path.join(os.getenv('HOME'), 'Dropbox', 'C4SF-datasci-homeless', 'processed', 'homeless_row_per_enrollment.csv')
df.to_csv(filename)

In [7]:
# set up to count the number of times a person was in the system
df['enrollments'] = 1

# create feature vectors for each person by subselecting or aggregating their enrollments;
# one row per person
agg = {
    # binary
    'veteran_status': 'max',
    'disabling_condition': 'max',
    'continuously_homeless_one_year': 'max',
    'chronic_homeless': 'max',
    'domestic_violence_victim': 'max',
    'dv_currently_fleeing': 'max',
    'head_of_household': 'max',
    # quantitative
    'enrollments': 'sum',
    'client_age_at_entry': 'last',
    'times_homeless_past_three_years': 'last',
    'months_homeless_this_time': 'last',
    'months_ago_dv_occurred': 'last',
    'days_enrolled': 'sum',
    # categorical
    'race': 'first',
    'ethnicity': 'first',
    'gender': 'first',
    'housing_status_project_start': 'last',
    'living_situation_before_program_entry': 'last',
    'noncash_benefit': 'last',
    'disability_type': 'last',
    'project_type_code': 'last',
    # outcome related
    'in_permanent_housing': 'last',
    'days_to_residential_move_in': 'last',
    }
df_features = df.reset_index().groupby(by=['Personal ID']).agg(agg)

# convert booleans to integers
features_binary = [
    'veteran_status',
    'disabling_condition',
    'continuously_homeless_one_year',
    'chronic_homeless',
    'in_permanent_housing',
    'domestic_violence_victim',
    'dv_currently_fleeing',
    'head_of_household',
]
for col in features_binary:
    df_features[col] = df_features[col].astype(int)

In [8]:
# number of people in the dataset
df_features.shape

(11362, 23)

In [9]:
# glance at the data
df_features.head()

Unnamed: 0_level_0,ethnicity,days_to_residential_move_in,dv_currently_fleeing,continuously_homeless_one_year,months_ago_dv_occurred,times_homeless_past_three_years,head_of_household,gender,project_type_code,client_age_at_entry,noncash_benefit,living_situation_before_program_entry,chronic_homeless,enrollments,race,in_permanent_housing,days_enrolled,disability_type,disabling_condition,months_homeless_this_time,housing_status_project_start,veteran_status,domestic_violence_victim
Personal ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
173781,Hispanic/Latino,,0,0,12.0,2.0,0,Female,Emergency Shelter,35,Supplemental Nutrition Assistance Program (Foo...,"Emergency shelter, including hotel or motel pa...",0,2,White,0,147,,0,,Category 1 - Homeless,0,1
173782,Hispanic/Latino,,0,1,12.0,2.0,0,Male,Emergency Shelter,10,,"Emergency shelter, including hotel or motel pa...",0,1,White,0,147,,0,,Category 1 - Homeless,0,1
173783,Hispanic/Latino,,0,1,12.0,2.0,0,Female,Emergency Shelter,12,,"Emergency shelter, including hotel or motel pa...",0,1,White,0,147,,0,,Category 1 - Homeless,0,1
173803,Hispanic/Latino,,0,0,12.0,,0,Female,Emergency Shelter,32,Supplemental Nutrition Assistance Program (Foo...,"Staying or living in a friend's room, apartmen...",0,1,White,0,78,,0,,Category 1 - Homeless,0,1
173804,Hispanic/Latino,,0,0,,,0,Female,Emergency Shelter,11,,"Staying or living in a friend's room, apartmen...",0,1,White,0,78,,0,,Category 1 - Homeless,0,0
