Skip to content

smsheehan/Class_Project

Repository files navigation

Class_Project: Exploring factors that lead to improved unsheltered homelessness outcomes

Presentation

image

The US government's Department of Housing and Urban Development (HUD) funds the continuum of care (COC) in major cities and select rural areas across the USA.

image

As a requirement for receiving funding, every COC must conduct a Point In Time (PIT) count. One night every year a group of service providers and volunteers in each city surveys and counts the homeless individuals and families. Additionally, COCs must provide a Housing Inventory Count (HIC) and maintain a Homeless Management Information System (HMIS) database. While the HMIS database is not open to the public, HUD makes available the PIT and HIC data from across the country. https://www.huduser.gov/portal/datasets/ahar/2020-ahar-part-1-pit-estimates-of-homelessness-in-the-us.html

Nationwide, there is a disturbing rising trend in the percentage of unsheltered homeless.

image

We also see this trend in Indianapolis, especially in 2021. Here in Indianapolis, the agency known as CHIP (Coalition for Homelessness Intervention and Prevention) is the coordinating entity to ensure our city meets HUD reporting requirements. Some information on the 2021 PIT cound here in Indy can be found at: https://www.chipindy.org/reports.html

image

This project will use the HUD nationwide PIT and HIC datasets for COCs in the United States combined with population and homelessness funding allocation data to evaluate if machine learning models can identify factors (investment in low barrier shelter beds, investment in housing first units, federal funding levels, etc) which lead to better outcomes (lower homelessness as a percentage of city population).

2020 PIT dataframe: image

2020 HIC dataframe: image

Additional datasets that will be pulled into this analysis include HUD funding levels at the programmatic level accross the COCs and population data available from the US Census. County-level and state-level data that is able to be translated into CoC-level units will be integrated to provide context beyond HUD information. This will include population, poverty level, and geolocation data that will allow for a more robust analysis and presentation of data at both the CoC level as well as by census unit. This data will be combined and standardized using FIPS coding to keep consistent across all data sources, which will allow indexing to each HUD CoC.

The hypothesis for this project is that we will be able to correlate certain factors across these broad nation-wide data sets with homelessness outcomes. Successful homelessness outcomes for this project will be defined by cities with the lowest percent of unsheltered and/or unsheltered homeless as a percentage of the city's population. These findings would then be the first step in helping to guide additional dialog and potentially new investments in our city of Indianapolis. Currently the Mayor of Indianapolis has a proposal to spend 12.5 million dollars on a new low barrier shelter which would include additional transitional beds. Does our modeling provide evidence in support of this proposal?

Machine learning Model

The overall strategy for our machine learning approach is to first establish if there are key features from the combined PIT/HIC datasets which are correlated with the outcome of unsheltered individuals. We felt like random forest models would provide an initial visualization of feature importances. If there were strong correlations with outcomes, we hoped there might be a slim chance that we would be able to build a predictive deep learning model to predict if changing the number of beds/units of certain types in a given city would be predicted to deliver an improvement in the unsheltered individuals target. Since the CoC's represent areas of greatly differing populations, we anticipated that we would need some way to normalize across CoCs. We envisioned that transforming our data into percents of total CoC population would be a reasonable way to do this.

Initial Modeling Attempts:

Models based on data as a percentage of total CoC region population

The first step was to connect the model to our SQL lite database to pull the data for all the years up through 2019:

image

After dropping some unwanted columns, the data needed to be transformed into percent of population. Exemplar code is shown below:

image

After dropping the original columns, the data frame was ready for initial exploration:

image

Using the Random Forest Regressor from sklearn, "Unsheltered_perc_pop" was selected as the target. The train, test, split data was not scaled for this model. Results are shown below:

image

image

image

This exercise was also performed for the data on individual years with some feature importances changing from year to year. These notebook files are available in the ML folder. The next question is whether this data is good enough to be used in predictive fashion. A neural network model was build using tensor flow and evidently the data is not strong enough to create a useful model of this type. Preprocessing for this model included breaking down the "Unsheltered_perc_pop" category into quantiles using the following code:

q = df_AllYears['Unsheltered_perc_pop'].quantile(np.arange(10) / 10) df_AllYears['UnshelteredPercentQuantile'] = df_AllYears['Unsheltered_perc_pop'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])

The original "Unsheltered_perc_pop" column was dropped and the NN model was run with the quantile being the target. The feature data was scaled using X-scaler:

image

image

Unfortunately, this resulted in a low accuracy model:

image

This suggests that perhaps calibrating our data to region population is not a meaningful approach since there may be many other factors that drive percent outcomes of population (for example average temperature or local policing policies). Before trying a different calibration method, this data was evaluated using a regression model using R studio:

image

Looking at the data of one of the features with the strongest level of significance (unemployment) we see that the trend is a bit iffy and probably even that is driven by a few major outliers. The below visualization shows the quintiles of unsheltered as a percent of population where quintile 0.0 is the lowest percentage quintile and quintile 0.9 being the highest percentage quintile.

image

Models based on data as a percentage of total number of homeless

The abover results based on total population caused a rething of how to calibrate the data. Another potential avenue is to calibrate our numbers relative to the total number of homeless. Using a similar approach as above, a random forest model was built:

image

transform using total number of homeless:

image

Build and run the model as before:

image

Feature importances:

image

We see Total Beds as a percentage of the total number of homeless to be the strongest feature, which makes sense. Note total beds wasn't included in the percent population model, since it is just the sum of the ES, TS, and SH bed columns.

image

Running a regression model in R shows us that more of the features now achieve significance:

image

However, applying this approach to our NN model does not really improve the accuracy of our model:

image

Clearly this speaks to the noise within the data set. PIT counts are not homogeneous in the methodology used across CoCs and are counting individuals usually on a single night (sometimes two nights). Next steps to getting to models which can be used prospectively will involve further evaluation and processing of the data to determine a path forward. This may be a case where some of the older data in the set is less relevant (older methodologies) than more recent data and as a result is adding noise to the set. On the to do list is to use select years (2015-2018) as the training set and year 2019 as the test set. Additionally it may be that rural CoC data may be dramatically distinct from Urban CoCs so we may have to break the data down further to improve our predictivity. Some high level data looking at trends in total homeless and total unsheltered across CoC category which highlights some differences between the regional CoC types:

image

Final ML Solutions:

(Code can be found in the "CheckPoint3_ML_Models" folder)

image

image

image

image

image

image

image

image

image

image

image

image

image

Database

The extracted dataframes from the HUD data will be stored utilizing a SQlite database. Once data is cleaned of irrelevant columns and cleaned to eliminate non-populated data and revise any pieces of data that diverge from the expected value ranges, the tables will be merged based upon the CoC indices as our master key for all values. Because the content of our database is expected to be manageable from a data size standpoint, all files will be saved to the repository, with the ability to read in the data from any project participant individual on their own machine, reducing the need to rely upon a centrally stored database. This will also serve to let each team member manipulate the data on their unique branches to ensure that all content is sercure, yet able to accomodate independent research without burdensome lockout/tag-out procedures to maintain data integrity. The tables we intend to read into the dataframe include:

  • CityData
  • ShelterAvailability
  • HomelessCounts
  • HomelessFunding

Database Diagram

If there is additional need to contextualize the data, we intend to incorporate metropolitan statistical area data that can assist in providing some demographic and socioeconomic context to the areas supported by the specific CoCs of interest in our analysis.

Link to Google Slides Presentation

Link to Tableau Dashboard

  • The tools we used to create our dashboard was searching the web to find related datasets to upload to Tableau. Once uploaded in Tableau Public, various bar graphs and maps were created to illustrate the overall homeless population and funding received to shelter those individuals.
  • One of the many benefits of using Tableau Public is the program allows you to create interative data visualizations.
    link to dashboard

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages