Class_Project: Exploring factors that lead to improved unsheltered homelessness outcomes

Presentation

The US government's Department of Housing and Urban Development (HUD) funds the continuum of care (COC) in major cities and select rural areas across the USA.

As a requirement for receiving funding, every COC must conduct a Point In Time (PIT) count. One night every year a group of service providers and volunteers in each city surveys and counts the homeless individuals and families. Additionally, COCs must provide a Housing Inventory Count (HIC) and maintain a Homeless Management Information System (HMIS) database. While the HMIS database is not open to the public, HUD makes available the PIT and HIC data from across the country. https://www.huduser.gov/portal/datasets/ahar/2020-ahar-part-1-pit-estimates-of-homelessness-in-the-us.html

Nationwide, there is a disturbing rising trend in the percentage of unsheltered homeless.

We also see this trend in Indianapolis, especially in 2021. Here in Indianapolis, the agency known as CHIP (Coalition for Homelessness Intervention and Prevention) is the coordinating entity to ensure our city meets HUD reporting requirements. Some information on the 2021 PIT cound here in Indy can be found at: https://www.chipindy.org/reports.html

This project will use the HUD nationwide PIT and HIC datasets for COCs in the United States combined with population and homelessness funding allocation data to evaluate if machine learning models can identify factors (investment in low barrier shelter beds, investment in housing first units, federal funding levels, etc) which lead to better outcomes (lower homelessness as a percentage of city population).

2020 PIT dataframe:

2020 HIC dataframe:

Additional datasets that will be pulled into this analysis include HUD funding levels at the programmatic level accross the COCs and population data available from the US Census. County-level and state-level data that is able to be translated into CoC-level units will be integrated to provide context beyond HUD information. This will include population, poverty level, and geolocation data that will allow for a more robust analysis and presentation of data at both the CoC level as well as by census unit. This data will be combined and standardized using FIPS coding to keep consistent across all data sources, which will allow indexing to each HUD CoC.

The hypothesis for this project is that we will be able to correlate certain factors across these broad nation-wide data sets with homelessness outcomes. Successful homelessness outcomes for this project will be defined by cities with the lowest percent of unsheltered and/or unsheltered homeless as a percentage of the city's population. These findings would then be the first step in helping to guide additional dialog and potentially new investments in our city of Indianapolis. Currently the Mayor of Indianapolis has a proposal to spend 12.5 million dollars on a new low barrier shelter which would include additional transitional beds. Does our modeling provide evidence in support of this proposal?

Machine learning Model

The overall strategy for our machine learning approach is to first establish if there are key features from the combined PIT/HIC datasets which are correlated with the outcome of unsheltered individuals. We felt like random forest models would provide an initial visualization of feature importances. If there were strong correlations with outcomes, we hoped there might be a slim chance that we would be able to build a predictive deep learning model to predict if changing the number of beds/units of certain types in a given city would be predicted to deliver an improvement in the unsheltered individuals target. Since the CoC's represent areas of greatly differing populations, we anticipated that we would need some way to normalize across CoCs. We envisioned that transforming our data into percents of total CoC population would be a reasonable way to do this.

Initial Modeling Attempts:

Models based on data as a percentage of total CoC region population

The first step was to connect the model to our SQL lite database to pull the data for all the years up through 2019:

After dropping some unwanted columns, the data needed to be transformed into percent of population. Exemplar code is shown below:

After dropping the original columns, the data frame was ready for initial exploration:

Using the Random Forest Regressor from sklearn, "Unsheltered_perc_pop" was selected as the target. The train, test, split data was not scaled for this model. Results are shown below:

This exercise was also performed for the data on individual years with some feature importances changing from year to year. These notebook files are available in the ML folder. The next question is whether this data is good enough to be used in predictive fashion. A neural network model was build using tensor flow and evidently the data is not strong enough to create a useful model of this type. Preprocessing for this model included breaking down the "Unsheltered_perc_pop" category into quantiles using the following code:

q = df_AllYears['Unsheltered_perc_pop'].quantile(np.arange(10) / 10) df_AllYears['UnshelteredPercentQuantile'] = df_AllYears['Unsheltered_perc_pop'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])

The original "Unsheltered_perc_pop" column was dropped and the NN model was run with the quantile being the target. The feature data was scaled using X-scaler:

Unfortunately, this resulted in a low accuracy model:

This suggests that perhaps calibrating our data to region population is not a meaningful approach since there may be many other factors that drive percent outcomes of population (for example average temperature or local policing policies). Before trying a different calibration method, this data was evaluated using a regression model using R studio:

Looking at the data of one of the features with the strongest level of significance (unemployment) we see that the trend is a bit iffy and probably even that is driven by a few major outliers. The below visualization shows the quintiles of unsheltered as a percent of population where quintile 0.0 is the lowest percentage quintile and quintile 0.9 being the highest percentage quintile.

Models based on data as a percentage of total number of homeless

The abover results based on total population caused a rething of how to calibrate the data. Another potential avenue is to calibrate our numbers relative to the total number of homeless. Using a similar approach as above, a random forest model was built:

transform using total number of homeless:

Build and run the model as before:

Feature importances:

We see Total Beds as a percentage of the total number of homeless to be the strongest feature, which makes sense. Note total beds wasn't included in the percent population model, since it is just the sum of the ES, TS, and SH bed columns.

Running a regression model in R shows us that more of the features now achieve significance:

However, applying this approach to our NN model does not really improve the accuracy of our model:

Clearly this speaks to the noise within the data set. PIT counts are not homogeneous in the methodology used across CoCs and are counting individuals usually on a single night (sometimes two nights). Next steps to getting to models which can be used prospectively will involve further evaluation and processing of the data to determine a path forward. This may be a case where some of the older data in the set is less relevant (older methodologies) than more recent data and as a result is adding noise to the set. On the to do list is to use select years (2015-2018) as the training set and year 2019 as the test set. Additionally it may be that rural CoC data may be dramatically distinct from Urban CoCs so we may have to break the data down further to improve our predictivity. Some high level data looking at trends in total homeless and total unsheltered across CoC category which highlights some differences between the regional CoC types:

Final ML Solutions:

(Code can be found in the "CheckPoint3_ML_Models" folder)

Database

The extracted dataframes from the HUD data will be stored utilizing a SQlite database. Once data is cleaned of irrelevant columns and cleaned to eliminate non-populated data and revise any pieces of data that diverge from the expected value ranges, the tables will be merged based upon the CoC indices as our master key for all values. Because the content of our database is expected to be manageable from a data size standpoint, all files will be saved to the repository, with the ability to read in the data from any project participant individual on their own machine, reducing the need to rely upon a centrally stored database. This will also serve to let each team member manipulate the data on their unique branches to ensure that all content is sercure, yet able to accomodate independent research without burdensome lockout/tag-out procedures to maintain data integrity. The tables we intend to read into the dataframe include:

CityData
ShelterAvailability
HomelessCounts
HomelessFunding

If there is additional need to contextualize the data, we intend to incorporate metropolitan statistical area data that can assist in providing some demographic and socioeconomic context to the areas supported by the specific CoCs of interest in our analysis.

Link to Google Slides Presentation

Reducing Homelessness

Link to Tableau Dashboard

The tools we used to create our dashboard was searching the web to find related datasets to upload to Tableau. Once uploaded in Tableau Public, various bar graphs and maps were created to illustrate the overall homeless population and funding received to shelter those individuals.
One of the many benefits of using Tableau Public is the program allows you to create interative data visualizations.
link to dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
CheckPoint2 ML Models		CheckPoint2 ML Models
CheckPoint3_ML_Models		CheckPoint3_ML_Models
ML		ML
Resources		Resources
scripts		scripts
static		static
.gitignore		.gitignore
Procfile		Procfile
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Class_Project: Exploring factors that lead to improved unsheltered homelessness outcomes

Presentation

Machine learning Model

Initial Modeling Attempts:

Models based on data as a percentage of total CoC region population

Models based on data as a percentage of total number of homeless

Final ML Solutions:

Database

Link to Google Slides Presentation

Link to Tableau Dashboard

About

Releases

Packages

Contributors 5

Languages

smsheehan/Class_Project

Folders and files

Latest commit

History

Repository files navigation

Class_Project: Exploring factors that lead to improved unsheltered homelessness outcomes

Presentation

Machine learning Model

Initial Modeling Attempts:

Models based on data as a percentage of total CoC region population

Models based on data as a percentage of total number of homeless

Final ML Solutions:

Database

Link to Google Slides Presentation

Link to Tableau Dashboard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages