# Understanding Homelessness Rates

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

## Introduction
**Need to add info about the databases and assumptions of homelessness including personal v macro issues**

### Data Background
- Originally was going to use a dataset from Kaggle but decided to pull straight from HUD-CoC site. 
- Used [2007 - 2017 PIT Counts by State](https://www.hudexchange.info/resource/3031/pit-and-hic-data-since-2007/) and converted to single database

(borrowed from: https://www.kaggle.com/bltxr9/eda-of-total-homeless-population)
This dataset was generated by CoC and provided to HUD. Note: HUD did not conduct a full data quality review on the data submitted by each CoC.

What is the [Continuum of Care (CoC) Program](https://www.hudexchange.info/programs/coc/)?

Original Data: [PIT and HIC Data Since 2007](https://www.hudexchange.info/resource/3031/pit-and-hic-data-since-2007/)

CoC-HUD Summary Reports: [CoC Homeless Populations and Subpopulations Reports](https://www.hudexchange.info/programs/coc/coc-homeless-populations-and-subpopulations-reports/)

**Other Resources**

[Funding Awards](https://www.hudexchange.info/programs/coc/awards-by-component/)

[CoC Dashboard Reports](https://www.hudexchange.info/programs/coc/coc-dashboard-reports/)

[CoC Housing Inventory Count Reports](https://www.hudexchange.info/programs/coc/coc-housing-inventory-count-reports/)

## Data Wrangling

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Inspect Data

In [96]:
df = pd.read_csv('homeless-pit-by-state.csv')
df.head()

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
0,AK,2017,2,1845,1551,294,1354,1060,294,491,...,0,0,0,0,22,22,0,39,39,0
1,AL,2017,8,3793,2656,1137,2985,1950,1035,808,...,3,6,6,0,23,20,3,39,35,4
2,AR,2017,6,2467,1273,1194,2068,937,1131,399,...,0,0,0,0,10,10,0,13,13,0
3,AZ,2017,3,8947,5781,3166,6488,3423,3065,2459,...,0,0,0,0,81,81,0,112,112,0
4,CA,2017,43,134278,42636,91642,112756,25022,87734,21522,...,234,16,11,5,874,645,229,1058,782,276


#### Check Data Types and Missing Data

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605 entries, 0 to 604
Data columns (total 45 columns):
State                                                          605 non-null object
Year                                                           605 non-null int64
Number of CoCs                                                 605 non-null int64
Total Homeless                                                 605 non-null object
Sheltered Homeless                                             605 non-null object
Unsheltered Homeless                                           605 non-null object
Homeless Individuals                                           605 non-null object
Sheltered Homeless Individuals                                 605 non-null object
Unsheltered Homeless Individuals                               605 non-null object
Homeless People in Families                                    605 non-null object
Sheltered Homeless People in Families                          605 

_Observations_
- Missing data is consistent across groups of categories. Visual inspection of data confirms that this is due to additional categories added in subsequent years. 
- Data is all in object format and will need to be converted to float. 
- Not all categories are multually exclusive, confirmation is required to confirm how data is summed.

Visualization of data that all columns are available from 2015 onwards, 2011 - 2014 contains columns up to Unsheltered Homeless Veterans and 2007 - 2013 contains columns up to Unsheltered Chronically Homeless Individuals.

#### Check unique values

In [12]:
df.nunique()

State                                                           56
Year                                                            11
Number of CoCs                                                  34
Total Homeless                                                 577
Sheltered Homeless                                             570
Unsheltered Homeless                                           523
Homeless Individuals                                           573
Sheltered Homeless Individuals                                 560
Unsheltered Homeless Individuals                               520
Homeless People in Families                                    558
Sheltered Homeless People in Families                          550
Unsheltered Homeless People in Families                        378
Chronically Homeless                                           522
Sheltered Chronically Homeless                                 461
Unsheltered Chronically Homeless                              

_Observations_
- There are 11 years of data contained in the set (2007 - 2017)
- Need to confirm what states are covered within state

In [13]:
df['State'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'GU', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD',
       'ME', 'MI', 'MN', 'MO', 'MP', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
       'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'RI', 'SC',
       'SD', 'TN', 'TX', 'UT', 'VA', 'VI', 'VT', 'WA', 'WI', 'WV', 'WY',
       'KS*'], dtype=object)

- It appears that the list includes some US territories and DC. I am less familiar with state abbreviations - will need to get state names for reference.
- One of states is KS*, note from dataset says: The number of CoCs in 2017 was 399. However, MO-604 merged in 2016 and covers territory in both MO and KS, contributing to the PIT count in both states. This will need to be inspected individually to understand.

### Clean Data
#### Convert Coumns to `float`

In [64]:
df.columns

Index(['State', 'Year', 'Number of CoCs', 'Total Homeless',
       'Sheltered Homeless', 'Unsheltered Homeless', 'Homeless Individuals',
       'Sheltered Homeless Individuals', 'Unsheltered Homeless Individuals',
       'Homeless People in Families', 'Sheltered Homeless People in Families',
       'Unsheltered Homeless People in Families', 'Chronically Homeless',
       'Sheltered Chronically Homeless', 'Unsheltered Chronically Homeless',
       'Chronically Homeless Individuals',
       'Sheltered Chronically Homeless Individuals',
       'Unsheltered Chronically Homeless Individuals',
       'Chronically Homeless People in Families',
       'Sheltered Chronically Homeless People in Families',
       'Unsheltered Chronically Homeless People in Families',
       'Homeless Veterans', 'Sheltered Homeless Veterans',
       'Unsheltered Homeless Veterans',
       'Homeless Unaccompanied Youth (Under 25)',
       'Sheltered Homeless Unaccompanied Youth (Under 25)',
       'Unsheltered Home

It was discovered that missing data for states was reported as `.` For example:

In [97]:
df.query('State == "MP"')[:5]

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
26,MP,2017,1,672,24,648,208,11,197,464,...,0,0,0,0,0,0,0,0,0,0
81,MP,2016,0,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.
136,MP,2015,0,.,.,.,.,.,.,.,...,.,.,.,.,.,.,.,.,.,.
191,MP,2014,0,.,.,.,.,.,.,.,...,,,,,,,,,,
246,MP,2013,0,.,.,.,.,.,.,.,...,,,,,,,,,,


These needed to be converted to NaN to allow for further data transformation.

In [98]:
df.replace('.', np.NaN, inplace=True)
df.query('State == "MP"')[:5]

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
26,MP,2017,1,672.0,24.0,648.0,208.0,11.0,197.0,464.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
81,MP,2016,0,,,,,,,,...,,,,,,,,,,
136,MP,2015,0,,,,,,,,...,,,,,,,,,,
191,MP,2014,0,,,,,,,,...,,,,,,,,,,
246,MP,2013,0,,,,,,,,...,,,,,,,,,,


All commas needed to be removed from the str values to allow conversion to `float` while managing `NaN` values.

In [127]:
df[df.columns] = df[df.columns].replace({',':''}, regex = True)

In [128]:
df[df.columns[1:]] = df[df.columns[1:]].astype(float)

In [130]:
df['Year'] = df['Year'].astype(int)
df.head()

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
0,AK,2017,2.0,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0
1,AL,2017,8.0,3793.0,2656.0,1137.0,2985.0,1950.0,1035.0,808.0,...,3.0,6.0,6.0,0.0,23.0,20.0,3.0,39.0,35.0,4.0
2,AR,2017,6.0,2467.0,1273.0,1194.0,2068.0,937.0,1131.0,399.0,...,0.0,0.0,0.0,0.0,10.0,10.0,0.0,13.0,13.0,0.0
3,AZ,2017,3.0,8947.0,5781.0,3166.0,6488.0,3423.0,3065.0,2459.0,...,0.0,0.0,0.0,0.0,81.0,81.0,0.0,112.0,112.0,0.0
4,CA,2017,43.0,134278.0,42636.0,91642.0,112756.0,25022.0,87734.0,21522.0,...,234.0,16.0,11.0,5.0,874.0,645.0,229.0,1058.0,782.0,276.0


In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 605 entries, 0 to 604
Data columns (total 45 columns):
State                                                          605 non-null object
Year                                                           605 non-null int32
Number of CoCs                                                 605 non-null float64
Total Homeless                                                 595 non-null float64
Sheltered Homeless                                             595 non-null float64
Unsheltered Homeless                                           595 non-null float64
Homeless Individuals                                           595 non-null float64
Sheltered Homeless Individuals                                 595 non-null float64
Unsheltered Homeless Individuals                               595 non-null float64
Homeless People in Families                                    595 non-null float64
Sheltered Homeless People in Families                     

#### Confirm Column Configurations
To confirm how each of the individual columns are grouped, the first row was tested.

In [132]:
test = df.head(1)
test

Unnamed: 0,State,Year,Number of CoCs,Total Homeless,Sheltered Homeless,Unsheltered Homeless,Homeless Individuals,Sheltered Homeless Individuals,Unsheltered Homeless Individuals,Homeless People in Families,...,Unsheltered Parenting Youth (Under 25),Parenting Youth Under 18,Sheltered Parenting Youth Under 18,Unsheltered Parenting Youth Under 18,Parenting Youth Age 18-24,Sheltered Parenting Youth Age 18-24,Unsheltered Parenting Youth Age 18-24,Children of Parenting Youth,Sheltered Children of Parenting Youth,Unsheltered Children of Parenting Youth
0,AK,2017,2.0,1845.0,1551.0,294.0,1354.0,1060.0,294.0,491.0,...,0.0,0.0,0.0,0.0,22.0,22.0,0.0,39.0,39.0,0.0


A series of tests were completed to confirm the expected combinations:

`Total Homeless` = `Sheltered Homeless` + `Unsheltered Homeless`

In [150]:
print(test['Sheltered Homeless'][0] + test['Unsheltered Homeless'][0])
print(test['Total Homeless'][0])

1845.0
1845.0


`Homeless Individuals` = `Sheltered Homeless Individuals` + `Unsheltered Homeless Individuals`

In [151]:
print(test['Sheltered Homeless Individuals'][0] + test['Unsheltered Homeless Individuals'][0])
print(test['Homeless Individuals'][0])

1354.0
1354.0


`Homeless People in Families` = `Sheltered Homeless People in Families` + `Unsheltered Homeless People in Families`

In [141]:
print(test['Sheltered Homeless People in Families'][0] + test['Unsheltered Homeless People in Families'][0])
print(test['Homeless People in Families'][0])

491.0
491.0


`Total Homeless` = `Homeless Individuals` + `Homeless People in Families`

In [143]:
print(test['Homeless Individuals'][0] + test['Homeless People in Families'][0])
print(test['Total Homeless'][0])

1845.0
1845.0


`Chronically Homeless` = `Sheltered Chronically Homeless` + `Unsheltered Chronically Homeless`

In [144]:
print(test['Sheltered Chronically Homeless'][0] + test['Unsheltered Chronically Homeless'][0])
print(test['Chronically Homeless'][0])

257.0
257.0


It is also expected that `Chronically Homeless` is a subset of `Total Homeless` and so the values for `Chronically Homeless` should be less than that of the total.

In [148]:
test['Chronically Homeless'][0] < test['Total Homeless'][0]

True

As it is not captured, a column for `Not Chronically Homeless` should be added.

`Sheltered Chronically Homeless` = `Sheltered Chronically Homeless Individuals` + `Sheltered Chronically Homeless People in Families`

In [145]:
print(test['Sheltered Chronically Homeless Individuals'][0] + test['Sheltered Chronically Homeless People in Families'][0])
print(test['Sheltered Chronically Homeless'][0])

158.0
158.0


`Unsheltered Chronically Homeless` = `Unsheltered Chronically Homeless Individuals` + `Unsheltered Chronically Homeless People in Families`

In [146]:
print(test['Unsheltered Chronically Homeless Individuals'][0] + test['Unsheltered Chronically Homeless People in Families'][0])
print(test['Unsheltered Chronically Homeless'][0])

99.0
99.0


`Homeless Veterans` = `Sheltered Homeless Veterans` + `Unsheltered Homeless Vetereans`

In [147]:
print(test['Sheltered Homeless Veterans'][0] + test['Unsheltered Homeless Veterans'][0])
print(test['Homeless Veterans'][0])

124.0
124.0


It is expected that `Homeless Veterans` is a subset of `Total Homeless` and should be lower.

In [152]:
test['Homeless Veterans'][0] < test['Total Homeless'][0]

True

As it is not captured, a column for `Homeless Non-Veteran` should be added.

`Homeless Unaccompanied Youth (Under 25)` = `Sheltered Homeless Unaccompanied Youth (Under 25)` + `Unsheltered Homeless Unaccompanied Youth (Under 25)`

In [153]:
print(test['Sheltered Homeless Unaccompanied Youth (Under 25)'][0] + test['Unsheltered Homeless Unaccompanied Youth (Under 25)'][0])
print(test['Homeless Unaccompanied Youth (Under 25)'][0])

162.0
162.0


It is expected that `Homeless Unaccompanied Youth (Under 25)` is a subset of `Homeless Individuals` and should be lower.

In [154]:
test['Homeless Unaccompanied Youth (Under 25)'][0] < test['Homeless Individuals'][0]

True

As it is not captured, a column for `Homeless Adult` should be added.

`Homeless Unaccompanied Youth (Under 25)` = `Homeless Unaccompanied Children (Under 18)` + `Homeless Unaccompanied Young Adults (Age 18-24)`

In [155]:
print(test['Homeless Unaccompanied Children (Under 18)'][0] + test['Homeless Unaccompanied Young Adults (Age 18-24)'][0])
print(test['Homeless Unaccompanied Youth (Under 25)'][0])

162.0
162.0


`Homeless Unaccompanied Children (Under 18)` = `Sheltered Homeless Unaccompanied Children (Under 18)` + `Unsheltered Homeless Unaccompanied Children (Under 18)`

In [158]:
print(test['Sheltered Homeless Unaccompanied Children (Under 18)'][0] + test['Unsheltered Homeless Unaccompanied Children (Under 18)'][0])
print(test['Homeless Unaccompanied Children (Under 18)'][0])

15.0
15.0


`Homeless Unaccompanied Young Adults (Age 18 - 24)` = `Sheltered Homeless Unaccompanied Young Adults (Age 18 - 24)` + `Unsheltered Homeless Unaccompanied Young Adults (Age 18 - 24)`

In [160]:
print(test['Sheltered Homeless Unaccompanied Young Adults (Age 18-24)'][0] + test['Unsheltered Homeless Unaccompanied Young Adults (Age 18-24)'][0])
print(test['Homeless Unaccompanied Young Adults (Age 18-24)'][0])

147.0
147.0


`Parenting Youth (Under 25)` = `Sheltered Parenting Youth (Under 25)` + `Unsheltered Parenting Youth (Under 25)`

In [161]:
print(test['Sheltered Parenting Youth (Under 25)'][0] + test['Unsheltered Parenting Youth (Under 25)'][0])
print(test['Parenting Youth (Under 25)'][0])

22.0
22.0


`Parenting Youth (Under 25)` = `Parenting Youth Under 18` + `Parenting Youth Age 18-24`

In [162]:
print(test['Parenting Youth Under 18'][0] + test['Sheltered Parenting Youth Age 18-24'][0])
print(test['Parenting Youth (Under 25)'][0])

22.0
22.0


`Parenting Youth Age 18-24` = `Sheltered Parenting Youth Age 18-24` + `Unsheltered Parenting Youth Age 18-24`

In [163]:
print(test['Sheltered Parenting Youth Age 18-24'][0] + test['Unsheltered Parenting Youth Age 18-24'][0])
print(test['Sheltered Parenting Youth Age 18-24'][0])

22.0
22.0


`Parenting Youth Under 18` = `Sheltered Parenting Youth Under18` + `Unsheltered Parenting Youth Under 18`

In [164]:
print(test['Sheltered Parenting Youth Under 18'][0] + test['Unsheltered Parenting Youth Under 18'][0])
print(test['Parenting Youth Under 18'][0])

0.0
0.0


`Children of Parenting Youth` = `Sheltered Children of Parenting Youth` + `Unsheltered Children of Parenting Youth`

In [165]:
print(test['Sheltered Children of Parenting Youth'][0] + test['Unsheltered Children of Parenting Youth'][0])
print(test['Children of Parenting Youth'][0])

39.0
39.0


Children of parenting youth and their parents create a subset of `Homeless People in Families` and their sum should be less than this. 

In [166]:
(test['Parenting Youth (Under 25)'][0] + test['Children of Parenting Youth'][0]) < test['Homeless People in Families'][0]

True

This is considered sufficient comparisons to confirm that the data is structured in the way intended. There are a number of different ways that data can be divided by on factors of sheltering type (sheltered or unsheltered), homelessness type (chronic or not), family status (individual or family), veteran status (veteran or not), age (under 25 or not), but not all segmentations are carried across each category.

The groupings are as follows.