## Exploring Shelter Data
I'm taking it upon myself to learn a bit about data science. 

Since I have a lot of interest in homeless services in Toronto (see [ChalmersCards](chalmerscards.com)), I felt Toronto's [public shelter occupancy dataset](https://www.toronto.ca/ext/open_data/catalog/data_set_files/SMIS_Daily_Occupancy_2017.csv) would make a good learning environment.

---
#### My current understanding of the application of datascience:
Data Science: The task of generating insights.

Actionable insights are determined by understanding the values of the client/subject, and using those values to define an appropriate research question. 

1. A research **question** is used to define the research **method**. 
2. Executing the **method** will render a **result and insights**. 
3. From the **insight**, **actions** can be determined. 

---
#### Understanding the client
One of shelters' greatest concerns is capacity. Other than shelter names and propteties, it's the only metric tracked in this database.

Maybe my question could be:
**Question** Which shelters are most susceptible to hitting capacity?
or
**Question** During what time of the year is Shelter occupancy highest?
or
**Question** Which shelter type (male/femeale/mix/family) has the highest occupancy?

> I'll tackle these items one at a time

# Which shelters are most suseptible to hitting capacity?

+ Question : Which shelters are most susceptible to hitting capacity?
+ Method : For each shelter, collect capacity numbers for the year. Find which shelters spend the most time with their occupancy close to their capacity

In [7]:
import pandas as pd

In [8]:
import numpy as np

In [9]:
import matplotlib.pyplot as plt

## Importing Libraries
I don't really know what I'm doing, so I'll grab pandas, it's dependancy numPy, and a visualization library matPlotLib. 

In [11]:
data = pd.read_csv('data/shelter_Occupancy_2017.csv')
print(data.columns)

Index(['OCCUPANCY_DATE', 'ORGANIZATION_NAME', 'SHELTER_NAME',
       'SHELTER_ADDRESS', 'SHELTER_CITY', 'SHELTER_PROVINCE',
       'SHELTER_POSTAL_CODE', 'FACILITY_NAME', 'PROGRAM_NAME', 'SECTOR',
       'OCCUPANCY', 'CAPACITY'],
      dtype='object')


## Importing dataset
I've grabbed the shelter_occupancy_2017.csv dataset from Toronto's open data portal. I've also printed all the columns for reference.

Now what I need to do is grab all the shelters occupancy/capacity numbers and find which shelters were always close to capacity

## Defining the method
This may be an unfair assumption, but I'll make it anyway:

---
> A shelter's facilities will be built around it's capacity. Therefore, what matters most to the shelter's performance is the proportion of free beds, and not the number of free beds. 

For example, using the above assumption, we would infer that:
> Shelter 'A' that is at 68/69 beds capacity

is in worse shape than 
> Shelter 'B' that is at 9/10 beds capacity

because shelter 'A' has 0.014% capacity left, while shelter 'B' has 10% capacity left

---
What I'll try to find out:
1. find out: What are the top 90th percentile shelters for highest capacity percentage
2. find out: of that top 10%, which 3 shelters most often hit 100% capacity

In [39]:
date = data.loc[:,'OCCUPANCY_DATE']
shelter = data.loc[:,'FACILITY_NAME']
df = pd.DataFrame(index=date, columns=['a','b','c'])
df

Unnamed: 0_level_0,a,b,c
OCCUPANCY_DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,
01/01/2017,,,


## Planning the dataframe
1. Rows: Date;
2. Columns: Shelter([Shelter_Title, Location], [Occupancy, Capacity])

> "most at risk of hitting capacity" sounds to me like you want to calculate the ratio OCCUPANCY/CAPACITY for each datum and look for the data with high ratios.

```
for(i = 0; i < shelters.length; i++){
    for(j=0;j < date.length; j++){
        cap_occ_ratio = (shelter[i].capacity[j] / shelter[i].occupancy[j]); 
        shelter[i].cap_occ_ratios.append({date[j]: cap_occ_ration});
        } 
    shelter_ratios.append(shelter[i].occ_cap_ratios)
}
```

> MisterChalmers: or a rolling window. The increasing trend at one shelter (at say, 75% now) may be more at risk than the another, consistently at 90%.

## Index of relvant values
+ Date collected (Occupancy_DATE) [i,0]
+ Shelter Name is at index [i,2]
+ Shelter address at [1,3]
+ Shelter Occupancy at [i,10]
+ Shelter Capacity at [i,11]


In [30]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['One', 'Orange', 'C', 'D'])
df

Unnamed: 0,One,Orange,C,D
2000-01-01,0.65477,0.926678,0.18331,1.547004
2000-01-02,0.074499,-1.22102,0.83591,0.019799
2000-01-03,-2.251502,0.542808,0.612041,-0.844552
2000-01-04,-0.227076,-1.277675,-0.037194,-1.57453
2000-01-05,0.150048,-0.820513,-1.459872,0.870769
2000-01-06,0.366685,0.407998,0.918772,0.919961
2000-01-07,0.042024,1.911375,0.366676,-1.66773
2000-01-08,-0.022837,0.256802,0.120658,-2.242906
