<a id='Top'></a>
<h1> <center>Analytics Programming: Module 9</center> </h1>
<p><h2><center>Transforming CHS Survey  Data</center> 
<center>supported by a <a href="https://github.com/yuleidner/Katz_Data_Analytics/blob/master/M9/README.md">M9 README file </a></center></h2></p>
<center>Alan Leidner Oct 26, 2019</center>

In this notebook I will analyze The New York City Community Health Survey data.

The CHS "is a telephone survey conducted annually by the DOHMH, Division of Epidemiology, Bureau of Epidemiology Services. CHS provides robust data on the health of New Yorkers, including neighborhood, borough, and citywide estimates on a broad range of chronic diseases and behavioral risk factors. The data are analyzed and disseminated to influence health program decisions, and increase the understanding of the relationship between health behavior and health status."

1. [Importing](#Importing)
2. [Understanding](#Understanding)
3. [Transforming](#Transforming)
    - [Slice Dataframe](#slice_data)
    - [Cleaning Dataframe](#cleaning)
    - [Arranging Dataframe](#arranging)
4. [Closing Thoughts](#conclusion)
    

DataSource: https://data.cityofnewyork.us/Health/Community-Health-Survey/2r9r-m6j4

### Importing<a id='Importing'></a>
First I will import the data using the Socrata API from the website.
<br> I will import it into a dataframe which will let us read the csv data.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', -1) # shows full cell text, instead of  truncated data in cells
health_survey = pd.read_csv("https://data.cityofnewyork.us/resource/2r9r-m6j4.csv")

### Understanding <a id='Understanding'></a>
Now I'll peek at the top and bottom of the dataframe using the  `head` and `tail` function. 
<br>I'm printing both ends in one data set using the `concat` function, so be aware that the middle rows still exist.

In [2]:
pd.concat([health_survey.head(), health_survey.tail()], axis=0)

Unnamed: 0,survey,question,year,prevalence,lower95_ci,upper95_ci
0,CHS,Health Insurance Coverage,2010,83.3,82.0,84.6
1,CHS,Health Insurance Coverage,2010,83.3,82.0,84.6
2,CHS,Did not get needed medical care,2010,10.3,9.4,11.4
3,CHS,Did not get needed medical care,2010,10.3,9.4,11.4
4,CHS,No Personal Doctor,2010,,,
95,CHS,"Colon cancer screening, adults age 50+ (colonoscopy)",2014,69.9,67.8,72.0
96,CHS,Self-reported Health Status (excellent/very good/good),2014,77.8,76.6,78.9
97,CHS,Self-reported Health Status (excellent/very good/good),2014,77.8,76.6,78.9
98,CHS,"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",2014,64.2,60.7,67.6
99,CHS,"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",2014,64.2,60.7,67.6


I seem to  have imported the csv  properly, and that it has 100 rows, but I don't know what many of the columns mean. 
<br>Let's pull in the data dictionary from the website to see if it can explain.

In [3]:
data_dict = pd.read_excel("https://data.cityofnewyork.us/api/views/2r9r-m6j4/files/febc69d7-5928-4191-a678-894790f89a0a?download=true&filename=DOHMHDataDictionary_CHS_5%204%2017.xlsx",
                         sheet_name='Column Info')
data_dict

Unnamed: 0,Data Dictionary - Column Information,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,Column Name,Column Description,"Term, Acronym, or Code Definitions","Additional Notes \n(where applicable, include the range of possible values, units of measure, how to interpret null/zero values, whether there are specific relationships between columns, and information on column source)"
1,Survey,This is a survey of a sample of New York City adults.,"Community Health Survey, CHS",
2,Question,Please see the Codebook tab for detailed descriptions of each question,,
3,Year,The year that the survey was conducted,,
4,Prevalence,The percent of New York City adults who have the characteristic being described,,N/A indicates that the question was not asked in that year.
5,Lower 95% Cl,"Confidence Interval (CI) is a measure of the precision of an estimate: the wider the CI, the more imprecise the estimate. The Lower Bound of the confidence interval is the lower threshold of imprecision. If a different sample of New York City adults were interviewed using the same methodology, 95% of the time their answers would fall between the lower bound and the upper bound of the confidence interval.",,N/A indicates that the question was not asked in that year.
6,Upper 95% Cl,"Confidence Interval (CI) is a measure of the precision of the estimate: the wider the CI, the more imprecise the estimate. The Upper Bound of the confidence interval is the upper threshold of imprecision. If a different sample of New York City adults were interviewed using the same methodology, 95% of the time their answers would fall between the lower bound and the upper bound of the confidence interval.",,N/A indicates that the question was not asked in that year.


I believe I understand this data set now. It looks like the CHS survey was given over multiple years. 

In this survey, participants were asked  anumber of health related questions, and aggregated responses were recorded.

**prevalence** is the percent who answered positively

**lower95_ci** and **upper95_ci** function as the upper and lower bounds for how confident the authors are in the accuracy of the data for that question in that year. In other words, if they had to throw out their responses, and ask a new random set of new yorkers the same question, the autors are 95%  certain the answers would be between the lower number and the upper number.

The only other question I have is if other surveys are nested in this dataset.
I'll double check to make sure we are only using the one survey using the `unique` function on our survey column.
If there is only one value, we'll have confirmed.

In [4]:
health_survey['survey'].unique()

array(['CHS'], dtype=object)

### Transforming<a id='Understanding'></a>
Now that we know what the data is, I will start my analysis.

I'll begin by making a copy of the data that keeps just the "Year", "Question" and "Prevalence" columns.<a id='slice_data'></a>

I'm going to use `copy`, so that I know that I'm not inadvertently modifying the original imported dataset.

In [5]:
basic_chs = health_survey[['question','year', 'prevalence']].copy()
pd.concat([basic_chs.head(), basic_chs.tail()], axis=0)

Unnamed: 0,question,year,prevalence
0,Health Insurance Coverage,2010,83.3
1,Health Insurance Coverage,2010,83.3
2,Did not get needed medical care,2010,10.3
3,Did not get needed medical care,2010,10.3
4,No Personal Doctor,2010,
95,"Colon cancer screening, adults age 50+ (colonoscopy)",2014,69.9
96,Self-reported Health Status (excellent/very good/good),2014,77.8
97,Self-reported Health Status (excellent/very good/good),2014,77.8
98,"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",2014,64.2
99,"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",2014,64.2


<a id='cleaning'></a>
I can already see that there  are duplicates in this set. I'll start by pruning them out using `drop_duplicates`

In [6]:
basic_chs.drop_duplicates(keep='first',inplace=True) 
pd.concat([basic_chs.head(), basic_chs.tail()], axis=0)

Unnamed: 0,question,year,prevalence
0,Health Insurance Coverage,2010,83.3
2,Did not get needed medical care,2010,10.3
4,No Personal Doctor,2010,
6,Drinks 1 or more sugar-sweetened beverages per day,2010,30.3
8,Smoking Status (current smokers),2010,14.0
90,Binge Drinking,2014,16.5
92,Obesity,2014,24.7
94,"Colon cancer screening, adults age 50+ (colonoscopy)",2014,69.9
96,Self-reported Health Status (excellent/very good/good),2014,77.8
98,"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",2014,64.2


It looks like we certainly dropped some of those duplicate rows. Let's get a sense of what we're working with now using `describe`

In [7]:
basic_chs.describe()

Unnamed: 0,year,prevalence
count,50.0,48.0
mean,2012.0,41.454167
std,1.428571,28.009998
min,2010.0,9.6
25%,2011.0,16.8
50%,2012.0,24.45
75%,2013.0,68.7
max,2014.0,86.2


<a id='arranging'></a>
It looks like we have 50 questions, and 48 answers.

It is a bit hard to read using the `head` and `tail` features, at least using the dataframe as is. The `describe` function gave some statistical anlyses of the data, but I really don't have a good sense of it. 

To remedy that, I will transform the data into wideform using the `pivot` function, making the question text the column header, and the year the row index.

In [8]:
years_by_question = basic_chs.pivot('year','question')
years_by_question

Unnamed: 0_level_0,prevalence,prevalence,prevalence,prevalence,prevalence,prevalence,prevalence,prevalence,prevalence,prevalence
question,Binge Drinking,"Colon cancer screening, adults age 50+ (colonoscopy)",Did not get needed medical care,Drinks 1 or more sugar-sweetened beverages per day,"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",Health Insurance Coverage,No Personal Doctor,Obesity,Self-reported Health Status (excellent/very good/good),Smoking Status (current smokers)
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2010,,67.5,10.3,30.3,62.3,83.3,,23.4,79.1,14.0
2011,17.9,68.6,10.7,29.9,67.4,81.4,16.9,23.7,78.2,14.8
2012,19.6,68.5,11.1,28.2,61.8,80.2,18.3,24.2,78.7,15.5
2013,18.2,69.0,11.2,23.3,66.8,79.1,19.1,23.4,76.9,16.1
2014,16.5,69.9,9.6,22.5,64.2,86.2,15.6,24.7,77.8,13.9


This is *much* eaiser to read. Missing values are clear, and comparing the surveys over time is possible at a glance.

However, I believe we may like it better using the year the column header, and the question text the row index.

In [9]:
questions_by_year = basic_chs.pivot('question','year')
questions_by_year

Unnamed: 0_level_0,prevalence,prevalence,prevalence,prevalence,prevalence
year,2010,2011,2012,2013,2014
question,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Binge Drinking,,17.9,19.6,18.2,16.5
"Colon cancer screening, adults age 50+ (colonoscopy)",67.5,68.6,68.5,69.0,69.9
Did not get needed medical care,10.3,10.7,11.1,11.2,9.6
Drinks 1 or more sugar-sweetened beverages per day,30.3,29.9,28.2,23.3,22.5
"Flu shot in last 12 months, adults ages 65+ (not age-adjusted)",62.3,67.4,61.8,66.8,64.2
Health Insurance Coverage,83.3,81.4,80.2,79.1,86.2
No Personal Doctor,,16.9,18.3,19.1,15.6
Obesity,23.4,23.7,24.2,23.4,24.7
Self-reported Health Status (excellent/very good/good),79.1,78.2,78.7,76.9,77.8
Smoking Status (current smokers),14.0,14.8,15.5,16.1,13.9


This way it is *even easier* to compare the numbers! The missing values are just as clear, but comparing questions over time is dead simple!

### Closing Thoughts <a id='conclusion'></a>

We've finally transformed the original csv into a dataframe that we can use, some examples include;

* We could look to see if health data is getting better or worse over time
* We could look to see which values are correlated
* We could import other data to see if reprted health surveys line up with recorded health data from docotr visits, to see if this survey is accurate.

Though further analysis is beyond the scope of this notebook, I hope this demonstration and final work may be of some use.

# <center> <br>[Begining of the page](#Top)</center>