## Vaccination Dataset Analysis

For this analysis we'll be looking at 2017 data on immunizations from the CDC. The datafile for this assignment is in assets/NISPUF17.csv. A data users guide for this, which we'll need to map the variables in the data to the analysis being done, is available at assets/NIS-PUF17-DUG.pdf.

## 1
A function called `proportion_of_education` which returns the proportion of children in the dataset who had a mother with the education levels equal to less than high school (<12), high school (12), more than high school but not a college graduate (>12) and college degree.

*This function would return a dictionary in the form of:*
```
    {"less than high school":0.2,
    "high school":0.4,
    "more than high school but not college":0.2,
    "college":0.2}
```

In [None]:
def proportion_of_education():
    import pandas as pd
    
    df = pd.read_csv('assets/NISPUF17.csv')
    
    total = len(df)
    
    LTHS = len(df[df['EDUC1'] == 1])    #less than high school
    HS = len(df[df['EDUC1'] == 2])      #high school
    MTHS = len(df[df['EDUC1'] == 3])    #more than high school but not graduated
    CLG = len(df[df['EDUC1'] == 4])     #some degree
    
    return {'less than high school':LTHS/total,
            'high school':HS/total,
            'more than high school but not college':MTHS/total,
            'college':CLG/total}

print(proportion_of_eductaion())

## 2

Let's explore the relationship between being fed breastmilk as a child and getting a seasonal influenza vaccine from a healthcare provider. We will a tuple of the average number of influenza vaccines for those children we know received breastmilk as a child and those who know did not.

*This function would return a tuple in the form :*
```
(2.5, 0.1)
```

In [None]:
def average_influenza_doses():
    import pandas as pd
    df = pd.read_csv('assets/NISPUF17.csv')
    
    dfB = df[df['CBF_01']==1]       #breastfed
    dfN = df[df['CBF_01']==2]
    
    meanB = dfB['P_NUMFLU'].sum()/len(dfB[dfB['P_NUMFLU']>=0]) #average of number of vaccines
    meanN = dfN['P_NUMFLU'].sum()/len(dfN[dfN['P_NUMFLU']>=0])
    
    return (meanB, meanN)

print(average_influenza_doses())

## 3
It would be interesting to see if there is any evidence of a link between vaccine effectiveness and sex of the child. We will calculate the ratio of the number of children who contracted chickenpox but were vaccinated against it (at least one varicella dose) versus those who were vaccinated but did not contract chicken pox. Results returned by sex.

*This function would return a dictionary in the form of:*
```
    {"male":0.2,
    "female":0.4}
```


In [None]:
def chickenpox_by_sex():
    import pandas as pd
    
    df = pd.read_csv('assets/NISPUF17.csv')
    
    dfM = df[df['SEX'] == 1]        #male
    dfF = df[df['SEX'] == 2]
    
    ratioM = len(dfM[(dfM['P_NUMVRC']>=1)&(dfM['HAD_CPOX']==1)])/len(dfM[(dfM['P_NUMVRC']>=1)&(dfM['HAD_CPOX']==2)])
    #HAD_CPOX defines had chicken pox or not, P_NUMVRC defines number of varicella vaccines recieved
    ratioF = len(dfF[(dfF['P_NUMVRC']>=1)&(dfF['HAD_CPOX']==1)])/len(dfF[(dfF['P_NUMVRC']>=1)&(dfF['HAD_CPOX']==2)])
    
    return {'male':ratioM,
            'female':ratioF}

print(chickenpox_by_sex())

## 4
A correlation is a statistical relationship between two variables. If we wanted to know if vaccines work, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease [1]. In this, you are to see if there is a correlation between having had the chicken pox and the number of chickenpox vaccine doses given (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either `1` (for yes) or `2` (for no), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A positive correlation (e.g., `corr > 0`) means that an increase in `had_chickenpox_column` (which means more no’s) would also increase the values of `num_chickenpox_vaccine_column` (which means more doses of vaccine). If there is a negative correlation (e.g., `corr < 0`), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, `pval` is the probability that we observe a correlation between `had_chickenpox_column` and `num_chickenpox_vaccine_column` which is greater than or equal to a particular value occurred by chance. A small `pval` means that the observed correlation is highly unlikely to occur by chance. In this case, `pval` should be very small (will end in `e-18` indicating a very small number).

[1] This isn’t really the full picture, since we are not looking at when the dose was given. It’s possible that children had chickenpox and then their parents went to get them the vaccine.

In [None]:
def corr_chickenpox():
    import scipy.stats as stats
    import numpy as np
    import pandas as pd

    df = pd.read_csv('assets/NISPUF17.csv')
    df = df[(df['HAD_CPOX'] > 0)&(df['P_NUMVRC'] >= 0)&(df['HAD_CPOX'] < 3)]        #handling missing or corrupt values
    
    corr, pval = stats.pearsonr(df['HAD_CPOX'],df['P_NUMVRC'])
    
    return Ocorr