## Contents:

1. Introduction:
        - What questions are we hoping to answer here?
        - How would we ideally proceed?
2. Imports & Reading in the data
3. Understanding and Cleaning the data
        - What do the columns mean?
        - What are the unique values in each column?
        - Replacing NaNs and with what?
        - Which columns should be dropped?
        - Which rows should be dropped?
4. Helper functions
5. Charting how state wise consumption of alcohol changed between 2010 and 2014
            -Will require streamlining in terms of what years, DataValueUnits and Data Descriptions we want to keep
6. Charting how renal failure rates increased between 2011 and 2014
            -Will require streamlining in terms of what years, DataValueUnits and Data Descriptions we want to keep
7. Plotting the impact of the 2


# Introduction:

### Questions I want to answer with this-
- Alcohol trends across states
    * Are these getting better with time?
- How they correlate with renal failure
    * Did the states that showed an increase in alcohol consumption also show an icrease in renal failure?
    
### The dataset has survey results from multiple surveys, and across multiple years - will have to clean

### Multiple results have staggered categorization of the data, which isn't uniform
    

# Imports & Reading in the Data: 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip3 install plotly

In [None]:
from sklearn import preprocessing

import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show, output_file
import plotly
import plotly.express as px

import json

from urllib.request import urlopen

import warnings
warnings.filterwarnings('ignore')

In [None]:
df_source = pd.read_csv('/kaggle/input/chronic-disease/U.S._Chronic_Disease_Indicators.csv')
df_source.head()

# Understanding the data

In [None]:
df_source.info()

In [None]:
for col in df_source:
    print(col)
    print(df_source[col].value_counts())
    print('*******')

Based on the above the following columns are either entirely empty or not of any use to us:
* DataSource
* Response
* DataValueFootnoteSymbol
* DataValueAlt
* StratificationCategory2
* StratificationCategory3
* Stratification2
* Stratification3
* ResponseID
* QuestionID
* TopicID
* StratificationCategoryID1
* StratificationCategoryID2
* StratificationCategoryID3
* StratificationID1
* StratificationID2
* StratificationID3

In [None]:
df = df_source.drop(columns=["DataSource",
"Response",
"DataValueFootnoteSymbol",
"DataValueAlt",
'StratificationCategory2',
"StratificationCategory3",
"Stratification2",
"Stratification3",
"ResponseID",
"QuestionID",
"TopicID",
"StratificationCategoryID1",
"StratificationCategoryID2",
"StratificationCategoryID3",
"StratificationID1",
"StratificationID2",
"StratificationID3"])
print(len(df))

In [None]:
df.info()

How many of these values are na per column?

In [None]:
print(df.isna().sum())

Things we need to look into:
* what's the data value when its unit is na, does it make sense? if no, remove the entry.
* If the data has no value what is the corresponding topic and question?
* Are the footnotes necessary given the values?
* do the confidence limits not make sense in the context of the data value?
* If there's no geolocation but there's a location we can still probably plot the origin, therefore it's okay if we let this column be.
* Certain data value cells are empty due to low number of respondents and n=lack of available data as indicated by the footnotes, get rid of those

In [None]:
#remove for no data value at all
print(len(df))
df = df.merge(df[(df['DataValueUnit'].isna()) & (df['DataValue'].isna())], how='left', indicator=True)
df = df[df['_merge']=='left_only']
df= df.drop('_merge',axis=1)
print(len(df))

df = df[df['DataValue'].notna()]
print(len(df))

df = df[df['DatavalueFootnote']!='No data available']
print(len(df))

df = df[df['DatavalueFootnote']!='Data not shown because of too few respondents or cases']  
print(len(df))

df.DatavalueFootnote[df['DatavalueFootnote'].isna()]=''
print(df.isna().sum())

Topic wise

In [None]:
df.Topic.value_counts()

### Helper function to find relevant Questions by key words

In [None]:
# parameter is a list of words that you'd associate with that question
def find_ques(words):
    rel=[]
    for i in list(df.Question.value_counts().index):
        for w in words:
            if str(w) in str(i):
                rel.append(i)
                
    return list(set(rel))


### Helper function to merge dataframes on Location and keep only the data values

In [None]:
def merge_dfs_loc(df1,df2):
    temp1 = df1[['LocationAbbr','DataValue']].reset_index().drop(columns='index')
    temp2 = df2[['LocationAbbr','DataValue']].reset_index().drop(columns='index')
    temp = temp1.merge(temp2, left_on='LocationAbbr', right_on='LocationAbbr')
    return temp 
    

In [None]:
print(find_ques(['alcohol','drinking']))

# Alcoholism Trends

## Trends over the years and across states

In [None]:
df_alc = pd.concat([df[df['Question']==x] for x in find_ques(['alcohol','drinking']) if x!='Population served by community water systems that receive optimally fluoridated drinking water'])

In [None]:
per_cap_alc = df_alc[df_alc['Question']=='Per capita alcohol consumption among persons aged >= 14 years']
per_cap_alc['YearStart'].value_counts()
#per capita in gallons for the years 2010 and 2014 data available for all 52 states
per_cap_alc_2010= per_cap_alc[per_cap_alc['YearStart']==2010]
per_cap_alc_2014= per_cap_alc[per_cap_alc['YearStart']==2014]
per_cap_alc_2010['DataValue']=pd.to_numeric(per_cap_alc_2010['DataValue'])
per_cap_alc_2014['DataValue']=pd.to_numeric(per_cap_alc_2014['DataValue'])

In [None]:
fig = px.choropleth(
    per_cap_alc_2010,
    locations="LocationAbbr",
    locationmode='USA-states',
    color= 'DataValue',
    color_continuous_scale='Matter',  
    scope='usa',
    hover_name="LocationDesc",
    labels = {'DataValue':'Gallons/capita'},
    title="per capita alcohol consumption in the year 2010 in gallons"
)

fig.show()

fig = px.choropleth(
    per_cap_alc_2014,
    locations="LocationAbbr",
    locationmode='USA-states',
    color= 'DataValue',
    color_continuous_scale='Matter',  
    scope='usa',
    hover_name="LocationDesc",
    labels = {'DataValue':'Gallons/capita'},
    title="per capita alcohol consumption in the year 2014 in gallons"
)

fig.show()



In [None]:
df_alc_change =merge_dfs_loc(per_cap_alc_2010,per_cap_alc_2014)
df_alc_in= set(df_alc_change.LocationAbbr[df_alc_change['DataValue_y']>df_alc_change['DataValue_x']])

In [None]:
f, ax = plt.subplots(figsize=(8, 15))

sns.set_color_codes("muted")
sns.barplot(x="DataValue_y", y="LocationAbbr", data=df_alc_change,
            label="2014", color="b")

sns.set_color_codes("pastel")
sns.barplot(x="DataValue_x", y="LocationAbbr", data=df_alc_change,
            label="2010", color="b")

ax.legend(ncol=2, loc="upper right", frameon=True)
ax.set(xlim =(0,5),ylabel=" State ",
       xlabel="Gallons per Capita")
sns.despine(left=True, bottom=True)

## Kidney Failure

In [None]:
print(find_ques(['renal','kidney']))

In [None]:
df_kidney = df[df['Question']=='Prevalence of chronic kidney disease among adults aged >= 18 years']
df_kidney = df_kidney[(df_kidney['StratificationCategory1']=='Overall')& (df_kidney['DataValueType']=='Age-adjusted Prevalence')]
df_kidney['DataValue']=pd.to_numeric(df_kidney['DataValue'])
df_kidney_2011= df_kidney[df_kidney['YearStart']==2011]
df_kidney_2014= df_kidney[df_kidney['YearStart']==2014]

In [None]:
df_kidney_change = merge_dfs_loc(df_kidney_2011,df_kidney_2014)

In [None]:
f, ax = plt.subplots(figsize=(8, 15))

sns.set_color_codes("muted")
sns.barplot(x="DataValue_y", y="LocationAbbr", data=df_kidney_change,
            label="2014", color="b")

sns.set_color_codes("pastel")
sns.barplot(x="DataValue_x", y="LocationAbbr", data=df_kidney_change,
            label="2010", color="b")

ax.legend(ncol=2, loc="upper right", frameon=True)
ax.set(xlim =(0,5),ylabel=" State ",
       xlabel="% of population with renal failure")
sns.despine(left=True, bottom=True)

In [None]:
df_kidney_in= set(df_kidney_change.LocationAbbr[df_kidney_change['DataValue_y']>df_kidney_change['DataValue_x']])

x= df_kidney_in.intersection(df_alc_in)

In [None]:
d1 = df_kidney_change[['LocationAbbr','DataValue_x','DataValue_y']].reset_index().drop(columns='index')
d2 = df_alc_change[['LocationAbbr','DataValue_x','DataValue_y']].reset_index().drop(columns='index')
d = d1.merge(d2, left_on='LocationAbbr', right_on='LocationAbbr')
d = pd.concat([d[d['LocationAbbr']==i] for i in x])
d['inc_alc']=d['DataValue_y_x']-d['DataValue_x_x']
d['inc_kid']=d['DataValue_y_y']-d['DataValue_x_y'] 

In [None]:
fig = px.choropleth(
    d,
    locations="LocationAbbr",
    locationmode='USA-states',
    color= 'inc_alc',
    color_continuous_scale='Matter',  
    scope='usa',
    hover_name="LocationAbbr",
    labels = {'inc_alc':'Gallons/capita'},
    title="increase in per capita alcohol consumption in gallons"
)

fig.show()


fig = px.choropleth(
    d,
    locations="LocationAbbr",
    locationmode='USA-states',
    color= 'inc_kid',
    color_continuous_scale='Matter',  
    scope='usa',
    hover_name="LocationAbbr",
    labels = {'inc_kid':'% increase'},
    title="increase in renal failure as % of the population "
)

fig.show()

Surprisingly, the impact of higher alcohol consumption isn't that direct on renal fialure