Health searches data contains the statistics of google searches made in US. This data covers the most prominent medical conditions in the USA.
To start our analysis, let's read the data into a pandas dataframe and also we look at the first 3 rows to understand the columns/data. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

healthSearchData=pd.read_csv("../input/RegionalInterestByConditionOverTime.csv")
healthSearchData.head(3)

For our study, we do not consider the "geoCode" column and lets drop it. This is because we already have the city name in a separate column and I would like to keep the data simple.

In [None]:
healthSearchData = healthSearchData.drop(['geoCode'],axis=1)

In the dataset, we have 9 medical conditions and the search data is from 2004 to 2017. Its soo refreshing to see data for more than 10 years. Anyway, now we plot year wise search change for the diseases available. 

In [None]:
#2004-2017
#cancer cardiovascular stroke depression rehab vaccine diarrhea obesity diabetes
yearWiseMeam = {}
for col in healthSearchData.columns:
    if '+' in col:
        year = col.split('+')[0]
        disease = col.split('+')[-1]
        if not disease in yearWiseMeam:
            yearWiseMeam[disease] = {}
        if not year in yearWiseMeam[disease]:
            yearWiseMeam[disease][year] = np.mean(list(healthSearchData[col]))

plt.figure(figsize=(18, 6))
ax = plt.subplot(111)
plt.title("Year wise google medical search", fontsize=20)

ax.set_xticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13])
ax.set_xticklabels(list(yearWiseMeam['cancer'].keys()))
lh = {}
for disease in yearWiseMeam:
    lh[disease] = plt.plot(yearWiseMeam[disease].values())
plt.legend(lh, loc='best')


It can be observed that the line plot has so many uneven jumps. Let's smooth the plot and visualise how the search looks like. This is just for observational benefits and need not be performed everytime.

In [None]:
plt.figure(figsize=(18, 6))
ax = plt.subplot(111)
plt.title("Year wise google medical search [smoothened]", fontsize=20)

ax.set_xticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13])
ax.set_xticklabels(list(yearWiseMeam['cancer'].keys()))
lh = {}
myLambda = 0.7
for disease in yearWiseMeam:
    tempList = list(yearWiseMeam[disease].values())
    localMean = np.mean(tempList)
    smoothList = []
    for x in tempList:
        smoothList.append(x + myLambda * (localMean - x)) 
    lh[disease] = plt.plot(smoothList)
plt.legend(lh, loc='best')

We see that Cancer is the most searched illness whereas cardiovascular search is the least. Surprisingly, in 2017, diabetes is the highest searched illness. I believe that people are becoming more aware about their health and this can mostly be preemptive search to avoid any future illness. Whatever the case, diabetes has overtaken Cancer in search data.

Continuing our analysis, we will further go ahead and plot a heat map to visualise the pattern acros geographical locations.

In [None]:
statesData = pd.DataFrame(healthSearchData.iloc[:,0])
healthSearchData = healthSearchData.drop(['dma'],axis=1)

meanDict = {}
yearList = []
illnessList = []
for col in healthSearchData.columns:
    if '+' in col:
        yearList.append(col.split('+')[0])
        illnessList.append(col.split('+')[-1])
        
for index, row in healthSearchData.iterrows():
    for illness in illnessList:
        searchCountList = []
        for year in yearList:
            searchCountList.append(row[year+ '+' +illness])
        if not illness in meanDict:
            meanDict[illness] = []
        meanDict[illness].append(np.mean(searchCountList))
yearWiseMeanDf = pd.DataFrame.from_dict(meanDict, orient='columns', dtype=None)
heatMapData = statesData.join(yearWiseMeanDf)
heatMapData.set_index('dma', inplace=True, drop=True)

import seaborn as sns
plt.figure(figsize=(10, 25))
plt.title("State wise illness search", fontsize=16)
ax = plt.subplot(111)
ax.spines["top"].set_visible(False)    
ax.spines["bottom"].set_visible(False)    
ax.spines["right"].set_visible(False)    
ax.spines["left"].set_visible(False)
ax.get_xaxis().tick_bottom()    
ax.get_yaxis().tick_left()
ax = sns.heatmap(heatMapData)

Here, we see the mean value of all the diseases acroos the years being plotted against major cities. Further, I would like to somehow relate it to the actual statistics of the people affected. If we can get our hands on the actual spread of illness, it is possible to a promising correlation with the actual ground truth. 