# Introduction

![](https://www.dettol.co.in/media/3485/cholera-740x400.jpg?width=1140&height=641&center=0.5,0.5&mode=crop)

Cholera is an acute diarrhoeal infection caused by ingestion of food or water contaminated with the bacterium Vibrio cholerae. Cholera remains a global threat to public health and an indicator of inequity and lack of social development. Researchers have estimated that every year, there are roughly 1.3 to 4.0 million cases, and 21 000 to 143 000 deaths worldwide due to cholera

# Importing the libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Reading the Cholera Dataset as input

In [None]:
file=pd.read_csv('/kaggle/input/cholera-dataset/data.csv')
file.head()

# Identifying the DataTypes

In [None]:
file.dtypes

# Identifying the Data gaps

In [None]:
file.isnull().sum()

# Data Cleaning

In [None]:
file.replace(np.nan,0,regex=True,inplace=True)

# Data Clean Check

In [None]:
file.isnull().sum()

# Checking if "Unknown" value exists

In [None]:
file[(file['Number of reported deaths from cholera']=='Unknown') | (file['Number of reported cases of cholera']=='Unknown')]

# Correcting the "Unknown" Values

In [None]:
file.replace('Unknown',0,regex=True,inplace=True)

# Identifying the total number of countries present in the Dataset

In [None]:
country_list=file.Country.unique()
len(file.Country.unique())

# Upon Checking the data, we observed an issue with this particular entry

In [None]:
file[file['Number of reported cases of cholera']=="3 5"]

# This entry needs to be cleaned. Since the fatality rate is given as 0.0 and 0.0, we assume, that all the entries here to be 0 and assign the variables accordingly.

In [None]:
file['Number of reported cases of cholera'] = file['Number of reported cases of cholera'].str.replace('3 5','0')
file['Number of reported deaths from cholera'] = file['Number of reported deaths from cholera'].str.replace('0 0','0')
file['Cholera case fatality rate'] = file['Cholera case fatality rate'].str.replace('0.0 0.0','0')

# Country wise visualization of Total Number of Reported Cases and Total Number of Deaths over the Years

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(30,90))
for c,num in zip(file.Country.unique(), np.arange(1,len(file.Country.unique()))):
    file_req=file[file['Country']==c]
    ax = fig.add_subplot(27,6,num)
    x=file_req['Year']
    y1=pd.to_numeric(file_req['Number of reported cases of cholera'])
    ax.plot(x,y1)
    y2=pd.to_numeric(file_req['Number of reported deaths from cholera'])
    ax.plot(x,y2,color='r')
    ax.legend(['Reported Cases','Deaths'])
    ax.set_title(c)
    ax.set_xlabel('Years')
    ax.set_ylabel('Cases Count')

plt.tight_layout()
plt.show()

This chart has too much information to be interpreted, so breaking them up into chunks in the following sections.

# WHO Region wise visualization of Total Number of Reported Cases and Total Number of Deaths over the Years

In [None]:
fig = plt.figure(figsize=(30,10))
for c,num in zip(file['WHO Region'].unique(), np.arange(1,1+len(file['WHO Region'].unique()))):
    file_req=file[file['WHO Region']==c]
    ax = fig.add_subplot(2,3,num)
    x=file_req['Year'].unique()
    y11=file_req.groupby('Year').apply(lambda x:np.sum(pd.to_numeric(x['Number of reported cases of cholera']))).reset_index(name='Counts')
    y1=y11['Counts']
    sns.regplot(x,y1)
    y21=file_req.groupby('Year').apply(lambda x:np.sum(pd.to_numeric(x['Number of reported deaths from cholera']))).reset_index(name='Counts')
    y2=y21['Counts']
    sns.regplot(x,y2)
    ax.legend(['Reported Cases','Deaths'])
    ax.set_title(c)
    ax.set_xlabel('Years')
    ax.set_ylabel('Cases Count')

plt.tight_layout()
plt.show()

**Observation** South East Asia is having the highest number of Cholera detected cases and Death, when all time data is considered. But if we check into the last 10 years data, America and Africa has been the regions which have the highest number of Cases- although the death rates have been well controlled, as compared to South East Asia in 1980s-2000s

# To identify the number of Cases Registered in the World (All time data)

In [None]:
file_ctry=file.groupby(['Country']).apply(lambda x:np.sum(pd.to_numeric(x['Number of reported cases of cholera']))).reset_index(name='Count of Cholera Cases')
print('The Number of Cholera Cases for the counties:\n',file_ctry)
plt.figure(figsize=(30,10))
plt.scatter(file_ctry['Country'],file_ctry['Count of Cholera Cases'])
plt.xlabel('Countries',size=20)
plt.ylabel('Cholera Cases Count',size=20)
plt.title('Cholera Cases Distribution- All Time',size=40)
plt.xticks(rotation=90)
plt.show()

**Observation-** We can see from the chart above- India has the most number of Cholera Cases in the world- if all Time data is considered. India is followed by Haiti and Peru respectively as seen in the chart.

# Importing a shape file, that will be required for the plotting of the Global Heatmap

In [None]:
import geopandas as gpd
fp = "/kaggle/input/world-countries-shp-file/TM_WORLD_BORDERS-0.3.shp"
map_df = gpd.read_file(fp)

# The Countries in the .shp file needs to be the same as the Country name in the data file. Identifying those countries and replacing their names.

In [None]:
map_df.name.replace({'Burma':'Myanmar'},regex=True,inplace=True)
map_df.name.replace({'Korea, Republic of':'Republic of Korea'},regex=True,inplace=True)
map_df.name.replace({'Russia':'Russian Federation'},regex=True,inplace=True)
map_df.name.replace({'United Kingdom':'United Kingdom of Great Britain and Northern Ireland'},regex=True,inplace=True)
map_df.name.replace({'United States':'United States of America'},regex=True,inplace=True)
map_df.name.replace({ 'Venezuela': 'Venezuela (Bolivarian Republic of)'},regex=True,inplace=True)

# Creating a World Heat Map to display the total number of cases

In [None]:
merged = map_df.set_index('name').join(file_ctry.set_index('Country'))
variable = 'Count of Cholera Cases'
vmin, vmax = np.min(merged['Count of Cholera Cases']), np.max(merged['Count of Cholera Cases'])
fig, ax = plt.subplots(1, figsize=(20, 12))
merged.plot(column=variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('Cholera Cases History: World Level', fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate('Source:Cholera Dataset',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='Blues', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

**Observation:** What we can see is, India has the highest number of total cases, followed by Haiti (Shown as a small dot in the Carribean islands) and followed by Peru- and have been marked with the darkest shades of Blue. Countries where there has been no information has been shown in White.

# To identify the span of last 10 years from the dataset

In [None]:
file_yrs=file.Year.unique()
file_yrs.sort()
final_10_yrs=file_yrs[-10:]
final_10_yrs

**Observation** The Data available has a data uptil 2016. Hence considering the last 10 years would range from 2007-2016.

# Countries with the Least number of Cases in the past 10 years

In [None]:
ten_yr_case=file[file['Year'].isin(final_10_yrs)]
file_yr_ctry=ten_yr_case.groupby(['Country']).apply(lambda x:np.sum(pd.to_numeric(x['Number of reported cases of cholera']))).reset_index(name='Count of Cholera Cases')
best_cases_country=file_yr_ctry.sort_values(by='Count of Cholera Cases',ascending=True)[0:10]
print('Countries having the least number of Cholera Cases in the last 10 years: \n',best_cases_country)

**Observation:** Europe, Brazil and parts of Asia have a comparatively good control in the number of Cholera cases.

# Plotting the Countries with the Least Number of Cases in the Last 10 Years

In [None]:
merged = map_df.set_index('name').join(best_cases_country.set_index('Country'))
variable = 'Count of Cholera Cases'
vmin, vmax = np.min(merged['Count of Cholera Cases']), np.max(merged['Count of Cholera Cases'])
fig, ax = plt.subplots(1, figsize=(20, 12))
merged.plot(column=variable, cmap='Greens', linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('Countries with the Least Count of Cholera (Last 10 Years)', fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate('Source:Cholera Dataset',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='Greens', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

# List of Countries with the Maximum number of cases in the past 10 Years

In [None]:
worst_cases_country=file_yr_ctry.sort_values(by='Count of Cholera Cases',ascending=False)[0:10]
print('Countries having the most number of Cholera Cases in the last 10 years: \n',worst_cases_country)

**Observation:** Africa and America (Haiti has the highest counts) has the most dominating numbers in the countries list having maximum number of Cholera Cases detected. We have validated this information below with the help of a heatmap.

# Plotting the Countries with the Maximum number of Cases in the last 10 Years

In [None]:
merged = map_df.set_index('name').join(worst_cases_country.set_index('Country'))
variable = 'Count of Cholera Cases'
vmin, vmax = np.min(merged['Count of Cholera Cases']), np.max(merged['Count of Cholera Cases'])
fig, ax = plt.subplots(1, figsize=(20, 12))
merged.plot(column=variable, cmap='Reds', linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('Countries with the Highest Count of Cholera (Last 10 Years)', fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate('Source:Cholera Dataset',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='Reds', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

# Countries Level Analysis- Deaths from Cholera (All Time) 

In [None]:
file_ctry_deaths=file.groupby(['Country']).apply(lambda x:(pd.to_numeric(x['Number of reported deaths from cholera'])).sum()).reset_index(name='Count of Deaths from Cholera')
print('The Number of Cholera Cases for the counties:\n',file_ctry_deaths)
plt.figure(figsize=(30,10))
plt.bar(file_ctry_deaths['Country'],file_ctry_deaths['Count of Deaths from Cholera'])
plt.xlabel('Countries',size=30)
plt.ylabel('Cholera Death Cases Count',size=30)
plt.title('Cholera Death Cases Distribution- All Time',size=40)
plt.xticks(rotation=90)
plt.show()
merged = map_df.set_index('name').join(file_ctry_deaths.set_index('Country'))
variable = 'Count of Deaths from Cholera'
vmin, vmax = np.min(merged['Count of Deaths from Cholera']), np.max(merged['Count of Deaths from Cholera'])
fig, ax = plt.subplots(1, figsize=(20, 12))
merged.plot(column=variable, cmap='PuBu', linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('Countries with the Highest Count of Death from Cholera', fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate('Source:Cholera Dataset',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='PuBu', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

**Observation:** As per all time data, India and Bangladesh have accounted for the maximum number of Deaths due to Cholera. We have also seen that India ranks highest among the overall number of Cases. We will understand the pattern for India in detail below.

# Creating a "WHO Region vs Year Heatmap" for the Total Counts of Cases- Analysis of the Last 10 years

In [None]:
file_yr_region=ten_yr_case.groupby(['WHO Region','Year']).apply(lambda x:np.sum(pd.to_numeric(x['Number of reported cases of cholera']))).reset_index(name='Count of Cholera Cases')
heatmap1_data = pd.pivot_table(file_yr_region, values='Count of Cholera Cases', 
                     index=['WHO Region'],columns='Year')
plt.figure(figsize=(10,10))
sns.heatmap(heatmap1_data,annot=True,fmt='.0f', cmap='YlOrBr').set_title('WHO Region vs Year Heatmap of Cholera Cases- in the last 10 years',size=15)

# Plotting the average fatality rate in the Nations over the years

In [None]:
fat_ctry=file.groupby('Country').apply(lambda x:pd.to_numeric(x['Cholera case fatality rate']).median()).reset_index(name='Median Fatality')
fat_ctry.replace(np.NaN,0,inplace=True)
plt.figure(figsize=(30,5))
plt.plot(fat_ctry['Country'],fat_ctry['Median Fatality'],color='r')
plt.xlabel('Country',size=25)
plt.ylabel('Fatality Rate (Median)',size=20)
plt.title('Median Fatality Rate- Nation wise',size=40)
plt.xticks(rotation=90)
merged = map_df.set_index('name').join(fat_ctry.set_index('Country'))
variable = 'Median Fatality'
vmin, vmax = np.min(merged['Median Fatality']), np.max(merged['Median Fatality'])
fig, ax = plt.subplots(1, figsize=(20, 12))
merged.plot(column=variable, cmap='OrRd', linewidth=0.8, ax=ax, edgecolor='0.8')
ax.axis('off')
ax.set_title('Countries with the Highest Fatality Rates from Cholera', fontdict={'fontsize': '25', 'fontweight' : '3'})
ax.annotate('Source:Cholera Dataset',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')
sm = plt.cm.ScalarMappable(cmap='PuBuGn', norm=plt.Normalize(vmin=vmin, vmax=vmax))
sm._A = []
cbar = fig.colorbar(sm)

**Observation** Clearly it is seen African nations have a higher fatality rate, followed by the Northern portions of South America and Southern Asia. One clear indication that can be identified here is these nations are in developing or underdeveloped stages- with densely populated areas. Since the waterbodies present are much more polluted here as compared to the developed nations- the chances of Cholera spread and fatality is much more in these parts of the world.

# We will now be summarizing the overall data for 2 nations- One which has the highest counts in all time data, and the other having the highest counts in the past 10 years



# We are defining a function from where we shall be recieving the Counts Analysis for the particular country

In [None]:
def country_summary(country):
    country_file=file[file['Country']==country]
    country_case_counts=country_file.groupby('Year').apply(lambda x:pd.to_numeric(x['Number of reported cases of cholera'])).reset_index(name='Total Cases of Cholera')
    country_death_counts=country_file.groupby('Year').apply(lambda x:pd.to_numeric(x['Number of reported deaths from cholera'])).reset_index(name='Total Deaths from Cholera')
    plt.plot(country_case_counts['Year'],country_case_counts['Total Cases of Cholera'])
    plt.plot(country_death_counts['Year'],country_death_counts['Total Deaths from Cholera'],color='r')
    plt.xlabel('Year')
    plt.ylabel('Total count numbers')
    plt.title('Analysis of Cholera Cases vs Death: {}'.format(country))
    plt.legend(['Total Cases of Cholera','Total Deaths from Cholera'])
    print('The Year {:.0f} has seen the maximum number of Cholera Cases: {:.0f}'.format(country_case_counts.loc[np.max(country_case_counts['Total Cases of Cholera'].idxmax())]['Year'],np.max(country_case_counts['Total Cases of Cholera'])))
    print('The Year {:.0f} has seen the maximum number of deaths due to Cholera: {:.0f}'.format(round(country_death_counts.loc[np.max(country_death_counts['Total Deaths from Cholera'].idxmax())]['Year']),round(np.max(country_death_counts['Total Deaths from Cholera']))))

# 1. The Country with the highest number of Cases and Deaths of all time- **India**

In [None]:
India_Data=country_summary('India')

**Observation**- India, altohough had an extremely tough phase handling Cholera in the 1950s and a slightly controlled phase of Cholera in 1960-70, has been able to control the disease in the recent years. The number of detected Cases has been low in the recent times, as well as the number of deaths.

# 2. Country with the second highest number of Cases and Deaths of all time data- Bangladesh

In [None]:
Bangladesh_Data=country_summary('Bangladesh')

**Observation** The data od Bangladesh shows the high extent of risk for Cholera in 1950-70s period. Although the number of cases and deaths were significantly decreasing with time- we are unsure about the metrics in the last 10 years due to unavialbility of the Data,

# 3. The Country with the highest number of Cases in the Last 10 Years- **Haiti**

In [None]:
Haiti_Data=country_summary('Haiti')

**Observation** The Data available for Haiti is only from the Span 2010-2016. Thereby Comparative analysis with a country like India was not feasible. But The pattern here has seen a different layout. The number of Cases has been on the rise between 2010-2011, and has dropped for the next 3 years. There has been a slight increase from 2014 and is on the increasing direction. Due to less amount of Data, this cannot be further forecasted. The death rate although has been very controlled throughout.

# Conclusion

We have seen the patterns of Cholera Cases- how that has changed in throughout the years. Certain observations that we could infer upon are-
1. Developed nations having very lesser counts of Cholera Cases Detection and Deaths associated.
2. The highest number of Cases as well as the highest number of deaths is available for India. Now this is quite possible considering the high population, sanitation problems and disturbed political scenerio of India during 1950-70 period. Although they hav etaken serious setps and improved upon their condition now.
3. Haiti is a small country in the Carribean Islands. The Country is not having a strong economy, and due to the environmental drawbacks, the counts there is still high. yet they have been able to control the death rates, and that is a great point.

From the data standpoint, we would like to highlight some observations:
1. The Data is not rich. So a very detailed analysis for every country was not possible since sufficient amount of data was not available for all the countries.
2. The Data wasn't very cleaned, hence we had to apply a few cleaning techniques at the beginning of out Analysis.

# Please Upvote if you liked the Analysis!!!