# 823hw3

# Creating effective visualizations using best practices:
### Create 3 informative visualizations about malaria using Python in a Jupyter notebook, starting with the data sets at https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-11-13. 
### Where appropriate, make the visualizations interactive.

### We have three original datasets: malaria_deaths.csv, malaria_deaths_age.csv, malaria_inc.csv.
### The malaria_deaths dataset includes Entity, Code, Year and Deaths(per 100,000)
### The malaria_deaths_age includes entity, code, year,age_group and death(total count)
### The malaria_inc includes Entity, Code, Year and Incidence of malaria (per 1,000 population at risk)

### After all three dataset have been read in and checked, I find myself particularly interested in seeing how the incidence changes within the most populated countires over the years. 
### So I looked up the list of populations per country and then chose 10 countries with most population that exist in the malaria_inc.csv dataset. 
### I store the data of these 10 countries in popn_top10 using pandas concatenate function. After that I renamed a column that has a very long and complicated name, with a simple descriptive name.

In [91]:
import pandas as pd
import numpy as np
import plotly.express as px

In [92]:
malaria_deaths = pd.read_csv("malaria_deaths.csv")
malaria_deaths_age = pd.read_csv("malaria_deaths_age.csv", index_col=0)
malaria_inc = pd.read_csv("malaria_inc.csv")
malaria_deaths_age.rename(columns={'entity':'Entity','year':'Year'}, inplace=True)

In [93]:
popn_top10 =(pd.concat([malaria_inc[malaria_inc.Entity == "China"], 
                       malaria_inc[malaria_inc.Entity == "India"],
                       malaria_inc[malaria_inc.Entity == "Brazil"],
                       malaria_inc[malaria_inc.Entity == "Indonesia"],
                       malaria_inc[malaria_inc.Entity == "Pakistan"],
                       malaria_inc[malaria_inc.Entity == "Nigeria"],
                       malaria_inc[malaria_inc.Entity == "Bangladesh"],
                       malaria_inc[malaria_inc.Entity == "Ethiopia"],
                       malaria_inc[malaria_inc.Entity == "Mexico"],
                       malaria_inc[malaria_inc.Entity == "Philippines"],]))

#### Check how this dataframe looks

In [94]:
popn_top10.rename(columns={'Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)':'Incidence'}, inplace=True)
popn_top10.sample(10)

Unnamed: 0,Entity,Code,Year,Incidence
332,Nigeria,NGA,2000,497.8
79,China,CHN,2015,0.000159
340,Pakistan,PAK,2000,44.8
223,Indonesia,IDN,2015,26.1
22,Bangladesh,BGD,2010,8.6
343,Pakistan,PAK,2015,8.6
146,Ethiopia,ETH,2010,106.2
45,Brazil,BRA,2005,38.7
147,Ethiopia,ETH,2015,58.6
21,Bangladesh,BGD,2005,12.6


### After checking the dataframe, I decided to show an interactive barplot of how the incidence of malaria change in these most populated countries over the years.
### The time interval is 5 years so we will see data from 2000,2005,2010 and 2015
### Feel free to zoom in to the plot and play with it.

In [95]:
fig1 = px.bar(popn_top10, x='Year', y='Incidence', color='Entity', 
             color_discrete_sequence=['#2ca02c','#d62728','#9467bd','#8c564b','#e377c2','#1f77b4','#ff7f0e','#7f7f7f','#bcbd22','#17becf'],
             title='Incidence interactive plot of Malaria in top 10 most populated countries')
fig1.show()

### After seeing the interactive barplot, I think adding a scatter plot will give people a more complete look on the trend of how incidence changes over the years. 
### Hence I did a scatter plot of the same dataset and added a linear trendline for each country.
### Feel free to zoom in to the plot and play with it.

In [96]:
fig2 = px.scatter(popn_top10, x='Year', y='Incidence', color='Entity', color_discrete_sequence=['#2ca02c','#d62728','#9467bd','#8c564b','#e377c2','#1f77b4','#ff7f0e','#7f7f7f','#bcbd22','#17becf'],trendline = "ols",
             title='Incidence interactive plot of Malaria in top 10 most populated countries')
fig2.show()

### Now we have seen how the malaria incidence changes in the most populated countries over the years, but what about continets and other regions?
### From the dataset, we can see the incidence of different continents and region that are also very informative.
### I picked some of the most representative continents and special regions to redo a scatter plot with trend lines to see how malaria incidence changes in these places. 

In [97]:
popn_cont_and_special =(pd.concat([malaria_inc[malaria_inc.Entity == "Heavily indebted poor couuntries"],
                       malaria_inc[malaria_inc.Entity == "East Asia & Pacific"],
                       malaria_inc[malaria_inc.Entity == "Latin America & Caribbean"],
                       malaria_inc[malaria_inc.Entity == "Sub-Saharan Africa"],
                       malaria_inc[malaria_inc.Entity == "Fragile and conflict affected situations"]])
           )
popn_cont_and_special.rename(columns={'Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)':'Incidence'}, inplace=True)
popn_cont_and_special.sample(10)


Unnamed: 0,Entity,Code,Year,Incidence
250,Latin America & Caribbean,,2010,14.409069
116,East Asia & Pacific,,2000,22.736116
118,East Asia & Pacific,,2010,20.134525
148,Fragile and conflict affected situations,,2000,319.032295
249,Latin America & Caribbean,,2005,25.144594
150,Fragile and conflict affected situations,,2010,247.498298
149,Fragile and conflict affected situations,,2005,304.801393
251,Latin America & Caribbean,,2015,10.026764
423,Sub-Saharan Africa,,2015,234.292105
420,Sub-Saharan Africa,,2000,422.510847


### Feel free to zoom in to the plot and play with it.

In [98]:
fig3 = px.scatter(popn_cont_and_special, x='Year', y='Incidence', color='Entity', color_discrete_sequence=['#1f77b4','#ff7f0e','#7f7f7f','#bcbd22','#17becf'],trendline = "ols",
             title='Incidence interactive plot of Malaria in different continents and special regions')
fig3.show()

### We have looked at how malaria incidence changes in the most populated countries, different continents and special region. 
### Now let's see which country/continent or region has the most deaths over the years.
### Since there are a few na values, we fill them forward and backward to ensure better fit.

In [101]:
df_merged = pd.merge(malaria_deaths, malaria_inc, how='left',
        left_on=['Entity', 'Year'], right_on=['Entity', 'Year'])
df_merged.rename(columns={'Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)':'Deaths',
                         'Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)':'Incidence'}, inplace=True)
df_merged = df_merged.ffill().bfill()
df_merged_all = pd.merge(df_merged, malaria_deaths_age, how='left',left_on=['Entity', 'Year'], right_on=['Entity', 'Year'])
df_merged_all.sort_values(by=['deaths']).tail(20)


Unnamed: 0,Entity,Code_x,Year,Deaths,Code_y,Incidence,code,age_group,deaths
25840,Sub-Saharan Africa,LKA,2001,95.387495,LKA,422.510847,,Under 5,677487.264137
30280,World,OWID_WRL,1998,13.962049,VNM,0.3,OWID_WRL,Under 5,681065.266685
25875,Sub-Saharan Africa,LKA,2008,81.83778,LKA,354.424146,,Under 5,685669.900363
30335,World,OWID_WRL,2009,13.383327,OWID_WRL,141.256696,OWID_WRL,Under 5,688065.379789
25845,Sub-Saharan Africa,LKA,2002,95.496177,LKA,422.510847,,Under 5,689351.595969
25870,Sub-Saharan Africa,LKA,2007,85.007071,LKA,354.424146,,Under 5,689897.81996
25860,Sub-Saharan Africa,LKA,2005,90.986663,LKA,354.424146,,Under 5,694321.203817
25865,Sub-Saharan Africa,LKA,2006,87.69955,LKA,354.424146,,Under 5,695767.937767
30285,World,OWID_WRL,1999,14.271505,VNM,0.3,OWID_WRL,Under 5,698846.092898
25855,Sub-Saharan Africa,LKA,2004,93.839664,LKA,422.510847,,Under 5,702419.646809


### After merging all the datasets together and sorted them:
### We can tell that Sub-Saharan Africa has the most deaths of malaria over the years
### Now let's see how many people of each age groups have died in Sub-Saharan Africa from 1990 to 2015.
### This is an area plot which I think perfectly shows the distribution of deaths from different age groups in Sub-Saharan Africa from 1990 to 2015.

In [100]:
df_merged_all_ssa=(pd.concat([df_merged_all[df_merged_all.Entity == "Sub-Saharan Africa"]]))
df_merged_all_ssa.head(15)
fig_ssa = px.area(df_merged_all_ssa, x="Year", y="deaths",color="age_group", labels={
                     "age_group": "Different age groups",
                     "Year": "Sub-Saharan Africa (1990 to 2015)",
                     "deaths": "Total Deaths Count Due To Malaria"
                 },
                title="Malaria deaths count in Sub-Saharan Africa over time measured by area ")
fig_ssa