## Homework 3 - Creating effective visualizations using best practices

#### Ruiqi Zhang

#### Blog:  
https://ruiqi-zhang063.github.io  

#### Github repository
https://github.com/ruiqi-zhang063/ruiqi-zhang063.github.io

### Install Package

In [1]:
# ! pip install altair

### Import Packages

In [2]:
import pandas as pd
import altair as alt
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

### Load Data

In [3]:
url_1= 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths.csv'
malaria_deaths = pd.read_csv(url_1)
url_2= 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_deaths_age.csv'
malaria_deaths_age = pd.read_csv(url_2)
url_3= 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-13/malaria_inc.csv'
malaria_inc = pd.read_csv(url_3)

### Look at the first 5 rows of the data

In [4]:
malaria_deaths.head()

Unnamed: 0,Entity,Code,Year,"Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)"
0,Afghanistan,AFG,1990,6.80293
1,Afghanistan,AFG,1991,6.973494
2,Afghanistan,AFG,1992,6.989882
3,Afghanistan,AFG,1993,7.088983
4,Afghanistan,AFG,1994,7.392472


In [5]:
malaria_inc.head()

Unnamed: 0,Entity,Code,Year,"Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)"
0,Afghanistan,AFG,2000,107.1
1,Afghanistan,AFG,2005,46.5
2,Afghanistan,AFG,2010,23.9
3,Afghanistan,AFG,2015,23.6
4,Algeria,DZA,2000,0.037746


In [6]:
malaria_deaths_age.head()

Unnamed: 0.1,Unnamed: 0,entity,code,year,age_group,deaths
0,1,Afghanistan,AFG,1990,Under 5,184.606435
1,2,Afghanistan,AFG,1991,Under 5,191.658193
2,3,Afghanistan,AFG,1992,Under 5,197.140197
3,4,Afghanistan,AFG,1993,Under 5,207.357753
4,5,Afghanistan,AFG,1994,Under 5,226.209363


### Rename columns

In [7]:
malaria_deaths = malaria_deaths.rename(columns = {
    'Deaths - Malaria - Sex: Both - Age: Age-standardized (Rate) (per 100,000 people)':'Deaths (per 100,000 people)'
})
malaria_inc = malaria_inc.rename(columns = {
    'Incidence of malaria (per 1,000 population at risk) (per 1,000 population at risk)':'Incidence (per 1,000 population at risk)'
})

### The first visualization

The first visualization is used to compare different counties malaria deaths or malaria incidence. The first parameter is a list of country name, representing the countries we want to compare. The second parameter is a string, inputing 'Deaths' means that we want to compare malaria deaths, while inpunting 'Incidence' means that we want to compare incidence.

The visualization has interactive behavior tied to the x position of the cursor. When the cursor is moved over, the most recent value will be displayed.

In [8]:
def Country_Compare(country_list, compare_label):
    '''
    Compare different counties malaria deaths or malaria incidence.
    
    Parameter
    ---------
    country_list: list
        The list of countries name that we want to compare.
    compare_label: str
        The data we want to compare, must be either 'Deaths' or 'Incidence'.
        
    Result 
    ---------
    figure: visualization
        The line chart with interactive behavior tied to the x position of the cursor.
    '''
    
    if compare_label == 'Deaths':
        data = malaria_deaths[malaria_deaths['Entity'].isin(country_list)]
        label = 'Deaths (per 100,000 people)'
    elif compare_label == 'Incidence':
        data = malaria_inc[malaria_inc['Entity'].isin(country_list)]
        label = 'Incidence (per 1,000 population at risk)'
    else:
        return 'Compare_label is not correct, it should be either "Deaths" or "Incidence"'
    
    if data.empty:
        return 'No data for the corresponding country'
    
    nearest = alt.selection(type='single', nearest=True, on='mouseover',
                            fields=['Year'], empty='none')
    
    line = alt.Chart(data).mark_line().encode(
        x = 'Year', 
        y = label, 
        color = 'Entity',
    )
    
    selectors = alt.Chart(data).mark_point().encode(
    x = 'Year:Q',
    opacity = alt.value(0),
    ).add_selection(
    nearest
    )
    
    points = line.mark_point().encode(
        opacity=alt.condition(nearest, alt.value(1), alt.value(0))
    )
    
    text = line.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(nearest, label, alt.value(' '))
    )
    
    rules = alt.Chart(data).mark_rule(color='gray').encode(
        x='Year:Q',
    ).transform_filter(
        nearest
    )
    
    figure = alt.layer(
        line, selectors, points, rules, text
    )
    
    return figure

### Example of the first visualization

In [9]:
Country_Compare(['Afghanistan','Algeria','Angola'],'Deaths')

### The second visualization

The second visualization is used to obtain the n countries with the highest number of deaths or infection rates in a given year. The first parameter is an int input, representing the year we are interested. The second parameter is an int input, meaning the number of countries we want to see. The third parameter is a string, inputing 'Deaths' means that we want to compare malaria deaths, while inpunting 'Incidence' means that we want to compare incidence.

The visualization uses an interval selection, which causes the chart to include an interactive brush (shown in grey). The brush selection parameterizes the red guideline, which visualizes the average value within the selected interval.

In [10]:
def Year_Overview(Year, number, compare_label):
    '''
    Obtain the n countries with the highest number of deaths or infection rates in a given year.
    
    Parameter
    ---------
    Year: int
        The year we are interested in.
    number: int
        The number of countries we want to see.
    compare_label: str
        The data we want to see, must be either 'Deaths' or 'Incidence'.
        
    Result 
    ---------
    figure: visualization
        The bar chart with interactive behavior which visualizes the average value within the selected interval.
    '''
    
    if compare_label == 'Deaths':
        data = malaria_deaths[malaria_deaths['Year'] == Year]
        label = 'Deaths (per 100,000 people)'
    elif compare_label == 'Incidence':
        data = malaria_inc[malaria_inc['Year'] == Year]
        label = 'Incidence (per 1,000 population at risk)'
    else:
        return 'Compare_label is not correct, it should be either "Deaths" or "Incidence"'
    
    if data.empty:
        return 'No data for the corresponding Year'
        
    data_sorted = data.sort_values(by = label, ascending = False).head(number)
    
    brush = alt.selection(type = 'interval', encodings = ['y']) 
    
    bars = alt.Chart(data_sorted).mark_bar().encode(
        x = label,
        y = alt.Y(field = 'Entity', type = 'nominal', sort = ['x','descending']), 
        opacity = alt.condition(brush, alt.OpacityValue(1), alt.OpacityValue(0.7)),
    ).add_selection(
        brush
    )
    
    line = alt.Chart().mark_rule(color='firebrick').encode(
        x = f'mean({label}):Q',
        size=alt.SizeValue(3)
    ).transform_filter(
        brush
    )
    
    figure = alt.layer(bars, line, data = data_sorted)
    
    return figure

### Example of the second visualization

In [11]:
Year_Overview(2010,20,'Deaths')

### The third visualization

The third visualization is used to obtain the specified country age group distribution about malaria deaths. The only parameter is a string, representing the country we are interested in.

The visualization has interactive legend. We can choose one category of the legend to see specific age group area.

In [12]:
def Age_Distribution(Country):
    '''
    Obtain the specified country age group distribution about malaria deaths.
    
    Parameter
    ---------
    Country: string
        The country we are interested in.
        
    Result 
    ---------
    figure: visualization
        The area chart with interactive legend.
    '''

    data = malaria_deaths_age[malaria_deaths_age['entity'] == Country]
    
    if data.empty:
        return 'No data for the corresponding county'
    
    selection = alt.selection_multi(fields=['age_group'], bind='legend')
    
    figure = alt.Chart(data).mark_area().encode(
        x = 'year' , 
        y = 'sum(deaths)', 
        color = alt.Color('age_group:N',
                  sort=['Under 5', '5-14', '15-49',
                        '50-69', '70 or older']
                         ),
        opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_selection(
    selection
)
    
    return figure

### Example of the third visualization

In [13]:
Age_Distribution('Mozambique')