# Population Example

First, load the dataset.

In [1]:
import pandas as pd
import altair as alt

df = pd.read_csv('data/population.csv')
df.head()

Unnamed: 0,Country Name,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,54608.0,55811.0,56682.0,57475.0,58178.0,58782.0,59291.0,59522.0,59471.0,...,102112.0,102880.0,103594.0,104257.0,104874.0,105439.0,105962.0,106442.0,106585.0,106537.0
1,Africa Eastern and Southern,130692579.0,134169237.0,137835590.0,141630546.0,145605995.0,149742351.0,153955516.0,158313235.0,162875171.0,...,552530654.0,567891875.0,583650827.0,600008150.0,616377331.0,632746296.0,649756874.0,667242712.0,685112705.0,702976832.0
2,Afghanistan,8622466.0,8790140.0,8969047.0,9157465.0,9355514.0,9565147.0,9783147.0,10010030.0,10247780.0,...,30466479.0,31541209.0,32716210.0,33753499.0,34636207.0,35643418.0,36686784.0,37769499.0,38972230.0,40099462.0
3,Africa Western and Central,97256290.0,99314028.0,101445032.0,103667517.0,105959979.0,108336203.0,110798486.0,113319950.0,115921723.0,...,376797999.0,387204553.0,397855507.0,408690375.0,419778384.0,431138704.0,442646825.0,454306063.0,466189102.0,478185907.0
4,Angola,5357195.0,5441333.0,5521400.0,5599827.0,5673199.0,5736582.0,5787044.0,5827503.0,5868203.0,...,25188292.0,26147002.0,27128337.0,28127721.0,29154746.0,30208628.0,31273533.0,32353588.0,33428486.0,34503774.0


Compact columns related to years into a single column

Use the `melt()` function to reshape the `df` DataFrame by unpivoting it based on the `Country Name` column. Unpivoting means converting a dataset from a wide format to a long format by rearranging the columns into rows. 

In [2]:
df = df.melt(id_vars='Country Name', 
             var_name='Year', 
             value_name='Population')
df.head()

Unnamed: 0,Country Name,Year,Population
0,Aruba,1960,54608.0
1,Africa Eastern and Southern,1960,130692579.0
2,Afghanistan,1960,8622466.0
3,Africa Western and Central,1960,97256290.0
4,Angola,1960,5357195.0


Convert the variable containing years into an int.

In [151]:
df['Year'] = df['Year'].astype('int')

In [152]:
len(df)

16492

Disable the maximum row limit for data transformation. By calling this function, any limit on the number of rows that can be processed during data transformation is removed, allowing for unrestricted data processing.

In [153]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Draw the chart

In [154]:
chart = alt.Chart(df).mark_line().encode(
    x = 'Year:Q',
    y = 'Population:Q',
    color = 'Country Name:N'
)
chart

## From Data to Information

The previous chart is very confused and presents the following problems:

* too many countries

* too many colors

* no focus


To solve the problems, group countries by continents. The dataset alrady contains values for continents. List the countries using `unique()`.

In [155]:
df['Country Name'].unique()

array(['Aruba', 'Africa Eastern and Southern', 'Afghanistan',
       'Africa Western and Central', 'Angola', 'Albania', 'Andorra',
       'Arab World', 'United Arab Emirates', 'Argentina', 'Armenia',
       'American Samoa', 'Antigua and Barbuda', 'Australia', 'Austria',
       'Azerbaijan', 'Burundi', 'Belgium', 'Benin', 'Burkina Faso',
       'Bangladesh', 'Bulgaria', 'Bahrain', 'Bahamas, The',
       'Bosnia and Herzegovina', 'Belarus', 'Belize', 'Bermuda',
       'Bolivia', 'Brazil', 'Barbados', 'Brunei Darussalam', 'Bhutan',
       'Botswana', 'Central African Republic', 'Canada',
       'Central Europe and the Baltics', 'Switzerland', 'Channel Islands',
       'Chile', 'China', "Cote d'Ivoire", 'Cameroon', 'Congo, Dem. Rep.',
       'Congo, Rep.', 'Colombia', 'Comoros', 'Cabo Verde', 'Costa Rica',
       'Caribbean small states', 'Cuba', 'Curacao', 'Cayman Islands',
       'Cyprus', 'Czechia', 'Germany', 'Djibouti', 'Dominica', 'Denmark',
       'Dominican Republic', 'Algeria',
 

Build a list of continents.

In [156]:
continents = ['Africa Eastern and Southern',
             'Africa Western and Central',
             'Middle East & North Africa',
              'Sub-Saharan Africa',
             'Europe & Central Asia',
             'Latin America & Caribbean',
             'North America',
             'Pacific island small states',
             'East Asia & Pacific']

Filter the dataset by selecting only the continents. Use `isin()` to select continents.

In [157]:
df = df[df['Country Name'].isin(continents)]

Draw the chart again.

In [158]:
chart = alt.Chart(df).mark_line().encode(
    x = 'Year:Q',
    y = 'Population:Q',
    color = 'Country Name:N'
)
chart

The chart is readable. However, there are the following problems:

* too many colors

* no focus


Focus on the North America and group the other countries

In [159]:
mask = df['Country Name'].isin(['North America'])
df_mean = df[~mask].groupby(by='Year').mean().reset_index()

df_grouped = pd.DataFrame({ 
    'Year' : df[mask]['Year'].values,
    'North America' : df[mask]['Population'].values, 
    'World': df_mean['Population'].values
})

df_grouped.head()

Unnamed: 0,Year,North America,World
0,1960,198624756.0,311313900.0
1,1961,202007500.0,315004400.0
2,1962,205198600.0,320388500.0
3,1963,208253700.0,327289100.0
4,1964,211262900.0,334236900.0


It is still difficult to compare the countries because they start from different values. Set the initial value of each country (1960) to zero and calculate the difference between each year the initial value.
Calculate the difference between the current year and the baseline and store it into a new column called `diff`.

In [160]:
df_melt = df_grouped.melt(id_vars='Year', var_name='Continent', value_name='Population')
df_melt.head()

Unnamed: 0,Year,Continent,Population
0,1960,North America,198624756.0
1,1961,North America,202007500.0
2,1962,North America,205198600.0
3,1963,North America,208253700.0
4,1964,North America,211262900.0


In [161]:
colors=['#80C11E', 'grey']
chart = alt.Chart(df_melt).mark_line().encode(
    x = alt.X('Year:Q',
              title=None, 
              axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('Population:Q', 
              axis=alt.Axis(format='.2s')),
    color = alt.Color('Continent:N', 
                      scale=alt.Scale(range=colors),
                      legend=None),
    opacity = alt.condition(alt.datum['Continent'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in North America over the last 50 years',
    width=400,
    height=250
).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

chart

Calculate the difference 

In [162]:
baseline = df_melt[df_melt['Year'] == 1960]

In [163]:
continents = ['North America', 'World']
for continent in continents:
    baseline_value = baseline[baseline['Continent'] == continent]['Population'].values[0]
    m = df_melt['Continent'] == continent
    df_melt.loc[m, 'Diff'] = df_melt.loc[m,'Population'] - baseline_value

In [164]:
colors=['#80C11E', 'grey']
chart = alt.Chart(df_melt).mark_line().encode(
    x = alt.X('Year:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('Diff:Q', title='Difference from 1960',axis=alt.Axis(format='.2s')),
    color = alt.Color('Continent:N', scale=alt.Scale(range=colors),legend=None),
    opacity = alt.condition(alt.datum['Continent'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in North America over the last 50 years',
    width=400,
    height=250
).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

chart

In [165]:
mask = df_melt['Year'] == 2021
na = df_melt[mask]['Diff'].values[0] # North America
oth = df_melt[mask]['Diff'].values[1]

In [166]:
df_text = pd.DataFrame({'text' : ['Rest of the world','North America'],
       'x' : [2023,2023],
       'y' : [oth,na]})

df_text

Unnamed: 0,text,x,y
0,Rest of the world,2023,538693400.0
1,North America,2023,171579000.0


In [167]:
text = alt.Chart(df_text).mark_text(fontSize=14, align='left').encode(
    x = 'x',
    y = 'y',
    text = 'text',
    color = alt.condition(alt.datum.text == 'North America', alt.value('#80C11E'), alt.value('grey'))
)

text

In [168]:
chart = alt.Chart(df_melt).mark_line().encode(
    x = alt.X('Year:Q',title=None, axis=alt.Axis(format='.0f',tickMinStep=10)),
    y = alt.Y('Diff:Q', title='Difference from 1960',axis=alt.Axis(format='.2s')),
    color = alt.Color('Continent:N', scale=alt.Scale(range=colors),legend=None),
    opacity = alt.condition(alt.datum['Continent'] == 'North America', alt.value(1), alt.value(0.3))
).properties(
    title='Population in North America over the last 50 years',
    width=400,
    height=250
)

In [169]:
total = (chart + text).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E',
    #offset=20
).configure_view(
    strokeWidth=0
)

total

## From information to knowledge

In [170]:
oth - na

367114388.25

In [171]:
offset = 10000000
df_vline = pd.DataFrame({'y' : [oth - offset,na + offset], 
                         'x' : [2021,2021]})

line = alt.Chart(df_vline).mark_line(color='black').encode(
    y = 'y',
    x = 'x'
)

line

In [173]:
chart + text + line

In [174]:
df_ann = pd.DataFrame({'text' : ['367M'],
       'x' : [2022],
       'y' : [na + (oth-na)/2]})

df_ann

Unnamed: 0,text,x,y
0,367M,2022,355136200.0


In [175]:
ann = alt.Chart(df_ann).mark_text(fontSize=30, align='left').encode(
    x = 'x',
    y = 'y',
    text = 'text'
)

ann

## Add source

In [186]:
df_subtitle= pd.DataFrame({'text' : ['source: World Bank'], 
                                   'href' : ' https://data.worldbank.org/indicator/SP.POP.TOTL' })
subtitle = alt.Chart(df_subtitle
                    ).mark_text(
                        y=0
                    ).encode(
                        text='text',
                        href='href'
                    )

subtitle

In [194]:
total = chart + text + line + ann + subtitle

In [195]:
total.configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

In [196]:
df_context = pd.DataFrame({'text' : ['Why this gap?',
                            '1. Lower Fertility Rate', 
                            '2. Lower Immigration Rate', 
                            '3. Higher Average Age'],
                           'y': [0,1,2,3]})

df_context

Unnamed: 0,text,y
0,Why this gap?,0
1,1. Lower Fertility Rate,1
2,2. Lower Immigration Rate,2
3,3. Higher Average Age,3


In [197]:
context = alt.Chart(df_context).mark_text(fontSize=14, align='left', dy=50).encode(
    y = alt.Y('y:O', axis=None),
    text = 'text',
    stroke = alt.condition(alt.datum.y == 0, alt.value('#80C11E'), alt.value('black')),
    strokeWidth = alt.condition(alt.datum.y == 0, alt.value(1), alt.value(0))
)

context

In [198]:
total = (context | (chart + text + line + ann + subtitle)).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=16,
    color='#80C11E'
).configure_view(
    strokeWidth=0
)

total

## From knowledge to wisdom

In [199]:
df_cta = pd.DataFrame({
    'Strategy': ['Immigration Development', 'Enhance Family-Friendly Policies', 'Revitalize Rural Areas'],
    'Population Increase': [20, 30, 15]  # Sample population increase percentages
})

# Creating the stacked column chart
cta = alt.Chart(df_cta).mark_bar(color='#80C11E').encode(
    x='Population Increase:Q',
    y=alt.Y('Strategy:N', sort='-x', title=None),
    tooltip=['Strategy', 'Population Increase']
).properties(
    title='Strategies for population growth in North America',
)

cta
# Displaying the chart


In [200]:
total = alt.vconcat((context | (chart + text + line + ann + subtitle)), cta,center=True).configure_axis(
    grid=False,
    titleFontSize=14,
    labelFontSize=12
).configure_title(
    fontSize=20,
    color='#80C11E',
    offset=10
).configure_view(
    strokeWidth=0
).configure_concat(
    spacing=50
)

total