# Data Visualization with Python


<h3 style="text-align: center;">An image tells more than 1000 words!</h3>

A good visualization often gives a good inside into evaulated data and makes also communication of your findings easier! In this Notebook we show several possibilities to visualize data.
The data is originally taken from https://www.gapminder.org/, most usecases are taken from the Introduction to Data Science course by Rafael A. Irizarry
https://rafalab.dfci.harvard.edu/dsbook/ which again refers to the following Ted Talk

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en

For this notebook we will mainly use the ggplot routines included in the plotnine package.
It is impossible to remember all plot commands in detail. Important is to know what is possible, how it can more or less be done and then look up the details e.g. in this notebook or ask your prefered AI

In [9]:
#load the needed packages
import numpy as np # numerical python
import pandas as pd #data science
import matplotlib.pyplot as plt   # graphs
import plotnine as p9 #ggplot2 - graphs

plt.rcParams["figure.figsize"]=12,8 #size of figures in this notebook


We switch off warnings for better readability. This should not be done since warnings are helpful to identify possible or future errors

In [10]:
import warnings
warnings.filterwarnings("ignore")

## Data exploration
First we have a look at the data in order to understand what it is all about

In [11]:
gapminder = pd.read_csv('gapminder.csv') #load the data
gapminder.head(4) # show the first 4 lines

Unnamed: 0.1,Unnamed: 0,infant_mortality,life_expectancy,fertility,population,gdp,continent,region
0,Albania_1960,115.4,62.87,6.19,1636054.0,,Europe,Southern Europe
1,Algeria_1960,148.2,47.5,7.65,11124892.0,13828150000.0,Africa,Northern Africa
2,Angola_1960,208.0,35.98,7.32,5270844.0,,Africa,Middle Africa
3,Antigua and Barbuda_1960,,62.97,4.43,54681.0,,Americas,Caribbean


We see that the dataframe contains data of certain socioeconomic factors for countries_year combinations and the geographical information about the countries

In [12]:
gapminder = gapminder.rename(columns={'Unnamed: 0': 'country_year'})
gapminder.head(4)

Unnamed: 0,country_year,infant_mortality,life_expectancy,fertility,population,gdp,continent,region
0,Albania_1960,115.4,62.87,6.19,1636054.0,,Europe,Southern Europe
1,Algeria_1960,148.2,47.5,7.65,11124892.0,13828150000.0,Africa,Northern Africa
2,Angola_1960,208.0,35.98,7.32,5270844.0,,Africa,Middle Africa
3,Antigua and Barbuda_1960,,62.97,4.43,54681.0,,Americas,Caribbean


First we split the country_year column into 2 separate columns 

In [13]:
gapminder[['country', 'year']] = gapminder['country_year'].str.split('_', expand=True)
gapminder.head()

Unnamed: 0,country_year,infant_mortality,life_expectancy,fertility,population,gdp,continent,region,country,year
0,Albania_1960,115.4,62.87,6.19,1636054.0,,Europe,Southern Europe,Albania,1960
1,Algeria_1960,148.2,47.5,7.65,11124892.0,13828150000.0,Africa,Northern Africa,Algeria,1960
2,Angola_1960,208.0,35.98,7.32,5270844.0,,Africa,Middle Africa,Angola,1960
3,Antigua and Barbuda_1960,,62.97,4.43,54681.0,,Americas,Caribbean,Antigua and Barbuda,1960
4,Argentina_1960,59.87,65.39,3.11,20619075.0,108322300000.0,Americas,South America,Argentina,1960


In [14]:
gapminder.drop(['country_year'], axis=1, inplace=True)
gapminder.head(4)

Unnamed: 0,infant_mortality,life_expectancy,fertility,population,gdp,continent,region,country,year
0,115.4,62.87,6.19,1636054.0,,Europe,Southern Europe,Albania,1960
1,148.2,47.5,7.65,11124892.0,13828150000.0,Africa,Northern Africa,Algeria,1960
2,208.0,35.98,7.32,5270844.0,,Africa,Middle Africa,Angola,1960
3,,62.97,4.43,54681.0,,Americas,Caribbean,Antigua and Barbuda,1960


In [15]:
display(gapminder.columns) #columns of gapminder dataframe
col_reord = gapminder.columns[[7,8,0,1,2,3,4,5,6]] #columns in new order
display(col_reord)

Index(['infant_mortality', 'life_expectancy', 'fertility', 'population', 'gdp',
       'continent', 'region', 'country', 'year'],
      dtype='object')

Index(['country', 'year', 'infant_mortality', 'life_expectancy', 'fertility',
       'population', 'gdp', 'continent', 'region'],
      dtype='object')

In [16]:
gapminder = gapminder[col_reord]
gapminder.head(4) #now we have the columns in the desired order

Unnamed: 0,country,year,infant_mortality,life_expectancy,fertility,population,gdp,continent,region
0,Albania,1960,115.4,62.87,6.19,1636054.0,,Europe,Southern Europe
1,Algeria,1960,148.2,47.5,7.65,11124892.0,13828150000.0,Africa,Northern Africa
2,Angola,1960,208.0,35.98,7.32,5270844.0,,Africa,Middle Africa
3,Antigua and Barbuda,1960,,62.97,4.43,54681.0,,Americas,Caribbean


In [17]:
#overview
gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10545 entries, 0 to 10544
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           10545 non-null  object 
 1   year              10545 non-null  object 
 2   infant_mortality  9092 non-null   float64
 3   life_expectancy   10545 non-null  float64
 4   fertility         10358 non-null  float64
 5   population        10360 non-null  float64
 6   gdp               7573 non-null   float64
 7   continent         10545 non-null  object 
 8   region            10545 non-null  object 
dtypes: float64(5), object(4)
memory usage: 741.6+ KB


In [18]:
gapminder.describe() #some statistical key values (only numerical columns)

Unnamed: 0,infant_mortality,life_expectancy,fertility,population,gdp
count,9092.0,10545.0,10358.0,10360.0,7573.0
mean,55.308619,64.811623,4.083521,27014610.0,147954400000.0
std,47.728055,10.672495,2.02732,106664900.0,697912800000.0
min,1.5,13.2,0.84,31238.0,40395130.0
25%,16.0,57.5,2.2,1333486.0,1845780000.0
50%,41.5,67.54,3.75,5009043.0,7794215000.0
75%,85.1,73.0,6.0,15231790.0,55399650000.0
max,276.9,83.9,9.22,1376049000.0,11744220000000.0


In [19]:
gapminder.head()

Unnamed: 0,country,year,infant_mortality,life_expectancy,fertility,population,gdp,continent,region
0,Albania,1960,115.4,62.87,6.19,1636054.0,,Europe,Southern Europe
1,Algeria,1960,148.2,47.5,7.65,11124892.0,13828150000.0,Africa,Northern Africa
2,Angola,1960,208.0,35.98,7.32,5270844.0,,Africa,Middle Africa
3,Antigua and Barbuda,1960,,62.97,4.43,54681.0,,Americas,Caribbean
4,Argentina,1960,59.87,65.39,3.11,20619075.0,108322300000.0,Americas,South America


The meaning of the gapminder data columns is
- **infant mortality** (how many of 1000 children die before a certain age e.g. 3)
- **fertility** is measured as average children per woman
- **gdp** is measured in US dollars 

In [20]:
# We want to compare the fertility in germany, the US and Jordan for the years 1960 und 2010
# usage of .query
display(gapminder.query('year in [1960,2010]  & country in ["USA","Germany","Turkey"]')[['year','country','fertility']])
# usage of .isin()
display(gapminder[(gapminder['year'].isin([1960,2010]) ) & gapminder['country'].isin(["USA","Germany","Turkey"])][['year','country','fertility']])

Unnamed: 0,year,country,fertility


Unnamed: 0,year,country,fertility


**Exercise**  
1. Fix the problem(s) of code above.   
   Hints: Look at the first entry of the year column gapminder['year'][0]   
          with np.unique(gapminder.country) you can have a look at all countries included in the dataframe   
2. Compare infant mortality in 2015 in Sri Lanka and Turkey

## Scatter Plots
How developed the world economically? How can we distinguish developed countries and undeveloped countries?

undeveloped -> low life expectancy and many children   
developed  ->  high life expectancy and few children 



In [21]:
g1 = p9.ggplot(gapminder.query('year == 1962'), 
          p9.aes(x='fertility', y='life_expectancy')) + p9.geom_point()
g1

<ggplot: (640 x 480)>


In [22]:
g2 = (p9.ggplot(gapminder.query('year == 1962'), 
          p9.aes(x='fertility', y='life_expectancy',color = 'continent')) 
          + p9.geom_point()
)

g2 # different colors for continents

<ggplot: (640 x 480)>


Now we would like to show the data for different years

In [23]:
selected_years = [1962,1980,2015]
plot_data = gapminder[gapminder.year.isin(selected_years)]
gf1=(p9.ggplot(plot_data, 
          p9.aes(x='fertility', y='life_expectancy',color = 'continent')) 
   + p9.geom_point() 
   + p9.facet_grid('.~ year') #different columns for different years
  )
gf1

<ggplot: (640 x 480)>


In [24]:
plot_data = gapminder[gapminder.year.isin(selected_years)]
gf2=(p9.ggplot(plot_data, 
          p9.aes(x='fertility', y='life_expectancy',color = 'continent')) 
   + p9.geom_point() 
   + p9.facet_grid('continent ~ year')  #split continent and year
  )
gf2

<ggplot: (640 x 480)>


**Exercise**   
1. Make the same plots but comparing only Asia and Europe for the given years
2. Make a movie that shows the development of the world in this 2 parameter space
   

## Time Series
Now we want to compare the evolution with time for some parameters. Therefore we use time series and their visualization.

In [25]:
# How devoleped the amount of children per woman in Germany and the US?
plot_data = gapminder[gapminder.country.isin(['United States','Germany'])]
gl0=(p9.ggplot(plot_data, 
          p9.aes(x='year', y='fertility')) 
   + p9.geom_line() 
  )
gl0 # somethin isn't right...

<ggplot: (640 x 480)>


In [26]:
gl1=(p9.ggplot(plot_data, 
          p9.aes(x='year', y='fertility',color = 'country')) 
   + p9.geom_line() 
  )
gl1

<ggplot: (640 x 480)>


**Exercise**   
- Compare the life expectation of Germany and the US as a function of time
-  Compare the life expectation for Vietnam and Cambodia as a function of time. Did you expect this?
-  Look at the fertility of Vietnam and Cambodia as a function of time. Did you expect this?

## Boxplots
Boxplots are very useful if one wants to compare distributions of a parameter. In this case we want to compare the income distribution for certain regions in the world and see how this distribution changed in time.

**Exercise**  
  - Create the column dollars_per_day = gpd/population/365 
  - Create a new dataframe gapminder_clean where you drop all rows where this column does not have a value (nans)


In [27]:
gb0=(p9.ggplot(gapminder_clean[gapminder_clean.year == 1975], 
          p9.aes(x='region', y='dollars_per_day')) 
   + p9.geom_boxplot() 
  )
gb0

NameError: name 'gapminder_clean' is not defined

In [None]:
#rotate xticks labels
gb1=(p9.ggplot(gapminder_clean[gapminder_clean.year == 1975], 
          p9.aes(x='region', y='dollars_per_day')) 
   + p9.geom_boxplot() 
   + p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1))
)

gb1

In [None]:
# choose different colors for continents
gb2=(p9.ggplot(gapminder_clean[gapminder_clean.year == 1975], 
          p9.aes(x='region', y='dollars_per_day',fill = 'continent')) 
   + p9.geom_boxplot() 
   + p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1))
)

gb2

To get a nice graph we sort the income data and scale the y axis logarithmically to make very small incomes visible. We use log2 so it has an intuitive interpretation. In addition we will also show all datapoints to get a better understanding of the populations. This is especially important if there are only very few datapoints per categorie. 

In [None]:
gb3=(p9.ggplot(gapminder_clean[gapminder_clean.year == 1975], 
          p9.aes(x='reorder(region,dollars_per_day)', y='dollars_per_day',fill = 'continent')) #reorder x axis
   + p9.geom_boxplot()  #make boxplots
   + p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1)) #rotate xticks labels
   + p9.scale_y_continuous(trans = 'log2') #scale logarithmically
   + p9.geom_point(show_legend=False)  # show all datapoints
)

gb3

In [None]:
#we want to compare the years 1975 and 2010
gb4=(p9.ggplot(gapminder_clean[gapminder_clean.year.isin([1975,2010])], 
          p9.aes(x='reorder(region,dollars_per_day)', y='dollars_per_day',fill = 'continent')) 
   + p9.geom_boxplot() 
   + p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1))
   + p9.scale_y_continuous(trans = 'log2')
   + p9.geom_point(show_legend=False) 
   + p9.facet_grid('year ~.')
)

gb4

We notice that it is not easy to compare the data. It would be nice if the boxplots were diplayed for each region one besides the other.
We are going to do that now:

In [None]:
gb5=(p9.ggplot(gapminder_clean[gapminder_clean.year.isin([1975,2010])], 
          p9.aes(x='reorder(region,dollars_per_day)', y='dollars_per_day',fill = 'factor(year)')) 
   + p9.geom_boxplot() 
   + p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1))
   + p9.scale_y_continuous(trans = 'log2')
)

gb5

**Exercise**  
- Look carefully at the graph above. What does it tell you about the economic development in the different regions?
- What is strange about the graph above? Do you believe everything that is displayed?  What is surprising?
- Discuss with your fellow students and correct the above graph 

## Density plots
In order show distributions we can also use density plots (and or histograms). Density plots are not always that clear when comparing several distributions as boxplots but they contain more information about the form of a distribution.

In [None]:
gapminder_clean.head()

In [None]:
# we want to compare the development of the income distribution of the developing countries and the developed countries which are often called the west.
def sort_countries(row):
    if row['continent'] == 'Europe' or \
       row['region'] in ['Northern America','Australia and New Zealand'] or  \
       row['country'] == 'Japan':
        return 'west'
    else:
        return 'developing'

gapminder_clean['group'] = gapminder_clean.apply(sort_countries,axis = 1)
gapminder_clean.head()

In [None]:
plot_data = gapminder_clean[gapminder_clean.year.isin([1975,2010])]
gd1=(p9.ggplot(plot_data, 
          p9.aes(x='dollars_per_day', y='..count..',fill = 'group')) #y=count -> Tatsächliche Anzahl (plots nicht 1 Normiert)
   + p9.geom_density(alpha = 0.5,bw = 0.75) 
   +p9.facet_grid('year ~.')
   #+ p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1))
   + p9.scale_x_continuous(trans = 'log2')
)
gd1

Apparently a whole buch of the developed countries is (in absolut values) worse of in 2010 than in 1975. This seems very unlikely. Which countries are we comparing?

In [None]:
# are we comparing the same countries?
country_list_1975 = plot_data[plot_data.year == 1975].country.unique()
country_list_2010 = plot_data[plot_data.year == 2010].country.unique()

print(set(country_list_1975) == set(country_list_2010)) #are the countries the same?
print(len(country_list_1975),len(country_list_2010)) #how many countries are there in each year?

We note that in 2010 there are countries in the data for which there is no data for 1975. We are comparing apples and oranges!  
(The same error had been made above with the boxplots)

In [None]:
comon_countries = np.intersect1d(country_list_1975,country_list_2010) # use only countries for which data of both years is present

gd2=(p9.ggplot(plot_data[plot_data.country.isin(comon_countries)], 
          p9.aes(x='dollars_per_day', y='..count..',fill = 'group')) 
   +p9.geom_density(alpha = 0.5,bw=0.75) 
   +p9.facet_grid('year ~.')
   #+ p9.theme(axis_text_x = p9.element_text(rotation = 90, hjust = 1))
   + p9.scale_x_continuous(trans = 'log2')
)

gd2

We note a slight double hump in the data for the developing countries in the year 2010. This is a typical indicator that there is structure underlying and for a better understanding the data should be grouped into subgroups.

In [None]:
# we group the developing countries into new subgroups
developing = plot_data[(plot_data.country.isin(comon_countries)) & (plot_data['group']=='developing')].copy()

developing['group'] = np.select(
    [
        developing.region.isin(["Eastern Asia", "South-Eastern Asia"]),
        developing.region.isin(["Caribbean", "Central America", "South America"]),
        (developing.continent == 'Africa') & ~(developing.region.isin(["Northern Africa"]))
    ],
    ['East Asia',
     'Latin America',
     'Sub-Sahara Africa'
    ],
    default = 'others'
)

developing.head()

In [None]:
#show this data with the new subgroups
gd4=(p9.ggplot(developing, 
          p9.aes(x='dollars_per_day', y='..count..',fill = 'group')) 
   +p9.geom_density(alpha = 0.3,bw=0.75) 
   +p9.facet_grid('year ~.')
   + p9.scale_x_continuous(trans = 'log2')
)

gd4

In [None]:
#put the graphs on top of each other using position = stack 
gd5=(p9.ggplot(developing, 
          p9.aes(x='dollars_per_day', y='..count..',fill = 'group')) 
   +p9.geom_density(alpha = 0.3,bw=0.75,position = 'stack') 
   +p9.facet_grid('year ~.')
   + p9.scale_x_continuous(trans = 'log2')
)

gd5

In [None]:
def make_weights(gdf):
    gdf['weights'] = gdf['population']/gdf['population'].sum()
    return gdf

developing = developing.groupby('year',group_keys=False).apply(make_weights)

#check if the weights sum up to 1
developing.groupby('year').weights.sum()

In [None]:
gd6=(p9.ggplot(developing, 
          p9.aes(x='dollars_per_day',fill = 'group',weight = 'weights')) 
   +p9.geom_density(alpha = 0.3,bw=0.75,position = 'stack') 
   +p9.facet_grid('year ~.')
   + p9.scale_x_continuous(trans = 'log2')
)

gd6

In [None]:
gd7=(p9.ggplot(developing, 
          p9.aes(x='dollars_per_day',fill = 'group',weight = 'weights')) 
   +p9.geom_density(alpha = 0.3,bw=0.75) 
   +p9.facet_grid('year ~.')
   + p9.scale_x_continuous(trans = 'log2')
)

gd7

It seems that each subgroup has approximately the same weight. Which corresponds to the same number of people. This does not seem to be correst.
Indeed the weight argument does not work properly. You can find a discussion about it here:
https://github.com/has2k1/plotnine/issues/392

More information about the geom_density() function you can find here.

https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_density.html

In Order to get a weighted graph we generate a new dataframe. And we multiply the rows corresponding to their weight. So if we had a Data frame with 3 rows with weights [50%,25%,25%] we would generate a new dataframe with  4 rows where row one is duplicated. And for our data if one region has double the amount of inhabitants than another it will have in the new dataframe double the amount of rows.
For the weighted representation in the graph we will use the option after_stat('count')

In [None]:
df = developing[(developing.year.isin([1975,2010]))].copy()

# Diese Hilfsfunktion vervielfältigt die Zeilen entsprechend der Gewichtungen
def weight_to_frequency(df, wt, precision=3):
    ns = np.round(((wt/sum(wt)) * (10**precision))).astype(int)  # no. times to replicate
    idx = np.repeat(df.index, ns)                     # selection indices
    df = df.loc[idx].reset_index(drop=True)     # replication
    return df

# neuer Datensatz mit redundanten Daten
df = weight_to_frequency(df, df.weights, precision=3)

In [None]:
gd8=(p9.ggplot(df, 
          p9.aes(x='dollars_per_day',fill = 'group')) 
   +p9.geom_density(p9.aes(y=p9.after_stat('count')),alpha = 0.3,bw=0.75) 
   +p9.facet_grid('year ~.')
   + p9.scale_x_continuous(trans = 'log2')
)
gd8

Lets check the weights of the groups in order to make sure that this graph is correct

In [None]:
developing.groupby(['year','group']).weights.sum()

We note that East Asia and the "others" make about a 80% contribution of the whole developing population. Both groups have made a significant economic progress where especially the progress made in East Asia is impressive. We also not that the East Asia region has clearly overtaken the others (on a logarithmic scale ) and is about to be on par with Latin America when in 1975 the mean of Latin America was more than 8times higher than East Asia!

We also note that there was little progress in the Sub Sahara regions

**Exercise**   
- Have a closer look into the data for the Sub Sahara regions in order to show if there really has been so less economic progress. Chose an adequate visualization.
- Have a closer look into the data for the developed countries.
  1. Play with the bw (bandwidth) parameter and use values from [0.15-0.75]. What is the effect?
  2. Which countries are outliers below in 1975 and in 2010. Is it the same countries? How was the development from 1975 - 2010?
  3. Which country is the outlier above in 2010?