This notebook walks through the creation of an animation in R from the TB data.

As a relative newby to R, I strongly suspect there are more elegant ways of doing some of this. However, as a first stab, this appears to get the job done.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

The first step is to load the data and the libraries. Two data-sets are available; this notebooks makes use of the first one. I use **ggplot2** for the plots, **animation** for the animation, and the base package **grid** for making a gradient background to the plots. Note also the use of the **suppressMessages** function, which is quite useful...

In [None]:
tb_1 = read.csv('../input/tubercolusis_from 2007_WHO.csv')

suppressMessages(require('animation'))
suppressMessages(require('grid'))

Next I use the map data from the **maps** package...

In [None]:
s = suppressMessages(map_data("world"))

Have a look at the structure of this object...

In [None]:
str(s)

It is made up of various components, including a character vector called 'region'. We'll use later for the plots. I also want to give it a colour variable for the plots later on...

In [None]:
s$colour = 0

Looking at the data in the TB dataframe for '*number of deaths due to TB excluding HIV*' reveals that the numbers use spaces to separate the units. These need removing. The code below changes them from factors into characters, then uses the **gsub** function to replace spaces with 'no spaces' in the relevant column. It then changes them to a numeric data format...

In [None]:
tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV = as.character(tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV)

tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV = gsub(" ", "", tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV)

tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV = as.numeric(tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV)

I now need a list of all the countries in the TB data. To do this, I use the table function in R and save the results as a data-frame...

In [None]:
t = as.data.frame(table(tb_1$Country))

A quick look reveals that South Sudan has some missing data. To make life easier, I'll remove it...

In [None]:
ex = (t$Var1 == 'South Sudan')
t = t[!ex,]

Next I want to extract the TB death numbers from the main TB data and add them to my list of counties. The following code does this by looking for each country for a given year and adding to the appropriate cell of the 't' data-frame...

In [None]:
t$y_2007 = 0
t$y_2008 = 0
t$y_2009 = 0
t$y_2010 = 0
t$y_2011 = 0
t$y_2012 = 0
t$y_2013 = 0
t$y_2014 = 0

i=1

while (i<=length(t$Var1)) {

t[i,3] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2007]
t[i,4] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2008]
t[i,5] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2009]
t[i,6] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2010]
t[i,7] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2011]
t[i,8] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2012]
t[i,9] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2013]
t[i,10] = tb_1$Number.of.deaths.due.to.tuberculosis..excluding.HIV[tb_1$Country == t[i,1] & tb_1$Year == 2014]

i=i+1

}

We now have a data-frame of 193 counties and the corresponding TB death data for each year. A quick look shows that a fair few countries have very low TB death rates. Therefore, let's remove those with fewer than 1000 (arbitrary) in the first year...

In [None]:
ex = (t$y_2007 < 1000)
t = t[!ex,]

Later I match the counties in the map data (the 's' data-frame) those in the 't' data-frame, but not all will match. To check where the mis-matches are, I used the following code. This looks for matches and stores the true and false hits for each country...

In [None]:
z=1

c_check = data.frame(t$Var1)
c_check$False = 0
c_check$True = 0

while (z <= length(t$Var1)) {
  
  temp = as.data.frame(table(s$region == t[z,1]))
  c_check[z,2] = temp[1,2]
  c_check[z,3] = temp[2,2]
  z=z+1  
  
}

table(is.na(c_check$True))

This shows 9 mis-matches. Therefore, I manually changed them in the 't' data-frame to match the country names in the 's' data-frame...

In [None]:
t$Var1 = as.character(t$Var1)

t$Var1[t$Var1 == 'Congo'] = 'Republic of Congo'
t$Var1[t$Var1 == 'Cote d\'Ivoire'] = 'Ivory Coast'
t$Var1[t$Var1 == 'Democratic People\'s Republic of Korea'] = 'North Korea'
t$Var1[t$Var1 == 'Republic of Korea'] = 'South Korea'
t$Var1[t$Var1 == 'Iran (Islamic Republic of)'] = 'Iran'
t$Var1[t$Var1 == 'Lao People\'s Democratic Republic'] = 'Laos'
t$Var1[t$Var1 == 'Russian Federation'] = 'Russia'
t$Var1[t$Var1 == 'United Republic of Tanzania'] = 'Tanzania'
t$Var1[t$Var1 == 'Viet Nam'] = 'VietNam'

For my first attempt at this animation, I just plotted the raw numbers. However, I thought it would be more interesting to see which countries are improving and which are deteriorating. Therefore, I changed the raw numbers into percentage changes...

In [None]:
i=1

while (i<=length(t$Var1)) {
  
  t[i,10] = ((t[i,10] - t[i,3]) / t[i,3]) * 100
  t[i,9] = ((t[i,9] - t[i,3]) / t[i,3]) * 100
  t[i,8] = ((t[i,8] - t[i,3]) / t[i,3]) * 100
  t[i,7] = ((t[i,7] - t[i,3]) / t[i,3]) * 100
  t[i,6] = ((t[i,6] - t[i,3]) / t[i,3]) * 100
  t[i,5] = ((t[i,5] - t[i,3]) / t[i,3]) * 100
  t[i,4] = ((t[i,4] - t[i,3]) / t[i,3]) * 100
  t[i,3] = ((t[i,3] - t[i,3]) / t[i,3]) * 100
  
  i=i+1
  
}

Next I created a nice gradient background for the maps (just ... because). I had no idea how to do this but a quick internet search helped...

In [None]:
g <- rasterGrob(blues9, width=unit(1,"npc"), height = unit(1,"npc"), 
                interpolate = TRUE) 

Finally we get to the plotting and the creation of the animation. This has several parts to it. They are, in order,

- Set the colour variable in the 's' data-frame according to the percentage-change numbers in the 't' data-frame. It does this by matching the country name in the 's' data-frame with the Var1 variable in the 't' data-frame (which I've been too lazy to rename as 'country')

- Create a plot, first setting the background to the gradient created above

- Plot the actual map

- Colour the countries according to the now updated colour variable

- Set colours and a gradient for the numeric colour variable. I chose the limits by looking at the upper and lower limits of the data using the **range** function

- Remove the grid lines

- Add a title. Note that I use the **paste** function to combine text with the updating year variable

Crucially, note the these plots are wrapped in the **saveGIF** function from the animation package. This requires various arguments at the end, such as the name of your animated GIF, the interval between plots, etc. Also note that the plot must be wrapped in the print function to work correctly.

This then uses a program called [ImageMagick][1] to stitch the plots together. You'll need this installed locally for running such code. Thankfully, Kaggle has it installed!


  [1]: http://www.imagemagick.org/script/index.php

In [None]:
i=1

saveGIF(while (i<=8) {

  y=1
  
  while (y<=length(t$Var1)) {
    
    s$colour[t[y,1] == s$region] = (t[y,i+2])
    y=y+1
  }
    
  print(m <- ggplot(s, aes(x=long, y=lat, group=group, fill=colour)) + #Set ggplot2
          
          annotation_custom(g, xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) +
          
          geom_polygon(alpha=1) + #Set transparency
          
          geom_path(data = s, aes(x=long, y=lat, group=group), colour="black") + #Plot the Earth
          
          scale_fill_gradient(low = "green", high = "red", guide = "colourbar", limits=c(-77,77)) + #Set the colours,
                                
          theme(plot.title = element_text(size = rel(2)),
                panel.grid.major = element_blank(), panel.grid.minor = element_blank()) + #Change the text size,
          
          ggtitle(paste("The Spread of TB: ", 2006+i)))
  
  ani.pause()
  
  i=i+1
  
}, movie.name = "tb_ani.gif", interval = 1.5, convert = "convert", ani.width = 800, 
ani.height = 560)

So, to reiterate, this animation isn't showing raw numbers of TB deaths. Instead, it's showing relative increases or decreases (relative meaning relative to each country's death rates in the starting year of 2007). So countries that become red are deteriorating, and countries that become green are improving.  

Click on the **Output** tab at the top to see the animated GIF.