In [9]:
options(jupyter.rich_display = FALSE)

# Basic data science with R

## Reshaping data

Multivariate data can be represented in wide or long format:

- Reshaping a data in wide format into a long format is known as "melting"
- Reshaping a data in long format into a wide format is known as "casting"

Different shapes of data can be utilized for specific purposes

### Melting

A chessboard is in a 8x8 shape

Suppose we have a chessboard with random configuration: Each of the 32 pieces by any player is given a unique number and empty squares are 0:

First create a random vector:

In [11]:
chess_vec <- integer(64)
chess_vec[1:32] <- 1:32
chess_vec_r <- sample(chess_vec)
chess_vec_r

 [1]  0 20 18  0 28  0 27  4  0  0  0 17 32  0  0 26  0  2 12  0 31  0  8 14 29
[26]  3  9  0 24  0  0  0 10  0  0  0  0  0 16 22 19  0  5  0  0  0  0  0 11  0
[51]  0  0 25  7  1 15  0 23 30  0  0  6 13 21

And convert to a matrix:

In [10]:
chess_mat <- matrix(chess_vec_r, nrow = 8)
chess_mat

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]  0   19   20   4    22   11    0    6  
[2,] 24    0   15   0     7    8    1    0  
[3,]  0    0    0   0    31   16    0   26  
[4,] 14    0    5   0    13   25    0    0  
[5,]  0   18    3   9     0   32   27    0  
[6,]  0   17    0   0     2    0   28   23  
[7,]  0    0    0   0     0   21    0   30  
[8,] 10    0    0   0     0    0   29   12  

In [19]:
chess_df <- as.data.frame(chess_mat)
chess_df

Our task is to create a function to convert this matrix into a format in which we have three columns:
- One for the row index of the original matrix
- One for the column index of the original matrix
- One for the value

We will use the base function reshape()

In [12]:
?reshape

```
reshape(data, varying = NULL, v.names = NULL, timevar = "time",
             idvar = "id", ids = 1:NROW(data),
             times = seq_along(varying[[1]]),
             drop = NULL, direction, new.row.names = NULL,
             sep = ".",
             split = if (sep == "") {
                 list(regexp = "[A-Za-z][0-9]", include = TRUE)
             } else {
                 list(regexp = sep, include = FALSE, fixed = TRUE)}
             )
     
Arguments:

    data: a data frame

 varying: names of sets of variables in the wide format that correspond
          to single variables in long format (‘time-varying’).  This is
          canonically a list of vectors of variable names, but it can
          optionally be a matrix of names, or a single vector of names.
          In each case, the names can be replaced by indices which are
          interpreted as referring to ‘names(data)’.  See ‘Details’ for
          more details and options.

 v.names: names of variables in the long format that correspond to
          multiple variables in the wide format.  See ‘Details’.

 timevar: the variable in long format that differentiates multiple
          records from the same group or individual.  If more than one
          record matches, the first will be taken (with a warning).

   idvar: Names of one or more variables in long format that identify
          multiple records from the same group/individual.  These
          variables may also be present in wide format.

     ids: the values to use for a newly created ‘idvar’ variable in
          long format.

   times: the values to use for a newly created ‘timevar’ variable in
          long format.  See ‘Details’.

    drop: a vector of names of variables to drop before reshaping.

direction: character string, partially matched to either ‘"wide"’ to
          reshape to wide format, or ‘"long"’ to reshape to long
          format.
```

First, let's convert the matrix into a data.frame:

In [21]:
chess_df <- as.data.frame(chess_mat)
chess_df

  V1 V2 V3 V4 V5 V6 V7 V8
1  0 19 20 4  22 11  0  6
2 24  0 15 0   7  8  1  0
3  0  0  0 0  31 16  0 26
4 14  0  5 0  13 25  0  0
5  0 18  3 9   0 32 27  0
6  0 17  0 0   2  0 28 23
7  0  0  0 0   0 21  0 30
8 10  0  0 0   0  0 29 12

- idvar is the variable which need to be left unaltered
- varying are the ones that needs to converted from wide to long
- v.names are the values that should be against the times in the resultant data frame.
- new.row.names is used to assign row names to the resultant dataset
- direction is, to which format the data needs to be transformed

In [41]:
chess_long <- reshape(chess_df,
                      idvar = "rows",
                      varying = 1:8,
                      v.name = "piece",
                      direction = "long")

In [37]:
chess_long

    cols piece rows
1.1 1     0    1   
2.1 1    24    2   
3.1 1     0    3   
4.1 1    14    4   
5.1 1     0    5   
6.1 1     0    6   
7.1 1     0    7   
8.1 1    10    8   
1.2 2    19    1   
2.2 2     0    2   
3.2 2     0    3   
4.2 2     0    4   
5.2 2    18    5   
6.2 2    17    6   
7.2 2     0    7   
8.2 2     0    8   
1.3 3    20    1   
2.3 3    15    2   
3.3 3     0    3   
4.3 3     5    4   
5.3 3     3    5   
6.3 3     0    6   
7.3 3     0    7   
8.3 3     0    8   
1.4 4     4    1   
2.4 4     0    2   
3.4 4     0    3   
4.4 4     0    4   
5.4 4     9    5   
6.4 4     0    6   
⋮   ⋮    ⋮     ⋮   
3.5 5    31    3   
4.5 5    13    4   
5.5 5     0    5   
6.5 5     2    6   
7.5 5     0    7   
8.5 5     0    8   
1.6 6    11    1   
2.6 6     8    2   
3.6 6    16    3   
4.6 6    25    4   
5.6 6    32    5   
6.6 6     0    6   
7.6 6    21    7   
8.6 6     0    8   
1.7 7     0    1   
2.7 7     1    2   
3.7 7     0    3   
4.7 7     0    4   


We can change the default name of column "time" to "cols" and rearrange the columns: 

In [42]:
names(chess_long)[1] <- "cols"
chess_long <- chess_long[,c("cols", "rows", "piece")]
chess_long

    cols rows piece
1.1 1    1     0   
2.1 1    2    24   
3.1 1    3     0   
4.1 1    4    14   
5.1 1    5     0   
6.1 1    6     0   
7.1 1    7     0   
8.1 1    8    10   
1.2 2    1    19   
2.2 2    2     0   
3.2 2    3     0   
4.2 2    4     0   
5.2 2    5    18   
6.2 2    6    17   
7.2 2    7     0   
8.2 2    8     0   
1.3 3    1    20   
2.3 3    2    15   
3.3 3    3     0   
4.3 3    4     5   
5.3 3    5     3   
6.3 3    6     0   
7.3 3    7     0   
8.3 3    8     0   
1.4 4    1     4   
2.4 4    2     0   
3.4 4    3     0   
4.4 4    4     0   
5.4 4    5     9   
6.4 4    6     0   
⋮   ⋮    ⋮    ⋮    
3.5 5    3    31   
4.5 5    4    13   
5.5 5    5     0   
6.5 5    6     2   
7.5 5    7     0   
8.5 5    8     0   
1.6 6    1    11   
2.6 6    2     8   
3.6 6    3    16   
4.6 6    4    25   
5.6 6    5    32   
6.6 6    6     0   
7.6 6    7    21   
8.6 6    8     0   
1.7 7    1     0   
2.7 7    2     1   
3.7 7    3     0   
4.7 7    4     0   


### Casting

Now let's convert the long chessboard back to a wide 8x8 one:

- idvar is the variable which need to be left unaltered
- timevar are the variables that needs to converted to wide format
- v.names are the value variable
- direction is, to which format the data needs to be transformed

In [51]:
chess_wide <- reshape(chess_long,
       idvar = "rows",
       v.names = "piece",
       timevar = "cols",
       direction = "wide")

chess_wide

    rows piece.1 piece.2 piece.3 piece.4 piece.5 piece.6 piece.7 piece.8
1.1 1     0      19      20      4       22      11       0       6     
2.1 2    24       0      15      0        7       8       1       0     
3.1 3     0       0       0      0       31      16       0      26     
4.1 4    14       0       5      0       13      25       0       0     
5.1 5     0      18       3      9        0      32      27       0     
6.1 6     0      17       0      0        2       0      28      23     
7.1 7     0       0       0      0        0      21       0      30     
8.1 8    10       0       0      0        0       0      29      12     

We delete the first column, which is unneccessary in our example:

In [52]:
chess_wide <- chess_wide[,-1]
chess_wide

    piece.1 piece.2 piece.3 piece.4 piece.5 piece.6 piece.7 piece.8
1.1  0      19      20      4       22      11       0       6     
2.1 24       0      15      0        7       8       1       0     
3.1  0       0       0      0       31      16       0      26     
4.1 14       0       5      0       13      25       0       0     
5.1  0      18       3      9        0      32      27       0     
6.1  0      17       0      0        2       0      28      23     
7.1  0       0       0      0        0      21       0      30     
8.1 10       0       0      0        0       0      29      12     

and let's compare whether all values are unchanged vis-a-vis the original object:

In [54]:
all(chess_wide == chess_mat)

[1] TRUE

### Real data example: Reshaping IMF WEO database

First read the tsv data into R using read.delim() function (a read.table wrapper for tsv files):

In [203]:
?read.table

In [216]:
#weo <- read.table("data/WEO_Data.xls", sep = "\t", header = T, na.strings = c("n/a", "--", ""), stringsAsFactors = T, dec = ".")
weo <- read.delim("data/WEO_Data.xls", na.strings = c("n/a", "--", ""), stringsAsFactors = T, dec = ".")
#weo <- read.delim("~/Downloads/WEO_Data.xls", na.strings = c("n/a", "--", ""), stringsAsFactors = T, dec = ".")

Let's view the structure of the data:

In [219]:
str(weo)

'data.frame':	8730 obs. of  52 variables:
 $ WEO.Country.Code     : int  512 512 512 512 512 512 512 512 512 512 ...
 $ ISO                  : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ WEO.Subject.Code     : Factor w/ 45 levels "BCA","BCA_NGDPD",..: 25 26 22 27 39 23 30 31 29 28 ...
 $ Country              : Factor w/ 194 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Subject.Descriptor   : Factor w/ 29 levels "Current account balance",..: 14 14 15 15 15 16 12 12 13 13 ...
 $ Units                : Factor w/ 12 levels "Index","National currency",..: 2 5 2 12 11 1 2 10 2 12 ...
 $ Scale                : Factor w/ 3 levels "Billions","Millions",..: 1 NA 1 1 1 NA 3 3 3 3 ...
 $ X1980                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ X1981                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ X1982                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ X1983                : num  NA NA NA NA NA NA NA NA NA NA ...
 $ X1984                : 

Let's get the names and indices of factor variables:

In [220]:
factors <- which(sapply(weo, is.factor))
factors

               ISO   WEO.Subject.Code            Country Subject.Descriptor 
                 2                  3                  4                  5 
             Units              Scale 
                 6                  7 

And get the unique values (or levels) of each of these factor variables:

In [221]:
lapply(weo[, factors], levels)

$ISO
  [1] "ABW" "AFG" "AGO" "ALB" "ARE" "ARG" "ARM" "ATG" "AUS" "AUT" "AZE" "BDI"
 [13] "BEL" "BEN" "BFA" "BGD" "BGR" "BHR" "BHS" "BIH" "BLR" "BLZ" "BOL" "BRA"
 [25] "BRB" "BRN" "BTN" "BWA" "CAF" "CAN" "CHE" "CHL" "CHN" "CIV" "CMR" "COD"
 [37] "COG" "COL" "COM" "CPV" "CRI" "CYP" "CZE" "DEU" "DJI" "DMA" "DNK" "DOM"
 [49] "DZA" "ECU" "EGY" "ERI" "ESP" "EST" "ETH" "FIN" "FJI" "FRA" "FSM" "GAB"
 [61] "GBR" "GEO" "GHA" "GIN" "GMB" "GNB" "GNQ" "GRC" "GRD" "GTM" "GUY" "HKG"
 [73] "HND" "HRV" "HTI" "HUN" "IDN" "IND" "IRL" "IRN" "IRQ" "ISL" "ISR" "ITA"
 [85] "JAM" "JOR" "JPN" "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA" "KOR" "KWT" "LAO"
 [97] "LBN" "LBR" "LBY" "LCA" "LKA" "LSO" "LTU" "LUX" "LVA" "MAC" "MAR" "MDA"
[109] "MDG" "MDV" "MEX" "MHL" "MKD" "MLI" "MLT" "MMR" "MNE" "MNG" "MOZ" "MRT"
[121] "MUS" "MWI" "MYS" "NAM" "NER" "NGA" "NIC" "NLD" "NOR" "NPL" "NRU" "NZL"
[133] "OMN" "PAK" "PAN" "PER" "PHL" "PLW" "PNG" "POL" "PRI" "PRT" "PRY" "QAT"
[145] "ROU" "RUS" "RWA" "SAU" "SDN" "SEN" "SGP" "SLB" "SLE"

#### Melting GDP per capita (PPP)

Now we are concerned with only the NGDPRPPPPC series "Purchasing power parity; 2011 international dollar"

So we filter the dataset:

In [222]:
weo_ppp <- weo[weo$WEO.Subject.Code == "NGDPRPPPPC",]

In [223]:
str(weo_ppp)

'data.frame':	194 obs. of  52 variables:
 $ WEO.Country.Code     : int  512 914 612 614 311 213 911 314 193 122 ...
 $ ISO                  : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 4 49 3 8 6 7 1 9 10 ...
 $ WEO.Subject.Code     : Factor w/ 45 levels "BCA","BCA_NGDPD",..: 31 31 31 31 31 31 31 31 31 31 ...
 $ Country              : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Subject.Descriptor   : Factor w/ 29 levels "Current account balance",..: 12 12 12 12 12 12 12 12 12 12 ...
 $ Units                : Factor w/ 12 levels "Index","National currency",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Scale                : Factor w/ 3 levels "Billions","Millions",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ X1980                : num  NA 4675 10804 5277 9082 ...
 $ X1981                : num  NA 4845 10793 4912 9503 ...
 $ X1982                : num  NA 4882 11127 4785 9605 ...
 $ X1983                : num  NA 4837 11355 4859 10268 ...
 $ X1984                : num  NA 4834 1161

Now we want a long DF of four columns: Country names, country codes, years and per capita gdp in ppp values 

In [224]:
names(weo_ppp)

 [1] "WEO.Country.Code"      "ISO"                   "WEO.Subject.Code"     
 [4] "Country"               "Subject.Descriptor"    "Units"                
 [7] "Scale"                 "X1980"                 "X1981"                
[10] "X1982"                 "X1983"                 "X1984"                
[13] "X1985"                 "X1986"                 "X1987"                
[16] "X1988"                 "X1989"                 "X1990"                
[19] "X1991"                 "X1992"                 "X1993"                
[22] "X1994"                 "X1995"                 "X1996"                
[25] "X1997"                 "X1998"                 "X1999"                
[28] "X2000"                 "X2001"                 "X2002"                
[31] "X2003"                 "X2004"                 "X2005"                
[34] "X2006"                 "X2007"                 "X2008"                
[37] "X2009"                 "X2010"                 "X2011"                

In [225]:
cols <- sprintf("X%s", 1980:2023)
cols

 [1] "X1980" "X1981" "X1982" "X1983" "X1984" "X1985" "X1986" "X1987" "X1988"
[10] "X1989" "X1990" "X1991" "X1992" "X1993" "X1994" "X1995" "X1996" "X1997"
[19] "X1998" "X1999" "X2000" "X2001" "X2002" "X2003" "X2004" "X2005" "X2006"
[28] "X2007" "X2008" "X2009" "X2010" "X2011" "X2012" "X2013" "X2014" "X2015"
[37] "X2016" "X2017" "X2018" "X2019" "X2020" "X2021" "X2022" "X2023"

In [226]:
weo_ppp_long <- reshape(weo_ppp,
                      idvar = c("WEO.Country.Code", "Country"),
                      varying = cols,
                        times = cols,
                      v.name = "NGDPRPPPPC",
                      direction = "long")

In [227]:
str(weo_ppp_long)

'data.frame':	8536 obs. of  10 variables:
 $ WEO.Country.Code     : int  512 914 612 614 311 213 911 314 193 122 ...
 $ ISO                  : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 4 49 3 8 6 7 1 9 10 ...
 $ WEO.Subject.Code     : Factor w/ 45 levels "BCA","BCA_NGDPD",..: 31 31 31 31 31 31 31 31 31 31 ...
 $ Country              : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Subject.Descriptor   : Factor w/ 29 levels "Current account balance",..: 12 12 12 12 12 12 12 12 12 12 ...
 $ Units                : Factor w/ 12 levels "Index","National currency",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Scale                : Factor w/ 3 levels "Billions","Millions",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Estimates.Start.After: int  2016 2011 2017 2015 2011 2016 2014 2017 2017 2017 ...
 $ time                 : chr  "X1980" "X1980" "X1980" "X1980" ...
 $ NGDPRPPPPC           : num  NA 4675 10804 5277 9082 ...
 - attr(*, "reshapeLong")=List of 4
  ..$ varying:List of 1
  .. ..$

We can clear rownames:

In [228]:
rownames(weo_ppp_long) <- NULL

In [229]:
weo_ppp_long_2 <- weo_ppp_long[,c("Country", "WEO.Country.Code", "ISO", "time", "NGDPRPPPPC")]
weo_ppp_long_2

     Country                  WEO.Country.Code ISO time  NGDPRPPPPC
1    Afghanistan              512              AFG X1980        NA 
2    Albania                  914              ALB X1980  4674.883 
3    Algeria                  612              DZA X1980 10804.443 
4    Angola                   614              AGO X1980  5276.667 
5    Antigua and Barbuda      311              ATG X1980  9081.960 
6    Argentina                213              ARG X1980 14709.172 
7    Armenia                  911              ARM X1980        NA 
8    Aruba                    314              ABW X1980        NA 
9    Australia                193              AUS X1980 24398.012 
10   Austria                  122              AUT X1980 26129.809 
11   Azerbaijan               912              AZE X1980        NA 
12   The Bahamas              313              BHS X1980 27652.318 
13   Bahrain                  419              BHR X1980 40934.983 
14   Bangladesh               513              B

We can convert year values from X1980 to 1980 and make them numeric:

In [230]:
weo_ppp_long_2$time <- as.integer(gsub("X", "", weo_ppp_long_2$time))
weo_ppp_long_2

     Country                  WEO.Country.Code ISO time NGDPRPPPPC
1    Afghanistan              512              AFG 1980        NA 
2    Albania                  914              ALB 1980  4674.883 
3    Algeria                  612              DZA 1980 10804.443 
4    Angola                   614              AGO 1980  5276.667 
5    Antigua and Barbuda      311              ATG 1980  9081.960 
6    Argentina                213              ARG 1980 14709.172 
7    Armenia                  911              ARM 1980        NA 
8    Aruba                    314              ABW 1980        NA 
9    Australia                193              AUS 1980 24398.012 
10   Austria                  122              AUT 1980 26129.809 
11   Azerbaijan               912              AZE 1980        NA 
12   The Bahamas              313              BHS 1980 27652.318 
13   Bahrain                  419              BHR 1980 40934.983 
14   Bangladesh               513              BGD 1980  1159.

#### Casting 2016 data

Now let's select only a single year, e.g 2016 and create a wide matrix of subjects on the columns and countries on the rows:

In [231]:
names(weo_ppp)

 [1] "WEO.Country.Code"      "ISO"                   "WEO.Subject.Code"     
 [4] "Country"               "Subject.Descriptor"    "Units"                
 [7] "Scale"                 "X1980"                 "X1981"                
[10] "X1982"                 "X1983"                 "X1984"                
[13] "X1985"                 "X1986"                 "X1987"                
[16] "X1988"                 "X1989"                 "X1990"                
[19] "X1991"                 "X1992"                 "X1993"                
[22] "X1994"                 "X1995"                 "X1996"                
[25] "X1997"                 "X1998"                 "X1999"                
[28] "X2000"                 "X2001"                 "X2002"                
[31] "X2003"                 "X2004"                 "X2005"                
[34] "X2006"                 "X2007"                 "X2008"                
[37] "X2009"                 "X2010"                 "X2011"                

In [232]:
weo_2016 <- weo[,c(names(weo_ppp)[1:7], "X2016")]

In [233]:
str(weo_2016)

'data.frame':	8730 obs. of  8 variables:
 $ WEO.Country.Code  : int  512 512 512 512 512 512 512 512 512 512 ...
 $ ISO               : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ WEO.Subject.Code  : Factor w/ 45 levels "BCA","BCA_NGDPD",..: 25 26 22 27 39 23 30 31 29 28 ...
 $ Country           : Factor w/ 194 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Subject.Descriptor: Factor w/ 29 levels "Current account balance",..: 14 14 15 15 15 16 12 12 13 13 ...
 $ Units             : Factor w/ 12 levels "Index","National currency",..: 2 5 2 12 11 1 2 10 2 12 ...
 $ Scale             : Factor w/ 3 levels "Billions","Millions",..: 1 NA 1 1 1 NA 3 3 3 3 ...
 $ X2016             : num  493.07 2.16 1318.48 19.43 66.38 ...


In [234]:
?reshape

In [235]:
weo_2016_wide <- reshape(weo_2016,
                      idvar = c("WEO.Country.Code", "Country"),
                      v.names = "X2016",
                      timevar = "WEO.Subject.Code",
                      drop = c("Subject.Descriptor", "Units", "Scale"),
                      direction = "wide")

In [236]:
str(weo_2016_wide)

'data.frame':	194 obs. of  48 variables:
 $ WEO.Country.Code  : int  512 914 612 614 311 213 911 314 193 122 ...
 $ ISO               : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 4 49 3 8 6 7 1 9 10 ...
 $ Country           : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ X2016.NGDP_R      : num  493.07 760.32 7270.16 1588.96 3.25 ...
 $ X2016.NGDP_RPCH   : num  2.16 3.35 3.2 -2.58 5.35 ...
 $ X2016.NGDP        : num  1318.48 1475.25 17525.1 16549.56 3.94 ...
 $ X2016.NGDPD       : num  19.43 11.88 160.13 101.12 1.46 ...
 $ X2016.PPPGDP      : num  66.38 34.03 609.7 194.89 2.29 ...
 $ X2016.NGDP_D      : num  267 194 241 1042 121 ...
 $ X2016.NGDPRPC     : num  14228 264357 178355 57774 35967 ...
 $ X2016.NGDPRPPPPC  : num  1774 10958 13854 6563 23504 ...
 $ X2016.NGDPPC      : num  38045 512934 429934 601737 43674 ...
 $ X2016.NGDPDPC     : num  561 4132 3928 3677 16176 ...
 $ X2016.PPPPC       : num  1916 11831 14958 7086 25376 ...
 $ X2016.NGAP_NPGDP  : num  NA

In [237]:
weo_2016_wide

     WEO.Country.Code ISO Country                  X2016.NGDP_R X2016.NGDP_RPCH
1    512              AFG Afghanistan                493.073     2.164         
46   914              ALB Albania                    760.317     3.352         
91   612              DZA Algeria                   7270.163     3.200         
136  614              AGO Angola                    1588.961    -2.580         
181  311              ATG Antigua and Barbuda          3.247     5.348         
226  213              ARG Argentina                  708.338    -1.823         
271  911              ARM Armenia                   3367.848     0.261         
316  314              ABW Aruba                        3.310    -0.124         
361  193              AUS Australia                 1678.081     2.606         
406  122              AUT Austria                    317.149     1.451         
451  912              AZE Azerbaijan                  28.464    -3.100         
496  313              BHS The Bahamas   

We may have to correct column names:

In [238]:
names_2016 <- names(weo_2016_wide)
names_2016

 [1] "WEO.Country.Code"   "ISO"                "Country"           
 [4] "X2016.NGDP_R"       "X2016.NGDP_RPCH"    "X2016.NGDP"        
 [7] "X2016.NGDPD"        "X2016.PPPGDP"       "X2016.NGDP_D"      
[10] "X2016.NGDPRPC"      "X2016.NGDPRPPPPC"   "X2016.NGDPPC"      
[13] "X2016.NGDPDPC"      "X2016.PPPPC"        "X2016.NGAP_NPGDP"  
[16] "X2016.PPPSH"        "X2016.PPPEX"        "X2016.NID_NGDP"    
[19] "X2016.NGSD_NGDP"    "X2016.PCPI"         "X2016.PCPIPCH"     
[22] "X2016.PCPIE"        "X2016.PCPIEPCH"     "X2016.FLIBOR6"     
[25] "X2016.TM_RPCH"      "X2016.TMG_RPCH"     "X2016.TX_RPCH"     
[28] "X2016.TXG_RPCH"     "X2016.LUR"          "X2016.LE"          
[31] "X2016.LP"           "X2016.GGR"          "X2016.GGR_NGDP"    
[34] "X2016.GGX"          "X2016.GGX_NGDP"     "X2016.GGXCNL"      
[37] "X2016.GGXCNL_NGDP"  "X2016.GGSB"         "X2016.GGSB_NPGDP"  
[40] "X2016.GGXONLB"      "X2016.GGXONLB_NGDP" "X2016.GGXWDN"      
[43] "X2016.GGXWDN_NGDP"  "X2016.GGXWDG"       "

In [239]:
names(weo_2016_wide) <- gsub("X2016\\.", "", names_2016)

In [240]:
str(weo_2016_wide)

'data.frame':	194 obs. of  48 variables:
 $ WEO.Country.Code: int  512 914 612 614 311 213 911 314 193 122 ...
 $ ISO             : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 4 49 3 8 6 7 1 9 10 ...
 $ Country         : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ NGDP_R          : num  493.07 760.32 7270.16 1588.96 3.25 ...
 $ NGDP_RPCH       : num  2.16 3.35 3.2 -2.58 5.35 ...
 $ NGDP            : num  1318.48 1475.25 17525.1 16549.56 3.94 ...
 $ NGDPD           : num  19.43 11.88 160.13 101.12 1.46 ...
 $ PPPGDP          : num  66.38 34.03 609.7 194.89 2.29 ...
 $ NGDP_D          : num  267 194 241 1042 121 ...
 $ NGDPRPC         : num  14228 264357 178355 57774 35967 ...
 $ NGDPRPPPPC      : num  1774 10958 13854 6563 23504 ...
 $ NGDPPC          : num  38045 512934 429934 601737 43674 ...
 $ NGDPDPC         : num  561 4132 3928 3677 16176 ...
 $ PPPPC           : num  1916 11831 14958 7086 25376 ...
 $ NGAP_NPGDP      : num  NA NA NA NA NA ...
 $ PPPSH     

## Discretize numeric data

Now we may want to create factors on GDP per capita (PPP): Low income, middle income and high income countries  

Let's say the cutting points are 1000 and 12000

In [241]:
str(weo_ppp_long_2)

'data.frame':	8536 obs. of  5 variables:
 $ Country         : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ WEO.Country.Code: int  512 914 612 614 311 213 911 314 193 122 ...
 $ ISO             : Factor w/ 194 levels "ABW","AFG","AGO",..: 2 4 49 3 8 6 7 1 9 10 ...
 $ time            : int  1980 1980 1980 1980 1980 1980 1980 1980 1980 1980 ...
 $ NGDPRPPPPC      : num  NA 4675 10804 5277 9082 ...


We will discretize the NGDPRPPPPC column with cut() function and add the results as a new factor variable:

In [242]:
max(weo_ppp_long_2$NGDPRPPPPC, na.rm = T)

[1] 172994.5

In [243]:
a <- cut(weo_ppp_long_2$NGDPRPPPPC,
                                    breaks = c(0, 1000, 12000, max(weo_ppp_long_2$NGDPRPPPPC, na.rm = T)))

In [244]:
levels(a)

[1] "(0,1e+03]"          "(1e+03,1.2e+04]"    "(1.2e+04,1.73e+05]"

In [245]:
weo_ppp_long_2$income_levels <- cut(weo_ppp_long_2$NGDPRPPPPC,
                                    breaks = c(0, 1000, 12000, max(weo_ppp_long_2$NGDPRPPPPC, na.rm = T)),
                                    labels = c("L", "M", "H"))

In [246]:
weo_ppp_long_2[,3:6]

     ISO time NGDPRPPPPC income_levels
1    AFG 1980        NA  NA           
2    ALB 1980  4674.883  M            
3    DZA 1980 10804.443  M            
4    AGO 1980  5276.667  M            
5    ATG 1980  9081.960  M            
6    ARG 1980 14709.172  H            
7    ARM 1980        NA  NA           
8    ABW 1980        NA  NA           
9    AUS 1980 24398.012  H            
10   AUT 1980 26129.809  H            
11   AZE 1980        NA  NA           
12   BHS 1980 27652.318  H            
13   BHR 1980 40934.983  H            
14   BGD 1980  1159.722  M            
15   BRB 1980 14066.335  H            
16   BLR 1980        NA  NA           
17   BEL 1980 25189.821  H            
18   BLZ 1980  3183.026  M            
19   BEN 1980  1724.655  M            
20   BTN 1980  1102.657  M            
21   BOL 1980  4866.147  M            
22   BIH 1980        NA  NA           
23   BWA 1980  4265.308  M            
24   BRA 1980 11373.057  M            
25   BRN 1980        NA  

## Joining data

Now let's import a second IMF dataset: 2016 values of international investment position of all countries

In [265]:
iip <- read.csv("data/Data_in_US_Dollars.csv")

In [266]:
str(iip)

'data.frame':	208 obs. of  13 variables:
 $ WEO.Country.Code                                                        : int  512 914 612 614 312 311 213 911 314 193 ...
 $ Assets                                                                  : num  9274 7352 127904 87718 1511 ...
 $ Direct.investment                                                       : num  5.61 1406.26 1281.88 21493.93 71.64 ...
 $ Portfolio.investment                                                    : num  351.4 748.4 NA 7401.8 42.2 ...
 $ Financial.derivatives..other.than.reserves..and.employee.stock.options  : num  NA NA NA 78.2 0 ...
 $ Other.investment                                                        : num  1454 2093 5414 34392 1342 ...
 $ Reserve.assets                                                          : num  7462.8 3105 121208.4 24352.5 55.6 ...
 $ Liabilities                                                             : num  3843 12439 33264 90246 1897 ...
 $ Direct.investment.1              

Now similar to SQL joins, we will "merge" the weo_2016_wide and iip datasets using common "WEO.Country.Code" variable:

## Aggregating and summarizing data 

## Imputation of missing values

## Sampling