hw02/hw02_Gapminder.Rmd

---
title: "STAT545 HW02"
author: "Xinmiao Wang"
date: "`r format(Sys.Date())`"
output: github_document
---

#Navigation 

* The main repo for homework: [here](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao) 

* Requirement for Homework 02: click [here](http://stat545.com/hw02_explore-gapminder-dplyr.html)

* hw02 folder: [here](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao/tree/master/hw02).

* Files inside hw02:

  1. [README.md](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao/blob/master/hw02/README.md)

  2. [hw02_Gapminder.md](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao/blob/master/hw02/hw02_Gapminder.md)


# Bring Rectangular Data in

In this module, we intend to explore Gapminder data and practice the functions in dplyr, which can be loaded from gapminder package and tidyverse package in R. Please make it sure that those package have been installed before we load them.

Install `gapminder` from CRAN:
```{r eval=FALSE}
install.packages("gapminder")
```

Install `tidyverse` from CRAN:
```{r eval=FALSE}
install.packages("tidyverse")
```

Here, we load those two packages.

```{r load_library, warning=F}
#load packages
library(gapminder)
library(tidyverse)
library(ggthemes)
```


# Smell Test the Data

### Is it a data.frame, a matrix, a vector, a list?

```{r type_dat}
gapminder
str(gapminder)
```

* A tibble, because we load the `tidyverse` package

* A data.frame by the classes shown in `str(gapminder)`

* A list, if we use `typeof()`. But a data.frame is a special case of a list.


### What¡¯s its class?

```{r class_data}
class(gapminder)
```
 
* The class of gapminder includes `r class(gapminder)`.


### How many variables/columns?

```{r ncol_data} 
ncol(gapminder)
```

* There are `r ncol(gapminder)` variables.


### How many rows/observations?

```{r nrow_data}
nrow(gapminder)
```

* There are `r nrow(gapminder)` observations.
  
  
### Can you get these facts about ¡°extent¡± or ¡°size¡± in more than one way? Can you imagine different functions being useful in different contexts? 

```{r other_extent}
dim(gapminder) # number of observations and number of variables
length(gapminder) # number of variables
```

* `dim()`: the number of observations and the number of variables

* `str()`: the number of observations and the number of variables

* `length()`: the number of variables
  
  
### What data type is each variable?

```{r type_data}
attach(gapminder)
a <- rbind(names(gapminder), c(typeof(country), typeof(continent), typeof(year), typeof(lifeExp), typeof(pop), typeof(gdpPercap)))
as.data.frame(a, row.names=c("Variables", "Data Type")) #Data type of each variable
```

* Integer: country, continent, year, pop

* Numeric: lifeExp, gdpPercap

<Notes> Here, the data types of variables country and continent are integer, even though we can see their values are characters not integer numbers. I think it is because that these two variables are treated as factors with order when we import the data into R. The integers represent the levels. For example, for country, 1 represents Afghanistan, and for continent, 1 represent Africa. You can check it by using `str(gapminder)`, `levels(gapminder$country)` and `levels(gapminder$continent)`. 


# Explore individual variables

### Categorical Variable: Continent

Here are the summary table and the barchart for Continent. 

There are six continents where we collected data, including Africa, Americas, Asia, Europe and Oceania. We collected the most number of data from Africa. The smallest number of data were collect in Oceania. From the barchart below, we can observe the distribution of continent more clearly.

```{r continent}
summary(continent)

ggplot(gapminder, aes(x=continent))+
  geom_bar(aes(color=continent), fill=continent_colors)+
  theme_calc()+
  ggtitle("The Bar Chart of Continent")

```


### Quantitative Variable: LifeExp

Here are the summary data and the histogram for LifeExp. 

The range of lifeExp is from `r range(lifeExp)[1]` to `r range(lifeExp)[2]`. The mean of LifeExp is `r mean(lifeExp)` with standard derivation `r sd(lifeExp)`, and the Median is `r median(lifeExp)`. From the histogram, we can observe the mode of lifeExp is around 70, the shape of its distribution is a little bit left-skewed. Based in the histogram, I suspect the minimum of lifeExp might be an outliter. However, by the 1.5 IQR rule, the minimum value is considerable.

```{r lifeExp}
summary(lifeExp)
sd(lifeExp)

ggplot(gapminder, aes(x=lifeExp))+
  geom_histogram(binwidth = 1,col="red", aes(fill=..count..))+
  scale_fill_gradient("count", low = "green", high = "red")+
  theme_calc()+
  ggtitle("The Histogram of LifeExp")

```


# Explore various plot types

## Life Expectancy vs. Year

Here is a boxplot of life expectancy among every five years from 1952 to 2007. From the boxplot, we can see the increasing trend of the average life expectancy all over the world.

```{r year_lifeExp, echo=FALSE}
ggplot(gapminder, aes(x = year, y = lifeExp))+ 
  geom_boxplot(aes(group = year), fill="pink")+
  theme_calc()+
  ggtitle("The Boxplot of LifeExp over each year")

```


## Life Expectancy vs. Contient

Here is the boxplot of Continent vs. Life Expectancy. We can observe that the average of life expectancy in Oceania is the highest one. However, we are nor sure yet based on the boxplot, which continent has the highest expectancy than any other continents, because the box overlap with each other. 

```{r boxplot_continent_lifeExp}
ggplot(gapminder, aes(x=continent, y=lifeExp))+
  geom_boxplot(aes(color=continent), fill=continent_colors)+
  theme_calc()+
  ggtitle("The Boxplot of LifeExp in each Continent")

```

In addition, I plot the density of lifeExp for each continent. We can compare the distribution of lifeExp in each continent.

```{r densityplot_continent_lifeExp}
ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
  geom_density(alpha = 0.2, lwd=0.65)+
  theme_calc()+
  ggtitle("The Density Plot of Continent vs. LifeExp")

```


## Life Expectancy vs. GDP per capita

First, I plot the gdpPercap versus LifeExp. We observe a shape of logarithm function in the plot. 

```{r plot_gdpPercap_lifeExp, echo=FALSE}
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp))+
  geom_point(alpha=0.75, aes(color = continent))+
  theme_calc()+
  ggtitle("The Plot of GpdPercap vs. LifeExp")

```

Hence, I plot the log of gdpPercap versus LifeExp instead, which show us a linear relationship between these two variables.

```{r plot_log_gdpPercap_lifeExp, echo=FALSE}
ggplot(gapminder, aes(x=log10(gdpPercap), y=lifeExp))+
  geom_point(alpha=0.75, aes(color = continent))+
  geom_smooth(method = "lm")+
  theme_calc()+
  ggtitle("The Plot of log(GdpPercap) vs. LifeExp")

```


# Use filter(), select() and %>%

Here is the scatter plot of log(gdpPercap) vs. LifeExp in Americas and in Europe. They both show us a positive linear relationship between log(gdpPercap) and lifeExp.

```{r piping}
gapminder %>% 
  filter(continent %in% c("Americas", "Europe") ) %>% 
  select(continent, country, lifeExp, gdpPercap) %>% 
  ggplot(aes(x=log10(gdpPercap), y=lifeExp, color=continent))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_wrap(~continent)+
  theme_calc()+
  ggtitle("The Scatterplot of Log(gdpPerCap) vs. LifeExp in Americas and in Europe")

```


Here is the density plots of Log(gdp) for each continent except Africa.

```{r piping2}
gapminder %>% 
  filter(continent != "Oceania") %>% 
  select(continent, year, pop, gdpPercap) %>% 
  mutate(gdp = gdpPercap*pop) %>% 
  ggplot(aes(x=log10(gdp), fill=continent))+
  geom_density(alpha=0.5)+
  facet_wrap(~continent)+
  theme_calc()+
  ggtitle("The Density Plots of Log(gdp) for Each Continent Except Africa")

```


# Extra Question

```{r extra_question}
extra_dat <- filter(gapminder, country == c("Rwanda", "Afghanistan")) %>% 
  arrange(year)
my_dat <- filter(gapminder, country %in% c("Rwanda", "Afghanistan")) %>% 
  arrange(year)
```


* The answer of this question is NO.

* The command, `filter(gapminder, country == c("Rwanda", "Afghanistan"))`, give us only `r nrow(extra_dat)` observations. However, there are actually `r nrow(my_dat)` observations. This is because, R will compare two consecutive observations with  each time when you use `country==c("Rwanda", "Afghanistan")`. For example, R will compare the country of first observation with Rwanda and the country of second observation with Afghanistan.

* We can also check it from the tables below.

```{r extra_queation_table}
nrow(extra_dat)
nrow(my_dat)

knitr::kable(extra_dat)
knitr::kable(my_dat)
```


In the following section, I try some other functions in dplyr.

```{r extra_queation_dplyr}
extra_dat %>% 
  group_by(country) %>% 
  summarize(avg_lifeExp = mean(lifeExp)) %>% 
  knitr::kable()

my_dat %>% 
  group_by(country) %>% 
  summarize(avg_lifeExp = mean(lifeExp)) %>%
  knitr::kable()

extra_dat %>%  
  group_by(country) %>% 
  select(country, year, lifeExp) %>% 
  arrange(country) %>% 
  mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>% 
  knitr::kable()

my_dat %>%  
  group_by(country) %>% 
  select(country, year, lifeExp) %>% 
  arrange(country) %>% 
  mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>% 
  knitr::kable()

```


# My Process Report

* The tutorials in HW02 and lecture notes are very helpful for this assignment. I have listed those links below in the reference section.

* I think this assignment gets harder than the previous one. It is not harder in a technical way, but requires us to spend more time to work on it and discover some new functions and figure out which can be use properly. However, it's still interesting to do so.

* The type of data set and the data type of each variable are two question I feel very confused, but after reading the lecture notes and doing some research, I think I give a reasonable answer for these questions.


# Reference

- [STAT545: cm005 Notes and Exercises](http://stat545.com/cm005-notes_and_exercises.html)
- [ggplot2 Tutorial](https://github.com/jennybc/ggplot2-tutorial)
- [Gapminder README.md by jennybc](https://github.com/jennybc/gapminder/blob/master/README.md)