-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw02_Gapminder.Rmd
322 lines (206 loc) · 9.91 KB
/
hw02_Gapminder.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
---
title: "STAT545 HW02"
author: "Xinmiao Wang"
date: "`r format(Sys.Date())`"
output: github_document
---
#Navigation
* The main repo for homework: [here](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao)
* Requirement for Homework 02: click [here](http://stat545.com/hw02_explore-gapminder-dplyr.html)
* hw02 folder: [here](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao/tree/master/hw02).
* Files inside hw02:
1. [README.md](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao/blob/master/hw02/README.md)
2. [hw02_Gapminder.md](https://github.com/xinmiaow/STAT545-hw-Wang-Xinmiao/blob/master/hw02/hw02_Gapminder.md)
# Bring Rectangular Data in
In this module, we intend to explore Gapminder data and practice the functions in dplyr, which can be loaded from gapminder package and tidyverse package in R. Please make it sure that those package have been installed before we load them.
Install `gapminder` from CRAN:
```{r eval=FALSE}
install.packages("gapminder")
```
Install `tidyverse` from CRAN:
```{r eval=FALSE}
install.packages("tidyverse")
```
Here, we load those two packages.
```{r load_library, warning=F}
#load packages
library(gapminder)
library(tidyverse)
library(ggthemes)
```
# Smell Test the Data
### Is it a data.frame, a matrix, a vector, a list?
```{r type_dat}
gapminder
str(gapminder)
```
* A tibble, because we load the `tidyverse` package
* A data.frame by the classes shown in `str(gapminder)`
* A list, if we use `typeof()`. But a data.frame is a special case of a list.
### What¡¯s its class?
```{r class_data}
class(gapminder)
```
* The class of gapminder includes `r class(gapminder)`.
### How many variables/columns?
```{r ncol_data}
ncol(gapminder)
```
* There are `r ncol(gapminder)` variables.
### How many rows/observations?
```{r nrow_data}
nrow(gapminder)
```
* There are `r nrow(gapminder)` observations.
### Can you get these facts about ¡°extent¡± or ¡°size¡± in more than one way? Can you imagine different functions being useful in different contexts?
```{r other_extent}
dim(gapminder) # number of observations and number of variables
length(gapminder) # number of variables
```
* `dim()`: the number of observations and the number of variables
* `str()`: the number of observations and the number of variables
* `length()`: the number of variables
### What data type is each variable?
```{r type_data}
attach(gapminder)
a <- rbind(names(gapminder), c(typeof(country), typeof(continent), typeof(year), typeof(lifeExp), typeof(pop), typeof(gdpPercap)))
as.data.frame(a, row.names=c("Variables", "Data Type")) #Data type of each variable
```
* Integer: country, continent, year, pop
* Numeric: lifeExp, gdpPercap
<Notes> Here, the data types of variables country and continent are integer, even though we can see their values are characters not integer numbers. I think it is because that these two variables are treated as factors with order when we import the data into R. The integers represent the levels. For example, for country, 1 represents Afghanistan, and for continent, 1 represent Africa. You can check it by using `str(gapminder)`, `levels(gapminder$country)` and `levels(gapminder$continent)`.
# Explore individual variables
### Categorical Variable: Continent
Here are the summary table and the barchart for Continent.
There are six continents where we collected data, including Africa, Americas, Asia, Europe and Oceania. We collected the most number of data from Africa. The smallest number of data were collect in Oceania. From the barchart below, we can observe the distribution of continent more clearly.
```{r continent}
summary(continent)
ggplot(gapminder, aes(x=continent))+
geom_bar(aes(color=continent), fill=continent_colors)+
theme_calc()+
ggtitle("The Bar Chart of Continent")
```
### Quantitative Variable: LifeExp
Here are the summary data and the histogram for LifeExp.
The range of lifeExp is from `r range(lifeExp)[1]` to `r range(lifeExp)[2]`. The mean of LifeExp is `r mean(lifeExp)` with standard derivation `r sd(lifeExp)`, and the Median is `r median(lifeExp)`. From the histogram, we can observe the mode of lifeExp is around 70, the shape of its distribution is a little bit left-skewed. Based in the histogram, I suspect the minimum of lifeExp might be an outliter. However, by the 1.5 IQR rule, the minimum value is considerable.
```{r lifeExp}
summary(lifeExp)
sd(lifeExp)
ggplot(gapminder, aes(x=lifeExp))+
geom_histogram(binwidth = 1,col="red", aes(fill=..count..))+
scale_fill_gradient("count", low = "green", high = "red")+
theme_calc()+
ggtitle("The Histogram of LifeExp")
```
# Explore various plot types
## Life Expectancy vs. Year
Here is a boxplot of life expectancy among every five years from 1952 to 2007. From the boxplot, we can see the increasing trend of the average life expectancy all over the world.
```{r year_lifeExp, echo=FALSE}
ggplot(gapminder, aes(x = year, y = lifeExp))+
geom_boxplot(aes(group = year), fill="pink")+
theme_calc()+
ggtitle("The Boxplot of LifeExp over each year")
```
## Life Expectancy vs. Contient
Here is the boxplot of Continent vs. Life Expectancy. We can observe that the average of life expectancy in Oceania is the highest one. However, we are nor sure yet based on the boxplot, which continent has the highest expectancy than any other continents, because the box overlap with each other.
```{r boxplot_continent_lifeExp}
ggplot(gapminder, aes(x=continent, y=lifeExp))+
geom_boxplot(aes(color=continent), fill=continent_colors)+
theme_calc()+
ggtitle("The Boxplot of LifeExp in each Continent")
```
In addition, I plot the density of lifeExp for each continent. We can compare the distribution of lifeExp in each continent.
```{r densityplot_continent_lifeExp}
ggplot(gapminder, aes(x = lifeExp, fill = continent)) +
geom_density(alpha = 0.2, lwd=0.65)+
theme_calc()+
ggtitle("The Density Plot of Continent vs. LifeExp")
```
## Life Expectancy vs. GDP per capita
First, I plot the gdpPercap versus LifeExp. We observe a shape of logarithm function in the plot.
```{r plot_gdpPercap_lifeExp, echo=FALSE}
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp))+
geom_point(alpha=0.75, aes(color = continent))+
theme_calc()+
ggtitle("The Plot of GpdPercap vs. LifeExp")
```
Hence, I plot the log of gdpPercap versus LifeExp instead, which show us a linear relationship between these two variables.
```{r plot_log_gdpPercap_lifeExp, echo=FALSE}
ggplot(gapminder, aes(x=log10(gdpPercap), y=lifeExp))+
geom_point(alpha=0.75, aes(color = continent))+
geom_smooth(method = "lm")+
theme_calc()+
ggtitle("The Plot of log(GdpPercap) vs. LifeExp")
```
# Use filter(), select() and %>%
Here is the scatter plot of log(gdpPercap) vs. LifeExp in Americas and in Europe. They both show us a positive linear relationship between log(gdpPercap) and lifeExp.
```{r piping}
gapminder %>%
filter(continent %in% c("Americas", "Europe") ) %>%
select(continent, country, lifeExp, gdpPercap) %>%
ggplot(aes(x=log10(gdpPercap), y=lifeExp, color=continent))+
geom_point()+
geom_smooth(method="lm")+
facet_wrap(~continent)+
theme_calc()+
ggtitle("The Scatterplot of Log(gdpPerCap) vs. LifeExp in Americas and in Europe")
```
Here is the density plots of Log(gdp) for each continent except Africa.
```{r piping2}
gapminder %>%
filter(continent != "Oceania") %>%
select(continent, year, pop, gdpPercap) %>%
mutate(gdp = gdpPercap*pop) %>%
ggplot(aes(x=log10(gdp), fill=continent))+
geom_density(alpha=0.5)+
facet_wrap(~continent)+
theme_calc()+
ggtitle("The Density Plots of Log(gdp) for Each Continent Except Africa")
```
# Extra Question
```{r extra_question}
extra_dat <- filter(gapminder, country == c("Rwanda", "Afghanistan")) %>%
arrange(year)
my_dat <- filter(gapminder, country %in% c("Rwanda", "Afghanistan")) %>%
arrange(year)
```
* The answer of this question is NO.
* The command, `filter(gapminder, country == c("Rwanda", "Afghanistan"))`, give us only `r nrow(extra_dat)` observations. However, there are actually `r nrow(my_dat)` observations. This is because, R will compare two consecutive observations with each time when you use `country==c("Rwanda", "Afghanistan")`. For example, R will compare the country of first observation with Rwanda and the country of second observation with Afghanistan.
* We can also check it from the tables below.
```{r extra_queation_table}
nrow(extra_dat)
nrow(my_dat)
knitr::kable(extra_dat)
knitr::kable(my_dat)
```
In the following section, I try some other functions in dplyr.
```{r extra_queation_dplyr}
extra_dat %>%
group_by(country) %>%
summarize(avg_lifeExp = mean(lifeExp)) %>%
knitr::kable()
my_dat %>%
group_by(country) %>%
summarize(avg_lifeExp = mean(lifeExp)) %>%
knitr::kable()
extra_dat %>%
group_by(country) %>%
select(country, year, lifeExp) %>%
arrange(country) %>%
mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>%
knitr::kable()
my_dat %>%
group_by(country) %>%
select(country, year, lifeExp) %>%
arrange(country) %>%
mutate(lifeExp_gain = lifeExp - first(lifeExp)) %>%
knitr::kable()
```
# My Process Report
* The tutorials in HW02 and lecture notes are very helpful for this assignment. I have listed those links below in the reference section.
* I think this assignment gets harder than the previous one. It is not harder in a technical way, but requires us to spend more time to work on it and discover some new functions and figure out which can be use properly. However, it's still interesting to do so.
* The type of data set and the data type of each variable are two question I feel very confused, but after reading the lecture notes and doing some research, I think I give a reasonable answer for these questions.
# Reference
- [STAT545: cm005 Notes and Exercises](http://stat545.com/cm005-notes_and_exercises.html)
- [ggplot2 Tutorial](https://github.com/jennybc/ggplot2-tutorial)
- [Gapminder README.md by jennybc](https://github.com/jennybc/gapminder/blob/master/README.md)