# 1 Summary Statistics

Summary statistics gives you the tools you need to boil down massive datasets to reveal the highlights. In this chapter, you'll explore summary statistics including mean, median, and standard deviation, and learn how to accurately interpret them. You'll also develop your critical thinking skills, allowing you to choose the best summary statistics for your data.

# Guess the correlation

On the right, use the scatterplot to estimate what the correlation is between the variables x and y. Once you've guessed it correctly, use the New Plot button to try out a few more scatterplots. When you're ready, answer the question below to continue to the next exercise.

Which of the following statements is NOT true about correlation?

# Instructions:

( ) If the correlation between `x` and `y` has a high magnitude, the data points will be clustered closely around a line.

( ) Correlation can be written as *r*.

( ) If `x` and `y` are negatively correlated, values of `y` decrease as values of `x` increase.

(x) Correlation cannot be 0.



# Relationships between variables

In this chapter, you'll be working with a dataset world_happiness containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.

In this exercise, you'll examine the relationship between a country's life expectancy (life_exp) and happiness score (happiness_score) both visually and quantitatively. Both dplyr and ggplot2 are loaded and world_happiness is available.

# Instructions:

- Create a scatterplot of happiness_score vs. life_exp using ggplot2.

In [None]:
# Create a scatterplot of happiness_score vs. life_exp
ggplot(world_happiness, aes(life_exp, happiness_score)) +
  geom_point()

- Add a linear trendline to the scatterplot, setting se to FALSE.

In [None]:
# Add a linear trendline to scatterplot
ggplot(world_happiness, aes(life_exp, happiness_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Question
Based on the scatterplot, which is most likely the correlation between life_exp and happiness_score?

# Possible answers

( ) 0.3

( ) -0.3

(x) 0.8

( ) -0.8

- Calculate the correlation between life_exp and happiness_score.

In [None]:
# Add a linear trendline to scatterplot
ggplot(world_happiness, aes(life_exp, happiness_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Correlation between life_exp and happiness_score
cor(world_happiness$life_exp, world_happiness$happiness_score)

# What can't correlation measure?

While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it's far from perfect. In this exercise, you'll explore one of the caveats of the correlation coefficient by examining the relationship between a country's GDP per capita (gdp_per_cap) and happiness score.

Both dplyr and ggplot2 are loaded and world_happiness is available.

# Instructions:

- Create a scatterplot showing the relationship between gdp_per_cap (on the x-axis) and life_exp (on the y-axis).

In [None]:
# Scatterplot of gdp_per_cap and life_exp
ggplot(world_happiness, aes(gdp_per_cap, life_exp)) +
  geom_point()

- Calculate the correlation between gdp_per_cap and life_exp.

In [None]:
# Scatterplot of gdp_per_cap and life_exp
ggplot(world_happiness, aes(gdp_per_cap, life_exp)) +
  geom_point()

# Correlation between gdp_per_cap and life_exp
cor(world_happiness$gdp_per_cap, world_happiness$life_exp)

# Question

The correlation between GDP per capita and life expectancy is 0.7. Why is correlation not the best way to measure the relationship between the two variables?

# Possible answers

( ) Correlation measures how one variable affects another.

(x) Correlation only measures linear relationships.

( ) Correlation cannot properly measure relationships between numeric variables.

# Transforming variables

When variables have skewed distributions, they often require a transformation in order to form a linear relationship with another variable so that correlation can be computed. In this exercise, you'll perform a transformation yourself.

Both dplyr and ggplot2 are loaded and world_happiness is available.

# Instructions:

- Create a scatterplot of happiness_score versus gdp_per_cap.
- Calculate the correlation between happiness_score and gdp_per_cap.

In [None]:
# Scatterplot of happiness_score vs. gdp_per_cap
ggplot(world_happiness, aes(gdp_per_cap, happiness_score)) +
  geom_point()

# Calculate correlation
cor(world_happiness$gdp_per_cap, world_happiness$happiness_score)

- Add a new column to world_happiness called log_gdp_per_cap that contains the log of gdp_per_cap.
- Create a scatterplot of happiness_score versus log_gdp_per_cap.
- Calculate the correlation between happiness_score and log_gdp_per_cap.

In [None]:
# Create log_gdp_per_cap column
world_happiness <- world_happiness %>%
  mutate(log_gdp_per_cap = log(gdp_per_cap))

# Scatterplot of happiness_score vs. log_gdp_per_cap
ggplot(world_happiness, aes(log_gdp_per_cap, happiness_score)) +
  geom_point()

# Calculate correlation
cor(world_happiness$log_gdp_per_cap, world_happiness$happiness_score)

# Does sugar improve happiness?

A new column has been added to world_happiness called grams_sugar_per_day, which contains the average amount of sugar eaten per person per day in each country. In this exercise, you'll examine the effect of a country's average sugar consumption on its happiness score.

Both dplyr and ggplot2 are loaded and world_happiness is available.

# Instructions:

- Create a scatterplot showing the relationship between grams_sugar_per_day (on the x-axis) and happiness_score (on the y-axis).
- Calculate the correlation between grams_sugar_per_day and happiness_score.

In [None]:
# Scatterplot of grams_sugar_per_day and happiness_score
ggplot(world_happiness, aes(grams_sugar_per_day, happiness_score)) +
  geom_point()

# Correlation between grams_sugar_per_day and happiness_score
cor(world_happiness$grams_sugar_per_day, world_happiness$happiness_score)

# Question

Based on this data, which statement about sugar consumption and happiness scores is true?

# Possible answers

( ) Increased sugar consumption leads to a higher happiness score.

( ) Lower sugar consumption results in a lower happiness score

(x) Increased sugar consumption is associated with a higher happiness score.

( ) Sugar consumption is not related to happiness.

# Study types

While controlled experiments are ideal, many situations and research questions are not conducive to a controlled experiment. In a controlled experiment, causation can likely be inferred if the control and test groups have similar characteristics and don't have any systematic difference between them. On the other hand, causation cannot usually be inferred from observational studies, whose results are often misinterpreted as a result.

In this exercise, you'll practice distinguishing controlled experiments from observational studies.