# ECON 326: Introduction to Basic Statistics using Jupyter


* **Authors**: COMET Team (Emrul Hasan, Jonah Heyl, Shiming Wu)
* Last updated: 15 August 2022
---

## Outline

### Prerequisites
* Introduction to Jupyter 
* Introduction to R

### Outcomes
After completing this worksheet, you will be able to

* Import and load data into R
* Create and customize plots in R
* Compute and visualize descriptive statistics from data
* Perform a t-test in R

## Part 1: Introduction to R and Jupyter


In this tutorial, we will be working with some real-world data: the 2019 Survey of Financial Security, provided by Statistics Canada[<sup id="fn1s">1</sup>](#fn1). First, we will import data into our notebook. We are going to give brief review of some of the features of R (feel free to skip this or review the introduction notebooks for more details).
In R, we import data using different commands, which are stored in **libraries** that other developers of R have created for us.  Let's import some of the most important libraries now.  

We can do this in a Jupyter notebook by selecting a cell and hitting `shift+enter` or by pressing the `play` button in the menu.

> **Important**: information is shared across cells in a notebook.  However, cells run independently; so, if you run a later cell it doesn't re-evaluate previous cells.  You have to re-run them before you can use the results.  You can re-run all the cells in a notebook with the "fast forward" button.



Try loading some of the packages into memory by running the following cell.

In [None]:
# Run this cell to evaluate the code
 
library(tidyverse)
library(haven)
library(dplyr)
library('scales')

source("functions1.r")

You have now imported the packages into memory, and they are available to use in subsequent cells.  You may also see some output, which tells you about how they have been imported.

As an aside, in this course we try (as much as possible) to use R packages which are part of the [tidyverse](https://www.tidyverse.org/) family of packages.  This is because they are well-supported, consistent, and commonly used in data sciences.  There are (usually) other packages that provide similar functions.



### Importing Data into R

The first step in an econometric project is to import, tidy, and examine your data.  For this project, we will be using data from the 2019 Survey of Financial Security (SFS), provided by Statistics Canada (see the license notes).  This is _real data_ on real Canadians.

In this course, we will always work with data that is **tidy**.  This is a data-science term that refers to data with a particular format or shape.  Specifically, tidy data is rectangular data in which:
* Each row represents one observation
* Each column represents a single variable

You can imagine this as a spreadsheet in a program like Excel.  This is the way most statistical programs like to recieve data - but it _isn't_ a property of the data itself.  It is only a representation; there can be other kinds of representations.  For example, a panel dataset might have each row as a unit (e.g. country) and each column might represent a variable in a year (e.g. unemployment in 2016, unemployment in 2017, etc.).  Often, when we work with real-world data we have to reshape it in order for it to be usable - but in this course, we won't worry about that.

Data can be stored in many different file formats, which must be interpreted by our statistical programs. For example, data files with a `.dta` are formatted for Stata,  data files with a `.csv` suffix are formatted as comma-seperated values, and data files with an `.Rda` suffix are formatted for R. This is referred to as **importing** data. Essentially, we need to tell them what the data means. There are different methods for importing data in different formats; for example:

* `read_dta()`: Importing data from a ```.dta``` file: this is data that was created by the statistical software STATA and is commonly used in economics.
* `read.csv()`: Importing data from a delimited text file: this is data which is in a text format, but where the columns are *delimited* or seperated by a special character.  For example, the ```.csv``` (comma-separated variable) format is text, but where the columns are separated by a comma.
* `load()`: Importing data from a ```.Rda``` file: this is data that was created by the statistical software R and is commonly used in statistics.

Each of these formats has a special **method** associated with it.  Methods are commands, stored in R packages, that tell R to do things.  In fact, you're already seen one!  The ```library(...)``` method imports whatever package is in the brackets into memory.

Let's start by importing our ```.dta``` file into memory, using the ```read_dta``` method.  When we read things into memory, we normally want to use them later, so we have to give them a name.  Let's call this ```SFS_data``` so we can refer to it later on:

In [None]:
# read the file named "SFS_2019_Eng.dta" into memory, and assign it to the object SFS_data
SFS_data <- read_dta("../datasets/SFS_2019_Eng.dta")

### Viewing and Inspecting Data

We have now imported our data into R. To better understand our data, the natural next step is to inspect and view it. We always do this immediately after loading in our data to make sure there are no weird features or problems with it. We will learn three of these commands:

* ```head(...)```
* ```print(...)```
* ```glimpse(...)```

Let's try these methods out to see what each one does!

In [None]:
head(SFS_data)

You can use the following command to know the detailed information of variables.

In [None]:
# If you want to know more about the variables, you can input the name in dictionary(). Sometimes you may need to compile 'source("functions2.r")' twice.
dictionary('pefatinc')

If you want to familiarize yourself with data run the next loop, which prints out a definition for each variable.

In [None]:
#for (word in names( SFS_data)) {
 #  print(word)
  # print(dictionary(word))
# }

Some important variables in this dataset:

* ```pefmtinc```: income_before_tax
* ```pefatinc```: income_after_tax
* ```pwnetwpg```: wealth
* ```pgdrmie```: gender
* ```peducmie```: education


#### Accessing Variables and Data Frames

If you recall, this dataset is _tidy_ in that each observation is a row, and each column is a variable.  In R, datasets are called *data frames* - this particular one is a special type of data frame called a _tibble_ (like, table).  We don't need to get into too many details about data frames, but basically they collect and organize all of the variables and observations.  Many functions in R need information to be organized into a data frame so that it can be computed.  You will see examples of these later on.

One of the most important things to remember is that you can access the _variables_ in a dataframe in two ways:

1.  First, you can use the `$` operator to directly access the variables
2.  Second, within a command, you can tell the command what data you are working with, then refer to the variables by name.

For example, if we wanted to get the ``mean`` of ``pwnetwpg`` (wealth) (for some reason...) we could do it this way:

In [None]:
mean(SFS_data$pwnetwpg)

This says get the variable ``pwnetwpg`` from the dataframe ``SFS_data``, then compute the mean.  In a command like ``filter`` you can tell the command to work with the data, and then refer to the variables just by their name.  

In [None]:
SFS_data <- filter(SFS_data, !is.na(SFS_data$pefmtinc))  #here we filter out the observations where before tax income is missing (or is a NA)

In [None]:
SFS_data <- rename(SFS_data, income_before_tax = pefmtinc) #Finally, we can rename columns, to be more meaningfull names, this is not nessary but is generally a good practice
SFS_data <- rename(SFS_data, income_after_tax = pefatinc)
SFS_data <- rename(SFS_data, wealth = pwnetwpg)
SFS_data <- rename(SFS_data, gender = pgdrmie)
SFS_data <- rename(SFS_data, education = peducmie)

## Part 2: Hands on the Wealth Gap in Female and Male Lead Households

Now we can get started on our analysis. There is a common discussion in laymen press, economics and political sciences, which is the gender inequality. Females are an important part of workforce. But studies show that there is gender income gap and gender wealth gap. What factors contribute to the gender gap? Here we are going to explore the wealth gap between male lead households and female lead households (a male lead households is a household in which  a male earns the majority of the income). 

The first question we should ask ourselves is: is there a significant gap between male lead household wealth and female lead household wealth? A different way to ask this, is given our data, what percent of the time would we except this gap to appear randomly? This describes a p-value, so first we will perform a t-test, to get this value. Now if this p-value indicates a significant gap in wealth, we can go on to ask what is causing this wealth gap.

At a high level, there are only a few things that can cause this wealth gap:

* Difference in income between male and female lead household,  

* Female lead households and male lead household, different saving and investment habits. For instance male lead households might take on more risk, or holds less cash than female lead households. 

* Female lead households inherit less wealth (regardless of income), or 

* Female lead households face additional obstacles in regard to building wealth, for instance access to credit.

> **Note**: these factors are not mutual exclusive (for instance female lead household earn less could have an effect and inheritance could have an effect as well).

We will be going through this topic for the next three modules, in this one will see if there is a significant wealth gap, and what causes it, which may involve investigating the income gap, between male and female lead households. First of all, we're going to compute the average family wealth. Then we compute the average family wealth conditional on being a female lead household vs a male lead households. Next we will do a t-test to see if the wealth gap is significant. Finally, we will briefly investigate what is causing the wealth gap.

In [None]:
mean(SFS_data$wealth) #this compute the sample mean of wealth from the SFS_data set


We need to tell R that our variables are actually qualitative (factor) variables, so they are easier to read.  Let's try that now using the ``as_factor`` function. This function automatically translates labelled values into factor variables.  Compare the results above with the new dataset below.

In [None]:
SFS_data<-SFS_data[!(SFS_data$education=="9"),]
SFS_data$education <- as.numeric(SFS_data$education)
SFS_data <- SFS_data[order(SFS_data$education),]
SFS_data$education <- as.character(SFS_data$education)
SFS_data$education[SFS_data$education == "1"] <- "Less than high school" 
SFS_data$education[SFS_data$education == "2"] <- "High school"
SFS_data$education[SFS_data$education == "3"] <- "Non-university post-secondary"
SFS_data$education[SFS_data$education == "4"] <- "University"

In [None]:
SFS_data$gender <- as_factor(SFS_data$gender) # this swaps gender to be male or female rather than 0 or 1
SFS_data$education <- as_factor(SFS_data$education)

### Computing the Wealth Gap

We will first describe the gender wealth gap. In R, there are many ways we could compute basic descriptive statistics.  The simplest way is to use ``summarize`` and ``group_by``:

In [None]:
# Next we look at wealth for households with men and women as main earners
results <- 
    SFS_data %>% 
    group_by(gender) %>%
    summarize(m_wealth = mean(wealth), sd_wealth = sd(wealth))

results 

> **Advanced Note**: _piping_ The above example uses a special R command called a **pipe** (``%>%``).  Mechanically, what piping does is insert the object before the pipe into the object after the pipe.  For example, if we have ``z <- f(x,y)`` we could write this using pipes as ``z <- x %>% f(y)``.  Piping is really most useful when you are chaining (piping) a series of commands together.  You can think of a pipe as saying _and then_ followed by a command.  The item before the pipe will be inputted into the next command.  This lets you do complex data manipulation in a way with is readable.
> For example, the command above (i) starts with ``SFS_data`` (ii) groups it by ``gender``, (iii) takes the grouped data and summarizes it.  If we wrote this without using a pipe it would look like:
> ``summarize(group_by(SFS_data,gender), m_wealth = mean(wealth), sd_wealth = sd(wealth))``


We can also visualize this, using ``ggplot2``, which can create bar graphs and other visualizations.  Here's a bar graph and a boxplot.

In [None]:
f <- ggplot(data = SFS_data, aes(x = gender, y = wealth)) + xlab("Gender of Main Earner") + ylab("Wealth")
f1 <- f + geom_bar(stat = "summary", fun = "mean", fill = "lightblue") #produce a summary statistic, the mean
f1 <- f1 + coord_flip() #make a horizontal bar graph!

options(repr.plot.width=6,repr.plot.height=3) #this controls the size; you can change 6 and 3 to look better

f2 <- f + geom_boxplot(fill = "lightblue") + coord_flip()

f3 <- ggplot(data = SFS_data, aes(x = wealth)) + geom_histogram(binwidth = 500000) + xlab("Wealth") + ylab("Count") + facet_grid(. ~ gender)

f1
f2
f3

> _Think Deeper:_ What does this tell you about the distribution of wealth in these datasets?  Could this be a problem for our analysis?

This is all interesting to think about.  However, this is not a formal test of the gender-wealth gap.  We need to examine this from a statistical perspective. In other words, we would like to examine if the gender-wealth gap is statistically significant. We can do this through a two sample $t$-test, which can be performed using the `t.test()` command in R: 

In [None]:
t1 = t.test(
       x = filter(SFS_data, gender == "Male")$wealth,
       y = filter(SFS_data, gender == "Female")$wealth,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t1 

round(t1$estimate[1] - t1$estimate[2],2) 

As we can see here, the `t.test()` command outputs the $p$-value and test statistic immediately. This particular $t$-test was for a 95% confidence level. The gender-wealth gap is 235,285.6 dollars. As you can see, there is a significant gap in wealth between male and female lead households.

### Going Deeper:
> ***What could be the potential causes of the gender-wealth gap in Canada?***

The next step is to understand why or how the gender wealth gap might exist. The natural potential factors to study are education and incomes. For example, perhaps females are less likely to have university degree, or perhaps females earn less incomes.  These reasons could, potentially, create a gender wealth gap.  Let's take a look at education in the survey, then try to understand how it interacts with gender:

In [None]:
# Next we look at wealth for households with different education
results <- 
    SFS_data %>% 
    group_by(education) %>%
    summarize(m_wealth = mean(wealth), sd_wealth = sd(wealth))

results 

The results suggest average wealth increases with education of the main earner. 

> _Think Deeper:_ Why might this be the case?  

We also see how this breaks down by gender.  Look at the following table - do you see a pattern?

In [None]:
results <- 
    SFS_data %>%
    group_by(education,gender) %>% 
    summarize(m_wealth = mean(wealth), sd_wealth = sd(wealth))

results

options(repr.plot.width=10,repr.plot.height=3)

f <- ggplot(data = SFS_data, aes(x = gender, y = wealth)) + xlab("Gender") + ylab("Wealth")
f <- f + geom_bar(stat = "summary", fun = "mean", fill = "lightblue") #produce a summary statistic, the mean
f <- f + facet_grid(. ~ education) #add a grid by education

f

You can see that female-lead households tend to accumulate less wealth than the male counterpart.  

However, it is worthwhile to look at the difference in wealth gap in percent terms rather than, absolute terms. This is because people at a higher education level earn more so the gap may appear deceptively larger. ``percentage_table`` is a 'function', which takes `result`, name of `column1` and name of `column3` of `results` table that we summarized as inputs. We assign this function to the object `percentage_table`, and produce the table we want. 

In [None]:
percentage_table <- function(result,column1,column3) {
#gap in percentage: (wealth_male - wealth_female)/wealth_female
female_wealth=filter(result,gender=='Female')[[column3]]
male_wealth=filter(result,gender=='Male')[[column3]]
wealth_gap= (male_wealth-female_wealth)/female_wealth
education=filter(result,gender=='Female')[[column1]]
gap= data.frame(education,wealth_gap)
wealth_gap<-percent(gap$wealth_gap) #show in percentage form
gap$wealth_gap<-as.character(gap$wealth_gap)
gap$wealth_gap<-wealth_gap
return(gap)
}

Let's call `percentage_table` and use `results` generated in previous cell (the $8\times 4$ table) as the inputs. The following table is the average percentage of wealth which shows a male-lead household accumulates more than a female-lead household.

In [None]:
percentage_table(result=results, column1="education", column3="m_wealth")

From the results above, family with male as main earner generally accumulates more wealth than female-lead household which has similar education background. The gender-wealth gap is widest for university graduates, with male-lead family has 38.54% more wealth than female-lead family. We can make this even more clear by adding a new variable (``university``) to our dataset.  Frequently, we will want to make new variables to help us analyze the results, especially when a variable is more complicated than we would like it to be.

You can create this in many ways - but a very useful command is the ``case_when`` command.  Here is an example for our ``university`` variable.  Pay attention to the use of the ``as_factor`` command at the end to tell R that this is still a qualitative variable.



In [None]:
SFS_data <- SFS_data %>% 
               mutate( 
               university = case_when(#this is an example of this function
                     education == "University" ~ "Yes", #the ~ seperates the original from the new name
                     education == "Non-university post-secondary" ~ "No",
                     education == "High school" ~ "No",
                     education == "Less than high school" ~ "No")) %>%
             mutate(university = as_factor(university)) #remember, it's a factor!

glimpse(SFS_data$university)

Now, let's repeat the analysis we did above by education status; then we can perform a $t$-test on each of these sub-groups:

In [None]:
results <- 
    SFS_data %>%
    group_by(university,gender) %>%
    summarize(m_wealth = mean(wealth), sd_wealth = sd(wealth))

results 

f <- ggplot(data = SFS_data, aes(x = gender, y = wealth)) + xlab("Gender") + ylab("Wealth")
f <- f + geom_bar(stat = "summary", fun = "mean", fill = "lightblue") #produce a summary statistic, the mean
f <- f + facet_grid(. ~ university) #add a grid by education

f

Similarly, let's look at the difference in wealth gap in percent terms. We use `results` generated in previous cell (the $4\times 4$ table) as the inputs this time.

In [None]:
percentage_table(result=results, column1="university", column3="m_wealth")

Without university degree, male-lead households accumulate 21% more wealth than female counterpart, while with university degree, the gap is widen to 39%. Thus education seems to enlarge the gender-wealth gap.

Let's study gender wealth gap within subsamples of "university degree" and "no university degree" respectively by running formal two sample t-test in the 2 subsamples.

In [None]:
university_data = filter(SFS_data, university == "Yes") #university only data 
nuniversity_data = filter(SFS_data, university == "No") #not university data

t2 = t.test(
       x = filter(university_data, gender == "Male")$wealth,
       y = filter(university_data, gender == "Female")$wealth,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t2  #test for the wealth gap in university data

round(t2$estimate[1] - t2$estimate[2],2)


t3 = t.test(
       x = filter(nuniversity_data, gender == "Male")$wealth,
       y = filter(nuniversity_data, gender == "Female")$wealth,
       alternative = "two.sided",
       mu = 0,
       conf.level = 0.95)

t3 #test for the wealth gap in non-university data

round(t3$estimate[1] - t3$estimate[2],2)

Consider the results above.  Do you think the results of Welch Two Sample t-test are consistent with the above descriptive results?

> _Think Deeper_: How can you explain the effects of education on gender-wealth gap? What would you need to know in order to rationalize your explanation?  

### Wrapping Up

At this point, we have started to explore the gender wealth gap.  Next, work on the following exercises to learn more. We will touch on the second important factor: income.

## Part 3: Exercises

### Activity 1 
First, examine ``SFS_data`` with a focus on before-tax income. Create a table that tabulates the average before-tax income by gender. A correct ``tab_income`` table will pass the test.  Try looking at how we generated some of the earlier tables for inspiration, if you need a hint.

In [None]:
tab_income <- #fill in the code below; what goes before the %>%?
    SFS_data %>% 
             %>%
    summarize(m_income = mean(income_before_tax), sd_income = sd(income_before_tax))

tab_income

answer1 <- tab_income
test_1() #quiz 1

#### Short Answer 1:

What type of variable is ``gender``? Does it make sense to have ``gender`` as that variable type? Why or why not?  Write your answer in the box below:

<font color="red">Answer here (delete this text)</font>

### Activity 2

The table that we got in the previous activity is fairly clear, but let's illustrate things with a chart. Construct a bar graph that charts the average income before tax by gender. ``income_graph`` will store this plot. You can see it by running the second code chunk below.

In [None]:
SFS_data$gender <- haven::as_factor(SFS_data$gender)
income_graph <- ggplot(data = SFS_data, aes(x = , y = income_before_tax)) + xlab(" ") + ylab("Income before tax")  #what goes in the x = spot?  what goes in xlab("")?
income_graph <- income_graph + geom_bar(stat = "summary", fun = mean, fill = "lightblue")
income_graph <- income_graph + coord_flip()

In [None]:
income_graph

#### Short Answer 2
Examine the graph.  What do we observe when we compare the average before-tax income between genders? What does this suggest?

<font color="red">Answer here (delete this text)</font>

### Activity 3
Now, create a table that tabulates average before-tax income by education level and gender. This table, labelled ``tab_income2``, will be tested for correctness.

In [None]:
tab_income2 <- 
     %>%
    group_by(education,) %>% 
    summarize(m_income = mean(income_before_tax), sd_income = sd(income_before_tax))

tab_income2

answer2 <- tab_income2
test_2() #quiz 2

Next, create table of percentage that compares average income gap between males and females within education levels (``tab_income_percent`` will store this). Note that most of the syntax is provided -- you simply need to fill in the missing code.

In [None]:
tab_income_percent <- percentage_table(result= , column1="education", column3="m_income")
tab_income_percent

#### Short Answer 3
Examine the table. What do we observe when we compare the before-tax income gap between education levels? What does this suggest? And if you compare the income gap table with wealth gap table (percentage), what do you find?

<font color="red">Answer here (delete this text)</font>

### Activity 4
Economists are often concerned with two aspects of the relationship between male and female lead households  and education:
* Difference in average income between the two household groups
* The difference in returns to education between the two household groups

Let's explore these two topics. First, test whether there are significant differences in income before tax between male and female lead households within each education group. Within which education levels do we see significant differences 

_Note_: You will perform the t-test's of gender income gap on each education group.

In [None]:
#Less than high school 
tlesshs = 

tlesshs

#High school 

ths = 


round(ths$estimate[1] - ths$estimate[2],2)
ths
test_3() #quiz 3



# Non-uni post-seconary

tsocol = 

tsocol

#University

tuni = 

tuni
round(tuni$estimate[1] - tuni$estimate[2],2)
test_4() #quiz 4



### Activity 5
Next, examine whether returns to education differ between genders. For our purposes, we will define:
> **Returns to Education**: The difference in average income before tax between two subsequent education levels.

Run this test for the returns to education of: 
* High school diploma (relative to less than high school) and 
* University degree (relative to non-university post-secondary)

*The following t-test objects will be tested for correctness:* Returns to education of a high school diploma for males (``retHS``) and for females(``retHSF``), and returns to education of a university's degree for males (``retU``) and for females (``retUF``).

In [None]:
#Returns to education: High school diploma

##Males

retHS = #what goes here?

retHS
round(retHS$estimate[2] - retHS$estimate[1],2)

test_5() 

##Females

retHSF = #what goes here?

retHSF
round(retHSF$estimate[2] - retHSF$estimate[1],2)

test_6() #quiz 6

In [None]:
#Returns to education: University

##Males

retU = #what goes here?

retU
round(retU$estimate[2] - retU$estimate[1],2)

test_7() 

##Females

retUF = #what goes here?

retUF
round(retUF$estimate[2] - retUF$estimate[1],2)

test_8() #quiz 8

#### Short Answer 4
**Reflect on your analysis:** Interpret the results of the t-tests above. Are the returns to each level of education significant for males? For females? Comment on the difference between returns to education for a high school degree and that for a university degree.

<font color="red">Answer here (delete this text)</font>

#### Short Answer 5
**Discuss your results:** Do the returns to each level of education (for either level of education) differ between males and females? What differences between the two groups might explain this difference?

<font color="red">Answer here (delete this text)</font>

### Activity 6
Now, let's repeat Activity 3 with after-tax income, i.e., create a table that tabulates average after-tax income by education level and gender. This table, labelled ``tab_income3``, will be tested for correctness.

In [None]:
tab_income3 <- 
     %>%
    group_by(,gender) %>% 
    summarize(m_income = mean(), sd_income = sd())

tab_income3

answer3 <- tab_income3
test_9() #quiz 9

Next, create table of percentage that compares average income gap between males and females within education levels (``tab2_income_percent`` will store this). Note that most of the syntax is provided -- you simply need to fill in the missing code.

In [None]:
tab2_income_percent <- percentage_table(result= , column1="education", column3=" ")
tab2_income_percent

#### Short Answer 6
Compare the above table with the one in Activity 3. What do you find and why? And if you compare the after-tax income gap table with wealth gap table (percentage), what do you find?

<font color="red">Answer here (delete this text)</font>

<span id="fn1">[<sup>1</sup>](#fn1s)Provided under the Statistics Canada Open License (Public).  Adapted from Statistics Canada, Statistics Canada Open License (Public)
Adapted from Statistics Canada, 2021 Census Public Use Microdata File. This does not constitute an endorsement by Statistics Canada of this product.</span>