# ECON 490: Within Group Analysis (6)

## Prerequisites 
---
1. Inspect and clean the variables of a dataset.
2. Generate basic variables for a variety of purposes.

## Learning objectives:
---
1. Create new variables using the command `egen`.
2. Know when to use the pre-command`by`and when to use `bysort` 
3. Change a panel dataset to a cross-sectional dataset and vise versa 

We'll continue working with the fake data dataset introduced in the previous lecture. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.  

In [1]:
library(haven)
library(tidyverse)

“package ‘haven’ was built under R version 4.1.3”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

“package ‘ggplot2’ was built under R version 4.1.3”
“package ‘tibble’ was built under R version 4.1.3”
“package ‘dplyr’ was built under R version 4.1.3”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [8]:
data <- read_csv("../econ490-stata/fake_data.csv")  #change me!

[1mRows: [22m[34m2[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): version https://git-lfs.github.com/spec/v1

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## 6.1: Key Functions for Group Analysis
---

When we are working on a particular project, it is often quite important to know how to visualize data for specific groupings, whether of variables or observations meeting specific conditions. We have seen this quite briefly in previous notebooks with the `group_by` function. However, here we will go into much more depth and look at a variety of functions for conducting group-level analysis. We will rely heavily on the `dyplr` package, which we have implicitly imported through the `tidyverse` package.

### 6.1.1: `arrange`

Before grouping data, we may want to order our dataset based on the values of a particular dataset. The `arrange` function helps us achieve this. It takes in a dataframe and variable and rearranges our dataframe in ascending order by default, with the option to arrange in descending order requiring a further desc() option. If we want this function to rearrange our entire dataset in order of one of our variables, say `year`, we can do it as below.

In [None]:
# arrange the dataframe by ascending year
data %>% arrange(year)

# arrange the dataframe by descending year
data %>% arrange(desc(year))

We can pass multiple variable parameters to the `arrange` function to indicate how we should further sort our data within each year grouping. For instance, including the `region` variable will further sort each year grouping in order of region.

In [None]:
data %>% arrange(year, region)

### 6.1.2: `group_by`

This is one of the most pivotal functions in R. It allows you to group a dataframe by the values of a specific variable and perform further operations on those groups. Let's say that we wanted to group our dataset by `region` and count the number of people in each region. We can simply pass this variable as a parameter to our group_by function and further pipe this result into the `tally()` function.

In [None]:
data %>% group_by(region) %>% tally()

Notice how the `group_by` function nicely groups the regions in ascending order for us automatically. Unlike with the `arrange` function, it does not preserve the dataset in its entirety. It instead collapses our dataset into groups, thus it is important not to redefine our `data` dataframe by this group_by if we want to preserve our original data. We can also pass multiple arguments to `group_by`. If we pass both `region` and `treated` to our function as inputs, our region groups will be further grouped by observations which are and are not treated. Let's count the number of treated and untreated people in each region.

In [None]:
data %>% group_by(region, treated) %>% tally()

Finally, we can pipe a group_by object into another group_by object. In this case, the second group_by will simply overwrite the first. For example, if we wanted to pass our original `region` group_by into a `treated` group_by, we will simply get a dataframe counting the total number of people who are treated and untreated.

In [None]:
data %>% group_by(region) %>% group_by(treated) %>% tally()

### 6.1.3: `group_keys`

This function allows us to see the specific groups for a group_by dataframe we have created. For instance, if we wanted to see every year in the data, we could group by `year` and then apply the `group_keys` function.

In [None]:
data %>% group_by(year) %>% group_keys()

This is equivalent to using the `unique` function directly on a column of our dataset, as below. The output is just a list in this case instead of another dataframe as above.

In [None]:
unique(data$year)

### 6.1.4: `ungroup`

We can even selectively remove groups from a grouped dataframe. Say we realized that we didn't need the dataframe grouping by `region` and `treated` and wanted to just count by `region`. If this dataframe had been defined as A, we can simply use `ungroup` to "work backwards", removing the grouping by treatment status and having a count for just regions.

In [None]:
A <- data %>% group_by(region, sex) %>% tally()
A %>% ungroup(treated) %>% tally()

In [39]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,1,1999,M,1944,55,1997,1,0,39975.012
2,1,2001,M,1944,57,1997,1,0,278378.09
3,2,2001,M,1947,54,2001,4,0,18682.6
4,2,2002,M,1947,55,2001,4,0,293336.41
5,2,2003,M,1947,56,2001,4,0,111797.3
6,3,2005,M,1951,54,2005,5,0,88351.672
7,3,2010,M,1951,59,2005,5,0,46229.57
8,4,1997,M,1952,45,1997,5,1,24911.029
9,4,2001,M,1952,49,1997,5,1,9908.3623
10,5,2009,M,1954,55,1998,2,1,137207.3


In [47]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings,var_one,obs_number,tot_obs
1,376133,1995,F,1974,21,1995,2,1,2709.9409,1,1,2861772
2,301719,1995,F,1971,24,1995,5,1,12740.87,1,2,2861772
3,314228,1995,F,1971,24,1995,2,1,62211.879,1,3,2861772
4,51487,1995,M,1947,48,1995,4,0,12464.09,1,4,2861772
5,196269,1995,F,1963,32,1995,4,0,9238.9434,1,5,2861772
6,55165,1995,M,1963,32,1995,1,0,56436.5,1,6,2861772
7,319139,1995,M,1970,25,1995,4,0,22412.1,1,7,2861772
8,314233,1995,M,1945,50,1995,2,0,57139.121,1,8,2861772
9,320516,1995,M,1974,21,1995,3,0,248878.59,1,9,2861772
10,316625,1995,M,1959,36,1995,1,0,125149.4,1,10,2861772


In [48]:
cap drop obs_number 
bysort workerid: gen obs_number = _n 

cap drop tot_obs
bysort workerid: gen tot_obs = _N

As we can see, some workers are observed only 2 times in the data; whereas other workers are observed 8 times. By knowing (and recoding in a variable) the times a worker has been observed, we can do some different analysis based on this information. For example, in some cases you might be interested in keeping only workers who are observed across all time periods. 

## 6.2: Generating Variables for Group Analysis
---
We have already seen how to redefine and add new variables to a dataframe using the df$ <- format. We have also seen how to use the `mutate` function to add new variables to a dataframe. However, we often want to add new variables to grouped dataframes to display information about the different groups rather than different observations from the original dataframe. That is where `summarise` comes in. The `summarise` function gives us access to a variety of common functions we can use to generate variables corresponding to individual groups. For instance, we may want to find the mean earnings of each region. As such, we will group on `region` and then add a variable to our grouped dataframe which aggregates the mean of the `earnings` variable for each region group.

In [None]:
data %>% group_by(region) %>% summarise(meanearnings = mean(earnings))

We may want more detailed information about each region. We can pass a series of parameters to `summarise`, and it will generate variables for all of these requests. Let's say we want the mean and standard deviation of earnings for each group, as well as the range earnings of each group (max earnings - min earnings).

In [None]:
data %>% 
    group_by(region) %>% 
    summarise(meanearnings = mean(earnings), stdevearnings = sd(earnings), range = max(earnings) - min(earnings))

#### DISCUSS SUMMARIZE FUNCTION AND USING FUNCTIONS GENERALLY??

The command `egenerate` is used whenever we want to create variables that require some functions (e.g. mean, standard deviation, min).  The basic syntax works as follows: 

```
 bysort groupvar: egen new_var = function() , options
```

Let's see an example where we create a new variable called `avg_earnings` which is the `mean` of earnings for every worker.

In [50]:
cap drop avg_earnings
bysort workerid: egen avg_earnings = mean(earnings)

In [51]:
cap drop total_earnings
bysort workerid: egen total_earnings = total(earnings)

By definition, these commands will create variables that use information across different observations. You can check the list of available functions by writing `help egen`.

In this documentation, you will notice that there are some functions that do not allow for `by`. For example, suppose we want to create the total sum across different variables in the same row. 

In [52]:
cap drop sum_of_vars
egen sum_of_vars = rowtotal(start_year region treated)

The variable we are creating for the example has no particular meaning but, what we need to notice is that the function `rowtotal()` only sums the non-missing values in our variables. This means that if there was a missing value in any of the three variables then, the sum would only take place between the two variables that do not have the missing value. We could also write this command as `gen sum_of_vars = start_year +  region + treated` however, if there was a missing value (`.`) in `start_year`,  `region` or `treated` then, the generated value for `sum_of_vars` would also be a missing value. The answer lies in the missing observations. If you sum any number with a missing (`.`), then the sum will be also missing.

Notice that we can use `by` with a list of variables, not necessarily a unique variable. 

In [53]:
cap drop regionyear_earnings
bysort year region : egen regionyear_earnings = total(earnings)

## 7.3 Collapsing the Data
---

#### MIGHT BE REDUNDANT IF WE COVER USING GROUP_BY AND APPLYING FUNCTIONS

We can aso compute statistics at some group level with the `collapse` command. However, these changes are irreversible. For example, suppose we want to create a dataset at the region-year level using information in the current dataset.

First, we decide which statistics we want to keep from the original dataset. For the sake of explanation, let's suppose we want to keep average earnings, the variance of earnings, and the total employment. 

The syntax is 

```
 collapse (statistic1) new_name = existing_variable (statistic2) new_name2 = existing_variable2 ... , by(group) 
```


We write

In [54]:
collapse (mean) avg_earnings = earnings (sd) sd_earnings = earnings (count) tot_emp = earnings , by(region year)

In [55]:
%browse 10

Unnamed: 0,year,region,avg_earnings,sd_earnings,tot_emp
1,1995,1,67284.297,144739.13,28143
2,1996,1,65632.828,139990.23,31078
3,1997,1,67655.398,144867.91,33583
4,1998,1,67836.609,113053.41,35696
5,1999,1,69703.961,130300.19,37443
6,2000,1,70636.844,153549.7,39559
7,2001,1,75025.453,212302.61,40722
8,2002,1,72514.414,148687.64,41353
9,2003,1,75885.875,134078.48,41242
10,2004,1,73384.0,145342.2,41339


As you can see above, there's no way to recover the information we previously had. However, we may be interesting in analysis this type of datasets at a group level. If this is not what you intend to do, you should stick to the use of `by` and `bysort` pre-commands.

## 7.4 Reshaping 
---

#### SIMILAR TO STATA, PYTHON: SHIFTING AND PIVOTING OF INDICES?

> Remember we have collapsed our data and that is irreversible so we need to import the data again to gain access of the full datatset.

In [56]:
clear *
cd "."

import delimited using "fake_data.csv", clear



C:\Users\paulc\Dropbox\Projects\Gitlab\econometrics\econ490-stata

(9 vars, 2,861,772 obs)


Notice that the nature of this particular dataset is a panel (individual workers being followed over many years). Sometimes we are interested in working with a cross section (i.e. have 1 observation per worker). Is there a simple way to go back and forth between these two? Yes!

The command's name is `reshape` and has two main forms: `wide` and `long`. The former is related to a cross-sectional nature, whereas the latter is the usual panel nature. 

Suppose we want to record the earnings of the workers while keeping the information across years.

In [57]:
reshape wide earnings region age birth_year start_year, i(workerid) j(year)

(note: j = 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
>  2009 2010 2011)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                  2.9e+06   ->  868002
Number of variables                   9   ->      88
j variable (17 values)             year   ->   (dropped)
xij variables:
                               earnings   ->   earnings1995 earnings1996 ... ear
> nings2011
                                 region   ->   region1995 region1996 ... region2
> 011
                                    age   ->   age1995 age1996 ... age2011
                             birth_year   ->   birth_year1995 birth_year1996 ...
>  birth_year2011
                             start_year   ->   start_year1995 start_year1996 ...
>  start_year2011
-----------------------------------------------------------------------------


<div class="alert alert-warning">

**Warning:** This command acts on *all* of the variables. If you don't include them in the list Stata assumes that they do not vary across *i* (workers, in this case). If you don't check this beforehand, you may get an error.

</div>

In [58]:
%browse 10

Unnamed: 0,workerid,birth_year1995,age1995,start_year1995,region1995,earnings1995,birth_year1996,age1996,start_year1996,region1996,earnings1996,birth_year1997,age1997,start_year1997,region1997,earnings1997,birth_year1998,age1998,start_year1998,region1998,earnings1998,birth_year1999,age1999,start_year1999,region1999,earnings1999,birth_year2000,age2000,start_year2000,region2000,earnings2000,birth_year2001,age2001,start_year2001,region2001,earnings2001,birth_year2002,age2002,start_year2002,region2002,earnings2002,birth_year2003,age2003,start_year2003,region2003,earnings2003,birth_year2004,age2004,start_year2004,region2004,earnings2004,birth_year2005,age2005,start_year2005,region2005,earnings2005,birth_year2006,age2006,start_year2006,region2006,earnings2006,birth_year2007,age2007,start_year2007,region2007,earnings2007,birth_year2008,age2008,start_year2008,region2008,earnings2008,birth_year2009,age2009,start_year2009,region2009,earnings2009,birth_year2010,age2010,start_year2010,region2010,earnings2010,birth_year2011,age2011,start_year2011,region2011,earnings2011,sex,treated
1,1,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1944,55,1997,1,39975.012,.,.,.,.,.,1944,57,1997,1,278378.09,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,M,0
2,2,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1947,54,2001,4,18682.6,1947,55,2001,4,293336.41,1947,56,2001,4,111797.3,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,M,0
3,3,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1951,54,2005,5,88351.672,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1951,59,2005,5,46229.57,.,.,.,.,.,M,0
4,4,.,.,.,.,.,.,.,.,.,.,1952,45,1997,5,24911.029,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1952,49,1997,5,9908.3623,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,M,1
5,5,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1954,55,1998,2,137207.3,.,.,.,.,.,1954,57,1998,2,5227.6899,M,1
6,6,1954,41,1995,5,53620.359,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1954,45,1995,5,28902.34,1954,46,1995,5,58023.73,.,.,.,.,.,.,.,.,.,.,1954,49,1995,5,132451.8,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,M,0
7,7,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1958,49,2006,2,208644.8,.,.,.,.,.,1958,51,2006,2,330875.19,.,.,.,.,.,.,.,.,.,.,M,0
8,8,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1959,48,1999,4,83872.563,.,.,.,.,.,.,.,.,.,.,1959,51,1999,4,29126.711,.,.,.,.,.,M,0
9,9,.,.,.,.,.,.,.,.,.,.,1962,35,1995,2,296822,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1962,44,1995,2,165501.41,.,.,.,.,.,1962,46,1995,2,399610.91,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,M,0
10,10,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,1965,34,1995,3,23455.5,.,.,.,.,.,1965,36,1995,3,28081.391,1965,37,1995,3,16827.1,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,M,1


There are so many missing values in the data! Should we worry? Not at all. As a matter of fact, we learned at the beginning of this module that many workers are not observed across all years. That's what these missing values are representing. 

Notice that the variable `year` which was part of the command line (the `j(year)` part) has dissapeared. We now have 1 observation per worker, and record the information across years in a cross-sectional way. 

How do we go from a `wide` dataset to a regular panel form? We need to indicate the prefix in the variables, which are formally known as `stubs` in the Stata lingo, and use the `reshape long` command. When we write `j(year)` it will create a new variable called `year`.

In [59]:
reshape long earnings region age birth_year start_year, i(workerid) j(year) 

(note: j = 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
>  2009 2010 2011)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                   868002   -> 1.5e+07
Number of variables                  88   ->       9
j variable (17 values)                    ->   year
xij variables:
earnings1995 earnings1996 ... earnings2011->   earnings
   region1995 region1996 ... region2011   ->   region
            age1995 age1996 ... age2011   ->   age
birth_year1995 birth_year1996 ... birth_year2011->birth_year
start_year1995 start_year1996 ... start_year2011->start_year
-----------------------------------------------------------------------------


In [60]:
%browse 10

Unnamed: 0,workerid,year,birth_year,age,start_year,region,earnings,sex,treated
1,1,1995,.,.,.,.,.,M,0
2,1,1996,.,.,.,.,.,M,0
3,1,1997,.,.,.,.,.,M,0
4,1,1998,.,.,.,.,.,M,0
5,1,1999,1944,55,1997,1,39975.012,M,0
6,1,2000,.,.,.,.,.,M,0
7,1,2001,1944,57,1997,1,278378.09,M,0
8,1,2002,.,.,.,.,.,M,0
9,1,2003,.,.,.,.,.,M,0
10,1,2004,.,.,.,.,.,M,0


Notice that we now have an observation for every worker in every year, although we know some workers are only observed in a subset of these. This is known as a `balanced panel`.  

To retrieve the original dataset, we get rid of such observations with missing values.

In [61]:
keep if !missing(earnings)

(11,894,262 observations deleted)


In [62]:
%browse 10

Unnamed: 0,workerid,year,birth_year,age,start_year,region,earnings,sex,treated
1,1,1999,1944,55,1997,1,39975.012,M,0
2,1,2001,1944,57,1997,1,278378.09,M,0
3,2,2001,1947,54,2001,4,18682.6,M,0
4,2,2002,1947,55,2001,4,293336.41,M,0
5,2,2003,1947,56,2001,4,111797.3,M,0
6,3,2005,1951,54,2005,5,88351.672,M,0
7,3,2010,1951,59,2005,5,46229.57,M,0
8,4,1997,1952,45,1997,5,24911.029,M,1
9,4,2001,1952,49,1997,5,9908.3623,M,1
10,5,2009,1954,55,1998,2,137207.3,M,1


## 7.5 Wrap up
---
Being able to generate new variables and modify a dataset to fit your specific research is pivotal. Now you should hopefully have more confidence in your ability to perform these tasks. Good job! Next, we will explore the challenges posed by working with multiple datasets at once.