# 7 Within Group Analysis

Before we begin this lecture, let us open the dataset from last time.

In [1]:
clear *
cd "."

import delimited using "fake_data.csv", clear



C:\Users\paulc\Dropbox\Projects\Gitlab\econometrics\econ490-stata

(9 vars, 2,861,772 obs)


When we're working on a particular project, it is important to know how to create variables that are computed within a group. For instance, we may be interested in the average wage of the individual workers across all different years, or enumerate the different observations within an individual.

Stata provides functionality to easily compute such statistics. The key to this analysis is the pre-command `by`, and the only requisite to using this is to ensure data is sorted the correct way.

In [2]:
%browse

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,1,1999,M,1944,55,1997,1,0,39975.012
2,1,2001,M,1944,57,1997,1,0,278378.09
3,2,2001,M,1947,54,2001,4,0,18682.6
4,2,2002,M,1947,55,2001,4,0,293336.41
5,2,2003,M,1947,56,2001,4,0,111797.3
6,3,2005,M,1951,54,2005,5,0,88351.672
7,3,2010,M,1951,59,2005,5,0,46229.57
8,4,1997,M,1952,45,1997,5,1,24911.029
9,4,2001,M,1952,49,1997,5,1,9908.3623
10,5,2009,M,1954,55,1998,2,1,137207.3


## 7.1 Generating Variables using Standard Generate

The command `generate` can be extended with the pre-command `by` as follows: 

In [3]:
cap drop var_one 
by year: gen var_one = 1 



not sorted


r(5);
r(5);






Wait... what is this message? It says that data is not sorted. Stata expects us to sort the data such that all observations corresponding to the same worker are next to each other. We can use the `sort` command as follows.

In [4]:
sort year 

In [6]:
%browse

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,378258,1995,M,1972,23,1995,5,0,52968.539
2,92298,1995,F,1963,32,1995,4,1,10119.29
3,17301,1995,M,1934,61,1995,1,0,49260.602
4,29021,1995,M,1949,46,1995,4,0,352930.69
5,269652,1995,M,1950,45,1995,3,0,238648.5
6,171356,1995,M,1968,27,1995,4,1,7242.2129
7,108381,1995,F,1951,44,1995,2,0,11913.33
8,9393,1995,F,1965,30,1995,2,0,5909.894
9,443061,1995,M,1973,22,1995,4,0,36347.461
10,215870,1995,F,1962,33,1995,3,0,7516.6699


In [7]:
cap drop var_one 
by year: gen var_one = 1 

Now the code works! We can run the whole thing in one step by writing `bysort` instead of `by`.

In [8]:
sort workerid year //Let's sort the data back as it was originally to revert back

In [9]:
cap drop var_one 
bysort year: gen var_one = 1 

Awesome. Although the variable we have created is not interesting by any means, it takes the value of 1 everywhere. We didn't even need to work within groups to do something like this, could've just written `gen var_one=1` and call it a day!

You may not be aware but Stata records the observation number as a hidden variable (formally, a scalar) called `_n` and the total number of observations as `_N`. 

In [10]:
cap drop obs_number 
gen obs_number = _n 

cap drop tot_obs
gen tot_obs = _N

In [11]:
%browse

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings,var_one,obs_number,tot_obs
1,247834,1995,M,1949,46,1995,1,0,10224.97,1,1,2861772
2,61219,1995,M,1963,32,1995,5,0,20904.561,1,2,2861772
3,33731,1995,F,1964,31,1995,3,1,61507.25,1,3,2861772
4,861979,1995,M,1975,20,1995,1,1,37619.32,1,4,2861772
5,211635,1995,M,1955,40,1995,3,0,130355.0,1,5,2861772
6,292584,1995,M,1959,36,1995,3,0,82980.039,1,6,2861772
7,100972,1995,M,1942,53,1995,4,1,12670.24,1,7,2861772
8,93840,1995,M,1958,37,1995,1,0,40754.699,1,8,2861772
9,306013,1995,F,1951,44,1995,4,0,38078.621,1,9,2861772
10,191984,1995,F,1969,26,1995,2,0,11864.55,1,10,2861772


As expected, this is sensitive to the way that data is sorted! The cool thing is that whenever we use the pre-command `by` the scalars `_n` and `_N` record the observation number and total number of observations for *every group* separately.

In [14]:
cap drop obs_number 
bysort workerid: gen obs_number = _n 

cap drop tot_obs
bysort workerid: gen tot_obs = _N

In [15]:
%browse

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings,var_one,obs_number,tot_obs
1,1,1999,M,1944,55,1997,1,0,39975.012,1,1,2
2,1,2001,M,1944,57,1997,1,0,278378.09,1,2,2
3,2,2002,M,1947,55,2001,4,0,293336.41,1,1,3
4,2,2001,M,1947,54,2001,4,0,18682.6,1,2,3
5,2,2003,M,1947,56,2001,4,0,111797.3,1,3,3
6,3,2010,M,1951,59,2005,5,0,46229.57,1,1,2
7,3,2005,M,1951,54,2005,5,0,88351.672,1,2,2
8,4,1997,M,1952,45,1997,5,1,24911.029,1,1,2
9,4,2001,M,1952,49,1997,5,1,9908.3623,1,2,2
10,5,2009,M,1954,55,1998,2,1,137207.3,1,1,2


As we can see, some workers are observed only 2 times in the data; whereas other workers are observed 8 times. By knowing (and recoding in a variable) the times a worker has been observed, we can do some different analysis based on this information. For example, in some cases you might be interested in keeping only workers who are observed across all time periods. 

## 7.2 Generating Variables using Extended Generate (egen)

The command `egenerate` is used whenever we want to create variables that require some functions (e.g. mean, standard deviation, min).  The basic syntax works as follows: 

```
 bysort groupvar: egen new_var = function() , options
```

Let's see an example where we create a new variable called `avg_earnings` which is the `mean` of earnings for every worker.

In [16]:
cap drop avg_earnings
bysort workerid: egen avg_earnings = mean(earnings)

In [17]:
cap drop total_earnings
bysort workerid: egen total_earnings = total(earnings)

By definition, these commands will create variables that use information across different observations. You can check the list of available functions by writing `help egen`.

In this documentation, you will notice that there are some functions that do not allow for `by`. For example, suppose we want to create the total sum across different variables in the same row. We'll sum some arbitrary variables (i.e. there is no meaning in the variable we are creating) for the sake of explanation.


In [19]:
cap drop sum_of_vars
egen sum_of_vars = rowtotal(start_year region treated)

How is this different than creating `gen sum_of_vars = start_year +  region + treated`. The answer lies in the missing observations. If you sum any number with a missing (`.`), then the sum will be also missing. The function `rowtotal()` only sums non-missing values in our variables. In most situations, this is what we're looking for.

Notice that we can use `by` with a list of variables, not necessarily a unique variable. 

In [20]:
cap drop regionyear_earnings
bysort year region : egen regionyear_earnings = total(earnings)

## 7.3 Collapsing the Data

We can aso compute statistics at some group level with the `collapse` command. However, these changes are irreversible. For example, suppose we want to create a dataset at the region-year level using information in the current dataset.

First, we decide which statistics we want to keep from the original dataset. For the sake of explanation, let's suppose we want to keep average earnings, the variance of earnings, and the total employment. 

The syntax is 

```
 collapse (statistic1) new_name = existing_variable (statistic2) new_name2 = existing_variable2 ... , by(group) 
```


We write

In [21]:
collapse (mean) avg_earnings = earnings (sd) sd_earnings = earnings (count) tot_emp = earnings , by(region year)

In [22]:
%browse

Unnamed: 0,year,region,avg_earnings,sd_earnings,tot_emp
1,1995,1,67284.297,144739.13,28143
2,1996,1,65632.828,139990.23,31078
3,1997,1,67655.398,144867.91,33583
4,1998,1,67836.609,113053.41,35696
5,1999,1,69703.961,130300.19,37443
6,2000,1,70636.844,153549.7,39559
7,2001,1,75025.453,212302.61,40722
8,2002,1,72514.414,148687.64,41353
9,2003,1,75885.875,134078.48,41242
10,2004,1,73384.0,145342.2,41339


As you can see above, there's no way to recover the information we previously had. However, we may be interesting in analysis this type of datasets at a group level. We may s