# ECON 490: Generating Variables (6)

## Prerequisites 
---
1. Be able to effectively use Stata do files and generate log files.
2. Be able to change your directory so that Stata can find your files.
3. Import datasets in csv and dta format. 
4. Save data files. 

## Learning objectives:
---
1. Explore your data set with commands like `describe`, `browse`,`tabulate`, `codebook` and `lookfor`.
2. Generate dummy (or indicator) variables using the command `generate` or `tabulate`.
3. Create new variables in Stata using `generate` and `replace`.
4. Rename and label variables.


## 6.1 Getting Started

We'll continue working with the fake data dataset introduced in the previous lecture. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings.  

Last lecture we introduced three steps to open data in stata:
1. Clear the workspace
2. Change the directory to  the space where the files we will use are located 
3. Import the data using commands specific to the file type.

Let's run those commands now so we are all ready to do our analysis. 

In [None]:
* Below you will need to include the path on your own computer to where the data is stored between the quotation marks.

clear *
cd " "
import delimited using "fake_data.csv", clear

## 6.2 Commands to Explore the Dataset

### 6.2.1 `describe`

The first command we are going to use describes the basic characteristics of the variables in the loaded data set.

In [None]:
describe

### 6.2.2 `browse`

In addition to use the `describe` command, in the Stata interface you can also open the data editor and see the raw data as if it were an Excel file. To do so we can type `browse`. This command will open a new Stata window. If we want to do this from within Jupyter, we use the command with `%` before `browse`. 


In [None]:
%browse

Opening the data editor has many benefits. Most importantly we get to see the data as a whole, allowing us to have a clearer perspective of the information the dataset is providing us. For example, here we observe that we have unique worker codes, the year where they are observed, worker characteristics (sex, age, and earnings), and whether or not they participated in the traning program. 

### 6.2.3 `codebook`

We can further analyze any variable by using the `codebook` command. Let's do this here to learn more about the variable earnings.

In [None]:
codebook earnings

The codebook command gives us important information about this variable such as the type (i.e. string or numeric), how many missing observations it has (very useful to know!) and, unique values. If the variable is numeric it will provide some summary statistics. If the variable is a sting it will provided examples of some of the entries.

Try changing the variable name in the cell above to see the codebook entries for the different variables in this data. 

### 6.2.4 `tabulate`

We can also learn more about frequency of the different measures in one variable by using the command `tabulate`.

In [None]:
tabulate region

Here we can see that there are five regions indicated in this data set, that more people surveyed came from region 1 and then fewer people surveyed came from region 3. 

We can actually include two variable in the `tabulate` command if we want more information. When you try this below you will see that there were 234,355 female identified persons surveyed in region 1 and 425,698 male identified persons surveyed in region 2. 

In [None]:
tabulate region sex

### 6.2.5 `lookfor`

What if there's a gazillion variables and I'm looking for a particular one?

Thankfully, Stata provides a nice command called `lookfor`. Suppose we want to look for a variable that is related to year. 

In [None]:
lookfor year

Stata found three variables that include the word `year` either in the variable name or in the variable label. This is super useful when we are getting to know a dataset!

##  6.3 Generate Dummy Variables

Dummy variables are variables that can only take on two values: 0 and 1. It is useful to think of a dummy variable as being the answer to a question that can be answered "yes" or "no". With a dummy variable the answer yes is coded as "1" and no is coded as "0".

Examples of question that are used to create dummy variables are those like:

1. Is the person female? Females are coded "1" and males are coded "0"
2. Does the person have a university degree? People with a degree are coded "1" and everyone else is coded "0"
3. Is the person married? Married people are coded "1" and everyone else is coded "0"
4. Is the person a millennial? People born between 1980 and 1996  are coded "1" and those born in other years are coded "0"

As you have probably already figured out, dummy variables are used primarily for data that is qualitative and cannot be ranked in any way. For example, being married is qualitative and "married" is neither higher nor lower than "single".  But they are also sometimes used for variables that are qualitative and ranked, such as level of education. And sometimes for variables that are quantitative, such as age groupings. 

It is important to remember that dummy variables must always be used when you want to include categorical (qualitative) variables in your analysis. These are variables such as sex, gender, race, marital status, religiosity, immigration status etc. We can’t use these variables without creating a dummy variable because the results would in no way be meaningful.

### 6.3.1 Creating Dummy Variables using `generate`

Let's do an example where we create a dummy variable that indicates if the observation identified as female. We are going to use the command `generate` which generates a completely new variable. 

In [None]:
generate female = ( sex == "F") 

What Stata interprets here is that whenever the condition `sex == "F" ` holds, then our dummy will take the value of 1. Otherwise it will take the value of zero. Depending on what you're doing, you may want that the cases where `sex` is missing mean that our dummy must be zero.

In [None]:
generate female = ( sex == "F")  if !mi(sex)

Whoops! We got an error. This says that our variable is already defined. Stata does this because it doesn't want you to accidentally overwrite an existing variable. Whenever we want to do that we have to use the command `replace`.

In [None]:
replace female = ( sex == "F")  if !mi(sex)

We could have also used the command `capture drop female` before we use `generate`. The `capture` command tells Stata to ignore any error in the command that immediately follows. In this example, this would do the following: 

-  If the variable that is being dropped didn't exist, the `drop female` command would automatically create an error. The `capture` command tells Stata to ignore that problem. 
- If the variable did already exist, the `drop female` command would work just fine, so that line will proceed as normal.

### 6.3.2 Creating Dummy Variables using `tabulate`

We already talked about how to create dummy variables with generate and replace. Let’s see how this can be done for a whole set of dummy variable - one for each region identifed in the data set. 

In [None]:
tabulate region, generate(reg)

This command generated five new dummy variables, one for each category for region. We asked Stata to call those variables "reg" and so those five new variables are called reg1, reg2, reg3, reg4. When we run the command `des reg*` will see all of the variables whose names start with "reg" listed. Stata has helpfully labelled those variables with data label from marstat. You might want to change the names for your own project to something that is more meaningful to you. 

In [None]:
des reg*

## 6.4 Generating Variables based on Expressions

Sometimes we want to generate variables after some transformations (e.g. squaring, taking logs, combining different variables). We can do that by simply writing the expression. For example, let's create a new variable that is simply the natural log of earnings:

In [None]:
gen log_earnings = log(earnings)

In [None]:
summarize earnings log_earnings

Let's try a second example, let's create a new variable that is the number of years since the year the individual started working. 

In [None]:
gen experience_proxy = year - start_year

In [None]:
summarize experience_proxy

## 6.5 Following Good Naming Conventions

Choosing good names for your variables is more important, and harder, than you might think! Some of the variables in the original dataset could have very unrecognizable names, which may be confusing when conducting your research, and need to change when you begin. You will also be creating your own variables, like dummy variables for qualitative measures, and you want to be careful about giving them good names. Finally, once you start generating tables you will want all of your variables to have high-quality names that will carry over to your paper.


You can always rename your variables with the command `rename`. Let' try to rename one of those dummy variables we created above. Maybe we know that if region = 3 then the region is in the west.

In [None]:
rename reg3 west
des west

Don’t think you need to include every piece of information in your variable name. Most of the important information is included in the variable label (more on that in a moment). Avoid variable names that include unnecessary pieces of information and can only be interpreted by you. 


<div class="alert alert-info">

**Pro tip:**  Put all of your variables in lower case to avoid errors (since Stata is case sensitive). 
</div>

## 6.6 Creating Variable Labels

It is important that anyone using your data set knows what each variable measures. You can add a new label, or a change a variable label, at any time by using the label variable command. Continuing the example from above, if I create a new dummy variable that indicates if people are female then I will want to add a label to my new variable. That command would be:

In [None]:
label variable female "Female Dummy"

When we describe the data, we will see this extra information in the variable label column.

In [None]:
des female

## Wrapping Up

When we are doing your own research, we *always* have to spend some time working with the data before beginning the analysis. In this module we have learned some important tools for manipulating data to get it ready for that analysis. Like everything else that you do in Stata, these manipulations should be done in a do file, so that you always know exactly what you have done with your data. Losing track of those changes can cause some very serious mistakes when you start to do your research!