
# 16. Stata Workflow Guide

## Pre-requisites: 

- Knowledge of the content of the previous modules: macros, opening datasets, creating graphs, regression analysis. 


## 16.1 Learning Objectives

- Learn foundational skills and practices for workflow management in research and data applications
- Improve coding style, especially for collaborative settings
- Apply conditional operators to automate workflow processes

## 16.2 Introduction to Workflow Management

Structuring your files and folders early on can save you a lot of time and effort throughout the research project. It will make it easier for you to keep track of your progress and help you streamline your process. This is also important if you have co-authors and collaborators, or if you want to make it easy to replicate your code.

In this module, we will discuss how to manage files and scripts as part of your research workflow. We will also cover how to stylize your code to make it easy to read and replicate. While these are not strict rules, consider them guidelines for research and data management.


## 16.3 Directory Structure

Over the course of a research project, we are likely to accumulate numerous files such as raw data, dofiles, tables, graphs and figures. In fact, there are often many versions of each of these files. You should start by creating a main folder or a "root" folder where _all_ your project files and folders are organised. Within the main folder, sort all your files into sub-folders similar to the structure shown below:

![Main directory structure](img/fig1-stata-dir.png  "Main directory structure")

Each sub-folder consists of a specific category of files:

* **data:** contains all the data files
* **scripts:** contains all the stata dofiles used to process, clean and analyze the data files
* **logfiles:** contains all the stata logfiles
* **tables:** contains all the regression tables, summary statistics, etc.
* **figures:** contains all the graphs and figures
* **literature:** contains papers and documents related to literature review
* **paper:** contains word documents or LaTeX files relating to the written part of your paper
* **slides:** contains presentation slides

<div class="alert alert-block alert-info">
    
<b>Note:</b> Avoid spaces, special characters or capital letters in your folder or file names. If you need to use spaces you can use underscores `_` . 
    
</div>


It is also good practice to number your folders to reflect your workflow. 



## 16.4 File naming conventions

While everyone uses their own naming conventions, it can be helpful to prefix your files with numbers to align with your workflow and post-fixed with version numbers. Version numbers could be `_v1`, `_v2`, etc. Or, they could be dates.

<div class="alert alert-block alert-info">
    
<b>Note:</b>  Following the yymmdd (year month date) format when using dates will automatically sort your files with the latest version at the top. Other date formats will not sort the files in the correct order and thus defeat the purpose of adding a post-fixed version number.
    
</div>

As you make progress with your project, you might find yourself collecting many versions of some of your files. As older versions become redundant, delete them or move them to a temp folder. Creating a temp folder for old dofiles, tables, documents, etc., can be helpful to keep your main folders neat if you are hesitant to delete them or if you are susceptible to digital hoarding (like many of us are).


## 16.5 Task-oriented dofiles

It's almost never a good idea to use one dofile for the entirety of the project. Instead, create different dofiles for different tasks and add descriptive labels to reflect your workflow. As mentioned in the previous section, prefix your files with numbers to align with the workflow sequence.

![Scripts folder with example dofiles](img/fig2-scripts.png "Example dofiles")

In the image above, the first dofile ` 1_build_data.do` cleans the raw data and generates core variables that will be used in subsequent scripts. The second dofile `2_descriptive.do` generates descriptive statistics and relevant figures. The third dofile `3_results.do` runs the final regressions and generates regression tables. The master dofile `0_master.do` runs all these other dofiles. We will discuss its role in detail in the upcoming section.

### 16.5.1 A note on figures and tables

Some people prefer to use different dofiles for different figures and tables, which is completely fine as long as the files are labelled well. If you are generating different tables and figures within the same dofile, write them into separate code blocks within a dofile, so that they can be easily distinguished.

### 16.5.2 Master dofile

You can think of the master dofile `0_master.do` as a "compiler", it runs all the previous dofiles to compile everything in your project. The file should be structured something like this:

```stata
    /* Project info */

    clear

    /* Directory settings: paths to folders, defined as globals */

    /* Project settings: such as global variables and other macros */

    *** Run all dofiles ***

    do ./scripts/1_build_data.do
    do ./scripts/2_descriptive.do
    do ./scripts/3_results.do

    /* FIN */
```

The master file begins with project information usually included in a block comment. Then, we begin writing the script, starting with the `clear` command. This is followed by **directory settings** and then **project settings** — these two components are responsible for automating and simplifying a lot of tasks that we would otherwise do manually. The final component of the script is to **run all the dofiles** in our project.

In the subsequent sections of this module, we will go over each of these components of a master file.


## 16.6 Directory Settings

Now let's consider an example of directory settings in a master dofile,

```stata
*Clear everything in the current Stata session
clear*

****************
* Directories
****************

* You need to change the first two global names according to your computer path

global proj_name "Fake Project"
global proj_main "$DROPBOX/Projects/${proj_name}"
global datadir "${proj_main}/data"                  // Raw Files and Output from those
global figdir "${proj_main}/figures"                // Figure path
global tabledir "${proj_main}/tables"               // Tables Path
global do_dir "${proj_main}/scripts"                // Scripts path
global log_dir "${proj_main}/logfiles"              // Log-file path

```

Take a moment to observe how the directory is organised and labelled. It's okay if it doesn't all make sense just yet.

There are two essential tools utilised in this master file:
1. Relative file paths
2. Macros (i.e. locals and globals)

### 16.6.1 What are relative paths?

Relative paths can be quite practical when working on group projects, as it can help automate tasks that would be otherwise easy to miss such as changing the directory each time a collaborator runs a dofile.

The idea behind relative paths is simple: you should set the main directory once and then everything else should be relative to that directory.

### 16.6.2 Setting the main directory

For example, if your main folder is called `fakeproject` and it is in your D: drive, we can point to this folder in the dofile with the command

```stata
clear
cd "D:/fakeproject"
```

Now let's consider a different scenario, where your files are synced on Dropbox and you are working across multiple computers at home and at school. It is likely that the file paths will be different across the computers. A simple workaround is to use the `capture` or `cap` command.

```stata
clear
cap cd "C:/Program files/Dropbox/fakeproject"        // home
cap cd "D:/Program files/Dropbox/fakeproject"       // school
```

We added the `cap` command so that if Stata cannot find the first path then it can move on to the next line and try to set the directory again with a different path.


>If you are collaborating with others on Dropbox, OneDrive or some other network drive, you can add as many `cap cd` options as needed.


<div class="alert alert-block alert-info">

<b>Note:</b>  Use `//` liberally to comment on your code, as shown in the code block above, especially if you have multiple directory options.

</div>


### 16.6.3 Relative paths for subdirectories

So far we set the main directory in Stata, which in this case is the `fakeproject` directory. All the subdirectories in the `fakeproject` directory can be accessed using relative paths. Let's say we want to access a data file `fake_data.csv` in the `raw` folder nested in the `data` folder. The command using relative paths will look something like this:

```stata
import delim using ./data/raw/fake_data.csv, clear
```

If you weren't using relative paths, the command would look something like this:

```stata
import delim using "C:/Program files/Dropbox/fakeproject/data/raw/fake_data.csv", clear
```

In the second case, you would have to correct the path if you were working on multiple computers or collaborating with someone.


### 16.6.4 Relative paths in other contexts

Similarly, once you are done generating variables or cleaning data, you can use relative paths to save it in the appropriate folder. For example,

```stata
import delim using ./raw/fake_data.csv, clear

/* data cleaning code, eg., generate and label variables */

*store the final dataset in the appropriate folder & rename the file
save ./final/main_data.dta, replace            

```

In the example above, we started out in the `data` subdirectory and used relative paths to navigate through `raw` and `final` folders within the `data` folder. Using relative paths this way is convenient, but it can sometimes get confusing. This is bound to happen especially when you are moving in and out of subdirectories.

A robust solution in this case is to store file paths in macros.


### 16.6.5 Macros (locals and globals)

Macros store information either temporarily `local` or permanently `global`.

Locals store information within a code instance and disappear once the instance ends. Globals are store in memory until you close Stata, hence considered "permanent".

Standard practice is to define key paths in globals in your `0_master.do` directory settings. Let's look at the example directory settings, once again.

```stata
*Clear everything in the current Stata session
clear*

****************
* Directories
****************

* You need to change the first two global names according to your computer path

global proj_name "Fake Project"
global proj_main "$DROPBOX/Projects/${proj_name}"
global datadir "${proj_main}/data"                  // Raw Files and Output from those
global figdir "${proj_main}/figures"                // Figure path
global tabledir "${proj_main}/tables"               // Tables Path
global do_dir "${proj_main}/scripts"                // Scripts path
global log_dir "${proj_main}/logfiles"              // Log-file path
```

In this example, we define the key paths in globals, where `proj_name` and `proj_main` are unique to your project name and main directory, respectively. Note that `proj_main` includes the full directory path, whereas the rest of the subdirectories such as `datadir` are defined as relative paths `${proj_main}/data`.

In subsequent dofiles, we can completely skip defining the full file paths and use globals to make things convenient. Let's try importing our raw data file in Stata once again using relative paths defined in globals.

```stata
import delimit using ${datadir}/raw/fake_data.csv, clear
```

Notice that we defined `datadir` earlier, and we navigated to a specific file `fake_data.csv` in the `raw` folder in `datadir`.

In this way, using globals can be helpful while navigating different parts of your project. This is especially true if you are working in a group, as you can choose to add the `cap` command (as shown earlier) alongside globals. Or, you could update the file path in the master file once and use that to run all your dofiles.


## 16.7 Running dofiles


Let's circle back to the idea of having different dofiles for performing different tasks. We discussed having a master dofile to run all these other dofiles and compile the project. Now let's explore how to execute this idea.

We can run a dofile by using the command `do` followed by the file path of the appropriate dofile.

```stata
do "${do_dir}/1_build_data.do"
```

The code for running all the dofiles in our master dofile will look something like this:

```stata
******************
* Run the Project
******************

do "${do_dir}/1_build_data.do"

do "${do_dir}/2_descriptive.do"

do "${do_dir}/3_results.do"
```

We simply run all the dofiles in the appropriate order. Notice how the naming convention makes it easy to identify the sequence in which we need to run the dofiles: file names are descriptive and sequentially numbered. We also used Stata's comment feature to create a "section header" of sorts.

While you could run all your dofiles this way, if you decide to use project settings in your master script, this code will look slightly different. 



## 16.8 Project settings

We can simplify our workflow by defining project settings in the master dofile. This can affect which dofiles are run, whether log files are genrated, which samples are excluded from your analysis, and so on.  

Take a look at the example below, and try to identify any patterns in the code. 


```stata
****************
* Settings:
****************

*Step 1: Build intermediate and final dataset from raw data

global run_build = 1                // 0 = skip build step; 1 = run build.do
global store_log_build = 1          // 0 = don't save logfile; 1 = save log file


*Step 2: Run descriptive analysis

global run_descriptive = 1          // 0 = skip; 1 = run
global store_log_descriptive = 1    // 0 = don't save logfile; 1 = save log file


*Step 3: Run main results (e.g. regressions)

global run_mainresults = 1          // 0 = skip; 1 = run
global store_log_mainresults = 1    // 0 = don't save logfile; 1 = save log file

```

Here, we define two kinds of settings: `run` and `store_log`. For each step we create globals to (1) run that step (`run_build`, `run_descriptive`, `run_mainresults`) and (2) store the log file for that step (`store_log_build`, etc.). 

At this stage, our settings don't mean much even though the comments might lead you to believe otherwise. We have simply created globals and assigned them a value. As we go on to reference these globals in other parts of our master dofile and in other dofiles, these settings will become meaningful. Subsequently, the values you chose to assign these globals will determine which actions occur and which don't.

`run` settings are referenced in two cases:
1. In the master dofile under the "run project" section
2. In beginning of the project dofiles, when required

`store_log` settings are referenced in two cases:
1. Always at the beginning of the project dofiles (excluding the master dofile)
2. Always at the end of the project dofiles (excluding the master dofile)


### 16.8.1 `run` settings

Let's start by mapping the `run` settings, beginning with the master dofile.


```stata
******************
* Run the Project
******************

if ${run_build}==1{
	do "${do_dir}/1_build_data.do"
}


if ${run_descriptive}==1{
	do "${do_dir}/2_descriptive.do"
}


if ${run_mainresults}==1{
	do "${do_dir}/3_results.do"
}
```

This is almost the same as the code block we saw earlier to run all your dofiles. The key difference is that each command is nested within an `[if]` statement. 

The `[if]` statements correspond to the global settings: IF the statement `${some_global}==1` is TRUE, THEN run the command in the curly brackets, which is `do "filename"`. Can you guess what happens if the statement is FALSE?


There's one missing piece in this story. The comments in the settings say that assigning a value of `0` to a global skips that action. You may have noticed, however, that the `[if]` statement would return as FALSE for any value of `global run_build` as long as it is not equal to 1. 

We could set `global run_build = 8` and Stata would still return the statement `${run_build}==1` as FALSE. The question remains: when does `0` become relevant?

To understand this we have to think of our master dofile as a very long script that links all the other dofiles together. Let's consider a scenario where you want to skip the build step. This means our script begins with `2_descriptive.do`. However, `2_descriptive.do` includes commands to work with the dataset we opened in `1_build_data.do`. Note that, we don't open the dataset in the beginning of each dofile over and over again. This means, we need to add a condition in the beginning of the `2_descriptive.do` script, where we open the correct dataset in the event we skip the first step. 

```stata
if ${run_build}==0 {
	use "${datadir}/final/main_data.dta", clear
}
```

This clearly defines a situation where, if we skip the build data step then we load the correct dataset in Stata to run `2_descriptive.do` .


Similarly, if we were to skip the first two steps, then we would have to load the correct dataset to run the results (i.e. step 3). We include the following command in the beginning of `3_results.do` to address this problem,

```stata
if ${run_build}==0 & ${run_descriptive}==0 {
	use "${datadir}/final/main_data.dta", clear
}
```

As you might have noticed, all scenarios where we skip a step is associated with `if ${some_global}==0`. As a result, we limit the values assigned to the global settings to 0 and 1.

### 16.8.2 `store_log` settings

Now let's take a look at the `store_log` settings, which is helps us automate the process of storing log files. 

All the dofiles except `0_master.do` include the `log` command in the beginning and end of the file. The `log` command is nested within an `[if]` statement related to the global settings, exactly like we saw earlier.

```stata
*If log setting is activated, we record a log file in the log folder
if ${store_log_descriptive}==1 {
	cap log close
	log using "${log_dir}/2_descriptive.log", replace
}

.
.
.

*Close log if needed
if ${store_log_descriptive}==1 {
	cap log close
}

```

First we start with an `[if]` statement which makes our global settings viable. Within the curly brackets we include `cap log close` to ensure that any open log files from prior attempts are closed before we open the log file. Then we use `log using "${log_dir}/2_descriptive.log", replace` which generates a log file stored in the log directory `log_dir` (we defined this in the master file) and saved under the name `2_descriptive.log`. Finally, at the end of the script we include a command to close the log file. 

We include this code within each of the dofiles only changing the `store_log` global and the name of the logfile to match the appropriate step. 



## 16.9 Stylized code

Following a clear and consistent code style is a foundational skill that can sometimes bypass new coders.

There are three core practices that will make it easy to write, edit and understand your code:

1. Adding comments
2. Splitting up your code into multiple lines
3. Indenting and spacing your code

### 16.9.1 Commenting

> Instead of relying on your memory, leave a quick comment. 

There are three ways to comment in Stata

```stata
* comments on individual lines

// comments on individual lines and after some code

/*
comments on multiple lines
like a "code block"
*/

```

You can also use a series of asterisks `*` to format your dofile and partition your code. In the `0_master.do` example we saw earlier, the directory settings were highlighted as follows

```stata
****************
* Directories
****************
```

Formatting your dofile in this manner creates visual bookmarks and highlights different sections of your script.

Another use for comments is to "comment out" code that you might be testing or might need later. Use an asterisk to comment out a line:

```stata
*gen log_earnings = log(earnings)
```

Or comment out a block of code:

```stata
/*
label variable workerid "ID"
la var treated "Treatment Dummy"
la var earnings "Earnings"
la var year "Calendar Year"
*/

```

Most importantly, leave comments before or after your code to explain what you did.

```stata
* Open Raw Data
import delimit using "${datadir}/raw/fake_data.csv", clear

la var birth_year "Year of Birth" // label variable

```

As we move on to writing more complex code, leaving comments will become exponentially more helpful.


### 16.9.2 Splitting the code across lines

In Stata, we can split code across multiple lines using three forward slashes `///`. This can be particularly useful when making graphs. Let's see an example to understand why.

```stata
twoway (connected log_earnings year if treated) || (connected log_earnings if !treated), ylabel(#8) xlabel(#10) ytitle("Log-earnings") xtitle("Year") legend( label(1 "Treated") label(2 "Control"))

```

Making a graph has a lot of small components, and they are clubbed together in a single line of code. If we had to go back and change the number of ticks for the x-axis `xlabel(#)`, it is safe to say it might take us a moment to parse through all this code.

Now, let's format this code block using `///` to split it across multiple lines.

```stata
twoway ///
    (connected log_earnings year if treated) || (connected log_earnings year if !treated) , ///
    ylabel(#8)  xlabel(#10) ///
    ytitle("Log-earnings") xtitle("Year") ///
    legend( label(1 "Treated") label(2 "Control"))

```

Is it easier for you to find `xlabel(#)` this time around?

Using `///` is a simple step we can take to make code blocks appear neat and simple to read.

### 16.9.3 Indent and space your code

Using indentations in your code and spacing it neatly can improve its readability with little effort. You can use the `tab` button on your keyboard to indent and organise your code. Let's reformat the last example to see this in action.

```stata
twoway                                              ///
    (connected log_earnings year if treated)        ///
    ||                                              ///
    (connected log_earnings year if !treated)       ///
        ,                                           ///
        ylabel(#8)                                  ///
        xlabel(#10)                                 ///
        ytitle("Log-earnings")                      ///
        xtitle("Year")                              ///
        legend(                                     ///
            label(1 "Treated")                      ///
            label(2 "Control")                      ///
        )
```

This is the same code block as before, but it is significantly easier to read this time around. Try to find `xlabel(#)` once again. Do you notice any difference?

You might not want to indent your code on such a granular level, as shown in the example above. That's okay, as long as the code is organised in a way that is clear to you and your collaborators, and generally easy to understand.

### 16.9.4 Putting it all together 

Let's review a final example, that combines all the code styling tools we have discussed so far.

```stata
twoway                                              ///
    (connected log_earnings year if treated)        ///     // log earnings, treated vs control group
    ||                                              ///
    (connected log_earnings year if !treated)       ///
        ,                                           ///
        ylabel(#8)                                  ///     // label ticks
        xlabel(#10)                                 ///
        ytitle("Log-earnings")                      ///     // axis titles
        xtitle("Year")                              ///
        legend(                                     ///     // legend labels
            label(1 "Treated")                      ///
            label(2 "Control")                      ///
        )
```

The comments in this example might seem unnecessary, since the code is self-explanatory. However, depending on your familiarity with Stata (or coding in general) and the complexity of the code, adding comments that seem obvious at the time can be helpful when you revisit your work days or weeks later. As students of economics, we understand that there is an opportunity cost to everything — including time spent deciphering code you have already written.

<div class="alert alert-block alert-danger">
    
<b>Warning:</b> Attribution reminder
    
</div>