# ECON 490: Stata Essentials (3)

## Prerequisites 
---
1. Have set up Anaconda Navigator to access JupyterLab
2. Have installed the Stata kernel for use in JupyterLab
3. Understand how to effectively use Stata do files and know how to generate log files

## Learning objectives:
---
1. View the characteristics of any dataset using the command `describe`
2. Use `help` to learn best how to run commands
3. Understand the Stata command syntax using command `summarize`
4. Create loops using the commands `for`, `while`, `forvalues` and `foreach` 

## 3.1: Describing Your Data

Let's start by opening a dataset that was provided when you installed Stata onto your own computer. We will soon move on to importing our own data, but this Stata dataset will help get us started. This is a dataset on automobiles and their characteristics. You can install this dataset by running the command in the cell below:

In [3]:
sysuse auto.dta, clear

(1978 automobile data)


We can begin by checking the characteristics of the dataset we have just downloaded. The command `describe` allows us to see: the number of observations, the number of variables, a list of variable names and descriptions, and the variable types and labels. 

In [5]:
describe 


Contains data from /Applications/Stata/ado/base/a/auto.dta
 Observations:            74                  1978 automobile data
    Variables:            12                  13 Apr 2020 17:45
                                              (_dta has notes)
--------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
--------------------------------------------------------------------------------
make            str18   %-18s                 Make and model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (i

Notice that this dataset consists of 12 variables and 74 observations. We can see that one of the those variables is named `make` and that that variable indicates the make and model of the vehicle. We can also see that variable `make` is made up of text because it is a string variables. Other variables are numeric. For example, the variable `mpg` that indicates the vehicle's mileage (miles per gallon) is an interger. The variable `foreign` is also numeric and it likely only takes the values 0 or 1 indicating whether the car is foreign or domestically made; this is a dummy variable. 

The numeric variables store numbers of different sizes based on their sub-type. You can see a brief description here

![](img/data_type_num.png)

The string variables store text of different size based on their sub-type. The brief description is provided here

![](img/data_type_str.png)



## 3.2 Introduction to Stata Command Syntax

### 3.2.1 Using HELP to understand your commands

To help us get comfortable with the syntax used by Stata, let's start with a simple and useful command:`summarize`. This command will give us the basic statistics from any variable in the dataset, such as the dummy variable that we talk about a moment ago. 

In [15]:
summarize rep78


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       rep78 |         69    3.405797    .9899323          1          5


The same command can be used to view the statistics from multiple variables at the same time by writing `summarize varname1 varname2` etc. Try this yourself in the empty cell below using some of variable names we learned about above using `describe`

Stata has a help manual installed in the program that provides documentation of a particular command. This information can be reached by typing the command `help` and then the name of the command we need extra information about. 

To see the extra information that is available for `summarize` run the command below: 

In [1]:
help summarize


[R] summarize -- Summary statistics
                 (View complete PDF manual entry)


Syntax
------

        summarize [varlist] [if] [in] [weight] [, options]

    options           Description
    --------------------------------------------------------------------------
    Main
      detail          display additional statistics
      meanonly        suppress the display; calculate only the mean;
                        programmer's option
      format          use variable's display format
      separator(#)    draw separator line after every # variables; default is
                        separator(5)
      display_options control spacing, line width, and base and empty cells

    --------------------------------------------------------------------------
    varlist may contain factor variables; see fvvarlist.
    varlist may contain time-series operators; see tsvarlist.
    by, collect, rolling, and statsby are allowed; see prefix.
  
    aweights, fweights, and iweights are 

You will need to run this command directly into Stata on your computer in order to able to see all of the information provided by `help`. I suggest that you run that in Stata now to be able to see that output directly.

When you do you will see that the first one or two letters of the commmand is often underlined. This underlying indicates the shortest allowable abbreviation for a command (or option). 

For example, if you type `help rename` you will see that `rename` can be abbreviated `ren`, `rena`, or `renam`, or it can be spelled out in its entirety. 

Other examples are, `g`enerate, `ap`pend, `rot`ate, `ru`n.

If there is no underline, then no abbreviation is allowed. For example, the command `replace` cannot be abbreviated. The reason for this is that Stata doesn't want you to accidentally make changes to your data by replacing the information in the variable. 

We can write `summarize` command as its shortest abbreviation `su` or a longer abbreviation such as `sum`. 

Also in the help output in Stata you will see that some words are written blue and are encased within square brackets. We will talk more about these options below, but in Stata you can directly click on those links for more information from help. 

Finally, help provides a the list of the available options which here will result in the display of extra information of the variable. We will learn more about this below in section 3.2.4.

### 3.2.2 Imposing IF Conditions

When the syntax of the command allows for `[if]`, we can run the command on a subset of the data that satisfies any condition we choose. The list of conditional operators is the following:

1. Equal: ==
2. Greater than and less than: > and <
3. Greater than or equal and less than or equal: >= and <= 
4. Not Equal: != 

We can also compound different conditions using the list of logical operators:

1. And: & 
2. Or: | 
3. Not: ! or ~ 

Let's look at an example using this new knowledge by summarizing the variable `price` when the make of car is domestic (i.e. not foreign):

In [25]:
su price if foreign == 0


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         52    6072.423    3097.104       3291      15906


Let's do this again, but now we will impose the additional condition that the mileage must be less than 25.

In [26]:
su price if foreign == 0  & mpg < 25


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         44    6354.568    3273.345       3291      15906


Maybe we want to restrict to a particular list of values. Here we can make use of the option `inlist()` or we can write out all of the conditions using the "or" operator:

In [11]:
su price if inlist(mpg,10,15,25,40)


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |          7    6507.857     1838.25       4482       9735


This works exactly same way as this command:

In [12]:
su price if mpg == 10 | mpg == 15 | mpg == 25 | mpg == 40


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |          7    6507.857     1838.25       4482       9735


Maybe we want to restrict values in a particular range. Here we can make use of the option inrange() or we can write out all of the conditions using the conditional operators:

In [13]:
su price if inrange(mpg,5,25) 


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         60    6577.083    3117.013       3291      15906


This works exactly same way as this command:This works exactly same way as this command:

In [14]:
su price if mpg>=5 & mpg<=25


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         60    6577.083    3117.013       3291      15906


There might be some variables where there is no information recorded for some particular observations. For example, when we `summarize` our automobile data you will see that there are 74 observations for most variables, but that the variable "rep78" has only 69 observations - for five observations there is no repair record indicated in the dataset.

In [19]:
su price rep78 


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906
       rep78 |         69    3.405797    .9899323          1          5


If for some reason we only want to consider observations without missing values, is with the option `!missing()` which combines the command `missing()` with the negative conditional operator "!". The command below says to summarize the variable `price` for the observations where the observations for which `rep78` is NOT missing.

In [20]:
su price if !missing(rep78)


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69    6146.043     2912.44       3291      15906


This command can also be written using the conditional operator since missing numeric variables are indicated by a ".". You can see that command here:

In [21]:
su price if rep78!=.


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         69    6146.043     2912.44       3291      15906


Notice that in both cases there are only 69 observations.

For future refence, if you want to do this with missing string variables you can indicate those with "". 

### 3.2.3 Imposing IN Conditions 

We can also subset the data by using the observation number. The example below summarizes the data in observations 1 through 10.

In [16]:
su price in 1/10


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         10      5517.4    2063.518       3799      10372


But be careful! These type of conditions are generally not recommended because they depend on the order of the data.

To show this, lets change the order the observations from lower to higher price by running the command `sort` and they run the same in condition: 

In [27]:
sort price 
su price in 1/10




    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         10      3726.5    245.9007       3291       3984


You can see that the result changes because the observations 1 through 10 in the data has changed. 

Always avoid using `in` whenever you can use `if` instead!

### 3.2.4 Command Options

When we used the `help` command, we saw that we can introduce some optional arguments after a comma. In the case of the `summarize` command we were show the options: `d`etail, `mean`only, `f`ormat and `sep`arator(#). 

If we want additional statistics apart from the mean std. deviation, min and max values we would use the option `detail` or if you prefer you can replace the option `detail` with just `d`.

In [28]:
su price , d


                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188


## 3.3 Using Loops 

Much like any other programming languages, there are `for` and `while` loops that we can use to iterate through many instances. In particular, the `for` loops are also sub-divided into `forvalues` (iterate across a range of numbers) and `foreach` (iterate across a list of names). 

It is very common that these loops create a local scope (i.e. the iteration labels only exist within a loop). A local in Stata is a special variable that temporarily stores information. We'll discuss locals in the next module, but consider this simple example in which the letter "i" is used as a place holder for the number 95. 

In [36]:
local i = 95

display `i'



95


We can also create locals that are strings rather than numberic. Consider this example:

In [37]:
local course = "ECON 490"

display "`course'"



ECON 490



We can store anything inside a local and when we want to use that information we include the local encased in a backtick (\`) and  apostrophe (').

In [39]:
local course = "ECON 490"

display "I am enrolled in `course' and hope my grade will be `i'%!"



I am enrolled in ECON 490 and hope my grade will be 95%!


### 3.3.1 Creating Loops Using Forvalues 

Whenever we want to iterate across a range of values defined as `min_value(steps)max_value`, we can write the command below. Here we are interating from 1 to 10 in incriments of 1.

In [43]:
forvalues counter=1(1)10{
    *Notice that now counter is a local variable
    display `counter'
}


1
2
3
4
5
6
7
8
9
10


Notice that the open brace ({) needs to be on the same line as the `for` command, with no comments after it. Similarly, the closing brace (}) needs to be on its own line.

Try change run command above, with different minimum or maximum values and/or different incriments, in the cell below.

### 3.3.2 Creating Loops Using Foreach

Whenever we want to iterate across a list of names, we may write the command below which asks Stata to `summarize` for a list of variables: `mpg` and `price`.

In [44]:
foreach name in "mpg" "price"{
    summarize `name'
}



    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906


We can have a list stored in a local variable as well.

In [45]:
local namelist "mpg price"
foreach name in `namelist'{
    summarize `name'
}




    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906


### 3.3.3 Writing Loops With Conitions Using While

Whenever we want to iterate until a condition is met, we may write the command below. The condition here is simply "while counter is less than 5". 

In [36]:
local counter = 1 
while `counter'<5{
    display `counter'
    local counter = `counter'+1
}



1
2
3
4


# 3.4 Wrapping up


In this module we understood the way Stata commands work and their syntax. In general, many Stata commands will follow the folllowing structure 

```
  name_of_command [varlist] [if] [in] [weight] [, options]
```

At this point, you should feel more comfortable reading a documentation file for a Stata command. The question that remains is how to find new commands!

You are encouraged to search for commands using the command `search`. For example, if you are interested in running a regression you can write:

In [20]:
search regress 

You will see that Stata on your computer new window pops up and you can click at the different options that it shows to look at the documentation for all these commands. Try it yourself!


In any of the following modules, whenever there is a command confuses you, you should feel free to write `search command` or `help command` to redirect to the documentation. 