# ECON 490: Stata Essentials (3)

## Prerequisites 
---
1. Setting up Anaconda and Stata kernel
2. Learning how to use Stata do files

## Outcomes
---
By the end of this module you will be able to:
- View dataset characteristics
- Produce summary statistics 

## 3.1: Stata Types 

Before we begin this exploration, let us open a dataset that is provided whenever we install Stata. This is a dataset on automobiles and their characteristics.

In [1]:
sysuse auto.dta, clear

(1978 Automobile Data)


To start, we should check the characteristics of the dataset we just downloaded. The command `describe` allows us to see how many observations and variables the data has and, variable type and label. 

In [None]:
describe 

We observe that our dataset consists on 12 variables and 74 observations. We have a brief description of these variables. For instance, some of these variables are numeric (int, double, float, byte) and some are made of text (string).

The numeric variables can store numbers of different sizes based on their sub-type. You can see a brief description here

![](img/data_type_num.png)

The string variables can also store text of different size based on their sub-type. The brief description is provided here

![](img/data_type_str.png)


With this knowledge we can infer that the variable `make` probably contains the model name written as a text, and the variable `foreign` is probably a variable that takes the values 0 or 1 depending on whether the car is foreign made (i.e. a dummy variable).

## 3.2 Stata Syntax 
A very useful command in stata is `summarize`, it allows us to get some basic statistics from a variable. The same command can be also used to view these statistics from multiple variables at the same time writing `summarize varname1 varname2` and so on. 

In [3]:
summarize foreign


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     foreign |         74    .2972973    .4601885          0          1


In [4]:
summarize foreign length * I would delete this example!


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     foreign |         74    .2972973    .4601885          0          1
      length |         74    187.9324    22.26634        142        233


Stata has a help manual installed in the program that provides documentation of a particular command. This information can be reached by running the command `help` and then the name of the command we need extra information about. For example, if we needed extra information about the command summarize, we would run:

In [5]:
help summarize

This command will open a whole page of information on the command. We should pay attention to the syntax diagram:

![](img/syntax_summarize.png)

The first thing we can observe from the `help` command is that Stata allows abbreviations. The shortest allowed abbreviation for a command or option is shown by underlining. For example, `rename` can be abbreviated `ren`, `rena`, or `renam`, or it can be spelled out in its entirety. Other examples are, 
 `g`enerate, `ap`pend, `rot`ate, `ru`n.
If there is no underlining, no abbreviation is allowed. For example, `replace` may not be abbreviated, the underlying reason being that replace changes the data. This means and we can write `summarize` command as its shortest abbreviation `su`, or a longer an abbreviated manner such as `sum`. 

Also, you will notice that there are some blue names within square brackets, these are optional arguments in this command, an in depth explanation of these is located below. 

Finally, the diagram provides the list of the available options which will result in the display of extra information of the variable. For example, if we wanted the statistics of the variable price when weight is bigger that 3000 and wanted additional statistics apart from the mean std. deviation, min and max values we would write the command 
`su price if weight > 3000, detail`. We will learn more about the different conditions we can add to commands (such as >) below.

> As we can see from the syntax diagram we can also abbreviate detail as `d`

In [None]:
su price if weight > 3000, d 

#### We will now discuss the diferent conditions that can be added to a particular command.

#### 1. If Conditions

When the syntax of the command allows for `[if]`, it means that we can run the command on a subset of the data that satisfies the condition. The list of conditional operators is the following:

1. Equal sign: ==
2. Greater and Less than: > and <
3. Greater than or equal and Less than or equal: >= and <= 
4. Not Equal: != 

We can also compound different conditions using the list of logical operators:

1. And: & 
2. Or: | 
3. Not: ! or ~ 

Let's look at an example using this new knowledge

> I would take out a couple of examples, Like i would only keep 3. Then give them 2 examples for them 2 do.

In [None]:
su price if foreign==0 * would delete this one

In [8]:
su price if foreign==0  & mpg<25


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         44    6354.568    3273.345       3291      15906


- We can make use of the functions `inlist()` and `inrange()` when we want to restrict to a particular list of values or to a particular range.

In [9]:
su price if inlist(mpg,10,15,25,40)

*This command can also be written as: su price if mpg == 10 | mpg == 15 | mpg == 25 | mpg == 40. Try it out!


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |          7    6507.857     1838.25       4482       9735


Which works the exact same way as

In [10]:
su price if mpg == 10 | mpg == 15 | mpg == 25 | mpg == 40 * would delete this one 


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |          7    6507.857     1838.25       4482       9735


In [11]:
su price if inrange(mpg,5,25) 

*The command: su price if mpg>=5 & mpg<=25 works the exact same way, try it out!


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         60    6577.083    3117.013       3291      15906


 would delete this -- Which works the exact same way as 

In [12]:
su price if mpg>=5 & mpg<=25 * would delete this one 


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         60    6577.083    3117.013       3291      15906


- We can make use of the functions `!missing()` when we want to restrict to a particular list of values that do not contain missing values.

>There will be observations where there is no information recorded for a particular variable. When it is a string variable it will show as `""` (empty text), and when it is a numeric variable it will show as `.` (a single dot). Missing values for numeric types are considered infinity in Stata. If you write `su price if mpg>5`it may include observations where `mpg` is missing! Consider the following example.


In [13]:
su price if rep78>2


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         64    6239.984    2925.843       3291      15906


The easiest way to get rid of the problems that could arise from missing values is deleting all the observations (rows) that contain missing values. However, if for some reason we need to keep the observations with missing values, the easiest way to omit missing values from a particular calculation is with the function `missing()` using the negative logical operator (!) beforehand. 

In [15]:
su price if rep78>2 & !missing(rep78)
* This command can also be written as: su price if rep78>2 & rep78!=.


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         59    6223.847    2880.454       3291      15906


In [None]:
su price if rep78>2 & rep78!=. * would delete this

#### 3. In Conditions 

We can also subset the data in terms of the observation number. 

In [16]:
su price in 1/10


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         10      5517.4    2063.518       3799      10372


Using these type of conditions is generally not recommended because it is sensible to the way the data is sorted. Suppose now we want to order the data from lower to higher price and we attempt to run the same command.

In [17]:
sort price 
su price in 1/10




    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         10      3726.5    245.9007       3291       3984


And you can see that the result changes. This is why you should avoid using `in` whenever you can use an `if` condition instead. 

### 3.2.4 Options

From the documentation file, we observed that we can introduce some optional arguments after a comma. In the case of the `summarize` command we have the option: `d`etail, `mean`only, `f`ormat and `sep`arator(#). 

In [18]:
su price , detail


                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188


delete -- And the options can have abbreviations as well!

In [19]:
su price , d * would delete this


                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188


## 3.3 Loops 

Much like any other programming languages, there are `for` and `while` loops that we can use to iterate through many instances. In particular, the `for` loops are also sub-divided into `forvalues` (iterate across a range of numbers) and `foreach` (iterate across a list of names). 

It is very common that these loops create a local scope (i.e. the iteration labels only exist within a loop). A local variable in Stata is a (non-dataset) variable that stores information that exists only within certain parts of our do-file run. We'll discuss much further about local variables in the next lecture, but consider this simple example

In [22]:
local i = 13
*Essentially `i' now works as a placeholder for the value 13

display `i'



13


In [28]:
local year = "Calendar Year"

display "`year'"



Calendar Year


Essentially, we can write anything inside a local, and whenever we want to replace it with the actual information we use a backtick (\`) and  apostrophe (').

### 3.3.1 Forvalues 

Whenever we want to iterate across a range of values defined as `min_value(steps)max_value`, we can write

In [30]:
forvalues counter=1(1)10{
    *Notice that now counter is a local variable
    display `counter'
}


1
2
3
4
5
6
7
8
9
10


In [31]:
forvalues counter=1(2)5{
    *Notice that now counter is a local variable
    display `counter'^2
}


1
9
25


Notice that the open brace needs to be on the same line as the `for` command, with no comments after it. Simiarly, the closing brace needs to be on its own line.

### 3.3.2 Foreach

Whenever we want to iterate across a list of names, we may write 

In [32]:
foreach name in "mpg" "price"{
    summarize `name'
}



    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906


We can have a list stored in a local variable as well

In [35]:
local namelist "mpg price"
foreach name in `namelist'{
    summarize `name'
}




    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906


### 3.3.3 While

Whenever we want to iterate until a condition is met, we may write 

In [36]:
local counter = 1 
while `counter'<5{
    display `counter'
    local counter = `counter'+1
}



1
2
3
4


# 3.4 Wrapping up


In this lecture we understood the way Stata commands work and their syntax. In general, a standard Stata command will follow the folllowing structure 

```
  name_of_command [varlist] [if] [in] [weight] [, options]
```

At this point, you should feel more comfortable reading a documentation file for a Stata command. The question that remains is how to find new commands!

You are encouraged to search for commands using the command `search`. For example, if you are interested in running a regression you can write:

In [20]:
search regress 

You will see that a new window pops up and you can click at the different options that it shows to look at the documentation for all these commands. The new window should look like this 

![](img/search_regress.png)

In any of the following lectures, whenever there is a command confuses you, you should feel free to write `search command` or `help command` to redirect to the documentation. 