# ECON 490: Opening Datasets (5)

## Prerequisites:
---
1. Setting up Anaconda and Stata kernel
2. Learning how to use Stata do files
3. Understanding the basic syntax of stata commands
4. Understanding local and global variables 

## Learning objectives:
---
By the end of this module you will be able to:
- Import and save datasets in stata 

In this repository you will notice that there is a folder named data, with a sub-folder named raw. You'll find two different versions of the same dataset. The dataset simulates information of workers in the years 1982-2012 in a fake country where, in 2003, a policy was enacted that allowed some workers to enter a training program with the purpose of boosting their earnings. We'll use this dataset to learn how to explore and manipulate real-world datasets. 

## 5.1 Clearing the Workspace

*Every* do-file must begin with some command that clears the previous work that has been done in Stata. This makes sure that 
1. We do not waste computer memory in things other than the current project.
2. Whatever result we obtain in the current session truly belongs to that session.


We can clean the workspace from many different things (see `help clear` if needed). For the purpose of this lecture, the easiest thing to deal with it all is to write the following

In [2]:
clear *

## 5.2 Changing Directories 

Before we get started on importing data into Stata, it is useful to know how to change the folder that Stata accesses whenever you run a command that either opens or saves a file. Once you instruct Stata to change the directory to a specific folder, from that point onward it will open files from that folder and save all files to that folder, including data files, do file and log files (more on these later in this course). Knowing how to change the directory will make the way you use Stata more efficient this is because, Stata will continue to do this until either the program is closed or you change to another directory. This means that every time you open Stata you will need to change the directory to the one you want to use. 

<div class="alert alert-info">

**Note:**  We write the directory path within quotation marks to make sure Stata interprets this as a single string of words. If we didn't do this, we may encounter issues with folders that include blank spaces. 

</div>

Where have you saved the fake_data file? Lets change the directory to the specific location where the file is located using the command below! You can change your workspace to a directory named `some_folder/some_sub_folder` by writing `cd "some_folder/some_sub_folder"`.

In [3]:
cd "."

C:\Users\paulc\Dropbox\Projects\Gitlab\econometrics\econ490-stata


Notice that once we change directory Stata prints the full name of the directory where we're currently working.

## 5.3 Opening Datasets 

#### Excel and CSV files 
When looking for the data for your research you will realize that  many data sets are not formatted for Stata. In many cases, data sets are formatted as excel or csv files. Not surprisingly the command to to this job is called `import`, and has two main versions: `import excel` and `import delimited`. 

Let's open the dataset called `fake_data.csv`. The file type is noticeable in title of file for example, the fake data data frame is stored as a csv file hence, we would need to use import delimited to open this table in stata. The syntax of this command is `import delimited [using] filename [, import_delimited_options]`. 

We *always* include the option `clear` to make sure we're clearing any previous dataset that was opened before. Recall that to use an option, we include a comma (`,`) after the command line and write the option name. You are welcome to also read the documentation of these commands by writing `help import delimited`.

In [4]:
import delimited using "fake_data.csv", clear

(9 vars, 2,861,772 obs)


Notice that Stata prints a message that says that there were 9 variables found with almost 3 million observations.  When we open datasets that are *not* in Stata format, it is very important to check whether the first row of the data include the variable names. 

In [6]:
list in 1/3 //List first 3 observations


     +----------------------------------------------------------------------+
  1. | workerid | year | sex | birth_~r | age | start_~r | region | treated |
     |        1 | 1999 |   M |     1944 |  55 |     1997 |      1 |       0 |
     |----------------------------------------------------------------------|
     |                               earnings                               |
     |                               39975.01                               |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | birth_~r | age | start_~r | region | treated |
     |        1 | 2001 |   M |     1944 |  57 |     1997 |      1 |       0 |
     |----------------------------------------------------------------------|
     |                               earnings                               |
     |                               278378.1                 

By default the first row of data is interpreted as the variable names, which in this case was correct. If that's not the case, we need to include the import delimited option `varnames(#|nonames)`, where we replace `#` by the observation number that includes the names. If the data has no names the option is `varnames(nonames)`. Don't forget that you can always check the documentation by writing `help import delimited`.

#### Stata files
To open datasets in Stata format we use the command `use`. As we can observe from the example below, we can recognize a dataset is stored in stata format because the file's name should end with .dta.

In [8]:
use "fake_data.dta", clear

In [9]:
list in 1/3 //List first 3 observations


     +----------------------------------------------------------------------+
  1. | workerid | year | sex | birth_~r | age | start_~r | region | treated |
     |        1 | 1999 |   M |     1944 |  55 |     1997 |      1 |       0 |
     |----------------------------------------------------------------------|
     |                               earnings                               |
     |                               39975.01                               |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | workerid | year | sex | birth_~r | age | start_~r | region | treated |
     |        1 | 2001 |   M |     1944 |  57 |     1997 |      1 |       0 |
     |----------------------------------------------------------------------|
     |                               earnings                               |
     |                               278378.1                 

## 5.4 Saving Datasets 

You can save any opened dataset into Stata format by writing `save using "some_directory/dataset_name.dta, replace`. The replace option overwrites a previous version of the file. 

We can also save files into different formats with the `export excel` and `export delimited` commands. You may check any details in the documentation to do so.