# Econ 490: Visualisation Types (9)

## Prerequisites 
---
1. Be able to effectively use Stata do files and generate log files.
2. Be able to change your directory so that Stata can find your files.
3. Import datasets in csv and dta format. 
4. Save data files. 

## Learning Objectives 
- Know when to use the following kinds of visualizations to answer specific questions using a data set:
    - scatterplots
    - line plots
    - bar plots
    - histograms
- Generate and fine-tune visualizations using Stata command `twoway` and its different options
- Use `graph export` to save visualizations in various formats including `.svg`, `.png` and `.pdf`

We'll continue working with the fake data dataset introduced in the previous lecture. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [None]:
clear* 

use fake_data, clear 

Data visualization is an effective way of communicating ideas to your audience, whether it's for an academic paper or a business setting. It can be a powerful medium to motivate your research, illustrate relationships between variables and provide some intuition behind why you applied certain econometric methods.

The real challenge is not understanding how to use Stata to create graphs, the challenge is figuring out the graph that will do the best job at telling your empirical story. Before creating any graphs, be sure to identify the message you want to the graph convey. Try to answer the questions: Who is our audience? What is the question we're trying to answer?

<div class="alert alert-block alert-info">
<b>Note:</b> You can use the drop down menus to create your graphs. If you want to include this in your Do file you simply need to copy and paste the command that appears in the Command Window after you create the graph
</div>

## 9.1 Types of graphs 
### 1. Scatter Plot

<!-- what is it? and, when to use? --> 
Scatter plots are frequently used to demonstrate how two quantitative variables are related to one another. This plot works great when we are interested in showing relationships and groupings among variables from relatively large datasets.

#### Example
- ![Relationship of country religiosity vs wealth](https://ourworldindata.org/uploads/2013/11/GDP-vs-Religion.png) 
- [Comparing Americans' perceptions of which foods are healthy to the perception of nutritionists](https://www.nytimes.com/2017/10/09/learning/whats-going-on-in-this-graph-oct-10-2017.html)
- [Instagram followers in the fashion industry](https://qz.com/267635/explore-the-hidden-patterns-of-the-fashion-instagram-universe/)

#### 1.1 Creating a scatter plot

Let's say we want to plot the log-earnings by year. We begin by generating a new variable for log earnings. 

In [None]:
gen log_earnings = log(earnings)

la var log_earnings "Log-earnings" // We are adding the label "log-earnings" to the variable log_earnings

In [None]:
preserve

collapse (mean) log_earnings, by(year) // We are collapsing the data so that the variables are grouped by years (the unique id is years)

In [None]:
describe

We can see that the data is now yearly and that the values of all the values of the same year are average to create the new values of the variable mean log_earnings. 

To create a scatterplot we need to use Stata's `twoway` command. The most important skill with graphing in Stata is to be able to understand the documentation. 

    
>Try using the `help` command to pull up the documentation for `twoway`.
    
</div>


In [None]:
help twoway

We can create many types of graphs using the command  `twoway (type_of_graph x_variable y_variable)`. In this case we want to create a scatterplot that shows earnings as the dependent variable and year as the independent variable. The command we use is as follows,

In [None]:
twoway (scatter log_earnings year)

graph export ./img/myscatterplot.svg, replace

It should look something like this: ![myscatterplot](img/myscatterplot.svg)

<div class="alert alert-block alert-info">
    
<b>Your turn:</b> Here's an example of a connected scatterplot. Can you deduce the command from the `twoway` documentation? 
    
</div>

![connected-scatter-plot](./img/myconnectedplot.svg)

In [None]:
** Try the command here!

### 2. Line Plot

<!-- what is it? and, when to use? --> 
Line plots visualize trends with respect to an independent, ordered quantity (e.g., time). This plot works great when one of our variables is ordinal (time-like) or when we want to display multiple series on a common timeline

#### 2.2 Creating line plots 

Line plots can be generated using Stata's `twoway` command we saw earlier. This time instead of writting `scatter` for type of graph we write `line`.

In [None]:
twoway (line log_earnings year), ///
    xtitle("Year") ytitle("Log-earnings")

graph export ./img/lineplot, replace

It should look something like this: ![mylineplot](img/mylineplot.svg)

 Now let's try creating a line plot with multiple series on a commone timeline. Let's set up the data frame to include log-earnings, year and treatment variable. Then, use your code from the last exercise to complete the code for a multiple series line plot. Export the graph as `multilineplot.svg`.

To accomplish this graph we first need to `restore` our data to the original version of the fake_data dataframe. Once we have our original dataset we should collapse it by groups of year and treated individuals (the unique ids are treatment and years).

In [None]:
restore

In [None]:
preserve

collapse (mean) log_earnings, by(treated year)

describe

Now that we have our cleaned dataset we can create the graph separating the earnings between the treated and not-treated throughout time. 

In [None]:
twoway ( log_earnings year if treated) || ( log_earnings year if !treated) 
graph export ./img/multilineplot.svg, replace

It should look something like this: ![multilineplot](img/multilineplot.svg)

### 3. Histogram

<!-- what is it? and, when to use? --> 
Histograms visualize the distribution of one quantitative variable. This plot works great when we are working with a discrete variable and are interested in visualizing all its possible values and how often they occur

#### 3.1 Creating histograms

Now let's restore the original dataset so that we can plot the distribution of log-earnings. 

In [None]:
restore

describe

In [None]:
histogram log_earnings

graph export ./img/myhistogram.svg, replace

It should look something like this: ![myhistogram](img/myhistogram.svg)

### 4.Bar plot

<!-- what is it? and, when to use? --> 
Bar plots visualize comparisons of amounts. It is useful when we are interested in comparing a few categories as parts of a whole, or across time. 

> Bar plots should always start at 0. Starting bar plots at any number besides 0 is generally considered a misrepresentation of the data.

#### 4.1 Creating a bar plot


In [None]:
help graph bar   /* this is a "traditional" bar plot. You can also create a bar plot using the twoway command.*/

Now let's plot mean earnings by region. Note that the regions are numbered in our dataset. 

In [None]:
graph bar (mean) earnings, over(region)

graph export ./img/mybarchart.svg, replace

![mybarchart](img/mybarchart.svg)

We can also create a horizontal bar plot by using the option `hbar` instead of `bar`.

In [None]:
graph hbar (mean) earnings, over(region)

graph export ./img/mybarchart2.svg, replace

![mybarchart2](./img/mybarchart2.svg)

We can also group your bars over another variable (or "category")

In [None]:
graph hbar (mean) earnings,  over(treated) over(region)

graph export ./img/mybarchart3.svg, replace

![mybarchart3](img/mybarchart3.svg)

<div class="alert alert-block alert-warning">
    
<b>Your turn:</b> What happens when you switch the order of categories in the code above? Try this in the following code cell. 
    
</div>

In [None]:
graph hbar (mean) earnings,  over() over()

graph export ./img/mybarchart4.svg, replace

<div class="alert alert-block alert-warning">
    
<b>Your turn:</b> Run the code cell below. Then, try switching the `over` and `by` variables and store it in `mybarchart5.svg` in the next code cell. 
    
</div>

In [None]:
graph hbar (mean) earnings,  over(treated) by(region)

graph export ./img/mybarchart5.svg, replace

## 9.2 Code Format
We can write your code in a single line as shown above. However, graph code can get very lengthy, so, to keep things neat and simple, we will break up the code into multiple lines using `///` in the next few examples. 

```stata

twoway (scatter log_earnings year), ///
    xtitle("Year") ytitle("Log-earnings")

graph export ./img/myscatterplot2.svg, replace

```

## 9.3 Exporting Format

So far, we exported our graphs in svg format. You can also export your graph in other formats such as `.jpg`, `.png` and `.pdf`. This may be particularly helpful if you plan to use LaTeX for writing your paper, as `.svg` files cannot be used with LaTeX PDF output. 

## 9.4 Fine-tuning your graph further

In order to customise our graph further, you can use the tools in the Stata graph window or the graph option commands we have been using in this module. We can include and adjust the following: 

- title 
- axis titles
- legend 
- axis 
- scale
- labels 
- theme (i.e. colour, appearance)
- adding lines, text or objects 

While we won't cover each of these in this module, you can always go back to the Stata documentation to explore the options available to you based on your needs. 


In [None]:
help twoway options

Some of the adjustments we can do include:
- Add axis titles using the `ytitle("y_title")` and `xtitle("x_title")` options. 

In [None]:
restore
preserve
collapse (mean) log_earnings, by(year)

twoway (scatter log_earnings year), xtitle("Year") ytitle("Log-earnings")

graph export ./img/myscatterplot2.svg, replace

![myscatterplot2](img/myscatterplot2.svg)

- Change the color of the graph by using the `color` option

In [None]:
histogram log_earnings, color(emidblue)

graph export ./img/myhistogram2.svg, replace

![myhistogram](img/myhistogram2.svg)

Run the code cell below to view the colorstyle options available in Stata

In [None]:
help colorstyle

- Add a labelled  legend to our graphs. To include the legend we use the option `legend( label(number_of_label "label"))` 
- Add an indicator line the year treatment began in 2002. To include the indicator line we use the the option  `xline()`. The line can also have different characteristics. For example, we can change its color and pattern using the options `lcolor` `lpattern()`

For example, we can use the line graph example above 

In [None]:
restore
preserve
collapse (mean) log_earnings, by(treated year)

twoway ( log_earnings year if treated) || ( log_earnings year if !treated), ///
    xtitle("Year") ytitle("Log-earnings")                                  ///
    legend( label(1 "Treated") label(2 "Control"))                         /// 
    xline( /*treatment year*/, lcolor(cranberry) lpattern(dash_dot))

graph export ./img/multilineplot2.svg, replace

It should look something like this:
![multilineplot2](img/multilineplot2.svg)

<div class="alert alert-block alert-info">
    
<b>Moment of Reflection:</b> Compare this graph (`multilineplot2`) with `mybarchart3` we generated earlier. Do both visualisations tell the same story? Does one capture the treatment effect better than the other? 
    
</div>

<div class="alert alert-block alert-warning">
    
<b>Your turn:</b> Generate a histogram of the age distribution in our dataset. Try customizing the bar colour and adding titles. Export the graph as `myhistogram3.svg`. 
    
</div>

In [None]:
histogram , color()

graph export , replace

## 9.5 Wrap up
We have learned in this module how to create different types of graphs using the command `twoway` and how to adjust them with the multiple options this command has. However, the most valuable lesson form this module is understanding when to use a specific. type of graph. Graphs are only able to tell illustrate a story if we pick correctly which graph and which options to use. 

Remember to check the stata documentation when creating graphs. The documentation can be your best ally if you end up using it.

## Further reading

- [Make your data speak for itself! Less is more (and people don’t read)](https://towardsdatascience.com/data-visualization-best-practices-less-is-more-and-people-dont-read-ba41b8f29e7b)

## References 

- Timbers, T., Campbell, T., Lee, M. (2022). Data Science: A First Introduction. https://datasciencebook.ca/viz.html
- Schrimpf, Paul. "Data Visualization: Rules and Guidelines." In *QuantEcon DataScience*. Edited by Chase Coleman, Spencer Lyon, and Jesse Perla. https://datascience.quantecon.org/applications/visualization_rules.html
- Kopf, Dan. "A brief history of the scatter plot." *Quartz*. March 31, 2018. https://qz.com/1235712/the-origins-of-the-scatter-plot-data-visualizations-greatest-invention/