# DS102 Statistical Programming in R : Lesson Nine - Data Exploration

### Table of Contents <a class="anchor" id="DS102L9_toc"></a>

* [Table of Contents](#DS102L9_toc)
    * [Page 1 - Introduction](#DS102L9_page_1)
    * [Page 2 - Data Exploration Part I](#DS102L9_page_2)
    * [Page 3 - Data Exploration Part II](#DS102L9_page_3)
    * [Page 4 - Data Exploration Part III](#DS102L9_page_4)
    * [Page 5 - Key Terms](#DS102L9_page_5)
    * [Page 6 - Hands On](#DS102L9_page_6)
    * [Page 7 - The Terminal](#DS102L9_page_7)
    * [Page 8 - Command Line Interface (CLI)](#DS102L9_page_8)
    * [Page 9 - File Organization and Paths](#DS102L9_page_9)
    * [Page 10 - Showing Files and Changing Directories](#DS102L9_page_10)
    * [Page 11 - Learning the CLI](#DS102L9_page_11)
    * [Page 12 - CLI Activity](#DS102L9_page_12)
    * [Page 13 - Key Terms](#DS102L9_page_13)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 -  Introduction <a class="anchor" id="DS102L9_page_1"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [13]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('247057983', width=720, height=480)

# Introduction

Now that you have all these newfound data skills in R, you will put them to use in a lesson on data exploration.  Often, data scientists are given specific questions that they need to answer, and that will guide their data analysis, but sometimes, you may be given a pile of data and are expected to make something out of it.  That's when data exploration comes in.  

By the end of the first part of this lesson, you will be able to: 

* Employ ```ggplot``` to make effective and informative exploration graphs
* Utilize data manipulation skills from ```dyplyr``` to wrangle and explore data
* Compute summary statistics

This lesson will also introduce you to some of the concepts you'll need in later modules about working with your computer through the command line, or terminal (depending on your operating system).  

By the end of the second part of this lesson, you will understand: 

* How to use your terminal
* File organization, directories, and paths

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/427596329"> recorded live workshop </a> that goes over the material in this lesson regarding your terminal. </p>
    </div>
</div>


In [14]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('427596329', width=720, height=480)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Data Exploration Part 1<a class="anchor" id="DS102L9_page_2"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [15]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('329835610', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L09-pg2tutorial.zip)**.

# Loading in and Examining the Data

The **[GapMinder website](https://www.gapminder.org/)** has a large collection of statistics about health and economic development for countries around the world. Its founder, Hans Rosling, has developed compelling ways to present statistical data. An R data set is available called ```Gapminder```, and you will use what you have learned about data manipulation and graphing to explore that dataset.


The first order of business is to install the ```gapminder``` package and make it available to R. You do this with the following commands:

```{r}
install.packages("gapminder")
library(gapminder)
head(gapminder)
```

Here is the result: 

```text
# A tibble: 6 x 6`

country continent year lifeExp pop gdpPercap

<fctr> <fctr> <int> <dbl> <int> <dbl>

1 Afghanistan Asia 1952 28.801 8425333 779.4453

2 Afghanistan Asia 1957 30.332 9240934 820.8530

3 Afghanistan Asia 1962 31.997 10267083 853.1007

4 Afghanistan Asia 1967 34.020 11537966 836.1971

5 Afghanistan Asia 1972 36.088 13079460 739.9811

6 Afghanistan Asia 1977 38.438 14880372 786.1134
```

From the package documentation, you will find that it has the following variables:

* **```country```:** The country in question. This is a factor with 142 levels.
* **```continent```:** The continent in which the country resides. This is a factor with 5 levels.
* **```year```:** The year for the given data. The year values range from 1952 to 2007 in steps of 5 years.
* **```lifeExy```:** The life expectancy at birth.
* **```pop```:** The population of the country.
* **```gdpPercap```:** The per capita gross domestic product of the country; this is a measure of individual productivity and wealth.

You can see all of the countries in the data set if you just type in ```country```.  You see that it is a factor.  A factor has levels, which are the different values that the factor can take. The ```levels()``` function shows you the levels of the factor country:

```{r}
levels(gapminder$country)
```

And here is the output:

```text
[1] "Afghanistan" "Albania" "Algeria"

[4] "Angola" "Argentina" "Australia"

[7] "Austria" "Bahrain" "Bangladesh"

[10] "Belgium" "Benin" "Bolivia"
```

You can see that there are 142 countries represented in the data set.

```year``` is a numerical variable, so it does not have levels. You can see what values the year variable takes by using the ```unique()``` function. This function will return a vector of all the unique values in a vector:

```{r}
unique(gapminder$year)
```

[1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

So you see that the data set starts in 1952 and includes data in five year increments until 2007.

---

# Graphing the Population

You can get a feel for how the population of the countries has changed over the years by creating a box plot for each year value.

```{r}
ggplot(gapminder, aes(x = factor(year), y = pop)) + geom_boxplot()
```

This code yields the following graph:

![A box plot showing how the population of countries has changed over the years. The x axis is labeled factor open parentheses year close parentheses and runs from nineteen fifty two to two thousand seven in increments of five years. The y axis is labeled pop, for population, and runs from zero e plus zero zero to more than 1 e plus zero nine. There is a separate box plot for each year indicated.](Media/exploration2.png)

This shows that some years had a very high population, but it makes it quite difficult to see the population of the smaller countries. You can change the vertical scale to be logarithmic; this allows you to see the small values as well as the large ones:

```{r}
ggplot(gapminder, aes(x = factor(year), y = pop)) + geom_boxplot() +
scale_y_log10()
```

See how things look much more spread out along the y axis? 

![A box plot showing how the population of countries has changed over the years. The x axis is labeled factor open parentheses year close parentheses and runs from nineteen fifty two to two thousand seven in increments of five years. The y axis is labeled pop, for population, and runs from less than one e plus zero five to one e plus zero nine. There is a separate box plot for each year indicated.](Media/exploration1.png)

In the logarithmic scale, each grid line is ten times the one below it. 1e+06 is scientific notation for one million (1,000,000), so the grid line labeled 1e+06 represents 1 million people. The next grid line up, which is largely hidden by the boxes, is ten times one million, or 10 million (10,000,000). The next grid line up is ten times 10 million, or 100 million. And finally, the top line is ten times 100 million, which is 1 billion. You can see that around 1982 one country passed the 1 billion population mark, and around 1987 a second country passed this mark.

---

# Filtering and Arranging by Population

You can find the largest countries in 2007 by first filtering the rows to get those for 2007 and then sorting these rows by the pop variable. This is done with the ```filter()``` and ```arrange()``` functions:

```{r}
gm.big <- gapminder %>%
filter(year == 2007) %>%
arrange(desc(pop))
```

These commands first use the ```gapminder``` data frame in the ```filter()``` function with a selection criteria of year == 2007; this finds all rows from 2007. These rows are then used in the ```arrange()``` function, which rearranges the order of the rows. The ```desc(pop)``` argument tells the ```arrange()``` function to order the rows in descending order of population; descending means that the largest number is first. You can look at the resulting tibble as follows; the argument ```n = 10 to head()``` indicates to print out the first 10 rows.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>If you get an error when running this code, try loading the library dplyr! You'll need it to perform filtering and other operations.</p>
    </div>
</div>

```{r}
head(gm.big, n = 10)
```

Here is the result:

```text
# A tibble: 10 x 6

country continent year lifeExp pop gdpPercap

<fctr> <fctr> <int> <dbl> <int> <dbl>

1 China Asia 2007 72.961 1318683096 4959.115

2 India Asia 2007 64.698 1110396331 2452.210

3 United States Americas 2007 78.242 301139947 42951.653

4 Indonesia Asia 2007 70.650 223547000 3540.652

5 Brazil Americas 2007 72.390 190010647 9065.801

6 Pakistan Asia 2007 65.483 169270617 2605.948

7 Bangladesh Asia 2007 64.062 150448339 1391.254

8 Nigeria Africa 2007 46.859 135031164 2013.977

9 Japan Asia 2007 82.603 127467972 31656.068

10 Mexico Americas 2007 76.195 108700891 11977.575
```

The list of big countries is topped by China, then India, and then the United States. You can find the smallest countries by using the ```tail()``` function:

```{r}
tail(gm.big, n = 10)
```

Which provides the bottom 10 rows of data in the tibble:

```text
# A tibble: 10 x 6`

country continent year lifeExp pop gdpPercap

`<fctr> <fctr> <int> <dbl> <int> <dbl>`

1 Swaziland Africa 2007 39.613 1133066 4513.4806

2 Trinidad and Tobago Americas 2007 69.819 1056608 18008.5092

3 Reunion Africa 2007 76.442 798094 7670.1226

4 Comoros Africa 2007 65.152 710960 986.1479

5 Bahrain Asia 2007 75.635 708573 29796.0483

6 Montenegro Europe 2007 74.543 684736 9253.8961

7 Equatorial Guinea Africa 2007 51.579 551201 12154.0897

8 Djibouti Africa 2007 54.791 496374 2082.4816

9 Iceland Europe 2007 81.757 301931 36180.7892

10 Sao Tome and Principe Africa 2007 65.528 199579 1598.4351
```

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>Instead of printing the bottom of this dataset, you could also choose to arrange by population rather than population descending.  This would put the smallest countries first!</p>
    </div>
</div>

---

## Determining Outliers in Per Capita GDP

You can plot the per capita GDP for each year:

```{r}
ggplot(gapminder, aes(x = factor(year), y = gdpPercap)) + geom_boxplot()
```

Providing this graph:

![A box plot showing how the population of countries has changed over the years. The x axis is labeled factor open parentheses year close parentheses and runs from nineteen fifty two to two thousand seven in increments of five years. The y axis is labeled G D P per capita, and runs from zero to one hundred twenty thousand. There is a separate box plot for each year indicated.](Media/L09-GDPBoxplot.png)

On this plot, you see that there are outliers with a high per capita GDP between 1952 and 1977. You can find out which country or countries these outliers represent by first calculating what values would be considered outliers. Thinking back a few lessons, you'll remember that you need to find the interquartile range, multiply it by 1.5, and then add it to the third quartile boundary: 

```text
9325.5 - 1202.1 = 8123.4
8123.4 x 1.5 = 12,185.1
12,185.1 + 9325.5 = 21,510.6
```

with the ```filter()``` function by selecting all rows with a ```gdpPercap``` value greater than 13988:

```{r}
filter(gapminder, gdpPercap > 13988)
```

This yields the following results:

```text
# A tibble: 143 × 6

  country  continent year  lifeExp    pop	   gdpPercap

1 Australia	Oceania	1987	76.320	16257249	21888.89
2 Australia	Oceania	1992	77.560	17481977	23424.77
3 Australia	Oceania	1997	78.830	18565243	26997.94
4 Australia	Oceania	2002	80.370	19546792	30687.75
5 Australia	Oceania	2007	81.235	20434176	34435.37
6 Austria	Europe	1982	73.180	7574613	    21597.08
7 Austria	Europe	1987	74.940	7578903	    23687.83
8 Austria	Europe	1992	76.040	7914969	    27042.02
9 Austria	Europe	1997	77.510	8069876	    29095.92
10 Austria	Europe	2002	78.980	8148312	    32417.61

# ... with 133 more rows
```

---

## Graphing Life Expectancy

Finally, you can plot the life expectancy values for each year:

```{r}
ggplot(gapminder, aes(x = factor(year), y = lifeExp)) + geom_boxplot()
```

![A box plot showing how the population of countries has changed over the years. The x axis is labeled factor open parentheses year close parentheses and runs from nineteen fifty two to two thousand seven in increments of five years. The y axis is labeled life expectancy and runs from twenty to eighty five. There is a separate box plot for each year indicated.](Media/L09-LifeBoxPLot.png)

It appears that one country had a very low life expectancy in 1992; you can find this country as follows:

```{r}
filter(gapminder, lifeExp < 28)
```

Which yields a one-row tibble:

```text
# A tibble: 1 x 6

country continent year lifeExp pop gdpPercap

`<fctr> <fctr> <int> <dbl> <int> <dbl>`

1 Rwanda Africa 1992 23.599 7290203 737.0686
```
---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [1]:
try:
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *
except:
    !pip install DS_Students
    from DS_Students import MultipleChoice
    from ipynb.fs.full.DS102Questions import *

In [2]:
try:
    display(L9P2Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Question?\n', 'output_type': 'stream'},)), Radio…

<p style="text-align: center">
  <img src="Media/L09-LifeBoxPlot.png" alt="Drawing" style="width: 500px;"/>
</p>

In [3]:
try:
    display(L9P2Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Question?\n', 'output_type': 'stream'},)), Radio…

<p style="text-align: center">
  <img src="Media/L09-LifeBoxPlot.png" alt="Drawing" style="width: 500px;"/>
</p>

In [4]:
try:
    display(L9P2Q3, L9P2Q4)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Question?\n', 'output_type': 'stream'},)), Radio…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '4. Question?\n', 'output_type': 'stream'},)), Radio…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Data Exploration Part II<a class="anchor" id="DS102L9_page_3"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [16]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('329835657', width=720, height=480)

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L09-pg3tutorial.zip)**.

# Comparing Countries

---

## Focusing on Angola

You will continue your exploration of this data set by looking at the data for Angola. As you learned previously, you can get all rows in the ```gapminder``` data set that have data from Angola by using the condition ```country == "Angola"``` in the ```filter()``` function:

```{r}
gm_Angola <- filter(gapminder, country == "Angola")
head(gm_Angola)
```

This provides you with the following data:

```text
# A tibble: 6 x 6

country continent year lifeExp pop gdpPercap

<fctr> <fctr> <int> <dbl> <int> <dbl>

1 Angola Africa 1952 30.015 4232095 3520.610

2 Angola Africa 1957 31.999 4561361 3827.940

3 Angola Africa 1962 34.000 4826015 4269.277

4 Angola Africa 1967 35.985 5247469 5522.776

5 Angola Africa 1972 37.928 5894858 5473.288

6 Angola Africa 1977 39.483 6162675 3008.647
```

As you can see, ```gm_Angola``` is now a tibble that has data only on Angola. You can use this tibble to plot life expectancy as a function of the year:

```{r}
ggplot(gm_Angola) + geom_line(aes(x = year, y = lifeExp)) +
ylab("Life Expectancy") + ggtitle("Life Expectancy in Angola")
```

This creates the plot below:

![A line graph titled life expectancy in Angola. The x axis is labeled year and runs from nineteen fifty to two thousand ten. The y axis is labeled life expectancy and runs from twenty eight to approximately forty three. The line rises at roughly forty five degrees from the bottom left, nineteen fifty, until approximately nineteen seventy seven, when it levels out until about nineteen eighty seven before slowly rising again.](Media/L09-AngolaLE.png)

You can create a similar plot of per capita GDP for Angola:

```{r}
ggplot(gm_Angola) + geom_line(aes(x = year, y = gdpPercap)) +
ylab("Per Capita GDP") + ggtitle("GDP in Angola")
```

![A line graph titled G D P in Angola. The x axis is labeled year and runs from nineteen fifty to two thousand ten. The y axis is labeled per capita G D P and runs from approximately two thousand to approximately five thousand seven hundred. The line starts just after nineteen fifty at three thousand five hundred, rises to approximately five thousand five hundred in approximately nineteen sixty seven, slightly decreases until approximately nineteen seventy two, sharply decreases until about nineteen eighty seven, then slightly increases and decreases to a low of less than two thousand five hundred until about nineteen ninety seven, and then rises up to approximately four thousand eight hundred by approximately two thousand seven.](Media/L09-AngolaGDP.png)

It looks like between the years of 1977 and 2002, the GDP dropped significantly and the life expectancy did not improve. This coincides with a prolonged civil war in Angola.

---

## Comparing Four African Countries

Next, you do a comparative investigation into four African countries: Angola, Ghana, Ethiopia, and South Africa. 

---

### Filtering to African Countries

The first thing to do is to create a data frame with the data from these four countries. You do this with the ```filter()``` function as follows:

```{r}
gm_Africa4 <- filter(gapminder,
country %in% c("Angola", "Ghana", "Ethiopia", "South Africa"))
```

In this case, the condition used for the ```filter()``` function is ```country %in% c("Angola", "Ghana", "Ethiopia", "South Africa")```; this condition will be true if the value of country is in the vector you specified with the names "Angola," "Ghana," "Ethiopia," and "South Africa." This creates a tibble with only those four countries. 

---

### Eliminating Extraneous Variables

If you like, you can clean up the data frame ```gm_Africa4``` by eliminating the variables that you aren't using. This is not strictly necessary, but will may make it easier later on to work with this data.  If you decide you want to examine more data later, however, it will be a pain in the butt to go back. 

Nevertheless, here's how you'd get only the variables you're interested in at the moment: 

```{r}
gm_AfricaClean <- select(gm_Africa4, country, year, lifeExp, gdpPercap)
head(gm_AfricaClean)
```

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Always give a new dataset a new name! This way it makes it easy to go backwards if you've made a mistake that's irreversible.</p>
    </div>
</div>

You will not be using the population variable in your computations. Neither will you use the continent variable, since you now have data only from Africa. The ```select()``` function tells R what rows to keep. Here is the resulting tibble:

```text
# A tibble: 6 x 4

country year lifeExp gdpPercap

<fctr> <int> <dbl> <dbl>

1 Angola 1952 30.015 3520.610

2 Angola 1957 31.999 3827.940

3 Angola 1962 34.000 4269.277

4 Angola 1967 35.985 5522.776

5 Angola 1972 37.928 5473.288

6 Angola 1977 39.483 3008.647
```

--- 

### Combining Filtering and Selecting

If you're feeling sassy, you can combine the filtering and selecting operations into one command using the ```%>%``` operator:

```{r}
gm_AfricaClean <- gapminder %>%
filter(country %in% c("Angola", "Ghana", "Ethiopia", "South Africa")) %>%
select(country, year, lifeExp, gdpPercap)
```

In layman's terms, this code translates to: take the ```gapminder``` data frame and ```filter()``` it to get only the rows for ```Angola, Ghana, Ethiopia, and South Africa```; then, from the resulting data frame, select the ```county, year, lifeExp, and gdpPercap``` variables.

---

### Plotting Life Expectancy

You can create a plot that shows the life expectancy in each of these four countries as a function of time using this command:

```{r}
ggplot(gm_AfricaClean) + geom_line(aes(x = year, y = lifeExp, color = country)) +
ylab("Life Expectancy") + ggtitle("Life Expectancy in Four Countries")
```

![A line graph titled life expectancy in four countries. Four lines are plotted, one for each country, Angola, Ethiopia, Ghana, and South Africa. The x axis is labeled year and runs from nineteen fifty to two thousand ten. The y axis is labeled life expectancy and runs from thirty to approximately sixty three. The lines for Ghana, Angola, and Ethiopia rise steadily. The line for South Africa rises over time until about nineteen ninety two, when it drops.](Media/L09-FourLE.png)

Note that the argument ```color = country``` in the ```aes()``` function assigns a different color to each country; that country's line is plotted in that color. You can create a plot of per capita GDP in a similar way:

```{r}
ggplot(gm_AfricaClean) + geom_line(aes(x = year, y = gdpPercap, color = country)) +
ylab("Per Capita GDP") + ggtitle("GDP in Four Countries")
```

![A line graph titled G D P in Four Countries. Four lines are plotted, one for each country, Angola, Ethiopia, Ghana, and South Africa. The x axis is labeled year and runs from nineteen fifty to two thousand ten. The y axis is labeled per capita G D P and runs from zero to almost ten thousand. The lines for Ghana and Ethiopia are low, both below one thousand two hundred fifty G D P per capita, and remain relatively flat over time. The line for Angola begins at a higher G D P, at about three thousand five hundred fifty, and rises for about fifteen years before dropping until about nineteen ninety seven, at which point it rises sharply to nearly five thousand per capital G D P. The line for South Africa begins in nineteen fifty at about five thousand per capita G D P and rises to over eight thousand five hundred in about nineteen eighty two. It drops until about nineteen ninety two, then slightly rises until about two thousand and two, and then rises sharply](Media/L09-FourGDP.png)

---

### Using grid.arrange()

When exploring the data, it may be helpful to have both plots together. You can do this as follows with the ```grid.arrange()``` function, which is part of the ```gridExtra``` package. To do this, first install and load ```gridExtra```:

```{r}
install.packages("gridExtra")
library("gridExtra")
```

Then create the two plots and store them in the variables ```life_exp``` and ```GDP```, then arrange them as follows:

```{r}
life_exp <- ggplot(gm_AfricaClean) + geom_line(aes(x = year, y = lifeExp, color = country))
GDP <- ggplot(gm_AfricaClean) + geom_line(aes(x = year, y = gdpPercap, color = country)) +
ylab("Life Expectancy")
grid.arrange(life_exp, GDP, ncol = 1)
```

You call ```grid.arrange()``` with arguments ```life_exp``` and ```GDP```, which are the plots you want to put together. Then add on the argument ```ncol = 1``` to indicate that you want the plots to be in one column, so they are stacked vertically. Simply changing the order of the plots changes which one goes on top. You should get the following plot:

![Two line graphs stacked. On the top, life expectancy over time in four countries, Angola, Ethiopia, Ghana, and South Africa. On the bottom, per capita G D P over time in the same four countries.](Media/L09-FourLEGDP.png)

You can see from this plot that the life expectancy and the per capita GDP declined in South Africa beginning in the early 1990's. This is most likely a consequence of the devastating HIV/AIDS epidemic in sub-saharan Africa. You will also note that while per capita GDP in Ethiopia and Ghana did not increase dramatically over time, the life expectancy in these two countries did increase dramatically.

---

### Adding Additional Variables to the Plot

You might try an alternate approach to representing both life expectancy and and per capita GDP on a single graph. You can plot the life expectancy both as lines and as points, and make the area of the points be proportional to the per capita GDP. You can do this as follows:

```{r}
ggplot(gm_AfricaClean, aes(x = year, y = lifeExp, color = country)) +
geom_line() + geom_point(aes(size = gdpPercap)) +
ylab("Life Expectancy") + ggtitle("Life Expectancy and GDP in Four Countries")
```

This yields the following graph:

![A line graph that uses both lines and points to show the life expectancy and G D P per capita over time in four countries, Angola, Ethiopia, Ghana, and South Africa.](Media/L09-FourLEGDP2.png)

You'll note that it is much harder with this plot to see the changes in GDP.

---

### Plotting GDP and Life Expectancy Against Each Other

You might try another approach. You can start by plotting the per capita GDP on the horizontal axis and the life expectancy on the vertical axis as follows:

```{r}
ggplot(gm_AfricaClean, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point()
```

![A scatter plot that shows data for per capita G D P and life expectancy in four countries, Angola, Ethiopia, Ghana, and South Africa. Per capita G D P is plotted on the horizontal axis and life expectancy is plotted on the vertical axis. The dots are color coded to show which country they represent.](Media/L09-Rosling1.png)

This creates an interesting plot that gives some indication of the changes in GDP and life expectancy over time. 

---

### Connecting Points with geom_line() vs. geom_path()

But it is difficult to see the evolution of these values over time. You might connect the series of points for each country with a line:


```{r}
ggplot(gm_AfricaClean, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point() + geom_line()
```

![A scatter plot that shows data for per capita G D P and life expectancy in four countries, Angola, Ethiopia, Ghana, and South Africa. Per capita G D P is plotted on the horizontal axis and life expectancy is plotted on the vertical axis. The dots are color coded to show which country they represent. The dots for each country are connected by a line that matches the color of the dots. Each line connects each point with the next point that has the closest horizontal distance, even though the points do not appear in the data frame in that order](Media/L09-Rosling2.png)

This does not help you see the evolution of the values over time. Instead, it connects each point with the next point that has the closest horizontal distance, even though the points do not appear in the data frame in that order. You can fix this by using ```geom_path()``` instead of ```geom_line()```; ```geom_path()``` draws lines between adjacent values in the data frame. Making the change like this:

```{r}
ggplot(gm_AfricaClean, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point() + geom_path()
```

And this is the graphic you are provided with:

![A scatter plot that shows data for per capita G D P and life expectancy in four countries, Angola, Ethiopia, Ghana, and South Africa. Per capita G D P is plotted on the horizontal axis and life expectancy is plotted on the vertical axis. The dots are color coded to show which country they represent. The dots for each country are connected by a line that matches the color of the dots. The lines are between adjacent values in the data frame.](Media/L09-Rosling3.png)

This helps you see that life expectancy has generally increased (except in South Africa); but it is still hard to clearly see the progression of time. 

---

### Making Points Transparent

You can convey a sense of the progression of time by changing the transparency of each point, with earlier points being more transparent that later points. The transparency of a point is represented by the ```alpha=``` argument; you can set the ```alpha=``` argument for ```year``` as follows:

```{r}
ggplot(gm_AfricaClean, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point(aes(alpha = year)) + geom_path()
```

![A scatter plot that shows data for per capita G D P and life expectancy in four countries, Angola, Ethiopia, Ghana, and South Africa. Per capita G D P is plotted on the horizontal axis and life expectancy is plotted on the vertical axis. The dots are color coded to show which country they represent. The dots for each country are connected by a line that matches the color of the dots. The lines are between adjacent values in the data frame. For each line, earlier points are more transparent than later points.](Media/L09-Rosling4.png)

This seems like you are making progress. It is now easier to see the progression of time, and the legend showing the year related to the transparency of the point provides a solid reference for time. However, it is too hard to see the early points because they are almost completely transparent; also, the points are small enough that it is hard to see the changes in transparency.

You can change the range of transparency values used by introducing a ```scale_alpha()``` term in the plot; the arguments to this will be the lowest and highest transparency values to be used. You will use a vector of values from 0.3 to 1.0. You will also make the points larger with the ```size=``` argument:

```{r}
ggplot(gm_AfricaClean, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point(aes(alpha = year), size = 3) + geom_path() +
scale_alpha(range = c(0.3, 1.0))
```

This gives the following:

![A scatter plot that shows data for per capita G D P and life expectancy in four countries, Angola, Ethiopia, Ghana, and South Africa. Per capita G D P is plotted on the horizontal axis and life expectancy is plotted on the vertical axis. The dots are color coded to show which country they represent. The dots for each country are connected by a line that matches the color of the dots. The lines are between adjacent values in the data frame. For each line, earlier points are more transparent than later points. The size of the dots have been increased for added visibility.](Media/L09-Rosling5.png)

---

### Clearly Labeling the Graph

Finally, make the labels on the horizontal and vertical axes meaningful in regular English: 

```{r}
ggplot(gm_AfricaClean, aes(x = gdpPercap, y = lifeExp, color = country)) +
geom_point(aes(alpha = year), size = 3) + geom_path() +
scale_alpha(range = c(0.3, 1.0)) +
xlab("Per capita GDP") + ylab("Life Expectancy")
```

You get the following:

![A scatter plot that shows data for per capita G D P and life expectancy in four countries, Angola, Ethiopia, Ghana, and South Africa. Per capita G D P is plotted on the horizontal axis and life expectancy is plotted on the vertical axis. The dots are color coded to show which country they represent. The dots for each country are connected by a line that matches the color of the dots. The lines are between adjacent values in the data frame. For each line, earlier points are more transparent than later points. The size of the dots have been increased for added visibility. The labels on the axes have been changed for clarity, with the horizontal axis reading per capita G D P and the vertical axis reading life expectancy.](Media/L09-Rosling6.png)

This is now quite an informative plot that shows the different paths that different African countries have taken. In both Ethiopia and Ghana, life expectancy has dramatically increased while GDP has moderately increased over time. In Angola, the increases in life expectancy came to an abrupt halt and GDP decreased dramatically, probably as a consequence of the civil war. In South Africa, the increases in life expectancy have later been reversed because of the HIV/AIDS crisis, and there have been economic impacts as well.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Interpreting your plot like the above is one of the most important things you can do as a data scientist! Just producing a visual is not enough; you need to point what you want others to notice.</p>
    </div>
</div>

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [5]:
try:
    display(L9P3Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Suppose you wish to compare the three most popul…

<p style="text-align: center">
  <img src="Media/L09-ExerciseBig3.png" alt="Drawing" style="width: 500px;"/>
</p>

In [6]:
try:
    display(L9P3Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. Which of the following commands would produce th…

<p style="text-align: center">
  <img src="Media/L09-ExerciseBig3Points.png" alt="Drawing" style="width: 500px;"/>
</p>

In [7]:
try:
    display(L9P3Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Which of the following commands would produce th…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Data Exploration Part III<a class="anchor" id="DS102L9_page_4"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [17]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('329835657', width=720, height=480)

```c-lms
topic: Statistical Summary
video-id: Data Exploration Part III
video-url-mp4: https://player.vimeo.com/external/329835741.hd.mp4?s=b15fcbe9c4e4cd4eb58a8603f71889e5c69a0b71&profile_id=175
video-url-mp4-1080: https://player.vimeo.com/external/329835741.hd.mp4?s=b15fcbe9c4e4cd4eb58a8603f71889e5c69a0b71&profile_id=175
video-url-mp4-720: https://player.vimeo.com/external/329835741.hd.mp4?s=b15fcbe9c4e4cd4eb58a8603f71889e5c69a0b71&profile_id=174
video-url-mp4-540: https://player.vimeo.com/external/329835741.sd.mp4?s=05c08741b791c0a7fab8c653d9a7bce2118be465&profile_id=165
video-url-mp4-360: https://player.vimeo.com/external/329835741.sd.mp4?s=05c08741b791c0a7fab8c653d9a7bce2118be465&profile_id=164
```

The transcript for the above topic tutorial video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO102-L09-pg4tutorial.zip)**.

# A Statistical Summary

The process of data exploration will often raise as many questions as it answers. In the case of your exploration of Angola, Ghana, Ethiopia, and South Africa, you may become curious how these countries compare to the rest of the countries in Africa. Are they typical? If not, how are they different?

You might ask how the life expectancy and per capita GDP in these four countries compare to the median values for all African countries. You can extract and graph the median values.

Use the following command to create a tibble that has the medians of life expectancy and per capita GDP for all of the countries in Africa:

```{r}
gm_medians <- gapminder %>%
filter(continent == "Africa") %>%
group_by(year) %>%
summarise(life_med = median(lifeExp), gdp_med = median(gdpPercap))
```

This command does the following: it starts with the whole gapminder data frame and then uses ```filter()``` to get only the rows for which the continent variable is ```Africa```. It then groups this data by ```year```; the group for a given year is all of the rows that have that year in them. Finally, the ```summarise()``` function computes the median value of ```lifeExp``` for each group and stores this value in the ```life_med``` variable and computes the median value of ```gdpPercap``` for each group and stores this value in the ```gdp_med``` variable.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Note the non-conventional spelling of "summarise." It is not a typo; the function was possibly created by someone in Europe or Australia.</p>
    </div>
</div>


The result is the tibble below. Note that each row corresponds to one year and has the median values of life expectancy and GDP over all African countries for that year.

```{r}
gm_medians
```

```text
# A tibble: 12 x 3
year life.med gdp.med

<int> <dbl> <dbl>

1 1952 38.8330 987.0256

2 1957 40.5925 1024.0230

3 1962 42.6305 1133.7837

4 1967 44.6985 1210.3764

5 1972 47.0315 1443.3725

6 1977 49.2725 1399.6388

7 1982 50.7560 1323.7283

8 1987 51.6395 1219.5856

9 1992 52.4290 1161.6314

10 1997 52.7590 1179.8831

11 2002 51.2355 1215.6832

12 2007 52.9265 1452.2671
```

---

## Summary

Data exploration is one of those often-called-for but infrequently taught skills that you will use as a data scientist. Packages such as ```ggplot2``` and ```dplyr``` are of great assistance in the data exploration endeavor, and you'll be able to find anomalies in the data that you can then explore or explain further. 

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [8]:
try:
    display(L9P4Q1, L9P4Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Suppose you wish to create a data set of median …

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. True or False?\n\n   Functions end in (), argume…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Key Terms<a class="anchor" id="DS102L9_page_5"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


```c-lms
topic: Key Terms
```

# Key R Code 

Below is a list and short description of the important R code learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

---

# Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>color=</td>
        <td>As an aesthetic argument in the ggplot() function geom_line(), you can add an additional variable that is plotted by color.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>grid.arrange()</td>
        <td>Allows you to combine more than one ggplot() together. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ncol=</td>
        <td>An argument for grid.arrange() that lets you specify how many columns you should make with your plots.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>size=</td>
        <td>An argument to geom_point() that allows you to add an additional variable that is plotted by dot size.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>geom_path()</td>
        <td>A function that draws a line between adjacent values in a data frame. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>alpha=</td>
        <td>An argument to the geom_point() aes() that allows you to set an additional variable on the graph by transparency level. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scale_alpha()</td>
        <td>Allows you to set a range for transparency. </td>
    </tr>
</table>

---

# Key R Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>gridExtra</td>
        <td>Allows you to put multiple graphs in one. </td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Hands On<a class="anchor" id="DS102L9_page_6"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [18]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('420538888', width=720, height=480)

```c-lms
activity-type: project
activity-name: Lesson 9 Hands-On
points: 12
due-at: 81%
close-at: end-of-module
```

For this Hands-On, you will compare five countries of your choice. You can see the countries in the ```gapminder``` data frame with the below command:

`levels(gapminder$country)`

This Hands-On **will** be graded, so be sure you complete all requirements. Please complete this Hands-On in a R script file, and submit it along with your presentation file below when completed.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---
## Requirements

Choose five countries from the resulting list after running the above command. Then use them to answer each of these questions:

* Which country of the five you chose has the lowest per capita GDP in 1952? In 2007? 
* Which has the highest per capita GDP in 1952? In 2007? 
* Create a line plot with year on the horizontal axis and ```lifeExp``` on the vertical axis for the five countries; give each country a different color line. Describe the variations in life expectancy between the countries.
* On the entire gapminder data frame, compute the median of lifeExp for each year. For what years is the life expectancy for your five countries above the median life expectancy for the entire gapminder data frame? 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch <a href="https://vimeo.com/420538888">this recorded live workshop </a> before beginning the hands-on,which goes over a similar example.</p>
    </div>
</div>

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You will need to load the library dplyr, ggplot, and gapminder to complete this hands on!</p>
    </div>
</div>


Create a presentation (MS Power Point or equivalent) to report out on your findings. Be sure to include any R code you used to get your findings and create your graphs. Add any additional insight you believe is warranted.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit all your files when finished!</p>
    </div>
</div>

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>To zip your file on <b>Windows</b>, right click on the file and select "Send to", then select "Compressed (zipped) folder". For <b>Mac</b> users, right click on the file and select "Compress", then select your file from the options.</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - The Terminal<a class="anchor" id="DS102L9_page_7"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [19]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Data Exploration
VimeoVideo('322292263', width=720, height=480)

# The Terminal

Most of the time people will interact with their computers using a graphical user interface (or GUI). This kind of interface is generally a very nice way to use a computer. Learning how to use a mouse or touchscreen is a reasonably low cost to pay to interact with a large number of different applications. It is not a surprise that GUIs are the dominant way people use computers today. However, as great as GUIs are for most use cases, there are several kinds of interactions where they feel somewhat clunky.

Consider the following directory (also called a folder):

![A computer window showing various files, a folder, the date each was created, and the size of each.](Media/directory1.png)

Now imagine you wanted to delete specific files, say all the `.jpg` files. You would probably click on each ```.jpg``` file you see and press delete. Maybe you would sort by file type, and select all the ```.jpg``` files at once and then delete them. Either way, this is generally a more intensive operation than it would be using a `Command Line Interface`.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Command Line Interface (CLI) <a class="anchor" id="DS102L9_page_8"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [20]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Command Line Interface
VimeoVideo('293882367', width=720, height=480)

# Command Line Interface (CLI)

Here's what the command line interface looks like:

![A bash window on a mac that shows the command line interface. The first line shows the last login attend. The second line shows the user has entered C D forward slash. The third line shows the user has entered L S, which will list files and directories present in the directory in which the user currently is.](Media/cli.png)

If a GUI makes it easy to navigate a large number of applications using a small set of inputs (moving the mouse, swiping, clicking, right-clicking, tapping, etc.), the CLI is the opposite. Every program has a different interface when using the CLI, but those interfaces are not limited to a small set of interactions; they can be anything typed on your keyboard.

---

## Deleting a Windows File

With a CLI, **Windows** accomplishes this with the following command:

```bash
del *.jpg
```

`del` means to delete all files that match the pattern `*.jpg`.  So, anything that ends with `.jpg`, no matter what comes first, will be deleted.

---

## Deleting a Mac or Linux File

On **MacOS** or **Linux** the command would be:

```bash
rm *.jpg
```

`rm` stands for remove.

---

## Commands

Each command on a CLI often has more than one part. The required part is the name of the command to run. There are commands to do a vast number of tasks, from showing you the files in a directory to copying, renaming, moving, deleting, zipping, unzipping, and transferring them. Each of these commands is a tiny program that does a specialized task.

![Example of a C L I command and parameter. The command is D E L. The parameter shown is asterisk dot J P G.](Media/cli-command1.png)

The parameter(s) are arbitrary and determined by the command. Many commands support a pattern that matches one or more files in a directory. In this case, the parameter specifies a pattern that matches a file that starts with any name but ends in `.jpg`

![Visualization of each part of a C L I command with parameters. D E L means delete all files that match the following pattern. Asterisk means starting with any name. Dot J P G means ending in dot J P G.](Media/cli-command2.png)

In the example directory above, this would only match one file, `embarrassing_christmas_photo.jpg`. Another way to match this file would be to list the name explicitly.

```bash
del embarrassing_christmas_photo.jpg
```

Or, if you wanted to match only the `.jpg` files that are named a particular way you could use this:

```bash
del embarrassing*.jpg
```

to match any file that starts with `embarrassing` and ends with `.jpg`. You could even use:

```bash
del *christmas*.jpg
```

To delete any file that has `christmas` as part of the name and ends in `.jpg`:

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [9]:
try:
    display(L9P8Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. True or False?\n\n   The * is a wild card that m…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - File Organization and Paths<a class="anchor" id="DS102L9_page_9"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [21]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Command Line Interface
VimeoVideo('294067838', width=720, height=480)

# Terminal Differences

There are two major families of terminals, the MacOS/Linux versions and the Windows version. Both MacOS and Linux come from an operating system family known as UNIX, while Windows was developed independently. As a result, a large number of commands are the same between MacOS and Linux, while Windows uses different commands. The same structure of having a command followed by its parameters remains the same, but the particulars of various common commands are different.

---

## File Organization

Before you can talk about using the terminal, it is important to understand how files are organized on a computer. You may have some familiarity with this structure if you've used the file browser application on your operating system to copy, move, rename, or delete files.

![A finder window in a mac showing how folders and files are organized on a computer.](Media/file-browser.png)

The organization of files and directories (aka folders) is a tree-like structure. There is a root starting point that holds some number of directories which, in turn, contain files and other directories.

![A tree like graphic organizer with four levels. The top level is the root level. Below the root level is a level for directories. Below a directory are files and subdirectories. Below a subdirectory are files.](Media/directory-tree.png)

On Windows, the root is a named "disk;" this is what the `C drive` is referencing. The ```C drive``` is the root of a tree of files and folders stored on your hard drive. On Windows, you can have multiple roots that are each identified by a different letter. On MacOS and Linux there is only ever one root which is `/` instead of a letter.

---

## Paths

When describing the location of a file or directory, a *path* is used. This path is the unique set of directories that must be traversed to reach the file each separated by a symbol. So if you had a file on your desktop named `notable_quotations_of_13th_century_poets.txt` and your username is `prosescholar` the full path to the file would be:

**Windows:**

```bash
C:\Users\prosescholar\Desktop\notable_quotations_of_13th_century_poets.txt
```

**MacOS / Linux:**

```bash
/Users/prosescholar/Desktop/notable_quotations_of_13th_century_poets.txt
```

You may have noticed that the path separator symbol is different between Windows (`\`) and MacOS/Linux (`/`). This is a remnant of their different heritages and a constant source of confusion when switching between operating systems.

Each section of the path describes a directory that is located inside the previous directory. The above example might be visualized as:

![A visualization file path. Each section of the path describes a directory that is located inside the previous directory. Root directory to users directory to prose scholar directory to desktop directory to a file named notable quotations of thirteenth century poets dot T X T.](Media/path-to-file.png)

The MacOS/Linux forward slash separator is used everywhere except Windows. If you look at a web page URL you may have seen the forward slash (`/`) in use there. Paths are very similar when used with websites so this concept is not just restricted to files on your hard drive.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [10]:
try:
    display(L9P9Q1, L9P9Q2)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. What is another word for a directory?\n', 'outpu…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. True or False?\n\n   The path is a specific loca…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Showing Files and Changing Directories<a class="anchor" id="DS102L9_page_10"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [22]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Showing Files and Changing Directories
VimeoVideo('294071245', width=720, height=480)

# Showing Files and Changing Directories

---

## Opening the CLI

Each operating system has a different way of opening its command line interface. On **Windows** this is called the `Command Prompt`, on **MacOS/Linux** it's called the `Terminal`. You will use the more general term `terminal` for the CLI on whatever operating system you are using.

**Windows:**

1. Click on the Windows Button
2. In the search box, type `cmd`
3. Press enter

**MacOS:**

1. Click on the magnifying glass in the upper right corner of the screen
2. In the box type `terminal`
3. Press enter

**Ubuntu Linux:**

1. Click on the Ubuntu button
2. In the search box, type `terminal`
3. Press enter

---

## Prompt

When you open a `terminal` you will see a prompt which is where you can type commands.

**Windows Prompt:**

```bash
C:\Users\dkoontz\Desktop>
```

![A visualization of a terminal prompt on windows. The drive letter is C colon. The path to the current directory is users backslash D Koontz backslash desktop. The end of prompt is greater than symbol.](Media/windows-prompt.png)

**Mac/Linux Prompt:**

```bash
Work-Laptop:Desktop dkoontz$
```

![A visualization of a terminal prompt on mac or linux. The name of the computer is work dash laptop colon. The current directory is desktop. The current user is D Koontz. The role is dollar sign.](Media/unix-prompt.png)

As you can see, the prompts vary in the information they present. Windows came from a non-networked single-user world, where the MacOS/Linux terminal was borrowed from the world of UNIX which primarily existed as a shared system supporting multiple simultaneous users.  They do this across business or university networks, and with different permission levels between users.

One of the essential features of the prompt is to tell you where on your hard drive, the terminal considers the `current directory`. When you use commands such as `del` or `rm` unless you specify a different location, the command will run in the current directory.

---

## Showing Files

When you open your command prompt, it will default to your `home` directory. This is often the directory located at `C:\Users\<your user name>` on **Windows** or `/Users/<username>` on **MacOS/Linux**. On MacOS/Linux systems this directory has a special name `~`.

To list the files in your home directory, you can use the following command:

![The windows command to show files is D I R. The mac O S or linux command to show files is L S.](media/homeDirectory.png)

This will show a listing of files, and depending on your operating system, it will display additional information such as size and last modified date. There are a lot of options for most commands such as `dir`/`ls` that will allow you to specify what kind of information you want. A common variant on **Windows** is `dir/w` which will show just the files names, and on **MacOS/Linux** `ls-al` which will show additional information beyond just file names.

---

## Changing Directories

Generally, the files that are of interest are not in your home directory.  They tend to be in sub-directories such as Documents or Pictures. To use `dir`/`ls` in your Documents folder, the `terminal` must first change the current directory. This command is the same on both types of terminals, (`cd`).

```bash
cd Documents
```

The `cd` command instructs the terminal to switch the current directory to Documents. If you are in a different directory or do not have a ```Documents``` folder, you will see an error saying: `No such file or directory`.

Once you have run the `cd Documents` command, you will see your prompt update to reflect that your current directory has changed to ```Documents```.

In the same way that you can traverse into a sub-directory, you can also move up towards the root of the file system. Two special patterns that represent the current directory, and the parent directory.

`.` is the current directory.

`..` is the parent of the current directory.

---

## Relative versus Absolute Directories

Remember, a command is running in the current directory if you do not specify somewhere else. As it turns out, your `terminal` is inserting a (`.\`) or (`./`) at the beginning of your commands. This is why the current directory is used for commands such as `dir`/`ls`.

`.\dir` is the same as `dir` and `./ls` is the same as `ls`.

So what about (`..`) then? You could use it as the location to change directories to using `cd ..`, which would move the current directory up back to its parent. So if the current directory was `/Users/prosescholar/Documents/awesome_poetry` and you used `cd ..` the resulting current directory would be `/Users/prosescholar/Documents`

It is also possible to specify a location when using commands such as `dir`/`ls`. For example, if the current directory is `/Users/prosescholar/Documents` you could use `ls awesome_poetry` to list the files in `/Users/prosescholar/Documents/awesome_poetry`.

Now examine how this works.

First, the terminal will helpfully insert a `./` in front of the `awesome_poetry` parameter resulting in `ls ./awesome_poetry`. Next, the `./` will be expanded out into what it stands for, the current directory. This results in `ls /Users/prosescholar/Documents/awesome_poetry`

A path that starts with either `/` on **MacOS/Linux** or a drive letter such as `C:\` on **Windows** is known as an `absolute` path. You could run `ls /Users/prosescholar/Documents/awesome_poetry` from any directory and get the same result. The result of `ls awesome_poetry` would vary based on the current directory, in many directories it would fail due to there not being an `awesome_poetry` directory. Paths that are not absolute are called `relative`, as the results of using them depends on what the current directory is.

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [11]:
try:
    display(L9P10Q1, L9P10Q2, L9P10Q3)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Which directory does \x1b[31;1mdir test\x1b[0m l…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '2. A path starting with \x1b[31;1mC:\\\x1b[0m on Wi…

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '3. Which command switches the current directory?\n'…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Learning the CLI<a class="anchor" id="DS102L9_page_11"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Learning the CLI

The topic of using a terminal is broad and will be an ongoing process throughout your career. Most of the commands you will use on an everyday basis are a bit more involved than these few, but they are a great way to get a feel for how the terminal works. It is encouraged for you to experiment with the following commands to get a feel for them.


<table class="table table-striped">
    <tr>
        <th>Windows Command</th>
        <th>MacOS/Linux Command</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>dir</td>
        <td>ls</td>
        <td>Lists the files in a current directory (by default the current directory)</td>
    </tr>
    <tr>
        <td>cd</td>
        <td>cd</td>
        <td>Changes the directory where commands are run</td>
    </tr>
    <tr>
        <td>move</td>
        <td>mv</td>
        <td>Moves a file to a new location (possibly with a new name)</td>
    </tr>
    <tr>
        <td>copy</td>
        <td>cp</td>
        <td>Copies a file</td>
    </tr>
    <tr>
        <td>del</td>
        <td>rm</td>
        <td>Deletes a file</td>
    </tr>
    <tr>
        <td>..</td>
        <td>..</td>
        <td>Move current directory back to parent directory</td>
    </tr>
    <tr>
        <td>md</td>
        <td>mkdir</td>
        <td>To create a directory</td>
    </tr>
</table>

---

## Review
Below is a quiz to review the recently covered material. Quizzes are _not_ graded.

In [12]:
try:
    display(L9P11Q1)
except:
    pass

VBox(children=(Output(outputs=({'name': 'stdout', 'text': '1. Windows and MacOS have most commands in common\n…

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - CLI Activity <a class="anchor" id="DS102L9_page_12"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Activity

Follow the instructions below to practice your newly learned skills!

1. Create a new directory named `fast`.
2. Switch the current directory to `fast`.
3. Create a directory named `furious` within the `fast` directory.
4. Switch the current directory to `furious`.
5. Create a new empty file named `furiosa.txt`.
   1. On **OSX/Linux**, you can use the `touch your_file.txt` command to create a new empty file.
   2. On **Windows**, you can use the `type nul > your_file.txt` to create a new empty file.
6. Switch back to the `fast` directory.

When complete, your directory structure should look like:

```text
fast
    furious
            furiosa.txt
```


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Key Terms <a class="anchor" id="DS102L9_page_13"></a>

[Back to Top](#DS102L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

```c-lms
topic: Command Line Key Terms
```

# Key Terms

Below, is a list and short description of the central keywords you have learned in this lesson. Please read through and go back and review any concepts you don't understand fully. Great Work!


<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>CLI</td>
        <td>Command Line Interface, a text based interface for giving instructions to an operating system.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Directory</td>
        <td>A location on your hard drive that can contain files and other directories.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Folder</td>
        <td>Another name for directory.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Command</td>
        <td>An action for the CLI to carry out.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Parameter</td>
        <td>Additional information to modify the behavior of a command.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Terminal</td>
        <td>Another name for a CLI.</td>
    </tr>
</table>

