#### Muhammad Rady Irawan

* Bioinformatics compilaton https://github.com/r4dbot

# BGGN213 UCSD 
## Data visualization with ggplot2 - Week 4 
Dr. Barry Grant
<br>http://thegrantlab.org
<br>course source: https://bioboot.github.io/bggn213_W21/ 

### 1. Overview

The ability to make clear and compelling data visualizations is a vital skill for scientists. The difference between good and bad figures can be the difference between a highly influential or an obscure paper, a grant or contract won or lost, a job interview gone well or poorly. In short, If you want to be successful in any technical field you will need to master these skills!

The goals for this hands-on session are for you to:

Understand major scientific plot types and when you might want to use them.
Learn the fundamentals of the ggplot2 package and what it can do for yo

### 2. Background
One of the biggest attractions of the R programming language is the ability to have complete programmatic control over the plotting of complex graphs and figures. R offers a ridiculously large set of tools and packages for data visualization. The core R language already provides a rich set of plotting functions and plot types. These plotting functions require users to specify how to plot each element on the canvas step by step. These “base R plots” offer complete control over virtually every pixel. However, they can be fiddly and time consuming to get just the way you want. By contrast, the ggplot2 package allows the specification of all plots through set of common plotting layers that minimally includes:

* Specifying the input data (in the form of a data.frame),
* How the data maps to aesthetic features of the plot (the x and y axis, color, plotting character, line type, etc.), and
* The geometric layers used in the plot (such as points, bars, lines etc.)

Data visualization with ggplot always involves these steps. Once you have mastered this sequence of steps we will layer on additional customizations and you will see that beautiful and sophisticated plots come within your reach very quickly. Let’s get cracking!

### 3. Getting Organized
#### Creating a Project
We will begin by getting organized. This entails you opening up RStudio and creating a new RStudio Project, then creating a new R script for storing your work and notes for this session.

>**Side-note**: If you are alrady fimilar with RMarkdown format documents feel free to use one of these rather than an RScript. If you have not yet heard of these, don’t worry we will be building towards these in our next class.

Begin by opening RStudio and creating a new **Project** <font style="background-color:lightgrey;">File > New Project > New Directory > New Project</font> make sure you are working in the directory (a.k.a. folder!) where you want to keep all your work for this class organized. For example, for me this is a directory on my Desktop with the class name (see animated figure below). We will create our project as a *subdirectory* called <font style="background-color:lightgrey;">class05</font> in this location.

>#### Side note: 

>If you have not already, please watch this weeks intro videos, which cover [why we want to visualize data graphically](https://youtu.be/R_b7g5sGzwY), [what makes an effective figure](https://youtu.be/0WPeOxrboug), and an [introduction to ggplot](https://youtu.be/qtfJa8muH9E). You may also wish to (re) visit our previous [introduction to R](https://youtu.be/Asm2PHOZAcE) and RStudio as well as our video on [major R data structures, data types, and using functions](http://youtu.be/3LOTxeQEHSM).

[video link](https://www.youtube.com/watch?v=qtfJa8muH9E)

<font color =red><label for="data visualization">Q. For which phases is data visualization important in our scientific workflows?:</label>
<select id="data visualization" name="grade" form="Question_data">
  <option value=""></option>
  <option value="0">Communication of Results</option>
  <option value="0">Explatory of Data Analysis (EDA)</option>
  <option value="0">Detection of outliers</option>
  <option value="0">All of the above</option>
</select><br>
<label for="Boolean">Q. True or False? The ggplot2 package comes already installed with R?</label>
<select id="Boolean" name="check" form="Question">
  <option value=""></option>
  <option value="1">True</option>
  <option value="0">False</option>
</select>
</font>


More Details:

Data visualization is important for all phases of our scientific workflows from exploratory data analysis (EDA), quality control and the detection of outliers, through formal analysis and the communication of results.

The ggplot2 package does not come pre-installed with R. Before you use it for the first time you will need to install it with the <font style="background-color:lightgrey;"> install.packages("ggplot2") </font> command in your R console panel of RStudio.

We see in next week’s class (and throughout the rest of the course) that lot’s of useful functionality comes in the form of add-on **R packages**.
You can think of *“base R”* like your smartphone’s OS when you first take it out of the box and *“R packages”* like apps that you can optionally install to allow you to do more cool things.

>**Side-note**: The key step here is to name your project after this class session (i.e. “class05”) and make sure it is a sub-directory of where ever you are organizing all your work for this course.

#### Create an R script
Finally, open a new R script: <font style="background-color:lightgrey;"> File > New File > R Script </font> and save as <font style="background-color:lightgrey;"> class05.R</font>. This is nothing more than a text file where we can write and save our R code. The big advantage is that we will have a record of our work and thus be able to reproduce and automate our analysis later. We will also turn this into a fancy HTML and PDF report for sharing with others (more on this later…)

### 4. Common Plot Types
There are many plot types available that can help you understand different features and relationships in your data.

During the exploratory data analysis phase we typically want to detect the most obvious patterns by looking at each variable in isolation or by detecting relationships of variables against others. The used plot type is also determined by the data type of the input variables like **continuous numeric** or **discrete categorical**.



#### Scatter Plots
Scatter plots are used to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x and y-axis.

#### Line Graphs
Line graphs are used to visualize the trajectory of one numeric variable against another which are connected through lines. They are well suited if values change continuously - like temperature over time.

#### Bar Charts and Histograms
Bar charts visualize numeric values grouped by categories. Each category is represented by one bar with a height defined by each numeric value. Histograms are specific bar charts that aim to summarize the number of occurrences of numeric values over a set of value ranges (a.k.a. *bins*). They are typically used to examine the distribution of numeric values.

#### Others
Other frequently used plot types in the biosciences include:

* **Box plots**: Show summary distributional information of numeric values grouped in categories as boxes. Great to quickly compare multiple distributions.
* **Violin plots**: Same as box plots but show distributions as violins.
* **Heat Maps**: Show interactions of variables - typically correlations - as rastered image highlighting areas of high interaction.
* **Network Graphs**: Show connections (as lines - a.k.a. edges or vertices) between nodes (typically shown as spheres).
* **Density plots**: A popular alternative to histograms.
* **Dendrograms**: Also known as cluster trees display the results of a clustering analysis (we will have a separate class on these).

<font color =red><label for="data visualization">Q. Which plot types are typically NOT used to compare distributions of numeric variables?</label>
<select id="visualization" name="type" form="Question_visualization">
  <option value=""></option>
  <option value="0">Density plots</option>
  <option value="0">Network graphs</option>
  <option value="0">Histogram</option>
  <option value="0">Violin plots</option>
  <option value="0">Box plots</option>
</select><br>
<label for="gg2plot">Q. Which statement about data visualization with ggplot2 is incorrect?</label>
<select id="gg2plot" name="gg2_type" form="Question_gg2">
  <option value=""></option>
  <option value="1">ggplot2 facilitates the creation of good looking graphs quickly</option>
  <option value="0">ggplot2 is the only way to create plots in R</option>
  <option value="0">ggplot2 enables users to specify plots in a declarative way</option>
  <option value="0">ggplot2 requires users to specify the plotting commands in a step-by-step fashion</option>
</select>
</font>


#### More details:

There are multiple plot types that can be used to examine and compare numeric data distributions. We will examine only a handful of the most common types in this session.

For a detailed discussion and compassion of these approaches see Chapters 7-9 of Claus Wilke’s excellent book [Fundamentalks of Data Visualization](https://serialmentor.com/dataviz/boxplots-violins.html).

Due to the importance of data visualization R offers a very large set of tools and packages in this area. Indeed, the core R language itself provides a rich set of plotting functions and plot types. Personally, I love “base R plots” but I have been using them for many, many years and acknowledge that they are not the easiest for newcomers to pick up quickly. The ggplot2 package is a popular alternative but is only one of many. Particularly noteworthy alternatives include interactive packages like [plotly](https://plotly.com/r/), [rgl](https://cran.r-project.org/web/packages/rgl/vignettes/WebGL.html), and various [javascript based R graphics packages](https://www.htmlwidgets.org/showcase_rbokeh.html).

### 5. Creating Scatter Plots
In this section we will focus on:

* Defining a dataset for your plot using the main ggplot() function. 
* Specifying how your data maps to plot aesthetics with the aes() function. 
* Adding geometric layers using the geom_point() function. 
* Combining the above function calls with + operator to make your plot.

#### Introduction to scatter plots
Scatter plots use points to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x- and y-axis. Let’s see an example of a scatter plot to understand the relationship between the speed and the stopping distance of cars:

Each point in this plot represents a car. Each car starts to break at a speed given on the bottom x-axis and travels the distance shown on the side y-axis until full stop. If we take a look at all points in the plot, we can clearly see that it takes faster cars a longer distance until they are completely stopped.

>Scatter plots like this one allow us to visualize the relationship between two numeric variables (in this case speed and stopping distance). The core of this plot is built with only three short lines of code that involve calling three ggplot2 functions as we will see in a moment (ggplot(), aes() and geom_point()).

We will use these same three functions to produce all sorts of scatter plots like the following plot that shows gene expression changes upon treating a particular cell line with a new anit-viral drug:



In this plot points represent individual genes. The red points are for genes that are up-regulated when treated with the drug (i.e. have more expression) and blue points are for down-regulated genes. Gray points are for genes with a non-significant difference in expression values when we look across replicate experiments. We will learn much more about these types of datasets in an upcoming class.

Our focus for this session is to learn how to produce a plot like this. The steps (and R functions) used are the same as those for the previous plot (namely ggplot(), aes(), and geom_point()). Are next sections will focus on each of these in turn. Once we understand these core functions we will be able to make small changes and additions to make all sorts of cool plots.

Specifing a dataset with ggplot()
To create plots with ggplot2 you first need to load the package using library(ggplot2).

After the package has been loaded specify the dataset to be used as an argument of the main ggplot() function. For example, we specify a plot using the in-built cars dataset. You should type (not paste) this code into your R console to see the effect:

N.B. Note that this command does not plot anything but a blank gray canvas yet. The ggplot() function alone just defines the dataset for the plot and creates an empty base on top of which we will add additional layers to build up our plot.

>**Side-note**: If you get an error message at this stage then you likely need to install the ggplot2 package first with install.packages("ggplot2").

#### Specifing aesthetic mappings with aes()
The ggplot2 package uses the concept of aesthetics, which map variables (i.e. columns) from your dataset to the visual features of the plot. The most common aesthetics include x and y that determine the x- and y-axis coordinates of the points to plot (we will see others further below). The aesthetics are mapped with a call to the aes() function.

We will use the columns labeled speed and distance from the cars dataset to set the x and y aesthetics of our plot. Critically, we combine our call to the aes() function with our previous specification of the input dataset with the ggplot(cars) function call from above.

N.B. Note how we combine the ggplot(cars) function call and the aes(x=speed, y=dist) function calls with the + plus sign. We will keep adding more layers to our plot in this same way as we will see in a moment.

As we can see above we don’t have any points in our plot yet. Adding them is the job of the geom_point() function discussed next.

#### Specifing a geom layer with <span style="font-weight:normal"><font style="background-color:lightgrey;">geom_point()</font></span>
Next we need to add one of ggplot’s geometric layers (or **geoms**) to define how we want to visualize our dataset. At this early stage we can thing of geoms as defining the plot type we actually want to produce. For example, <font style="background-color:lightgrey;">geom_line()</font> produces a line plot, <font style="background-color:lightgrey;">geom_bar()</font> produces a bar plot, <font style="background-color:lightgrey;">geom_boxplot()</font> a box plot (more on this later).

Here, we use <font style="background-color:lightgrey;">geom_point()</font> to add - you guessed it - points! Let’s take our code snippet from above and add a <font style="background-color:lightgrey;">geom_point()</font> function call:

**N.B.** Note again that we use the <font style="background-color:lightgrey;">+</font> operator to connect the <font style="background-color:lightgrey;">ggplot(cars)</font> with <font style="background-color:lightgrey;">aes(x=speed, y=dist)</font> and now the <font style="background-color:lightgrey;">geom_point()</font> lines. Through the linking ggplot knows that the mapped speed and dist variables are taken from the cars dataset. finally the <font style="background-color:lightgrey;">geom_point()</font> line instructs ggplot to plot the mapped variables as points.

Side-note: The required steps to create a scatter plot with ggplot can be summarized as follows:

Load the package ggplot2 using library(ggplot2).
Specify the dataset to be plotted using ggplot().
Use the + operator to add layers to the plot.
Map variables from the dataset to aesthetic properties through the aes() function.
Add a geometric layer with to define the shapes to be plotted. In the case of scatter plots, use geom_point().

>**Side-note**: The required steps to create a scatter plot with ggplot can be summarized as follows:

>* Load the package ggplot2 using library(ggplot2).
>* Specify the dataset to be plotted using ggplot().
>* Use the + operator to add layers to the plot.
>* Map variables from the dataset to aesthetic properties through the aes() function.
>* Add a geometric layer with to define the shapes to be plotted. In the case of scatter plots, use geom_point().

<font color =red><label for="data visualization"><b>Q.</b> Which geometric layer should be used to create scatter plots in ggplot2?</label>
<select id="visualization" name="type" form="Question_visualization">
  <option value=""></option>
  <option value="0">Density plots</option>
  <option value="0">Network graphs</option>
  <option value="0">Histogram</option>
  <option value="0">Violin plots</option>
  <option value="0">Box plots</option>
</select>
    
<font color =red><b>Q.</b> In your own RStudio can you add a trend line layer to help show the relationship between the plot variables with the <font style="background-color:lightgrey;">geom_smooth()</font> function?</font>
    
<font color =red><b>Q.</b> Can you finish this plot by adding various label annotations with the <font style="background-color:lightgrey;">labs()</font> function and changing the plot look to a more conservative “black & white” theme by adding the <font style="background-color:lightgrey;">theme_bw()</font> function:</font>


#### Adding more plot aesthetics through <span style="font-weight:normal"><font style="background-color:lightgrey;">aes()</font></span>
In their most basic form scatter plots can only visualize datasets in two dimensions through the <font style="background-color:lightgrey;">x</font> and <font style="background-color:lightgrey;">y</font> aesthetics passed to the <font style="background-color:lightgrey;">geom_point()</font> layer. However, most data sets have more than two variables and thus might require additional plotting dimensions. Ggplot makes it very easy to map additional variables to different plotting aesthetics like *size*, *transparency alpha* and *color*.

Here we will cover how to:

* Adjust the point size of a scatter plot using the <font style="background-color:lightgrey;">size</font> parameter.
* Change the point color of a scatter plot using the <font style="background-color:lightgrey;">color</font> parameter.
* Set a parameter <font style="background-color:lightgrey;">alpha</font> to change the transparency of all points.

We will also try to stress an important point that is often confusing for newcomers to ggplot: How to differentiate between *aesthetic mappings* (plot features you want mapped to variables in your data) and *constant parameters* (specifications of plot features you want to remain the same or otherwise come from elsewhere - i.e. not from your data).

Let’s turn for a moment to more relevant example data set. The code below reads the results of a differential expression analysis where a new anti-viral drug is being tested.

<font color =red><b>Q.</b> Use the <font style="background-color:lightgrey;">nrow()</font> function to find out how many genes are in this dataset. What is your answer?</font>

<font color =red><b>Q.</b> Use the <font style="background-color:lightgrey;">colnames()</font> function and the <font style="background-color:lightgrey;">ncol()</font> function on the <font style="background-color:lightgrey;">genes</font> data frame to find out what the column names are (we will need these later) and how many columns there are. How many columns did you find?</font>

<font color =red><b>Q.</b> Use the <font style="background-color:lightgrey;">table()</font> function on the <font style="background-color:lightgrey;">State</font> column of this data.frame to find out how many ‘up’ regulated genes there are. What is your answer? </font>

<font color =red><b>Q.</b> Using your values above and 2 significant figures. What fraction of total genes is up-regulated in this dataset?</font>

We can make a first basic scatter plot of this dataset by following the same recipe we have already seen, namely:

* Pass the <font style="background-color:lightgrey;">genes</font> data.frame as input to the <font style="background-color:lightgrey;">ggplot()</font> function.
* Then use the <font style="background-color:lightgrey;">aes()</font> function to set the <font style="background-color:lightgrey;">x</font> and <font style="background-color:lightgrey;">y</font> aesthetic mappings to the <font style="background-color:lightgrey;">Condition1</font> and <font style="background-color:lightgrey;">Condition2</font> columns.
* Finally add a <font style="background-color:lightgrey;">geom_point()</font> layer to add points to the plot.
* Don’t forget to add layers step-wise with the <font style="background-color:lightgrey;">+</font> operator at the end of each line.

<font color =red><b>Q.</b> Complete the code below to produce the following plot</font>

There is extra information in this dataset, namely the <font style="background-color:lightgrey;">State</font> column, which tells us whether the difference in expression values between conditions is statistically significant. Let’s map this column to point color:

I am not a big fan of these default colors so let’s change them up by adding another layer to explicitly specifcy our color scale. Note how we saved our previous plot as the object <font style="background-color:lightgrey;">p</font> and can use it now to add more layers:

<font color =red><b>Q.</b> Nice, now add some plot annotations to the <font style="background-color:lightgrey;">p</font> object with the <font style="background-color:lightgrey;">labs()</font> function so your plot looks like the following:</font>

### 6. OPTIONAL: Going Further
The following sections are considered optional extensions for motivated students.

The **gapminder** dataset contains economic and demographic data about various countries since 1952. This dataset features in your DataCamp course for this week and you can find out more about the portion we are using [here](https://github.com/jennybc/gapminder).

The data itself is available as either a tab-delimited file online, or via the <font style="background-color:lightgrey;">gapmider</font> package. You can use whichever method you are more comfortable with to obtain the dataset. I show both below:

Or read the TSV file from online:

This dataset covers many years and many countries. Before we make some plots we will use some **dplyr** code to focus in on a single year. You can install the **dplyr** package with the command <font style="background-color:lightgrey;">install.packages("dplyr")</font>. We will learn more about **dplyr** in next weeks class. For now, feel free to just copy and paste the following line that takes <font style="background-color:lightgrey;">gapmider</font> data frame and filters to contain only the rows with a <font style="background-color:lightgrey;">year</font> value of 2007.

<font color = red>Let’s consider the <font style="background-color:lightgrey;">gapminder_2007</font> dataset which contains the variables GDP per capita <font style="background-color:lightgrey;">gdpPercap</font> and life expectancy <font style="background-color:lightgrey;">lifeExp</font> for 142 countries in the year 2007</font>

<font color = red><b>Q.</b> Complete the code below to produce a first basic scater plot of this <font style="background-color:lightgrey;">gapminder_2007</font> dataset:</font>

There are quite a few points that are nearly on top of each other in the above plot. One useful approach here is to add an <font style="background-color:lightgrey;">alpha=0.4</font> argument to your <font style="background-color:lightgrey;">geom_point()</font> call to make the points slightly transparent. This will help us see things a little more clearly later on.

#### Adding more varables to <span style="font-weight:normal"><font style="background-color:lightgrey;">aes()</font></span>
By mapping the <font style="background-color:lightgrey;">continent</font> variable to the point <font style="background-color:lightgrey;">color</font> aesthetic and the population <font style="background-color:lightgrey;">pop</font> (in millions) through the point <font style="background-color:lightgrey;">size</font> argument to <font style="background-color:lightgrey;">aes()</font> we can obtain a much richer plot that now includes 4 different variables from the data set:

**N.B.** Here each new aesthetic we add results in a new dimension to our scatter plot. We see that in the resulting plot each point is colored differently based on the continent of each country. ggplot uses the coloring scheme based on the categorical data type of the variable continent.

By contrast, let’s see how the plot looks like if we color the points by the numeric variable population pop:

The scale immediately changes to continuous as can be seen in the legend and the light-blue points are now the countries with the highest population number (China and India).

#### Adjusting point size
For the gapminder_2007 dataset we can plot the GDP per capita (<font style="background-color:lightgrey;">x=gdpPercap</font>) vs. the life expectancy (<font style="background-color:lightgrey;">y=lifeExp</font>) and set the point size based on the population (<font style="background-color:lightgrey;">size=pop</font>) of each country we can use:

However, if you look closely we see that the point sizes in the plot above do not clearly reflect the population differences in each country. If we compare the point size representing a population of 250 million people with the one displaying 750 million, we can see, that their sizes are not proportional. Instead, the point sizes are binned by default. To reflect the actual population differences by the point size we can use the <font style="background-color:lightgrey;">scale_size_area()</font> function instead. The scaling information can be added like any other ggplot object with the <font style="background-color:lightgrey;">+</font> operator:

Note that we have adjusted the point’s max_size which results in bigger point sizes.

<font color = red><b>Q.</b> Can you addapt the code you have leaqrned thus far to reproduce our gapminder scatter plot for the year 1957? What do you notice abouyt this plot is it easy to compare with the one for 2007?

Steps to produce your 1957 plot should include:
<uo>
    <li>Use dplyr to <font style="background-color:lightgrey;">filter</font> the <font style="background-color:lightgrey;">gapmider</font> dataset to include only the year 1957 (check above for how we did this for 2007).</li>
    <li>Save your result as <font style="background-color:lightgrey;">gapminder_1957</font>.</li>
    <li>Use the <font style="background-color:lightgrey;">ggplot()</font> function and specify the <font style="background-color:lightgrey;">gapminder_1957</font> dataset as input</li>
    <li>Add a <font style="background-color:lightgrey;">geom_point()</font> layer to the plot and create a scatter plot showing the GDP per capita <font style="background-color:lightgrey;">gdpPercap</font> on the x-axis and the life expectancy <font style="background-color:lightgrey;">lifeExp</font> on the y-axis</li>
    <li>Use the <font style="background-color:lightgrey;">color</font> aesthetic to indicate each continent by a different color</li>
    <li>Use the <font style="background-color:lightgrey;">size</font> aesthetic to adjust the point size by the population pop</li>
    <li>Use <font style="background-color:lightgrey;">scale_size_area()</font> so that the point sizes reflect the actual population differences and set the <font style="background-color:lightgrey;">max_size</font> of each point to 15 -Set the opacity/transparency of each point to 70% using the <font style="background-color:lightgrey;">alpha=0.7</font> parameter</li>
</uo></font>
 

<font color = red><b>Q.</b> Do the same steps above but include 1957 and 2007 in your input dataset for <font style="background-color:lightgrey;">ggplot()</font>. You should now include the layer <font style="background-color:lightgrey;">facet_wrap(~year)</font> to produce the following plot:</font>

### 7. OPTIONAL: Bar Charts
In this section we will cover:

* How to create bar charts using <font style="background-color:lightgrey;">geom_col()</font>.
* How to fill bars with color using the <font style="background-color:lightgrey;">fill</font> aesthetic.
* Order your bars by the trend you are most interested in.
* Flip (or rotate) your plots to help with presentation clarity.
* Use alternative plot types when bar charts become too crowded.

#### Introduction to bar charts
Bar charts visualize numeric values grouped by categories. Each category is represented by one bar with a height defined by each numeric value.

Bar charts are well suited to compare values among different groups e.g. number of votes by parties, number of people in different countries or GDP per capita in different countries. Bar charts are a bit spacious and work best if the number of groups to compare is rather small.

Below you can find an example showing the number of people (in millions) in the five biggest countries by population in 2007:

#### Creating a simple bar chart
In ggplot2, bar charts are created using the geom_col() geometric layer. The geom_col() layer requires the x aesthetic mapping which defines the different bars to be plotted. The height of each bar is defined by the variable specified in the y aesthetic mapping. Both mappings, x and y are required for geom_col(). Let’s create our first bar chart with the gapminder_top5 dataset. It contains population (in millions) and life expectancy data for the biggest countries by population in 2007.

We see that the resulting bars are sorted by the country names in alphabetical order by default.

<font color =red><b>Q.</b> Plot life expectancy by country Create a bar chart showing the life expectancy of the five biggest countries by population in 2007.</font>

* <font color =red>Use the ggplot() function and specify the gapminder_top5 dataset as input</font>
* <font color =red>Add a geom_col() layer to the plot </font>
* <font color =red>Plot one bar for each country (x aesthetic)</font>
* <font color =red>Use life expectancy lifeExp as bar height (y aesthetic)</font>

#### Filling bars with color
Like other geoms <font style="background-color:lightgrey;">geom_col()</font> allows users to map additional dataset variables to the color attribute of the bar. The <font style="background-color:lightgrey;">fill</font> aesthetic can be used to fill the entire bars with color. A usual confusion is the <font style="background-color:lightgrey;">color</font> aesthetic which specifies the line color of each bar’s border instead of the <font style="background-color:lightgrey;">fill</font> color.

Based on the <font style="background-color:lightgrey;">gapminder_top5</font> dataset we plot the population (in millions) of the biggest countries and use the continent variable to color each bar:

Since the <font style="background-color:lightgrey;">continent</font> variable is a *categorical variable* the bars have a clear color scheme for each <font style="background-color:lightgrey;">continent</font>. Let’s see what happens if we use a numeric variable like life expectancy <font style="background-color:lightgrey;">lifeExp</font> instead:

The bar colors have now changed according the **continuous** legend on the right. We see that also <font style="background-color:lightgrey;">numeric</font> variables can be used to <font style="background-color:lightgrey;">fill</font> bars.

<font color =red><b>Q.</b> Plot population size by country<br> 
Create a bar chart showing the population (in millions) of the five biggest countries by population in 2007.<br>
<uo>
    <li>Use the <font style="background-color:lightgrey;">ggplot()</font> function and specify the <font style="background-color:lightgrey;">gapminder_top5</font> dataset as input</li>
    <li>Add a <font style="background-color:lightgrey;">geom_col()</font> layer to the plot</li>
    <li>Plot one bar for each <font style="background-color:lightgrey;">country</font> (<font style="background-color:lightgrey;">x</font> aesthetic)</li>
    <li>Use population <font style="background-color:lightgrey;">pop</font> as bar height (<font style="background-color:lightgrey;">y</font> aesthetic)</li>
    <li>Use the GDP per capita <font style="background-color:lightgrey;">gdpPercap</font> as <font style="background-color:lightgrey;">fill</font> aesthetic</li>
</uo></font>





And change the order of the bars.

and just fill by country.

#### Flipping bar charts
In some circumstances it might be useful to rotate (or “flip”) your plots to enable a more clear visualization. For this we can use the <font style="background-color:lightgrey;">coord_flip()</font> function. Lets look at an example considering arrest data in US states. This is another inbult dataset called <font style="background-color:lightgrey;">USArrests</font>.



Hmm… this is too crowded for an effective display in small format. Let’s try an alternative custom visualization by combining <font style="background-color:lightgrey;">geom_point()</font> and <font style="background-color:lightgrey;">geom_segment()</font>:



We will be exploring and producing different plot types in this way throughout the rest of our course going forward. **Happy ggplotting!**

### About this document
Here we use the <font style="background-color:lightgrey;">sessionInfo()</font> function to report on our R systems setup at the time of document execution.

In [None]:
sessionInfo()