In [None]:
# ignore the commands below; these just make sure plots fit on the screen
library('repr')
options(repr.plot.width=3, repr.plot.height=3)

# Lesson 7: Visualizing Data

Today:
1. More on bar plots and histograms
    + Bar plots with proportion on the y-axis
    + Understanding histograms with unequal bin width
    + Understanding histograms with density on the y-axis
2. Visualizing grouped datasets
3. Miscellaneous tweaks!

In [1]:
# class starter
# do not modify this cell
my_pets <- data.frame( Name = c('Alex', 'Bert', 'Cate', 'Doug', 'Evan', 'Finn', 'Gregor', 'Hummus', 'Iliad', 'Jamal') ,
                       Species = c('Cat', 'Cat', 'Dog', 'Cat', 'Dog', 'Rabbit', 'Rabbit', 'Rabbit', 'Rabbit', 'Rabbit' ),
                       Weight_lb = c(25, 15, 100, 20, 20, 4, 2, 5, 3, 1),
                       Age = c( 8.5, 3.9, 4.1, 3, 0.7, 2.5, 1.5, 3.1, 2.9, 2.2 ),
                       Spayed_Neutered = c( TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE  )
                     )

my_pets

Name,Species,Weight_lb,Age
Alex,Cat,25,8
Bert,Cat,15,3
Cate,Dog,100,4
Doug,Cat,20,3
Evan,Dog,20,1
Finn,Rabbit,4,2


## 1.1. Bar plots with proportion on the y-axis

We can also create a bar plot where 
+ the x-axis corresponds to the variable/column, 
+ the **y-axis (the height of the bar)** is **how many observations** belong to that group (in the case of bar plots) or bin (in the case of histograms).

To create bar plots with **proportion** in the y-axis:
    
        ggplot( DATAFRAMENAME , aes( x = COLUMNNAME, y = ..prop.., group = 1 ) ) + geom_bar()

**Exercise**

Create a bar plot for the `Spayed_Neutered` column, with proportion on the y-axis 

## 1.2. Understanding histograms with density on the y-axis


To create histograms with density on the y-axis:

    ggplot( DATAFRAMENAME , aes( x = COLUMNNAME, y = ..density.. ) ) + geom_histogram()
    


**Exercise**

Create a bar plot for the `Age` column, with proportion on the y-axis 

## 1.3. Understanding histograms with unequal bin widths


Sometimes, it makes sense to create a histogram where the width of the bins are not necessarily equal.

**Example**

<table>
    <tr> 
        <th>Name</th>
        <th>Species</th>
        <th>Weight_lb</th>
        <th>Age</th>
        <th>Spayed_Neutered</th>
    </tr>
    <tr>
        <td>Alex</td>
        <td>Cat</td>
        <td>25</td>
        <td>8.5</td>
        <td>TRUE</td>
    </tr>
    <tr>
        <td>Bert</td>
        <td>Cat</td>
        <td>15</td>
        <td>3.9</td>
        <td>FALSE</td>
    </tr>
    <tr>
        <td>Cate</td>
        <td>Dog</td>
        <td>100</td>
        <td>4.1</td>
        <td>TRUE</td>
    </tr>
    <tr>
        <td>Doug</td>
        <td>Cat</td>
        <td>20</td>
        <td>3</td>
        <td>TRUE</td>
    </tr>
    <tr>
        <td>Evan</td>
        <td>Dog</td>
        <td>20</td>
        <td>0.7</td>
        <td>FALSE</td>
    </tr>
    <tr>
        <td>Finn</td>
        <td>Rabbit</td>
        <td>4</td>
        <td>2.5</td>
        <td>TRUE</td>
    </tr>
    <tr>
        <td>Gregor</td>
        <td>Rabbit</td>
        <td>2</td>
        <td>1.5</td>
        <td>FALSE</td>
    </tr>
    <tr>
        <td>Hummus</td>
        <td>Rabbit</td>
        <td>5</td>
        <td>3.1</td>
        <td>FALSE</td>
    </tr>
    <tr>
        <td>Iliad</td>
        <td>Rabbit</td>
        <td>3</td>
        <td>2.9</td>
        <td>TRUE</td>
    </tr>
    <tr>
        <td>Jamal</td>
        <td>Rabbit</td>
        <td>1</td>
        <td>2.2</td>
        <td>TRUE</td>
    </tr>
</table>

We can specify the bins by specifying the "breaks"/endpoints between the bins:
    
    ggplot( DATAFRAMENAME, aes( x = COLUMNNAME ) ) + geom_histogram( breaks = LISTOFBINENDPOINTS )

In [None]:
# plot histogram for the Weight_lb columns, with 3 bins


# 2. Visualizing grouped data

We are done with the basics of data visualization!

The most important key goals is for us to be able to:
+ interpret information from a given data visualization
+ construct an appropriate data visualization for a given variable(s)
    + understand the meaning of proportion and density in bar plots and histograms, including histograms with unequal bin width

There are a lot of other tweaks you can put to your basic ggplot commands.  You won't have to memorize these for the exam, but you are encouraged to use them in your projects.

(This class is in fact NEVER about memorization, and our exam will reflect that.  We are about understanding ideas and understanding how to use computational tools to help us reason and work with data.)

## 2.1. Grouped Scatterplots

Suppose that we would like to create a scatterplot where the shape or color of each point corresponds to a group within a categorical variable.

    ggplot( DATAFRAMENAME , aes(x = NUMERICALVAR1, y = NUMERICALVAR2, color = CATEGORICALVAR )) + geom_point( )
    
    ggplot( DATAFRAMENAME , aes(x = NUMERICALVAR1, y = NUMERICALVAR2, shape = CATEGORICALVAR )) + geom_point( )

## 2.2. Grouped bar plot

Suppose that we would like to visualize the distribution of a categorical variable, while breakingdown each bar based on a second categorical variable.

    ggplot( my_pets_new, aes(x = Species, fill = Spayed_Neutered)) + geom_bar( )
    
    ggplot( my_pets_new, aes(x = Species, fill = Spayed_Neutered)) + geom_bar( position='dodge' )

**Example**

In [None]:
ggplot( my_pets_new, aes(x = Species , fill = Spayed_Neutered   )) + geom_bar(  )

In [None]:
ggplot( my_pets_new, aes(x = Species   )) + geom_bar(  )

## 2.3. Other `ggplot` tweaks

### Controling color, shape, size of points in scatterplots

    ggplot( DATAFRAMENAME, aes(x = NUMERICALVAR1, y = NUMERICALVAR2 )) 
    + geom_point( shape = SHAPENUMBER , color = COLORNAME, size = SIZENUMBER )

In [None]:
# Example

ggplot( my_pets_new, aes( x = Weight_lb, y = Age )) + geom_point(   )

### Controling filled color in bar plots

    ggplot( DATAFRAMENAME, aes(x = CATEGORICALVAR )) 
    + geom_bars( fill = COLORNAME )

In [None]:
ggplot( my_pets_new, aes( x = Species )) + geom_bar( fill = 'blue'  )

There are many others.  If you want to figure out how to tweak a ggplot visualization in a particular way, there is (most likely) a way to do it---google it!

## Other tweaks: Displaying histogram informations (bin endpoints; bin height; bin areas)

There were questions during class on how one can display more information on histograms (and on other types of visualizations in general).

I will add examples for how to do these below.  Everything below is completely optional for you to know/use.