# An Introduction to Data Visualization with R using ggplot2
* **Modeled off of Data Science Dojo's "An Introduction to Data Visualization with R and ggplot2"**
### Contributor : [Research Data Service](https://libraries.uc.edu/research-teaching-support/research-data-services.html), University of Cincinnati Libraries
### Contact: askdata@uc.edu

# Index

* **Introduction to R**
 * What is R 
 * Where R is used
 * Ways to run R
 * Data types
 * Comments<br><br>

* **Hands-on**
 * Install packages
 * Load data
 * Visualize decrete data
 * Visualize continuous data
 * Tell a story using a single plot 
 * Explore other visualizations (Stretched goal 1) 
 * Theme and layout (stretched goal 2)<br><br>

* **Helpful resources**<br><br>

# Introduction to R
### What is R
* General purpose, high level,
 * [Interpreted Language](https://guide.freecodecamp.org/computer-science/compiled-versus-interpreted-languages/)
 * [Object-oriented Language](https://www.freecodecamp.org/news/object-oriented-programming-concepts-21bb035f7260/)

* Simple, Portable, Open source & Powerful

* Stable version release in early 2000 by **Ross Inhaka and Robert Gentleman**

### Ways to run R
* General purpose Integrated Development Environment (IDE) - R stuido, etc
* Web-based interactive computational environment like Jupyter Notebooks, JupyterLab, Google Colaboratory, etc

### [Data types in R](https://www.statmethods.net/input/datatypes.html)
* [Scalar (Numeric, Character, Integer, Logical, Complex)](https://study.com/academy/lesson/scalar-data-type-in-r-programming-definition-function.html)
* Vector 
* Matrices
* List
* Data Frame
* Factor
* [Tibble](https://tibble.tidyverse.org/)

### Comments
* Single line commenting symbol is the hash symbol '#'
* The interpreter ignores everything after #
* Leave comments in your code to make it understandable to other team members and for yourself
* hot key to make comment "Ctrl+/"

In [1]:
# This line is ignored

* Multi-line - Use the # on each line of a multiline comment
* Select mulitples lines and "Ctrl+/"

In [2]:
# This is a multiline comment
# This line and the line above are both ignored by the interpreter

# Hands-on
### Install packages

In [3]:
# install.packages("ggplot2")
# install.packages("titantic")
# install.packages("RColorBrewer")
library(ggplot2)
library(titanic)
library(RColorBrewer)

"package 'titanic' was built under R version 3.6.3"

### Load data

In [4]:
### Load the Titanic data set
titanic <- read.csv("/home/rstudio/DataVizwithRandGgplot2/Data/titanic.csv")

"cannot open file '/home/rstudio/DataVizwithRandGgplot2/Data/titanic.csv': No such file or directory"

ERROR: Error in file(file, "rt"): cannot open the connection


In [None]:
head(titanic)

In [None]:
#or  
titanic <- titanic::titanic_train #This data is in a CRAN package called titanic 

In [None]:
#View the data
head(titanic)

In [None]:
#Look at the variable structures
str(titanic)

In [None]:
#Set up factors
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Survived <- as.factor(titanic$Survived)
titanic$Sex <- as.factor(titanic$Sex)
titanic$Embarked <- as.factor(titanic$Embarked)
str(titanic)

In [None]:
#Summarize the data
summary(titanic)

### Visualize decrete data

In [None]:
#Ggplot Full Template
# ggplot(data = <DATA>) + 
#   <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>),
#                   stat = <STAT>,
#                   position = <POSITION>) +
#   <COORDINATE_FUNCTION> +
#   <FACET_FUNCTION> + 
#   <THEME_FUNCTION>

#DONT FREAK OUT!!!

* **Question 1: What was the survial rate or what was the distribution of survived vs. perished?**
 * Do you remember what type of variable survived is?

In [None]:
#Plot a bar graph of survival 
ggplot(data = titanic, aes(x = Survived)) + geom_bar()

In [None]:
#Sometimes it is difficult to visualize the exact distributions
#So we can look up the percentages using prop.table()
prop.table(table(titanic$Survived))
#~61% perished and ~38% survived

In [None]:
# Lets clean this graph up a bit
#This simple bit of code produced publishable graphics (You just did data visualization!!)
ggplot(titanic, aes(Survived)) + geom_bar() +
  labs(y = "Passenger Count", title = "Passenger Survival Rate") + 
  theme_bw()

* **Question 2: What was the survival rate by gender?**
 * Remember the addage "women and children first"? Is this true?

In [None]:
ggplot(data = titanic, aes(x=Sex, fill = Survived)) + geom_bar()

In [None]:
ggplot(titanic, aes(Sex, fill =Survived)) + geom_bar() +
  labs(x = "Gender",y = "Passenger Count", title = "Passenger Survival Rate by Gender") + 
  theme_bw()

* **Question 3: What was the survival rate by ticket class (proxy for socio-economic status)?**
 * Reproducibility example: copy and paste gender code and just change x variable and lables

In [None]:
ggplot(data = titanic, aes(x=Pclass, fill = Survived)) + geom_bar()

In [None]:
ggplot(titanic, aes(Pclass, fill =Survived)) + geom_bar() +
  labs(x = "Ticket Class",y = "Passenger Count", title = "Passenger Survival Rate by Ticket Class") + 
  theme_bw()

* **Question 4: What was the survival rate of gender AND Ticket class?**
 * Use facet_wrap to examine how multiple variables interact and allow us to dive deeper into the data

In [None]:
#Look up facet_wrap()
?facet_wrap

In [None]:
ggplot(titanic, aes(Sex, fill =Survived)) + 
  geom_bar() +
  facet_wrap(~Pclass) +
  labs(x = "Gender",y = "Passenger Count", title = "Passenger Survival Rate by Gender") + 
  theme_bw()

###### What trends do you see? What gender and ticket classes are most and least likely to survive?

### Visualize continuous data
* **Question 5: What was the distribution of passenger ages**
 * Histogram are similiar to bar charts except they use numerical/continuous data

In [None]:
ggplot(titanic, aes(x = Age)) + geom_histogram()

In [None]:
titanic <- titanic[!is.na(titanic$Age),]

In [None]:
#You can specify the number of bins
ggplot(titanic, aes(x = Age)) + geom_histogram(bins = 10)

In [None]:
#Or define the bin width (column groupings)
ggplot(titanic, aes(x = Age)) + geom_histogram(binwidth = 5)

In [None]:
#Lets clean this graph up
ggplot(titanic, aes(x = Age)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(x = "Age (binwidth = 5)",y = "Passenger Count", title = "Titanic Age Distribution") + 
  theme_bw()

* **Question 6: What were the survival rate of passengers by age?**

In [None]:
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(x = "Age (binwidth = 5)",y = "Passenger Count", title = "Survival Rate by age") + 
  theme_bw()

In [None]:
#Alternative: We want to derive basic statistics on survial rates by age?
#box-and-whisker
ggplot(titanic, aes(x = Survived, y = Age)) +
  geom_boxplot() +
  labs(x = "Survived",y = "Age", title = "Survival Rate by age") + 
  theme_bw()
#No real discernible pattern between survival and age

In [None]:
#Need help remembering how to interpt a box-and-whisker graph?
?geom_boxplot #Computed variables

### Tell a story using a single plot

In [None]:
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_density(alpha = 0.5) + #Alpha is transparency control
  facet_wrap(Sex ~ Pclass) +
  labs(x = "Age",y = "Passengers", title = "Titanic Survival Rate by age and gender") + 
  theme_bw()

In [None]:
#Personally I find the interpretation of the density map a litte difficult 
#So I prefer a histogram
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  facet_wrap(Sex ~ Pclass) +
  labs(x = "Age",y = "Passengers", title = "Titanic Survival Rate by age and gender") + 
  theme_bw()

### Explore other visualizations (Stretched goal 1)

In [None]:
#Scatterpot
ggplot(titanic, aes(Age,Fare)) +geom_point()

In [None]:
#Smoothed average
ggplot(titanic, aes(Age,Fare)) +geom_smooth()

In [None]:
#Combined scatter with smoothed average
ggplot(titanic, aes(Age,Fare)) +
  geom_point() +
  geom_smooth(se = FALSE)

In [None]:
##Violin Graphs
ggplot(titanic, aes(Pclass,Age)) +
  geom_violin()

### Theme and layout (Stretched goal 2)

In [None]:
#Basic Plot
ggplot(titanic, aes(x = Age, fill = Survived)) + 
  geom_histogram(binwidth = 5,color = "white")

In [None]:
#Flipping Axes
#ONlY IF IT MAKES SENSE
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  coord_flip()

In [None]:
#Add Title and Axis labels
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  labs(x = "Age",y = "Passengers", title = "Passenger Survival Rate by age")

In [None]:
#Add a theme
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  labs(x = "Age",y = "Passengers", title = "Passenger Survival Rate by age") +
  theme_minimal()

In [None]:
#Change Axis range to only view ages 10-60 without clipping
#if you want to remove data use xlim() ylim() NOT RECOMMENDED!
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  labs(x = "Age",y = "Passengers", title = "Passenger Survival Rate by age") +
  theme_classic() + 
  coord_cartesian(xlim = c(10,60))

In [None]:
#Add color
ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  labs(x = "Age",y = "Passengers", title = "Passenger Survival Rate by age") +
  theme_minimal() + 
  coord_cartesian(xlim = c(10,60))+
  scale_fill_brewer(palette = "Dark2")

In [None]:
# more options for color
display.brewer.all(colorblindFriendly = TRUE)

In [None]:
#Change colors of bars
my_plot <- ggplot(titanic, aes(x = Age, fill = Survived)) +
  geom_histogram(binwidth = 5,color = "white") +
  labs(x = "Age",y = "Passengers", title = "Passenger Survival Rate by age") +
  theme_classic() + 
  coord_cartesian(xlim = c(10,60)) +
  scale_fill_manual(values = c("darkgray", "darkblue"))

In [None]:
#Saving Plots
#Saves last plot unless specified
ggsave("Final_Plot.png", width = 10, height =10)
ggsave("Final_Plot.png", plot = my_plot, width = 10, height =10)

# Helpful resources
* **[Data Visualization with ggplots Cheat Sheet](https://rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf)**
* **[R for Data Science](https://r4ds.had.co.nz/)**
* **[Geocomputation with R](https://geocompr.robinlovelace.net/)**
* **[Intro to GIS and Spatial Analysis](https://mgimond.github.io/Spatial/index.html)**

# $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$Questions ??

#  $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$[Survey](https://qfreeaccountssjc1.az1.qualtrics.com/jfe/form/SV_6uvE6uJSwMwV1Up)
$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$**Thank you for attending the workshop !!**


$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$**Your kind suggestions/feedbacks are more than welcome**