# Module 4 and 5 Exercise

In these exercises, we will plot some statistical plots about a data set. Whether using `R (ggplot2)` or `Python (plotnine)`, have your `ggplot2` cheat sheets and documentation handy to find the right parameters for the functions.

Once you choose which language you want to use, then switch the language kernel by clicking the top right corner of a notebook menu bar (Not the JupyterLab menu bar.)

Let's read the house sales data.

In [None]:
library(tidyverse)
County_data <- read.csv("../data/kc_house_data.csv", header=TRUE, sep=",")
head(County_data)  # Check the first five observations

Let's check the observations that the `bedrooms` are more than tens.


In [None]:
County_data %>% filter(bedrooms > 10)

Two rows in the dataset have more bedrooms than 10, but the price is very low. Let's first remove those outliers (anomalies) from the data.

In [None]:
County_data <- County_data %>% filter(bedrooms <= 10)

**Exercise 1:** Run `str` and summary commands on the data set and give your best guess about __(1) which attributes are **nominal** variables__. __(2) **Justify your answer.**__ 

In [None]:
# Your Code Here
# ------------------------

**Answer to specify nominal variables and why do you think those are nominal:**

**Exercise 2:** Map the **sqft_living** and **price** attributes to x and y axes of a scatter plot, respectively. Use some alpha transparency.

In [None]:
# Install the additional packages
# install.packages('ggthemes')

In [None]:
library(ggthemes)
options(repr.plot.width=8, repr.plot.height=8)

In [None]:
# Your Code Here
# ------------------------


If you draw a scatter plot properly, you cannot really see anything here. Let's try the same plot with log scales.

**Exercise 3**: Plot the above axes (both **x** and **y**) in log10 scale.


In [None]:
# Your Code Here
# ------------------------


This gives a better relation between price and square footage. 


Let's plot a histogram of price; it is the distribution of prices for all houses. 

**Exercise 4:** Plot a **histogram** of **price** with a **binwidth** of 30000.

In [None]:
# Your Code Here
# ------------------------


We can also plot the probability density function of the price, it will look like a smoothed version of the histogram.

**Exercise 5:** Plot the density of **price** using **geom_density**. 

In [None]:
# Your Code Here
# ------------------------

Let's work on the subset of the data; we will look only at houses that are less than $2M. 

In [None]:
lowprice_houses = County_data %>% filter(price < 2000000) 


---

We can plot **multiple densities** on the same plot to see how the price distribution differs with respect to some attribute. We should use alpha transparency in geom_density to see the different distributions.

**Exercise 6:** Plot multiple densities of **price** with respect to **number of bedrooms**. 

In [None]:
# Your Code Here
# ------------------------


This works well if there aren't too many classes. Here, we can't see much, so let's use the **facet_wrap()** to create **small multiples** to compare densities for different number of bedrooms. 

**Exercise 7:** Plot **small multiples** of **price densities** with respect to **number of bedrooms**. 

In [None]:
options(repr.plot.width=16, repr.plot.height=10)

In [None]:
# Your Code Here
# ------------------------


Here, we can see that the price distribution is narrowed up to four bedrooms; after that, the variance of the price increases. 
Let's see the distribution for a number of floors.

**Exercise 8:** Plot **small multiples** of **price densities** with respect to **number of floors**. 

In [None]:
# Your Code Here
# ------------------------


Let's plot a scatter plot for price vs. square footage using small multiples for the number of bedrooms. Map the number of floors to a *color* visual variable using a **sequential** brewer palette.

**Exercise 9:** Do a scatter plot of **price** vs. **square footage** with **small multiples** of **bedrooms**. 

In [None]:
# Your Code Here
# ------------------------


**Exercise 10:** Let's make both axes log scale. 

In [None]:
# Your Code Here
# ------------------------


---


**Let's look at a different data set. This one is about a survey of students enrolled in a class, and some information about them was collected about their behavior, demographics, etc.** 

In [None]:
ecg = read.csv("../data/eyecolorgenderdata.csv", header=TRUE, sep=",")
head(ecg)

**Exercise 11:** Name the **nominal** and **ordinal** variables in this data set. 

In [None]:
# Your Code Here
# ------------------------

**Answer the nominal and ordinal variables:**

**Exercise 12:** Plot a scatter plot of **gender** vs. **eyecolor**.

In [None]:
options(repr.plot.width=8, repr.plot.height=8)

In [None]:
# Your Code Here
# ------------------------


This didn't work well. When we have overplotting problem where attributes have same exact values for a large number of data rows, we should use geom_jitter() to randomly place points in a scatter plot.

**Exercise 13:** Plot a scatter plot of **gender** vs. **eyecolor** with **jitter** geometry. 

In [None]:
# Your Code Here
# ------------------------


It's intuitive to use **color** visual variable for the **eyecolor** attribute.

**Exercise 14:** Plot a scatter plot of **gender** vs. **eyecolor** and use **color** for the **eyecolor** attribute. Adjust width and height of **jitter** suitably and add transparency. 

In [None]:
options(repr.plot.width=11, repr.plot.height=11)

In [None]:
# Your Code Here
# ------------------------


The colors should be intuitive, so we will manually name them with **scale_color_manual()** where the color values will be 
**c("blue", "chocolate4", "green4", "#595c26", "black")**.

For the Python user, the above `chocolate4` color does not exist. Please use the following customized color palette:
**\["blue", "chocolate", "green", "#595c26", "black"\]**.

**Exercise 15:** Add manual colors to the above plot.

In [None]:
# Your Code Here
# ------------------------
# Black is not intuitive for the category "other" because 
# some people (or at least 30% of the world population based on 
# 2018 UN data) directly think that it is for a black eye. 
# So instead of black, I use yellow, which usually does not exist except for illness.


**Exercise 16:** And finally add **shape** visual variable to encode the **exercise** attribute. **Does it work well? Why?** 


In [None]:
# Your Code Here
# ------------------------


**Answer whether the shape visual variable works well and why:**

**Exercise 17:** Plot a scatter plot of **gender** vs. **height** with **small multiples** for **exercise**. 

**Use the techniques you applied in the above exercises. Use intuitive visual variables for the attributes.** 


In [None]:
options(repr.plot.width=10, repr.plot.height=9)

In [None]:
# Your Code Here
# ------------------------


-----

Let's read the Gapminder data from the web resource.

In [None]:
library(plotly)
library(RColorBrewer)

# Read the world population with life expectancy and GDP per capita by each country
world_pop <- read.csv(
    paste(
        "https://raw.githubusercontent.com", 
        "plotly/datasets/master", 
        "gapminderDataFiveYear.csv", 
        sep="/"
    )
)
head(world_pop)

**Exercise 18:** Create a subset of the data between the years 1951 and 1993. Then, display the U.S. table using `country=='United States'`. 

In [None]:
# Your Code Here
# ------------------------


**Exercise 19:** Plot **small multiples of histograms** of **life expectancy for each year** for the subset. Use a binwidth of **5**, and use sensible axis labels.

In [None]:
options(repr.plot.width=12.8, repr.plot.height=9.6)

In [None]:
# Your Code Here
# ------------------------


**Exercise 20:** (1) Do the same as above, but plot a density function this time. (2) How do you interpret the change of density in years?

In [None]:
# Please Answer your code here for (1)
# --------------------------------------


**Answer how to interpret**: 

**Exercise 21:** Create a line plot (use both geom_line and geom_point) to plot year versus population for the **whole** data set. Use a logarithmic scale in **y axis** and **group by country**, **color by continent**. Can you see any pattern? 

You might have a chance to use the library `latex2exp`. Please use the next code cell to install to it.

In [None]:
install.packages('latex2exp')

In [None]:
# Please Answer your code here
# ------------------------------------


_**Can you see any pattern?**_

**Answer here:**

---

**Aggregate data:** The above plot is too crowded to see. Let's aggregate data by continent and year so we can plot meaningful data. The following code creates a new data frame by computing the sums of the population for years and continents.

In [None]:
aggdata <- world_pop %>%
    group_by(continent, year) %>%
    summarise(
        total.population = sum(pop, na.rm=TRUE)
    ) %>%
    arrange(year, continent)
head(aggdata)

**Exercise 22:** Now (1) **repeat Exercise 21 with this aggregate data** and group and color by continent. (2) Do you see a pattern? 

In [None]:
# Your Code Here
# ------------------------


**Answer here your opinion whether you can see a pattern:**

**Exercise 23:** Now, plot a **stacked area chart** to see the same. Instead of group and color, use only **fill** parameter for **continent**, and use **geom_area**.

In [None]:
# Your Code Here
# ------------------------


---

**Find percentages:** The above plot shows actual population numbers, which grow in time. We want to see the percentage change of the continents' populations with respect to the total world population. The code below computes that. 

In [None]:
# Using dplyr
aggdata <- world_pop %>%
    group_by(year) %>%
    summarise(world.pop=sum(pop, na.rm=TRUE)) %>%
    right_join(world_pop, by="year") %>%
    group_by(continent, year) %>%
    select(continent, country, year, pop, world.pop) %>%
    summarise(
        total.pop=sum(pop, na.rm=TRUE),
        world.pop=mean(world.pop, na.rm=TRUE)        
    ) %>%
    mutate(percent.pop=(total.pop/world.pop)*100) %>%
    arrange(year, continent)
head(aggdata)

**Exercise 24:** Now, **plot the same as exercise 23** but use **percent.pop** as the y axis. 

In [None]:
# Your Code Here
# ------------------------


**Exercise 25:** We will aggregate once more; this time we will compute the **mean GDP per capita for continents and years**. **It's your turn this time.**

In [None]:
# Your Code Here
# ------------------------


**Exercise 26:** Plot a **heatmap** using **plot_ly function** for **years** vs. **continents** using the **mean gdp per capita as the z value**. 

In [None]:
# Load Viridis library
library(viridis)

In [None]:
# Your Code Here
# ------------------------



**Exercise 27:** Plot a **boxplot** for **gdp per capita** using **plot_ly function** for **continents**. 

Use the **whole** data set, **color by continent**, and make sure the **y-axis is in log scale**. 

**When hovering over data, what do you notice about the first and third quartiles for each continent (hint: think of income inequality) ?** 

In [None]:
# Your Code Here
# ------------------------


**Answer your thought about the first and third quartiles for each continent:**

### Please write your answer and execute all codes to display data visualization.
### Save your answer. (File->Save Notebook, Ctrl+S or Command+S)
### After save your notebook, please submit yours to Blackboard.