# Data Exploration in R 

Marta Sestelo (msestelo@gradiant.org)

----------------




## Table of contents


1. [Data wrangling](#wrangling)
    - [Data import](#import)
    - [Data structure](#structure)
2. [Exploratory Data Analysis](#EDA)
    - [Dealing with missing values](#NA)
    - [Exploring the variation of my variables](#variation)
    - [Exploring the covariation between my variables](#covariation)
    - [Data visualization](#plots)
   

Before starting, it is necessary to load the required packages. Both `ggplot2`, `dplyr` and `readr` are included in the [`tidyverse`](https://www.tidyverse.org/) package collection, so you can load just this package. 

Note: in the case you use a jupiter notebook you have to load the packages one by one.


In [None]:
# library(tidyverse)
library(dplyr)
library(ggplot2)
library(readr)

<a id="wrangling"></a>

# Data wrangling

This is the art of getting your data into R in a useful form for visualisation and modelling. Data wrangling is very important: without it you can’t work with your own data! 

This phase is divided into data import and the idea of tidy data (how you can organise your data in R).




<a id="import"></a>

### Data import 

In this part, we are going to learn how to get your data from disk and into R. There ara several functions to load data in R. You must choose the ideal depending on the data format.  If you’re looking for raw speed with a big dataset try `data.table::fread()`. For hierarchical data as json files, you can use `jsonlite` packages.


In [None]:
# Download if needed
# download.file("http://data.insideairbnb.com/spain/catalonia/barcelona/2018-09-11/visualisations/listings.csv", 
#             "airbnb.csv")

In [None]:
data <- read_csv("airbnb.csv") 

In [None]:
#data <- as_tibble(data) # tibble is structure designed for using with big dataset

<a id="structure"></a>

### Data structure

Now we will have a look at data, their size and their structure. 

In [None]:
dim(data) 

In [None]:
names(data)

In [None]:
str(data)

In [None]:
head(data)

In [None]:
class(data)

Next, you can see some example of how to select data from a `data.frame`

In [None]:
data[1, ] # first row 

In [None]:
data[1:3, ] # first three rows

In [None]:
data[, 1] # first column

In [None]:
data[, 2:5] # from column 2 to 5 

In [None]:
data[, "id"] # select column by name

In [None]:
data$id[1:10] # select column by name (another way)

In [None]:
head(data, 6) 

In [None]:
tail(data, 2) 

In [None]:
first(data$host_name) #  first position of a vector

In [None]:
last(data$host_name) # last position of a vector

<a id="EDA"></a>

# Exploratory Data Analysis


The goal during Exploratory Data Analysis (EDA) is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

* What type of variation occurs within my variables?

* What type of covariation occurs between my variables?






**Some remarks**:
* A *variable* is a quantity, quality, or property that you can measure.

* A *value* is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

* An *observation* is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a data point.



Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.  You can see here some examples:



### Dealing with missing values

But before starting, a common task in data analysis is dealing with missing values. In R, missing values are often represented by `NA` or some other value that represents missing values (i.e. `99`).

The first step is to identify this values. To this end, you can use `is.na()` which returns a logical vector with `TRUE` in the element locations that contain missing values represented by `NA`. `is.na()` will work on vectors, lists, matrices, and data frames.

In [None]:
data %>% 
  is.na() %>% 
  colSums()

# colSums(is.na(data)) 

In [None]:
#load(file(source_data("http://dl.dropboxusercontent.com/u/25710348/Blogposts/data/IL2010.Rda")))

Another option to see the number of NA's is applying the `summary` function. This function produces numerical summaries for each of the variables of the variables included in the data frame.

In [None]:
summary(data)

Now, we can delete these observations or we can recode them.

In order to recode missing values or recode specific indicators that represent missing values, we can use normal subsetting and assignment operations. For example, we can recode missing values in vector `x` with the mean values in `x` by first subsetting the vector to identify `NA's` and then assign these elements a value. 

In [None]:
(x <- c(1:4, NA, 6:7, NA))

In [None]:
x[is.na(x)] <- mean(x, na.rm = TRUE)

In [None]:
round(x, 2)

In our analysis, we are going to exclude these cases directly. To this end, we could use de `na.omit` function.

In [None]:
# data <- data[complete.cases(data), ] # another option
data <- na.omit(data)

<a id="variation"></a>

### Exploring the variation of my variables

At this stage, we'll see some example to analyze one variable paying attention of their type (continuous or categorical). In the case of a continuous variable, it is commmon to use the summary function and in the case of a catgorical variable we can obtain frequencies for each of the levels of the variable.

In [None]:
summary(data$price) # continuous

In [None]:
summary(data$room_type) # categorical

In [None]:
table(data$room_type) # absolutes frequencies

In [None]:
prop.table(table(data$room_type)) # relative frequencies

<a id="covariation"></a>

### Exploring the covariation between my variables 

Now, we'll try to understand which is the relation between some varibles and obtain results for the combination of them. We'll use some very useful fucntions from the `dplyr` packages. The first one is `filter()` that allows you to subset observations based on their values. 

Note: `dplyr` executes the filtering operation and returns a new data frame. `dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`.

In [None]:
data %>%
  filter(neighbourhood_group == "Ciutat Vella",
         room_type == "Private room")

# ciutat_vella_priv_room <- data %>%
#  filter(neighbourhood_group == "Ciutat Vella",
#         room_type == "Private room")

In [None]:
# data %>%
#  filter(neighbourhood_group == "Ciutat Vella"| neighbourhood_group == "Gràcia" ) # logic operators

# data %>%
#  filter(neighbourhood_group %in% c("Ciutat Vella","Gràcia" )) # logic operators


The function `arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:


In [None]:
data %>% 
  arrange(price, desc(number_of_reviews))

Besides selecting sets of existing columns, sometimes it's very useful to add new columns that are obtained from existing columns. That’s the job of `mutate()`.

In [None]:
data <- data %>% 
  mutate(ri_price = price > quantile(price, probs = 0.25) &
           price < quantile(price, probs = 0.75)) 

In [None]:
head(data)

`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.


In [None]:
data %>% 
    select(name, price)  %>% 
    head()

#data %>% select(price, ri_price)

In order to answer some questions, you'll need to use more than one function at once. For example, how many distinct `neighbourhood` are there in `Ciutat Vella`? 


In [None]:
data %>%
    filter(neighbourhood_group == "Ciutat Vella") %>% 
    select(neighbourhood) %>% 
    unique()

In [None]:
# pay attention!
 data %>%
#    filter(neighbourhood_group == "Ciutat Vella") %>% 
    select(neighbourhood) %>% 
    unique() 


Finally,  together `group_by()` and `summarise()` provide one of the tools that you’ll use most commonly when working with `dplyr`: **grouped summaries**. 

In [None]:
data %>% 
  group_by(neighbourhood_group) %>% 
  summarise(mean = mean(price))


In [None]:
# adding the number of listenings, min and max
data %>% 
  group_by(room_type) %>% 
  summarise(n = n(), mean = mean(price), 
            min = min(price), max = max(price))

In [None]:
# the same but nested grouping (by neighbourhood_group and  type of room)

data %>% 
  group_by(neighbourhood_group, room_type) %>% 
  summarise(mean = mean(price), n = n()) 


In [None]:
# now we add the relative frequencies

data %>% 
  group_by(neighbourhood_group, room_type) %>% 
  summarise(mean = mean(price), n = n()) %>% 
  mutate(prop = prop.table(n))

<a id="plots"></a>

### Data visualization

Finally we will see how to visualise your data using `ggplot2`. R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile package.


#### Univariate analysis

The way that you visualize the distribution of a variable depend on the type of variable you have: categorical or continuous. A variable is **categorical** if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart:




In [None]:
options(repr.plot.width = 8, repr.plot.height = 4)

In [None]:
qplot(neighbourhood_group, data = data)


A variable is **continuous** if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. In order to see the distribution of a continuous variable, you can use a histogram or a density plot:
  

In [None]:
qplot(x = price, data = data) # it seems that there are outliers

In [None]:
data_filtered <- data %>% 
  filter(price < 200)

In [None]:
qplot(price, data = data_filtered, geom = "histogram")

In [None]:
qplot(y = price, x = "", data = data_filtered, geom = "boxplot")

#### Multivariate analysis

From now onwards, we are going to describe the behaviour between variables ("covariation").  It  is the tendency for the values of two or more variables to vary together in a related way. To this end, the best option is to visualise the relationship between two or more variables. How you do that depend on again the type of variables involved.

If you want to explore the distribution of a **continuous** variable broken down by a **categorical** variable, you can use some of these plots:

In [None]:
library(ggridges)
data_filtered %>% 
    select(price, room_type) %>%
    ggplot(aes(x = price, y = room_type, fill = room_type)) + 
    geom_density_ridges(scale = 3, alpha = 0.7)

In [None]:
data_filtered  %>% 
    filter(neighbourhood_group == "Gràcia")  %>% 
    qplot(y = price, x = neighbourhood, data = ., geom = "boxplot", fill = neighbourhood)
    

In [None]:
qplot(y = price, x = neighbourhood_group, data = data_filtered, geom = "boxplot", fill = room_type)




To visualise the relation between **categorical variables**, you’ll need to count the number of observations for each combination. One way to do that is to rely on the built-in `geom_count()`

In [None]:
ggplot(data = data) +
  geom_count(mapping = aes(x = neighbourhood_group, y = neighbourhood)) #

In [None]:
data %>% 
  count(neighbourhood_group, room_type) %>%  
  ggplot(mapping = aes(x = neighbourhood_group, y = room_type)) +
    geom_tile(mapping = aes(fill = n)) #heat map

Note: if the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns.

One  way to visualise the covariation between two continuous variables is to draw a scatterplot with `geom_point()`. Maybe you can see a pattern in the points. 

In [None]:
qplot(x = price, y = number_of_reviews, data = data_filtered, geom = "point")

In [None]:
qplot(x = price, y = number_of_reviews, data = data_filtered, geom = c("point", "smooth"))

In [None]:
qplot(x = price, y = number_of_reviews, color = room_type, data = data_filtered, geom = c("point"))

In [None]:
library(leaflet)
library(htmlwidgets)

In [None]:
extract <- data %>% 
  select(latitude, longitude, host_name, name, neighbourhood_group)  %>%
    sample_n(50)
  

In [None]:
aux <- leaflet(data = extract) %>% addTiles() %>%
 addMarkers(~longitude, ~latitude, popup = ~as.character(host_name), label = ~as.character(name))

In [None]:
htmlwidgets::saveWidget(aux, "widget.html")

<br/><br/><br/><br/><br/><br/><div style="text-align: right">Material mainly based on the book **[R for Data Science](http://r4ds.had.co.nz/index.html)** (Grolemund and Wickham, 2017).</div>