<a href="https://colab.research.google.com/github/ratfarts/datasciencecoursera/blob/master/Copy_of_cleaning_data_in_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center">
<img src="https://github.com/datacamp/r-live-training-template/blob/master/assets/datacamp.svg?raw=True" alt = "DataCamp icon" width="50%">
<br>
<h1 align="center"><b>Cleaning Data in R Live Training</b></h1>
</p>
<br>


Welcome to this hands-on training where you'll identify issues in a dataset and clean it from start to finish using R. It's often said that data scientists spend 80% of their time cleaning and manipulating data and only about 20% of their time analyzing it, so cleaning data is an important skill to master!

In this session, you will:

- Examine a dataset and identify its problem areas, and what needs to be done to fix them.
-Convert between data types to make analysis easier.
- Correct inconsistencies in categorical data.
- Deal with missing data.
- Perform data validation to ensure every value makes sense.

## **The Dataset**

The dataset we'll use is a CSV file named `nyc_airbnb.csv`, which contains data on [*Airbnb*](https://www.airbnb.com/) listings in New York City. It contains the following columns:

- `listing_id`: The unique identifier for a listing
- `name`: The description used on the listing
- `host_id`: Unique identifier for a host
- `host_name`: Name of host
- `nbhood_full`: Name of borough and neighborhood
- `coordinates`: Coordinates of listing _(latitude, longitude)_
- `room_type`: Type of room 
- `price`: Price per night for listing
- `nb_reviews`: Number of reviews received 
- `last_review`: Date of last review
- `reviews_per_month`: Average number of reviews per month
- `availability_365`: Number of days available per year
- `avg_rating`: Average rating (from 0 to 5)
- `avg_stays_per_month`: Average number of stays per month
- `pct_5_stars`: Percent of reviews that were 5-stars
- `listing_added`: Date when listing was added


In [0]:
# Install non-tidyverse packages
install.packages("visdat")

In [0]:
# Load packages
library(readr)
library(dplyr)
library(stringr)
library(visdat)
library(tidyr)
library(ggplot2)
library(forcats)

In [0]:
# Load dataset
airbnb <- read_csv("https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/nyc_airbnb.csv")

In [0]:
# Examine the first few rows


## **Diagnosing data cleaning problems**


We'll need to get a good look at the data frame in order to identify any problems that may cause issues during an analysis. There are a variety of functions (both from base R and `dplyr`) that can help us with this:

-  `head()` to look at the first few rows of the data
- `glimpse()` to get a summary of the variables' data types
- `summary()` to compute summary statistics of each variable and display the number of missing values
- `duplicated()` to find duplicates


In [0]:
# Print the first few rows of data


In [0]:
# Inspect data types


In [0]:
# Examine summary statistics and missing values


In [0]:
# Count data with duplicated listing_id


*A note on the `%>%` operator:*

This is an operator commonly used in the Tidyverse to make code more readable. The `%>%` takes the result of whatever is before it and inserts it as the first argument in the subsequent function.

We could do this exact same counting operation using the following, but the function calls aren't in the order they're being executed, which makes it difficult to understand what's going on. The `%>%` allows us to write the functions in the order that they're executed.
```r
count(filter(airbnb, duplicated(listing_id)))
```

## **What do we need to do?**

**Data type issues**
- **Task 1:** Split `coordinates` into latitude and longitude and convert `numeric` data type.
- **Task 2:** Remove `$`s from `price` column and convert to `numeric`.
- **Task 3:** Convert `last_review` and `listing_added` to `Date`.

**Text & categorical data issues**
- **Task 4:** Split `nbhood_full` into separate neighborhood and borough columns.
- **Task 5:** Collapse the categories of `room_type` so that they're consistent.

**Data range issues**
- **Task 6:** Fix the `avg_rating` column so it doesn't exceed `5`.

**Missing data issues**
- **Task 7:** Further investigate the missing data and decide how to handle them.

**Duplicate data issues**
- **Task 8:** Further investigate duplicate data points and decide how to handle them.

***But also...***
- We need to validate our data using various sanity checks

---
<center><h1><b>Q&A</b></h1></center>

---

## **Cleaning the data**


### **Data type issues**


In [0]:
# Reminder: what does the data look like?


#### **Task 1:** Split `coordinates` into latitude and longitude and convert `numeric` data type.


- `str_remove_all()` removes all instances of a substring from a string.
- `str_split()` will split a string into multiple pieces based on a separation string.
- `as.data.frame()` converts an object into a data frame. It automatically converts any strings to `factor`s, which is not what we want in this case, so we'll stop this behavior using `stringsAsFactors = FALSE`.
- `rename()` takes arguments of the format `new_col_name = old_col_name` and renames the columns as such.

In [0]:
# Create lat_lon columns


- `cbind()` stands for column bind, which sticks two data frames together horizontally.

<img src="https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/cbind.png" width="500px;"/>

In [0]:
# Assign it to dataset


#### **Task 2:** Remove `$`s from `price` column and convert to `numeric`.

In [0]:
# Remove $ and convert to numeric


Notice we get a warning here that values are being converted to `NA`, so before we move on, we need to look into this further to ensure that the values are actually missing and we're not losing data by mistake.

Let's take a look at the values of `price`.


In [0]:
# Look at values of price


It looks like we have a non-standard representation of `NA` here, `$NA`, so these are getting coerced to `NA`s. This is the behavior we want, so we can ignore the warning.

In [0]:
# Add to data frame


#### **Task 3:** Convert `last_review` and `listing_added` to `Date`.

Conversion to `Date` is done using `as.Date()`, which takes in a `format` argument. The `format` argument allows us to convert lots of different formats of dates to a `Date` type, like "January 1, 2020" or "01-01-2020". There are special symbols that we use to specify this. Here are a few of them, but you can find all the possible ones by typing `?strptime` into your console.

<img src="https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/date_formats.png" alt="%d = day number, %m = month number, %Y = 4 digit year, %y = 2 digit year, %B = month, %b = month abbreviation" width="250px;"/>

A date like "21 Oct 2020" would be in the format `"%d %b %Y"`.


In [0]:
# Look up date formatting symbols


In [0]:
# Examine first rows of date columns


In [0]:
# Convert strings to Dates



---
<center><h1><b>Q&A</b></h1></center>

---

### **Text & categorical data issues**


#### **Task 4:** Split `nbhood_full` into separate `nbhood` and `borough` columns.

In [0]:
# Split borough and neighborhood

In [0]:
# Assign to airbnb

#### **Task 5:** Collapse the categories of `room_type` so that they're consistent.

In [0]:
# Count categories of room_type


- `stringr::str_to_lower()` converts strings to all lowercase, so `"PRIVATE ROOM"` becomes `"private room"`. This saves us the pain of having to go through the dataset and find each different capitalized variation of `"private room"`.
- `forcats::fct_collapse()` will combine multiple categories into one, which is useful when there are a few different values that mean the same thing.

In [0]:
# Collapse categorical variables

In [0]:
# Add to data frame

---
<center><h1><b>Q&A</b></h1></center>

---

### **Data range issues**

#### **Task 6:** Fix the `avg_rating` column so it doesn't exceed `5`.

In [0]:
# How many places with avg_rating above 5?


In [0]:
# What does the data for these places look like?


In [0]:
# Remove the rows with rating > 5


### **Missing data issues**

#### **Task 7:** Further investigate the missing data and decide how to handle them.

When dealing with missing data, it's important to understand what type of missingness we might have in our data. Oftentimes, missing data can be related to other dynamics in the dataset and requires some domain knowledge to deal with them.

The `visdat` package is useful for investigating missing data.

In [0]:
# See data frame summary again


In [0]:
# Visualize missingness 



It looks like missingness of `last_review`, `reviews_per_month`, `avg_rating`, and `avg_stays_per_month` are related. This suggests that these are places that have never been visited before (therefore have no ratings, reviews, or stays).

However, `price` is unrelated to the other columns, so we'll need to take a different approach for that.

In [0]:
# Sanity check that our hypothesis is correct


Now that we have a bit of evidence, we'll assume our hypothesis is true.
- We'll set any missing values in `reviews_per_month` or `avg_stays_per_month` to `0`.
    - Use `tidyr::replace_na()`
- We'll leave `last_review` and `avg_rating` as `NA`.
- We'll create a `logical` (`TRUE`/`FALSE`) column called `is_visited`, indicating whether or not the listing has been visited before.

In [0]:
# Replace missing data


**Treating the `price` column**

There are lots of ways we could do this
- Remove all rows with missing price values
- Fill in missing prices with the overall average price
- Fill in missing prices based on other columns like `borough` or `room_type`

**Let's examine the relationship between `room_type` and `price`.**

<img src='https://raw.githubusercontent.com/datacamp/cleaning-data-in-r-live-training/master/assets/boxplot.png' alt='Box plot diagram' width='350px;'>

In [0]:
# Create a boxplot showing the distribution of price for each room_type


We'll use *median* to summarize the `price` for each `room_type` since the distributions have a number of outliers, and median is more robust to outliers than mean.

We'll use `ifelse()`, which takes arguments of the form: `ifelse(condition, value if true, value if false)`.

In [0]:
# Use a grouped mutate to fill in missing prices with median of their room_type


In [0]:
# Overwrite price column in original data frame


### **Duplicate data issues**


#### **Task 8:** Further investigate duplicate data points and decide how to handle them.

In [0]:
# Find duplicated listing_ids


In [0]:
# Look at duplicated data


***Full duplicates***: All values match.
- To handle these, we can just remove all copies but one

***Partial duplicates***: Identifying values (like `listing_id`) match, but one or more of the others don't. Here, we have inconsistent values in `price`, `avg_rating`, and `listing_added`.
- We can remove them, pick a random copy to keep, or aggregate any inconsistent values. We'll aggregate using `mean()` for `price` and `avg_rating`, and `max()` for `listing_added`.

In [0]:
# Remove full duplicates


In [0]:
# Aggregate partial duplicates using grouped mutate


In [0]:
# Check that no duplicates remain


### Take-home practice: **Sanity Checks**
*The data should be consistent both with itself, as well as with what we know about the world.*

- **Is the data consistent with itself?**
    - Are there any `last_review` dates before `listing_added` dates?
- **Is the data consistent with what we know about the world?**
    - Are there any `last_review` dates in the future?
    - Are there any `listing_added` dates in the future?



---
<center><h1><b>Q&A</b></h1></center>

---