Skip to content
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
149 lines (96 sloc) 4.87 KB
title: "The R-Ladies Data Cleaning Gauntlet"
author: "Alice Walsh, R-Ladies Philly"
date: "8/2/2018"
output: html_document
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Intro
Below I have assembled some data cleaning challenges! The data to clean is all available online.
This first code chunk loads the data. In RStudio, you can click a green arrow in the top right corner to run the chunk. You should see that a new variable appears in your Global Environment.
```{r get_data}
data_link <- ''
df <- read.csv(data_link)
```{r load_packages}
# Do you want to load some packages here? Maybe tidyverse?
# Here are the packages I loaded to complete the challenges! You made need more or different ones.
# Challenge one
The first step is to inspect the data. Use 2 or three commands to look at things like...
* What are the variable/column names?
* What data class are the variable/columns?
* How many NA's are there per variable?
```{r inspect}
# Put your code here!
# Challenge two
There is some missing data - blank values that we would like to be represented as NA in our dataset.
Also some data has been converted to factors when it should be character.
Let's re-import the data with read.csv() with different options (or use a different function). We want the variables like NEIGHBORHOOD andd ADDRESS to be class `r character`. The variable ADDRESS_NOTES should have 38 NA's.
```{r reload_data}
# Fill in the options - add stringsAsFactors, na.strings
# Alternatively, use your favorite package to read in the data instead of read.csv()
df <- read.csv(data_link) # add here
# Challenge three
Let's clean up the NEIGHBORHOOD! Use all your favorite commands to look at the values in NEIGHBORHOOD and clean them up. There are some entries with an extra space. There is some inconsistent capitalization.
For example, "Center City", "Center city", and "Center City " are all different.
```{r clean_neighborhood}
# Your code here
# Challenge four
There are 48 values in the MONTHS variable. It looks like this variable contains information about the dates the farmer's market is open (e.g. "May - October").
If you want to plot the number of farmer's markets that are open by month, how will you do it?
Tip: Clean up the MONTHS variable and create some new variables. I also decided to create a new data frame that summarized the number of open markets by month.
```{r clean_months}
# Add code to process data here
```{r plot_by_month}
# Code to plot markets by month -
# Uncomment and Modify the below ggplot2 code or start fresh with your own code
# ggplot(by_month, aes(x = month, y = number)) +
# geom_col()
# Extra challenges
Let's combine this data with another dataset? maybe something from opendataphilly?
What if you were asked to plot the markets by zip code? or by day of week?
What if you wanted to plot the markets on a map using the coordinates provided in X, Y?
# Bonus data reshape challenge
The farmers market data is not a great example to show converting between wide and long data. If you want to practice this, here is an interesting dataset.
* Zillow makes home prices and other data available on their website:
```{r load_widedata}
zillow_source <- ""
zillow_wide <- read.csv(zillow_source)
The data is 10165 rows and 108 columns. Let's filter to just keep Philadelphia data:
```{r filter_zillow}
# This code uses filter from dplyr package - there are other ways you could do this
zillow_wide <- filter(zillow_wide, City == "Philadelphia")
This data is wide - there are columns for every month with home prices (e.g. "X2018.01" and "2016.08"). Instead, we want to have one column with the prices and another with the month and year. How can we convert from wide to long?
This is roughly what I want `r zillow_long` to look like:
```{r dummy_df}
fake_zillow_long <- data_frame(RegionName = c(19143, 19143, 19143, 19143, 19143),
City = c("Philadelphia","Philadelphia","Philadelphia","Philadelphia","Philadelphia"),
year = c(2010,2010,2010,2010,2010),
month = c(1,2,3,4,5),
price = c(79450, 77000, 83500, 79450, 79900))
```{r reshape_zillow}
# Create a new dataframe, zillow_long, that contains reshaped data
```{r plot_zillow}
# Now we can plot the prices over time for each Philadelphia RegionName (zip code)
# Here is some example code you can modify to fit your dataset or change to make prettier
# ggplot(zillow_long, aes(x = month, y = price, group = RegionName)) +
# geom_line() +
# facet_wrap(~year)
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.