Permalink
Browse files

first commit

this is the first commit of the new package, visdat, which supersedes
`footprintr`.
  • Loading branch information...
njtierney committed Jan 28, 2016
0 parents commit a350e50d210ab99939b53b22e2e2013e7e492f80
Showing with 245 additions and 0 deletions.
  1. +2 −0 .Rbuildignore
  2. +3 −0 .gitignore
  3. +13 −0 DESCRIPTION
  4. +8 −0 NAMESPACE
  5. +19 −0 R/fingerprint.R
  6. +38 −0 R/viz_dat.R
  7. +40 −0 R/viz_miss.R
  8. +47 −0 README.md
  9. BIN data/example2.rda
  10. +18 −0 man/fingerprint.Rd
  11. +18 −0 man/vis_dat.Rd
  12. +18 −0 man/vis_miss.Rd
  13. +21 −0 visdat.Rproj
@@ -0,0 +1,2 @@
^.*\.Rproj$
^\.Rproj\.user$
@@ -0,0 +1,3 @@
.Rproj.user
.Rhistory
.RData
@@ -0,0 +1,13 @@
Package: visdat
Title: preliminary visualisation of data
Version: 0.0.0.9000
Authors@R: person("Nicholas", "Tierney", email = "nicholas.tierney@gmail.com", role = c("aut", "cre"))
Description: visdat makes it easy to visualise your whole dataset so that you can quickly identify problems visually.
Depends:
R (>= 3.2.2)
License: MIT
LazyData: true
RoxygenNote: 5.0.1
Imports: ggplot2,
tidyr,
dplyr
@@ -0,0 +1,8 @@
# Generated by roxygen2: do not edit by hand

export(fingerprint)
export(vis_dat)
export(vis_miss)
import(dplyr)
import(ggplot2)
importFrom(tidyr,gather)
@@ -0,0 +1,19 @@
#' fingerprint
#'
#' \code{fingerprint} is a utility function for vis_dat
#'
#' @description fingerprint takes the fingerprint of a dataframe, and (currently) replaces the contents (x) with the class of a given object, unless it is missing (coded as NA), in which case it leaves it as NA. The name fingerprint is taken from the csv-fingerprint, of which this package is based.
#'
#' @param x a vector
#'
#' @export
fingerprint <- function(x){

# is the data missing?
ifelse(is.na(x),
# yes? Leave as is NA
yes = NA,
# no? make that value no equal to the class of this cell.
no = class(x))

} # end function
@@ -0,0 +1,38 @@
#' vis_dat
#'
#' \code{vis_dat} visualises a data.frame to tell you what it contains.
#'
#' @description \code{vis_dat} gives you an at-a-glance ggplot of what is inside a dataframe, colouring cells according to what class they are and whether the values are missing. As it returns a ggplot object, it is very easy to customize and change labels, etc.
#'
#' @param x a data.frame object
#'
#' @importFrom tidyr gather
#' @import dplyr
#' @import ggplot2
#'
#' @export
vis_dat <- function(x){

# apply the fingerprint to every column in the dataframe
lapply(x, fingerprint) %>%
# coerce it to a dataframe...there's probably a better way
as_data_frame %>%
# create a new column that is numbered from 1 to the number of rows
# this assists in the gathering of rows together
mutate(rows = 1:n()) %>%
# gather the variables together for plotting
# here we now have a column of the row number (row), then the variable(variables), then the contents of that variable (value)
gather(key = variables,
value = value,
-rows) %>%
# then we plot it
ggplot(data = .,
aes(x = variables,
y = rows)) +
geom_raster(aes(fill = value)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")

}
@@ -0,0 +1,40 @@
#' vis_miss
#'
#' \code{vis_miss} visualises a data.frame to display missingness.
#'
#' @description \code{vis_miss} gives you an at-a-glance ggplot of the missingness inside a dataframe, colouring cells according to missingness. As it returns a ggplot object, it is very easy to customize and change labels, etc.
#'
#' @param x a data.frame object
#'
#' @importFrom tidyr gather
#' @import dplyr
#' @import ggplot2
#'
#' @export
vis_miss <- function(x){

x %>%
is.na %>%
as.data.frame %>%
mutate(rows = 1:n()) %>%
# gather the variables together for plotting
# here we now have a column of the row number (row), then the variable(variables), then the contents of that variable (value)
gather(key = variables,
value = value,
-rows) %>%
# then we plot it
ggplot(data = .,
aes(x = variables,
y = rows)) +
geom_raster(aes(fill = value)) +
# change the colour, so that missing is grey, present is black
scale_fill_grey(name = "",
labels = c("Present",
"Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45,
vjust = 0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")

}
@@ -0,0 +1,47 @@
# visdat

This package is the second iteration of my attempt at cloning the super cool and way sexier "csv-fingerprint" from flowing data - see [here](https://github.com/setosa/csv-fingerprint) and [here](https://flowingdata.com/2014/08/14/csv-fingerprint-spot-errors-in-your-data-at-a-glance/). Initially I had named the package "footprintr", to keep in spirit with the name "csv-fingerprint". However, after a little more thought and usage, I felt that "footprintr" didn't actually describe what was going on with the pacakge, and what it does, and so "visdat" was born.

# What does it do?

visdat is a small r package that visualises a dataframe, displaying missing data and variable classes with different colours. Future work will allow for each cell to be colored according to its type (e.g., strings, factors, integers, decimals, dates, missing data). It would also be really cool to get this function to "intelligently" read in data types.

Part of the name suggests that it could be integrated with testdat and testthat. The idea being that first you visualise your data, then you run tests to fix them.


# How to install

```
# install.packages("devtools")
library(devtools)
install_github("tierneyn/footprintr")
```

# Example

Let's explore the missing data

```
library(visdat)
vis_miss(airquality)
```

Let's see what's inside airquality

```
vis_dat(airquality)
```

# Known Issues.

**Individual cells do not have an individual class**
Due to the fact that R coerces a vector to be the same type, this means that you cannot have something like c("a", 1L, 10.555) together as a vector, as it will just convert this to `[1] "a" "1" "10.555"`. This means that you don't get the ideal feature of picking up on nuances such as individuals cells that are different classes in the dataframe. Perhaps there is a way to read in a csv as a list so that these features are preserved?

**Missing Data not listed in legend**

When running the example below, the gray bars indicate missing values, but these are currently not specified as missing values.


BIN +10.6 KB data/example2.rda
Binary file not shown.

Some generated files are not rendered by default. Learn more.

Oops, something went wrong.

Some generated files are not rendered by default. Learn more.

Oops, something went wrong.

Some generated files are not rendered by default. Learn more.

Oops, something went wrong.
@@ -0,0 +1,21 @@
Version: 1.0

RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: knitr
LaTeX: pdfLaTeX

AutoAppendNewline: Yes
StripTrailingWhitespace: Yes

BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace

0 comments on commit a350e50

Please sign in to comment.