R Package of Cleaned OkCupid Data
R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
data-raw
data
inst
man
.Rbuildignore
.gitignore
.travis.yml
DESCRIPTION
NAMESPACE
NEWS.md
README.Rmd
README.md
cran-comments.md
okcupiddata.Rproj

README.md

okcupiddata

Build Status CRAN_Status_Badge CRAN RStudio mirror downloads

R package of cleaned profile data from OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education 2015): 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile.

The data in this package are a "cleaned" version of the original data from the above paper, in that the following variables are modified for easier use by novices:

  • Essay responses: Due to file size restrictions, only the first 140 characters of each user's first essay response (essay0: my self summary) are included
  • Missing income values: Previously coded as -1, they are now coded as NA
  • All other missing values: Previously coded as "", they are now coded as NA
  • offspring and sign: String instances of "?’" are replaced with apostrophes
  • last_online: Date/time strings are converted to USA/Pacific timezone POSIXct date-time objects

Note:

  • The original data, publication, code, and codebook can be found here.
  • The original data, and hence also this cleaned data, did not include usernames.
  • Permission to use this data was explicitly granted by OkCupid.

Installation

Get the released version from CRAN:

install.packages("okcupiddata")

Or the development version from GitHub:

# If you haven't installed devtools yet, do so:
# install.packages("devtools")
devtools::install_github("rudeboybert/okcupiddata")

Load Data

To load the profile data, run:

data(profiles)

If you prefer having the originally published Journal of Statistics Education data, which also include the complete essay responses, then do not use this package; simply run the following code:

# Download the data (run only once):
url <- "https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip?raw=true"
temp_zip_file <- tempfile()
download.file(url, temp_zip_file)
unzip(temp_zip_file, "profiles.csv")
# Load CSV into R:
profiles <- read.csv(file="profiles.csv", header=TRUE, stringsAsFactors = FALSE)