R package of cleaned profile data from OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education 2015): 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile.
The data in this package are a "cleaned" version of the original data from the above paper, in that the following variables are modified for easier use by novices:
- Essay responses: Due to file size restrictions, only the first 140 characters of each user's first essay response (
essay0: my self summary) are included
incomevalues: Previously coded as
-1, they are now coded as
- All other missing values: Previously coded as
"", they are now coded as
sign: String instances of
"?’"are replaced with apostrophes
last_online: Date/time strings are converted to
USA/Pacifictimezone POSIXct date-time objects
- The original data, publication, code, and codebook can be found here.
- The original data, and hence also this cleaned data, did not include usernames.
- Permission to use this data was explicitly granted by OkCupid.
Get the released version from CRAN:
Or the development version from GitHub:
# If you haven't installed devtools yet, do so: # install.packages("devtools") devtools::install_github("rudeboybert/okcupiddata")
To load the profile data, run:
If you prefer having the originally published Journal of Statistics Education data, which also include the complete essay responses, then do not use this package; simply run the following code:
# Download the data (run only once): url <- "https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip?raw=true" temp_zip_file <- tempfile() download.file(url, temp_zip_file) unzip(temp_zip_file, "profiles.csv") # Load CSV into R: profiles <- read.csv(file="profiles.csv", header=TRUE, stringsAsFactors = FALSE)