Navigation Menu

Skip to content



Repository files navigation

OkCupid Profile Data for Intro Stats and Data Science Courses

Albert Y. Kim and Adriana Escobedo-Land

Data and code for OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education July 2015, Volume 23, Number 2).

  • JSE.bib: bibliography file
  • JSE.pdf: PDF of document
  • JSE.Rnw: R Sweave document to recreate JSE.pdf.
  • JSE.R: R code used in document
  • okcupid_codebook.txt: codebook for all variables
  • CSV file of profile data (unzip this first)


  • Permission to use this data set was explicitly granted by OkCupid.
  • Usernames are not included.
  • JSE.Rnw Sweave document was compiled using the knitr package. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.


Distribution of Male and Female Heights

Joint Distribution of Sex and Sexual Orientation

A mosaicplot of the cross-classification of the 59946 users' sex and sexual orientation:

Logistic Regression to Predict Gender

Linear regression (in red) and logistic regression (in blue) compared. Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x gender) pair.

Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.


Journal of Statistical Education Paper on Using OkCupid Data for Data Science Courses






No releases published


No packages published


  • R 56.7%
  • TeX 43.3%