Albert Y. Kim and Adriana Escobedo-Land
Data and code for OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education July 2015, Volume 23, Number 2).
JSE.bib: bibliography fileJSE.pdf: PDF of documentJSE.Rnw: R Sweave document to recreateJSE.pdf.JSE.R: R code used in documentokcupid_codebook.txt: codebook for all variablesprofiles.csv.zip: CSV file of profile data (unzip this first)
Note:
- Permission to use this data set was explicitly granted by OkCupid.
- Usernames are not included.
JSE.RnwSweave document was compiled using theknitrpackage. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.
A mosaicplot of the cross-classification of the 59946 users' sex and sexual orientation:
Linear regression (in red) and logistic regression (in blue) compared. Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x gender) pair.
Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.



