Skip to content
Journal of Statistical Education Paper on Using OkCupid Data for Data Science Courses
R TeX
Branch: master
Clone or download

Latest commit

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README_files/figure-markdown_github Updated README. Jul 23, 2016
.gitignore Updated Sweave options and instructions. Aug 1, 2015
JSE.R Done revisions. May 31, 2015
JSE.Rnw Updated Sweave options and instructions. Aug 1, 2015
JSE.bib Done revisions. May 19, 2015
JSE.pdf Updated Sweave options and instructions. Aug 1, 2015
README.Rmd Updated README. Jul 23, 2016
README.md Trivial change AEL 1/20/20 Jan 22, 2020
okcupid_codebook.txt Final revisions. Apr 5, 2015
profiles.csv.zip Done initial upload. Mar 28, 2015

README.md

OkCupid Profile Data for Intro Stats and Data Science Courses

Albert Y. Kim & Adriana Escobedo-Land

Data and code for OkCupid Profile Data for Introductory Statistics and Data Science Courses (Journal of Statistics Education July 2015, Volume 23, Number 2).

  • JSE.bib: bibliography file
  • JSE.pdf: PDF of document
  • JSE.Rnw: R Sweave document to recreate JSE.pdf.
  • JSE.R: R code used in document
  • okcupid_codebook.txt: codebook for all variables
  • profiles.csv.zip: CSV file of profile data (unzip this first)

Note:

  • Permission to use this data set was explicitly granted by OkCupid.
  • Usernames are not included.
  • JSE.Rnw Sweave document was compiled using the knitr package. In RStudio, go to "Tools" -> "Project Options" -> "Sweave" -> "Weave Rnw files using:" and select knitr.

Preview

Distribution of Male and Female Heights

Joint Distribution of Sex and Sexual Orientation

A mosaicplot of the cross-classification of the 59946 users' sex and sexual orientation:

Logistic Regression to Predict Gender

Linear regression (in red) and logistic regression (in blue) compared. Note both the x-axis (height) and y-axis (is female: 1 if user is female, 0 if user is male) have random jitter added to better visualize the number of points involved for each (height x gender) pair.

Fitted probabilities p-hat of each user being female along witha decision threshold (in red) used to predict if user is female or not.

You can’t perform that action at this time.