Project Summary

This project involves the assessment of twelve machine learning models using the 2012 OkCupid dataset. This study aims to evaluate the performance of these models and determine which is most effective at classifying a user’s ‘Star Sign Intensity’. The performance of each model was evaluated based on metrics including accuracy, precision, recall, F1 score, and associated learning curves. Model selection, for each algorithm, compared three independent train/test splits (50-50, 70-30, and 80-20) before undergoing 10-fold cross-validation. The results of which were compared and the best models (by metrics) for each were selected by hand. The findings of this study do not necessarily support much in the way of predicting a human’s interest in star signs in the given context but do provide valuable insights into the appropriate selection of machine learning models and algorithms for any application.

Expectations

Pipeline Info

Model initialization: All models were initialized with random_state = 1234 for reproducibility whenever possible
Data splits: The models were trained on 3 splits of the dataset in 3 ratios (50-50, 70-30, 80-20)
Cross validation: For every split, every model is trained using 10-fold cross validation of the training set, from which the best model is selected.
For classification models, we primarily used accuracy as the determining metric as our dataset is largely evenly split (47-53).
From the best models chosen for every split, we choose the best model for every model type.
Finally, from all the models we choose a single best performer.
Notes on model evaluation: models that perform close to or worse than 0.53 (always guessing a single class) will be classified as poorly performing.

Dataset and Features

About the Dataset

Dataset obtained from Kaggle.com

OkCupid is a mobile dating app. It sets itself apart from other dating apps by making use of a precomputed compatibility score, calculated by optional questions the users may choose to answer. In this dataset, there are 60k records containing structured information such as age, sex, orientation as well as text data from open ended descriptions.

Raw Data

age, status, sex, orientation, body_type, diet, drinks, drugs, education, ethnicity, height, income, job, last_online, location, offspring, pets, religion, sign, smokes, speaks, essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9
59,949 raw entries
.csv format

For training and predicting, all used features were converted to numeric or binary data. These features are labeled with '_data' for use as an official for use in training and testing the models.

Feature Selection

The following fields were used for classification:

'age', 'height', 'income', 'sign_data', 'religion_data', 'religion_intensity', 'status_data', 'sex_data', 'height_data', 'orientation_data', 'body_type_data', 'diet_data' 'drinks_data', 'drugs_data', 'education_data', 'job_data', 'last_online_data', 'offspring_data', 'smokes_data', 'speaks_data', 'essay0_data', 'essay1_data', 'essay2_data', 'essay3_data', 'essay4_data', 'essay5_data', 'essay6_data', 'essay7_data', 'essay8_data', 'essay9_data', 'essay_len'.

Star Sign Intensity

Since predicting a persons Astrological Sign was not a solvable problem with this data set and these techniques an alternative metric was used: Star Sign Intensity. Star Sign Intensity is a composite feature based on self-reported OkCupid survey data representing one’s affinity or interest in their zodiac sign. This data was reported as part of the original column but separated by a comma from the original sign value. For example, an entry might contain "Leo, and it matters a lot'. The three possible options for this sub-field were combined into two to create a binary classification problem: "My sign matters" and "My sign doesn't matter".

Technologies

Python

Google Colab

Libraries

scikit-learn

NumPy

Pandas

Matplotlib

Machine Learning Models

Results

Tables of models selected from 10-fold cross validation for each split.

Split 0 (50/50)	Accuracy	Precision	Recall	F1-Score
Decision Tree	0.59	0.59	0.59	0.59
Perceptron	0.51	0.5	0.51	0.48
Naive Bayes	0.51	0.52	0.51	0.48
Logistic Regression	0.6	0.59	0.6	0.59
SVM - Linear Kernel	0.59	0.59	0.59	0.59
SVM - RBF Kernel	0.5	0.51	0.5	0.47
Multilayer Perceptron	0.53	0.54	0.53	0.51
Gradient Boosting	0.61	0.61	0.61	0.61
Ridge Regression	0.6	0.6	0.6	0.6
K-Nearest Neighbors	0.51	0.51	0.51	0.51
Passive Aggressive	0.52	0.49	0.52	0.37

Split 1 (70/30)	Accuracy	Precision	Recall	F1-Score
Decision Tree	0.61	0.61	0.61	0.61
Perceptron	0.51	0.49	0.51	0.46
Naive Bayes	0.55	0.55	0.55	0.55
Logistic Regression	0.6	0.6	0.6	0.6
SVM - Linear Kernel	0.59	0.6	0.59	0.59
SVM - RBF Kernel	0.49	0.5	0.49	0.46
Multilayer Perceptron	0.52	0.52	0.52	0.51
Gradient Boosting	0.62	0.62	0.62	0.62
Ridge Regression	0.6	0.6	0.6	0.6
K-Nearest Neighbors	0.51	0.51	0.51	0.51
Passive Aggressive	0.53	0.52	0.53	0.45

Sprint 2 (80/20)	Accuracy	Precision	Recall	F1-Score
Decision Tree	0.62	0.62	0.62	0.62
Perceptron	0.5	0.48	0.5	0.45
Naive Bayes	0.55	0.55	0.55	0.55
Logistic Regression	0.59	0.59	0.59	0.59
SVM - Linear Kernel	0.58	0.59	0.58	0.58
SVM with RBF Kernel	0.49	0.5	0.49	0.46
Multilayer Perceptron	0.53	0.53	0.53	0.53
Gradient Boosting	0.61	0.61	0.61	0.61
Ridge Regression	0.59	0.59	0.59	0.59
K-Nearest Neighbors	0.5	0.5	0.5	0.5
Passive Aggressive	0.53	0.44	0.53	0.37

Learning Curves for all models per split

Best Performers

The best performing algorithms on this classification problem were Decision Tree (F1 of 0.62 on split 2), Gradient Boosting (F1 of 0.62 on Split 1) and Logistic regression (F1 of 0.60 on split 1). Overall, Decision Tree was chosen as the winning algorithm.

Some of the poorer performing models include Perceptron and SVM with RBF Kernel, which consistently achieve scores worse than simply predicting a single class. The Passive aggressive classifier also tends to perform poorly, achieving an F1-Score below 0.4 in two of the three splits.

Citations

F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

Related Work & Other Resources

B. Resnick, "Researchers just released profile data on 70,000 OkCupid users without permission" Vox, May 12, 2016. [Online]. Available: https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release.
E. Kirkegaard, "Intelligence and religiosity among dating site users" ResearchGate, December 2019. [Online]. Available: https://www.researchgate.net/publication/338125762_Intelligence_and_Religiosity_among_Dating_Site_Users.
G. Suarez-Tangil, M. Edwards, C. Peersman, G. Stringhini, A. Rashid, M. Whitty, “Automatically Dismantling Online Dating Fraud”, 2020 IEEE Transactions on Information and Security
C. van der Lee, T. van der Zanden, E. Krahmer, M. Mos, and A. Schouten, “Automatic identification of writers’ intentions: Comparing different methods for predicting relationship goals in online dating profile texts” in Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Nov. 2019, pp. 94–100. doi: 10.18653/v1/D19-5512.
D. Boller, M. Lechner, and G. Okasa, The Effect of Sport in Online Dating: Evidence from Causal Machine Learning. 2021.
I. Backus, "Predicting Gender from OKCupid profiles using ensemble methods" self published, Mar. 22, 2018. [Online]. Available: https://raw.githubusercontent.com/wiki/ibackus/okcupid-gender-prediction/okcupid_gender_prediction.pdf.
M. Campbell, "Investigating OkCupid profiles: Can we predict someone's city?" Medium, July 21, 2022. [Online]. Available: https://medium.com/@macrotentional/investigating-okcupid-profiles-can-we-predict-someones-city-31a4734e96dd.

Special Thanks

The 'star sign' team would like to thank our Professor Dr. Juan M. Banda for guiding us this semester!

Contributors

Mike Doan:

Jack Ericson:

Robert Tognoni:

Contributions to this Github repository do not necessarily reflect contributions to the project as a whole. Code and textual content were worked on collaboratively by all team members, and then uploaded here for final submission

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Other Project Deliverables		Other Project Deliverables
OKC_Starsign.ipynb		OKC_Starsign.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table Of Contents

Project Summary

Expectations

Pipeline Info

Dataset and Features

About the Dataset

Raw Data

Feature Selection

Star Sign Intensity

Technologies

Python

Google Colab

Libraries

scikit-learn

NumPy

Pandas

Matplotlib

Machine Learning Models

Results

Tables of models selected from 10-fold cross validation for each split.

Learning Curves for all models per split

Best Performers

Citations

Related Work & Other Resources

Special Thanks

Contributors

About

Uh oh!

Releases

Packages

Languages

rtogn/CSC4850-Machine-Learning-Project

Folders and files

Latest commit

History

Repository files navigation

Table Of Contents

Project Summary

Expectations

Pipeline Info

Dataset and Features

About the Dataset

Raw Data

Feature Selection

Star Sign Intensity

Technologies

Libraries

Machine Learning Models

Results

Tables of models selected from 10-fold cross validation for each split.

Learning Curves for all models per split

Best Performers

Citations

Related Work & Other Resources

Special Thanks

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages