Skip to content

rtogn/CSC4850-Machine-Learning-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Researching the OkCupid dating profile data set using machine learning models to predict user star sign 'intensity'.

Table Of Contents

  1. Project Summary
  2. Expectations
  3. Pipeline Info
  4. Dataset and Features
  5. Technologies
  6. Machine Learning Models
  7. Results
  8. Citations
  9. Special Thanks
  10. Contributors

Project Summary

This project involves the assessment of twelve machine learning models using the 2012 OkCupid dataset. This study aims to evaluate the performance of these models and determine which is most effective at classifying a user’s ‘Star Sign Intensity’. The performance of each model was evaluated based on metrics including accuracy, precision, recall, F1 score, and associated learning curves. Model selection, for each algorithm, compared three independent train/test splits (50-50, 70-30, and 80-20) before undergoing 10-fold cross-validation. The results of which were compared and the best models (by metrics) for each were selected by hand. The findings of this study do not necessarily support much in the way of predicting a human’s interest in star signs in the given context but do provide valuable insights into the appropriate selection of machine learning models and algorithms for any application.

Expectations

Pipeline Info

  • Model initialization: All models were initialized with random_state = 1234 for reproducibility whenever possible
  • Data splits: The models were trained on 3 splits of the dataset in 3 ratios (50-50, 70-30, 80-20)
  • Cross validation: For every split, every model is trained using 10-fold cross validation of the training set, from which the best model is selected.
  • For classification models, we primarily used accuracy as the determining metric as our dataset is largely evenly split (47-53).
  • From the best models chosen for every split, we choose the best model for every model type.
  • Finally, from all the models we choose a single best performer.
  • Notes on model evaluation: models that perform close to or worse than 0.53 (always guessing a single class) will be classified as poorly performing.

Dataset and Features

About the Dataset

Dataset obtained from Kaggle.com

OkCupid is a mobile dating app. It sets itself apart from other dating apps by making use of a precomputed compatibility score, calculated by optional questions the users may choose to answer. In this dataset, there are 60k records containing structured information such as age, sex, orientation as well as text data from open ended descriptions.

Raw Data

  • age, status, sex, orientation, body_type, diet, drinks, drugs, education, ethnicity, height, income, job, last_online, location, offspring, pets, religion, sign, smokes, speaks, essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9
  • 59,949 raw entries
  • .csv format

For training and predicting, all used features were converted to numeric or binary data. These features are labeled with '_data' for use as an official for use in training and testing the models.

Feature Selection

The following fields were used for classification:

'age', 'height', 'income', 'sign_data', 'religion_data', 'religion_intensity', 'status_data', 'sex_data', 'height_data', 'orientation_data', 'body_type_data', 'diet_data' 'drinks_data', 'drugs_data', 'education_data', 'job_data', 'last_online_data', 'offspring_data', 'smokes_data', 'speaks_data', 'essay0_data', 'essay1_data', 'essay2_data', 'essay3_data', 'essay4_data', 'essay5_data', 'essay6_data', 'essay7_data', 'essay8_data', 'essay9_data', 'essay_len'.

Star Sign Intensity

Since predicting a persons Astrological Sign was not a solvable problem with this data set and these techniques an alternative metric was used: Star Sign Intensity. Star Sign Intensity is a composite feature based on self-reported OkCupid survey data representing one’s affinity or interest in their zodiac sign. This data was reported as part of the original column but separated by a comma from the original sign value. For example, an entry might contain "Leo, and it matters a lot'. The three possible options for this sub-field were combined into two to create a binary classification problem: "My sign matters" and "My sign doesn't matter".

Technologies

Libraries

Machine Learning Models

Results

Tables of models selected from 10-fold cross validation for each split.

Split 0 (50/50)

Accuracy

Precision

Recall

F1-Score

Decision Tree

0.59

0.59

0.59

0.59

Perceptron

0.51

0.5

0.51

0.48

Naive Bayes

0.51

0.52

0.51

0.48

Logistic Regression

0.6

0.59

0.6

0.59

SVM - Linear Kernel

0.59

0.59

0.59

0.59

SVM - RBF Kernel

0.5

0.51

0.5

0.47

Multilayer Perceptron

0.53

0.54

0.53

0.51

Gradient Boosting

0.61

0.61

0.61

0.61

Ridge Regression

0.6

0.6

0.6

0.6

K-Nearest Neighbors

0.51

0.51

0.51

0.51

Passive Aggressive

0.52

0.49

0.52

0.37

 

Split 1 (70/30)

Accuracy

Precision

Recall

F1-Score

Decision Tree

0.61

0.61

0.61

0.61

Perceptron

0.51

0.49

0.51

0.46

Naive Bayes

0.55

0.55

0.55

0.55

Logistic Regression

0.6

0.6

0.6

0.6

SVM - Linear Kernel

0.59

0.6

0.59

0.59

SVM - RBF Kernel

0.49

0.5

0.49

0.46

Multilayer Perceptron

0.52

0.52

0.52

0.51

Gradient Boosting

0.62

0.62

0.62

0.62

Ridge Regression

0.6

0.6

0.6

0.6

K-Nearest Neighbors

0.51

0.51

0.51

0.51

Passive Aggressive

0.53

0.52

0.53

0.45

 

Sprint 2 (80/20)

Accuracy

Precision

Recall

F1-Score

Decision Tree

0.62

0.62

0.62

0.62

Perceptron

0.5

0.48

0.5

0.45

Naive Bayes

0.55

0.55

0.55

0.55

Logistic Regression

0.59

0.59

0.59

0.59

SVM - Linear Kernel

0.58

0.59

0.58

0.58

SVM with RBF Kernel

0.49

0.5

0.49

0.46

Multilayer Perceptron

0.53

0.53

0.53

0.53

Gradient Boosting

0.61

0.61

0.61

0.61

Ridge Regression

0.59

0.59

0.59

0.59

K-Nearest Neighbors

0.5

0.5

0.5

0.5

Passive Aggressive

0.53

0.44

0.53

0.37

 

Learning Curves for all models per split

Best Performers

The best performing algorithms on this classification problem were Decision Tree (F1 of 0.62 on split 2), Gradient Boosting (F1 of 0.62 on Split 1) and Logistic regression (F1 of 0.60 on split 1). Overall, Decision Tree was chosen as the winning algorithm.

Some of the poorer performing models include Perceptron and SVM with RBF Kernel, which consistently achieve scores worse than simply predicting a single class. The Passive aggressive classifier also tends to perform poorly, achieving an F1-Score below 0.4 in two of the three splits.

Citations

F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

Related Work & Other Resources

Special Thanks

The 'star sign' team would like to thank our Professor Dr. Juan M. Banda for guiding us this semester!

Contributors

Mike Doan:   Jack Ericson:   Robert Tognoni:

Contributions to this Github repository do not necessarily reflect contributions to the project as a whole. Code and textual content were worked on collaboratively by all team members, and then uploaded here for final submission

About

Machine Learning Class Project 2023

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%