# Predicting Age from OK Cupid Data

## Introduction
The objective of this assignment is to predict the age of an individual on a dating profile, using OK Cupid profile data of users near San Francisco. This dataset was sourced from openml.org at https://www.openml.org/d/42164.

## Table of contents

* [Overview](#Overview)
* [Data preparation](#References)

## Overview
### Data source
The dating profile dataset available on openml.org was originally sourced from public profiles on www.okcupid.com of users who lived within a 25 mile radius of San Francisco who were active online between 30/6/2011-30/6/2012 and had at least one profile picture. This data was sourced and made public with the permission of OK Cupid, and the csv linked on openml.org can be found here: https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip This source also includes a text file which gives a description of all the variables.

This dataset has 59,946 observations (i.e. users), and 31 features (including the target feature which is age). We will use a sample of 5,000 observations for this assignment.

### Project objective

### Target feature
The target feature is `age`.

### Descriptive features
There are 30 descriptive features in this dataset. Some categorical features have two factors e.g. diet has mostly or strictly, as well as the type of diet.
- `body_type`: categorical
  - rather not say, thin, overweight, skinny, average, fit, athletic, jacked, a little extra, curvy, full figured, used up
- `diet`: categorical
  - mostly, strictly
  - anything, vegetarian, vegan, kosher, halal, other
- `drinks`: categorical
  - very often, often, socially, rarely, desperately, not at all
- `drugs`: categorical
  - never, sometimes, often
- `education`: categorical
  - graduated from, working on, dropped out of
  - high school, two-year college, university, masters program, law school, med school, Ph.D program, space camp
- `ethnicity`: categorical
  - asian, middle eastern, black, native american, indian, pacific islander, hispanic/latin, white, other
- `height`: continuous, in inches
- `income`: categorical in US $, -1 means rather not say
  - -1, 20000, 30000, 40000, 50000, 60000 70000, 80000, 100000, 150000, 250000, 500000, 1000000,
- `job`: categorical
  - student, art/music/writing, banking/finance, administration, technology, construction, education, entertainment/media, management, hospitality, law, medicine, military, politics/government, sales/marketing, science/engineering, transportation, unemployed, other, rather not say, retire
- `last online`:
- `location`:
- `offspring`: categorical
  - has a kid, has kids, doesnt have a kid, doesn't want kids
  - and/but might want them, wants them, doesn't want any, doesn't want more  
- `orientation`: categorical
  - straight, gay, bisexual
- `pets`: categorical
  - has dogs, likes dogs, dislikes dogs
  - has cats, likes cats, dislikes cats
- `religion`: categorical
  - agnosticism, atheism, Christianity, Judaism, Catholicism, Islam, Hinduism, Buddhism, Other
  - and very serious about it, and somewhat serious about it, but not too serious about it, and laughing about it
- `sex`: categorical
  - m, f
- `sign`: categorical
  - aquarius, pisces, aries, taurus, gemini, cancer, leo, virgo, libra, scorpio, sagittarius, capricorn
  - but it doesn’t matter, and it matters a lot, and it’s fun to think about
- `smokes`: categorical
  - yes, sometimes, when drinking, trying to quit, no
- `speaks`: categorical
  - Afrikaans, Albanian, Arabic, Armenian, Basque, Belarusan, Bengali, Breton, Bulgarian, Catalan, Cebuano, Chechen, Chinese, C++, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, Frisian, Georgian, German, Greek, Gujarati, Ancient Greek, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Ilongo, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Latin, Latvian, LISP, Lithuanian, Malay, Maori, Mongolian, Norwegian, Occitan, Other, Persian, Polish, Portuguese, Romanian, Rotuman, Russian, Sanskrit, Sardinian, Serbian, Sign Language, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Tibetan, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish 
  - fluently, okay, poorly
- `status`: categorical
  - single, seeing someone, married, in an open relationship
- `essay0`: id-like variable
  - "My self summary"
- `essay1`: id-like variable
  - "What I'm doing with my life"
- `essay2`: id-like variable
  - "I'm really good at"
- `essay3`: id-like variable
  - "The first thing people usually notice about me"
- `essay4`: id-like variable
  - "Favorite books, movies, show, music, and food"
- `essay5`: id-like variable
  - "The six things I could never do without"
- `essay6`: id-like variable
  - "I spend a lot of time thinking about"
- `essay7`: id-like variable
  - "On a typical Friday night I am"
- `essay8`: id-like variable
  - "The most private thing I am willing to admit"
- `essay9`: id-like variable
  - "You should message me if..."  

## Data Preparation

-

### Preliminaries

In [94]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

In [95]:
# Specifying the attribute names
attributeNames = [
    'age',
    'body_type',
    'diet',
    'drinks',
    'drugs',
    'education',
    'essay0',
    'essay1',
    'essay2',
    'essay3',
    'essay4',
    'essay5', 
    'essay6',
    'essay7',
    'essay8',
    'essay9',
    'ethnicity',
    'height',
    'income',
    'job',
    'last_online',
    'location',
    'offspring',
    'orientation',
    'pets',
    'religion',
    'sex',
    'sign',
    'smokes',
    'speaks',
    'status',
]

# Read in data
df = pd.read_csv('df.csv', names = attributeNames)
df.head(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
1,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
2,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
3,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
4,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
5,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single
6,29,average,mostly anything,socially,,graduated from college/university,"i'm an australian living in san francisco, but...",building awesome stuff. figuring out what's im...,imagining random shit. laughing at aforementio...,i have a big smile. i also get asked if i'm we...,...,"san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes cats,atheism,m,taurus,no,"english (fluently), chinese (okay)",single
7,32,fit,strictly anything,socially,never,graduated from college/university,life is about the little things. i love to lau...,digging up buried treasure,frolicking<br />\nwitty banter<br />\nusing my...,i am the last unicorn,...,"san francisco, california",,straight,likes dogs and likes cats,,f,virgo,,english,single
8,31,average,mostly anything,socially,never,graduated from college/university,,"writing. meeting new people, spending time wit...","remembering people's birthdays, sending cards,...",i'm rather approachable (a byproduct of being ...,...,"san francisco, california","doesn&rsquo;t have kids, but wants them",straight,likes dogs and likes cats,christianity,f,sagittarius,no,"english, spanish (okay)",single
9,24,,strictly anything,socially,,graduated from college/university,,"oh goodness. at the moment i have 4 jobs, so i...",,i'm freakishly blonde and have the same name a...,...,"belvedere tiburon, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,christianity but not too serious about it,f,gemini but it doesn&rsquo;t matter,when drinking,english,single


### Data Cleaning and Transformation

The first step is to remove the first row of column titles as we cannot use this data. 

In [96]:
df = df.iloc[1:]

In [97]:
print(f'The shape of the dataset is {df.shape}\n')
print(f'Data types are below:')
print(df.dtypes)

The shape of the dataset is (5000, 31)

Data types are below:
age            object
body_type      object
diet           object
drinks         object
drugs          object
education      object
essay0         object
essay1         object
essay2         object
essay3         object
essay4         object
essay5         object
essay6         object
essay7         object
essay8         object
essay9         object
ethnicity      object
height         object
income         object
job            object
last_online    object
location       object
offspring      object
orientation    object
pets           object
religion       object
sex            object
sign           object
smokes         object
speaks         object
status         object
dtype: object


Some of the data types have come across as `object` rather than `int64`, so we need to transform them. But to do this, we need to remove any missing values first. We also need to remove any id-like variables.

In [98]:
df = df.drop(['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9'], axis = 1)

### Checking for Missing Values

In [99]:
print(f'Number of missing values for each feature:')
print(df.isnull().sum())

Number of missing values for each feature:
age               0
body_type       459
diet           1965
drinks          256
drugs          1197
education       520
ethnicity       494
height            0
income            0
job             698
last_online       0
location          0
offspring      2941
orientation       0
pets           1695
religion       1706
sex               0
sign            900
smokes          498
speaks            4
status            0
dtype: int64


There are a lot of missing values, so we will need to remove these.

In [100]:
# Identify the rows with missing values
df[df.isna().any(axis = 1)]

Unnamed: 0,age,body_type,diet,drinks,drugs,education,ethnicity,height,income,job,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
3,38,thin,anything,socially,,graduated from masters program,,68,-1,,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
4,23,thin,vegetarian,socially,,working on college/university,white,71,20000,student,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
5,29,athletic,,socially,never,graduated from college/university,"asian, black, other",66,-1,artistic / musical / writer,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single
6,29,average,mostly anything,socially,,graduated from college/university,white,67,-1,computer / hardware / software,...,"san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes cats,atheism,m,taurus,no,"english (fluently), chinese (okay)",single
7,32,fit,strictly anything,socially,never,graduated from college/university,"white, other",65,-1,,...,"san francisco, california",,straight,likes dogs and likes cats,,f,virgo,,english,single
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,31,,,,never,graduated from masters program,indian,71,-1,science / tech / engineering,...,"mountain view, california",,straight,,,m,,no,"english, spanish (poorly)",single
4997,19,fit,mostly anything,socially,,working on high school,"asian, pacific islander",64,-1,student,...,"westlake, california",,straight,,,m,capricorn and it&rsquo;s fun to think about,no,"english (fluently), spanish (poorly)",single
4998,24,fit,,socially,sometimes,working on space camp,hispanic / latin,68,20000,student,...,"oakland, california",,straight,has dogs,catholicism and laughing about it,m,taurus but it doesn&rsquo;t matter,no,"english (fluently), spanish (fluently)",single
4999,28,average,,socially,,working on ph.d program,white,67,-1,,...,"palo alto, california",,straight,,,f,taurus but it doesn&rsquo;t matter,no,"english (fluently), hebrew (fluently), russian...",single


In [101]:
# Remove the rows with missing values and make sure there are no missing values left
df = df.dropna()
df.shape[0]

623

In [102]:
df.isnull().sum()

age            0
body_type      0
diet           0
drinks         0
drugs          0
education      0
ethnicity      0
height         0
income         0
job            0
last_online    0
location       0
offspring      0
orientation    0
pets           0
religion       0
sex            0
sign           0
smokes         0
speaks         0
status         0
dtype: int64

Now we can transform the variables into the correct data types so that we can perform the summary statistics

In [103]:
# Transforming data types for age, height, income
df['age'] = df.age.astype(int)
df['height'] = df.height.astype(int)
df['income'] = df.income.astype(int)

In [104]:
# Check that this has worked
df.dtypes

age             int32
body_type      object
diet           object
drinks         object
drugs          object
education      object
ethnicity      object
height          int32
income          int32
job            object
last_online    object
location       object
offspring      object
orientation    object
pets           object
religion       object
sex            object
sign           object
smokes         object
speaks         object
status         object
dtype: object

### Summary Statistics

-

### Continuous Features

-

### Categorical Features

-

### Dependent Variable

-

## Data Exploration 

-

### Univariate Visualisation

#### Bar Chart

#### Box plot

#### Histogram

### Multivariate Visualisation

#### Scatterplot

#### Categorical Attributes by

#### Facet plots

#### another scatterplot probably

## Statistical Modeling and Performance Evaluation

### Full Model

### Full Model Diagnostic Checks

#### predictive variable - scatterplot

#### actual variable - scatterplot

#### predictive variable - histogram

#### actual variable - histogram

### Backwards Feature Selection

### Reduced Model Diagnostic Checks

#### scatterplot of reduced model

#### histogram of reduced model|

## Summary and Conclusions

## References