# Predicting Age from OK Cupid Data

## Introduction
The objective of this assignment is to predict the age of an individual on a dating profile, using OK Cupid profile data of users near San Francisco. This dataset was sourced from openml.org at https://www.openml.org/d/42164.

## Table of contents

* [Overview](#Overview)
* [Data preparation](#References)

## Overview
### Data source
The dating profile dataset available on openml.org was originally sourced from public profiles on www.okcupid.com of users who lived within a 25 mile radius of San Francisco who were active online between 30/6/2011-30/6/2012 and had at least one profile picture. This data was sourced and made public with the permission of OK Cupid, and the csv linked on openml.org can be found here: https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip This source also includes a text file which gives a description of all the variables.

This dataset has 59,946 observations (i.e. users), and 31 features (including the target feature which is age). We will use a sample of 5,000 observations for this assignment.

### Project objective

### Target feature
The target feature is `age`.

### Descriptive features
There are 30 descriptive features in this dataset. Some categorical features have two factors e.g. diet has mostly or strictly, as well as the type of diet.
- `body_type`: categorical
  - rather not say, thin, overweight, skinny, average, fit, athletic, jacked, a little extra, curvy, full figured, used up
- `diet`: categorical
  - mostly, strictly
  - anything, vegetarian, vegan, kosher, halal, other
- `drinks`: categorical
  - very often, often, socially, rarely, desperately, not at all
- `drugs`: categorical
  - never, sometimes, often
- `education`: categorical
  - graduated from, working on, dropped out of
  - high school, two-year college, university, masters program, law school, med school, Ph.D program, space camp
- `ethnicity`: categorical
  - asian, middle eastern, black, native american, indian, pacific islander, hispanic/latin, white, other
- `height`: continuous, in inches
- `income`: categorical in US $, -1 means rather not say
  - -1, 20000, 30000, 40000, 50000, 60000 70000, 80000, 100000, 150000, 250000, 500000, 1000000,
- `job`: categorical
  - student, art/music/writing, banking/finance, administration, technology, construction, education, entertainment/media, management, hospitality, law, medicine, military, politics/government, sales/marketing, science/engineering, transportation, unemployed, other, rather not say, retire
- `last online`:
- `location`:
- `offspring`: categorical
  - has a kid, has kids, doesnt have a kid, doesn't want kids
  - and/but might want them, wants them, doesn't want any, doesn't want more  
- `orientation`: categorical
  - straight, gay, bisexual
- `pets`: categorical
  - has dogs, likes dogs, dislikes dogs
  - has cats, likes cats, dislikes cats
- `religion`: categorical
  - agnosticism, atheism, Christianity, Judaism, Catholicism, Islam, Hinduism, Buddhism, Other
  - and very serious about it, and somewhat serious about it, but not too serious about it, and laughing about it
- `sex`: categorical
  - m, f
- `sign`: categorical
  - aquarius, pisces, aries, taurus, gemini, cancer, leo, virgo, libra, scorpio, sagittarius, capricorn
  - but it doesn’t matter, and it matters a lot, and it’s fun to think about
- `smokes`: categorical
  - yes, sometimes, when drinking, trying to quit, no
- `speaks`: categorical
  - Afrikaans, Albanian, Arabic, Armenian, Basque, Belarusan, Bengali, Breton, Bulgarian, Catalan, Cebuano, Chechen, Chinese, C++, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, Frisian, Georgian, German, Greek, Gujarati, Ancient Greek, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Ilongo, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Latin, Latvian, LISP, Lithuanian, Malay, Maori, Mongolian, Norwegian, Occitan, Other, Persian, Polish, Portuguese, Romanian, Rotuman, Russian, Sanskrit, Sardinian, Serbian, Sign Language, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Tibetan, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish 
  - fluently, okay, poorly
- `status`: categorical
  - single, seeing someone, married, in an open relationship
- `essay0`: id-like variable
  - "My self summary"
- `essay1`: id-like variable
  - "What I'm doing with my life"
- `essay2`: id-like variable
  - "I'm really good at"
- `essay3`: id-like variable
  - "The first thing people usually notice about me"
- `essay4`: id-like variable
  - "Favorite books, movies, show, music, and food"
- `essay5`: id-like variable
  - "The six things I could never do without"
- `essay6`: id-like variable
  - "I spend a lot of time thinking about"
- `essay7`: id-like variable
  - "On a typical Friday night I am"
- `essay8`: id-like variable
  - "The most private thing I am willing to admit"
- `essay9`: id-like variable
  - "You should message me if..."  

## Data Preparation

-

### Preliminaries

In [1]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

In [2]:
# Specifying the attribute names
attributeNames = [
    'age',
    'body_type',
    'diet',
    'drinks',
    'drugs',
    'education',
    'essay0',
    'essay1',
    'essay2',
    'essay3',
    'essay4',
    'essay5', 
    'essay6',
    'essay7',
    'essay8',
    'essay9',
    'ethnicity',
    'height',
    'income',
    'job',
    'last_online',
    'location',
    'offspring',
    'orientation',
    'pets',
    'religion',
    'sex',
    'sign',
    'smokes',
    'speaks',
    'status',
]

# Read in data
fulldf = pd.read_csv('profiles.csv', names = attributeNames)
df = fulldf.sample(n=5000, random_state=1)
df.head(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
16912,39,fit,strictly anything,socially,,graduated from college/university,thanks for stopping by. talking about ourselve...,making time for my passions (see favorites) an...,"- expressing my feelings, thoughts and passion...",i have the pale complexion of my viking fore-b...,...,"oakland, california","has kids, but doesn&rsquo;t want more",straight,likes dogs and likes cats,,m,aquarius,no,"english (fluently), spanish (okay)",single
13297,22,athletic,,socially,never,graduated from college/university,"i'm a outgoing person,loves the outdoors an lo...","manage a record label,cook at a 5**star restau...","making music ,cooking with my culinary skills ...",i'm friendly :),...,"vallejo, california",,straight,likes dogs and likes cats,,m,,yes,"english (fluently), spanish",single
2604,24,fit,,socially,,working on college/university,"""i am an excitable person who only understands...",lol.,being perceptive.,my talons.,...,"san francisco, california",,bisexual,,,m,,no,"english (fluently), french (fluently), spanish...",single
39269,25,curvy,mostly anything,socially,,graduated from college/university,i think i'm too easy going or open-to-being-pl...,i was a high school english teacher in south a...,"being lazy, figuring out puzzles, being creati...",eyes/smile,...,"san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes cats,other but not too serious about it,f,virgo and it&rsquo;s fun to think about,sometimes,"english (fluently), afrikaans (okay)",single
12225,25,average,strictly anything,often,never,working on space camp,"i'm a kind and intelligent person, and am gene...",right now i'm working in the mornings as a del...,i'm a pretty good boxer and my childhood dream...,"""i wonder if that dude has ever screamed in th...",...,"oakland, california","doesn&rsquo;t have kids, and doesn&rsquo;t wan...",straight,likes dogs and likes cats,atheism and laughing about it,m,libra and it matters a lot,no,"english (fluently), spanish (poorly)",single
18970,26,fit,,socially,never,graduated from masters program,"i'm kind of phasing out this account, if you'v...",working to live. living to love. loving to lau...,music...for example:<br />\nstaying in tune<br...,stunningly handsome?<br />\n<br />\nbut seriou...,...,"san carlos, california",,bisexual,,christianity and somewhat serious about it,m,cancer but it doesn&rsquo;t matter,no,"english (fluently), latin (okay), spanish (poo...",single
8501,24,,,,,graduated from college/university,what are three apropos adjectives to describe ...,i work for a tasty start-up in san francisco. ...,i love improv comedy and i perform/learn it in...,i'm animated like a pixar film.,...,"san francisco, california",,straight,likes dogs and likes cats,,m,sagittarius and it&rsquo;s fun to think about,,english,single
10629,30,athletic,,rarely,never,graduated from two-year college,"well, to begin, i read the book 'on the road' ...",i recently returned to school to pursue film m...,"making omelettes. giving <a class=""ilink"" href...","my style and <a class=""ilink"" href=\n""/interes...",...,"san francisco, california",,bisexual,likes dogs and likes cats,other,m,gemini and it&rsquo;s fun to think about,no,"english (fluently), french (poorly)",single
15760,50,,,socially,never,dropped out of ph.d program,"hello there,<br />\n<br />\nlike i said, looki...","working too hard and neglecting <a class=""ilin...","analysis, perhaps not too much to led itslef t...",do you shave your head or you just look like t...,...,"berkeley, california",might want kids,straight,likes dogs and likes cats,agnosticism but not too serious about it,m,,no,english,available
34877,35,curvy,,socially,never,graduated from college/university,i work two jobs and am a single parent of a te...,one of my jobs involves computer data entry fo...,working with children and animals,my confindence &amp; personality,...,"san francisco, california",has a kid,straight,has dogs,catholicism and somewhat serious about it,f,capricorn and it&rsquo;s fun to think about,sometimes,"english (fluently), portuguese (fluently), spa...",single


### Data Cleaning and Transformation

The first step is to remove the first row of column titles as we cannot use this data. 

In [3]:
df = df.iloc[1:]

In [4]:
print(f'The shape of the dataset is {df.shape}\n')
print(f'Data types are below:')
print(df.dtypes)

The shape of the dataset is (4999, 31)

Data types are below:
age            object
body_type      object
diet           object
drinks         object
drugs          object
education      object
essay0         object
essay1         object
essay2         object
essay3         object
essay4         object
essay5         object
essay6         object
essay7         object
essay8         object
essay9         object
ethnicity      object
height         object
income         object
job            object
last_online    object
location       object
offspring      object
orientation    object
pets           object
religion       object
sex            object
sign           object
smokes         object
speaks         object
status         object
dtype: object


Some of the data types have come across as `object` rather than `int64`, so we need to transform them. But to do this, we need to remove any missing values first. We also need to remove any id-like variables.

In [5]:
df = df.drop(['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9'], axis = 1)

### Checking for Missing Values

In [6]:
print(f'Number of missing values for each feature:')
print(df.isnull().sum())

Number of missing values for each feature:
age               0
body_type       482
diet           2061
drinks          270
drugs          1152
education       563
ethnicity       495
height            0
income            0
job             686
last_online       0
location          0
offspring      2978
orientation       0
pets           1662
religion       1633
sex               0
sign            945
smokes          456
speaks            5
status            0
dtype: int64


There are a lot of missing values, so we will need to remove these.

In [7]:
# Identify the rows with missing values
df[df.isna().any(axis = 1)]

Unnamed: 0,age,body_type,diet,drinks,drugs,education,ethnicity,height,income,job,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
13297,22,athletic,,socially,never,graduated from college/university,hispanic / latin,66,-1,entertainment / media,...,"vallejo, california",,straight,likes dogs and likes cats,,m,,yes,"english (fluently), spanish",single
2604,24,fit,,socially,,working on college/university,white,74,-1,student,...,"san francisco, california",,bisexual,,,m,,no,"english (fluently), french (fluently), spanish...",single
39269,25,curvy,mostly anything,socially,,graduated from college/university,white,64,-1,other,...,"san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes cats,other but not too serious about it,f,virgo and it&rsquo;s fun to think about,sometimes,"english (fluently), afrikaans (okay)",single
18970,26,fit,,socially,never,graduated from masters program,white,69,-1,science / tech / engineering,...,"san carlos, california",,bisexual,,christianity and somewhat serious about it,m,cancer but it doesn&rsquo;t matter,no,"english (fluently), latin (okay), spanish (poo...",single
8501,24,,,,,graduated from college/university,,74,-1,sales / marketing / biz dev,...,"san francisco, california",,straight,likes dogs and likes cats,,m,sagittarius and it&rsquo;s fun to think about,,english,single
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17024,40,fit,strictly anything,socially,never,graduated from high school,hispanic / latin,68,80000,sales / marketing / biz dev,...,"pleasant hill, california","has kids, and might want more",straight,has cats,christianity and very serious about it,m,,,english,single
25945,26,athletic,mostly anything,,never,,,72,-1,executive / management,...,"san francisco, california",,straight,,buddhism,m,scorpio,,"english (fluently), japanese (okay)",seeing someone
54204,31,average,,socially,never,graduated from masters program,white,70,-1,science / tech / engineering,...,"san mateo, california",,straight,likes dogs and has cats,atheism and laughing about it,m,virgo and it&rsquo;s fun to think about,no,english (fluently),single
59383,28,skinny,halal,socially,never,graduated from college/university,asian,63,20000,student,...,"san francisco, california","doesn&rsquo;t have kids, but might want them",straight,,islam but not too serious about it,f,leo but it doesn&rsquo;t matter,yes,"english (fluently), chinese (fluently), russia...",single


In [8]:
# Remove the rows with missing values and make sure there are no missing values left
df = df.dropna()
df.shape[0]

581

In [9]:
df.isnull().sum()

age            0
body_type      0
diet           0
drinks         0
drugs          0
education      0
ethnicity      0
height         0
income         0
job            0
last_online    0
location       0
offspring      0
orientation    0
pets           0
religion       0
sex            0
sign           0
smokes         0
speaks         0
status         0
dtype: int64

Now we can transform the variables into the correct data types so that we can perform the summary statistics

In [10]:
# Transforming data types for age, height, income
df['age'] = df.age.astype(int)
df['height'] = df.height.astype(int)
df['income'] = df.income.astype(int)

In [11]:
# Check that this has worked
df.dtypes

age             int32
body_type      object
diet           object
drinks         object
drugs          object
education      object
ethnicity      object
height          int32
income          int32
job            object
last_online    object
location       object
offspring      object
orientation    object
pets           object
religion       object
sex            object
sign           object
smokes         object
speaks         object
status         object
dtype: object

### Summary Statistics

In [18]:
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of continuous features</b>'))
df.describe(include='int32')

Unnamed: 0,age,height,income
count,581.0,581.0,581.0
mean,33.659208,68.070568,34698.113597
std,10.754191,4.24713,133636.793991
min,18.0,55.0,-1.0
25%,26.0,65.0,-1.0
50%,31.0,68.0,-1.0
75%,40.0,71.0,20000.0
max,69.0,95.0,1000000.0


In [21]:
display(HTML('<b>Table 2: Summary of categorical features</b>'))
df.describe(include='object')

Unnamed: 0,body_type,diet,drinks,drugs,education,ethnicity,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
count,581,581,581,581,581,581,581,581,581,581,581,581,581,581,581,581,581,581
unique,12,15,6,3,25,37,21,555,56,15,3,14,42,2,45,5,223,3
top,average,mostly anything,socially,never,graduated from college/university,white,other,2012-06-27-22-05,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,agnosticism but not too serious about it,m,capricorn but it doesn&rsquo;t matter,no,english,single
freq,166,295,382,465,236,340,89,3,252,126,509,212,53,315,30,460,123,529


### Continuous Features

In [22]:
df['age'].describe()

count    581.000000
mean      33.659208
std       10.754191
min       18.000000
25%       26.000000
50%       31.000000
75%       40.000000
max       69.000000
Name: age, dtype: float64

In [23]:
df['height'].describe()

count    581.000000
mean      68.070568
std        4.247130
min       55.000000
25%       65.000000
50%       68.000000
75%       71.000000
max       95.000000
Name: height, dtype: float64

In [24]:
df['income'].describe()

count        581.000000
mean       34698.113597
std       133636.793991
min           -1.000000
25%           -1.000000
50%           -1.000000
75%        20000.000000
max      1000000.000000
Name: income, dtype: float64

### Categorical Features

In [25]:
categoricalColumns = df.columns[df.dtypes==object].tolist()

for col in categoricalColumns:
    print('Unique values for ' + col)
    print(df[col].unique())
    print('')

Unique values for body_type
['average' 'athletic' 'fit' 'a little extra' 'curvy' 'thin' 'overweight'
 'used up' 'full figured' 'skinny' 'rather not say' 'jacked']

Unique values for diet
['strictly anything' 'mostly anything' 'mostly other' 'mostly vegetarian'
 'anything' 'mostly kosher' 'strictly vegetarian' 'other' 'vegetarian'
 'strictly other' 'strictly vegan' 'mostly vegan' 'vegan' 'mostly halal'
 'strictly kosher']

Unique values for drinks
['often' 'socially' 'very often' 'rarely' 'not at all' 'desperately']

Unique values for drugs
['never' 'sometimes' 'often']

Unique values for education
['working on space camp' 'working on ph.d program'
 'graduated from college/university' 'working on college/university'
 'graduated from masters program' 'graduated from two-year college'
 'graduated from high school' 'dropped out of college/university'
 'graduated from law school' 'graduated from ph.d program'
 'working on masters program' 'dropped out of ph.d program'
 'graduated from space

### Dependent Variable

-

## Data Exploration 

-

### Univariate Visualisation

#### Bar Chart

#### Box plot

#### Histogram

### Multivariate Visualisation

#### Scatterplot

#### Categorical Attributes by

#### Facet plots

#### another scatterplot probably

## Statistical Modeling and Performance Evaluation

### Full Model

### Full Model Diagnostic Checks

#### predictive variable - scatterplot

#### actual variable - scatterplot

#### predictive variable - histogram

#### actual variable - histogram

### Backwards Feature Selection

### Reduced Model Diagnostic Checks

#### scatterplot of reduced model

#### histogram of reduced model|

## Summary and Conclusions

## References