# Predicting Age from OK Cupid Data

## Introduction
The objective of this assignment is to predict the age of an individual on a dating profile, using OK Cupid profile data of users near San Francisco. This dataset was sourced from openml.org at https://www.openml.org/d/42164.

## Table of contents

* [Overview](#Overview)
* [Data preparation](#References)

## Overview
### Data source
The dating profile dataset available on openml.org was originally sourced from public profiles on www.okcupid.com of users who lived within a 25 mile radius of San Francisco who were active online between 30/6/2011-30/6/2012 and had at least one profile picture. This data was sourced and made public with the permission of OK Cupid, and the csv linked on openml.org can be found here: https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip This source also includes a text file which gives a description of all the variables.

This dataset has 59,946 observations (i.e. users), and 31 features (including the target feature which is age). We will use a sample of 5,000 observations for this assignment.

### Project objective

### Target feature
The target features is `age`.

### Descriptive features
There are 30 descriptive features in this dataset. Some categorical features have two factors e.g. diet has mostly or strictly, as well as the type of diet.
- `body_type`: categorical
  - rather not say, thin, overweight, skinny, average, fit, athletic, jacked, a little extra, curvy, full figured, used up
- `diet`: categorical
  - mostly, strictly
  - anything, vegetarian, vegan, kosher, halal, other
- `drinks`: categorical
  - very often, often, socially, rarely, desperately, not at all
- `drugs`: categorical
  - never, sometimes, often
- `education`: categorical
  - graduated from, working on, dropped out of
  - high school, two-year college, university, masters program, law school, med school, Ph.D program, space camp
- `ethnicity`: categorical
  - asian, middle eastern, black, native american, indian, pacific islander, hispanic/latin, white, other
- `height`: continuous, in inches
- `income`: categorical in US $, -1 means rather not say
  - -1, 20000, 30000, 40000, 50000, 60000 70000, 80000, 100000, 150000, 250000, 500000, 1000000,
- `job`: categorical
  - student, art/music/writing, banking/finance, administration, technology, construction, education, entertainment/media, management, hospitality, law, medicine, military, politics/government, sales/marketing, science/engineering, transportation, unemployed, other, rather not say, retire
- `last online`:
- `location`:
- `offspring`: categorical
  - has a kid, has kids, doesnt have a kid, doesn't want kids
  - and/but might want them, wants them, doesn't want any, doesn't want more  
- `orientation`: categorical
  - straight, gay, bisexual
- `pets`: categorical
  - has dogs, likes dogs, dislikes dogs
  - has cats, likes cats, dislikes cats
- `religion`: categorical
  - agnosticism, atheism, Christianity, Judaism, Catholicism, Islam, Hinduism, Buddhism, Other
  - and very serious about it, and somewhat serious about it, but not too serious about it, and laughing about it
- `sex`: categorical
  - m, f
- `sign`: categorical
  - aquarius, pisces, aries, taurus, gemini, cancer, leo, virgo, libra, scorpio, sagittarius, capricorn
  - but it doesn’t matter, and it matters a lot, and it’s fun to think about
- `smokes`: categorical
  - yes, sometimes, when drinking, trying to quit, no
- `speaks`: categorical
  - Afrikaans, Albanian, Arabic, Armenian, Basque, Belarusan, Bengali, Breton, Bulgarian, Catalan, Cebuano, Chechen, Chinese, C++, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, Frisian, Georgian, German, Greek, Gujarati, Ancient Greek, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Ilongo, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Latin, Latvian, LISP, Lithuanian, Malay, Maori, Mongolian, Norwegian, Occitan, Other, Persian, Polish, Portuguese, Romanian, Rotuman, Russian, Sanskrit, Sardinian, Serbian, Sign Language, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Tibetan, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish 
  - fluently, okay, poorly
- `status`: categorical
  - single, seeing someone, married, in an open relationship
- `essay0`: id-like variable
  - "My self summary"
- `essay1`: id-like variable
  - "What I'm doing with my life"
- `essay2`: id-like variable
  - "I'm really good at"
- `essay3`: id-like variable
  - "The first thing people usually notice about me"
- `essay4`: id-like variable
  - "Favorite books, movies, show, music, and food"
- `essay5`: id-like variable
  - "The six things I could never do without"
- `essay6`: id-like variable
  - "I spend a lot of time thinking about"
- `essay7`: id-like variable
  - "On a typical Friday night I am"
- `essay8`: id-like variable
  - "The most private thing I am willing to admit"
- `essay9`: id-like variable
  - "You should message me if..."  

## Data Preparation

-

### Preliminaries

In [16]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

In [17]:
# Specifying the attribute names
attributeNames = [
    'age',
    'body_type',
    'diet',
    'drinks',
    'drugs',
    'education',
    'essay0',
    'essay1',
    'essay2',
    'essay3',
    'essay4',
    'essay5', 
    'essay6',
    'essay7',
    'essay8',
    'essay9',
    'ethnicity',
    'height',
    'income',
    'job',
    'last_online',
    'location',
    'offspring',
    'orientation',
    'pets',
    'religion',
    'sex',
    'sign',
    'smokes',
    'speaks',
    'status',
]

# Read in data
df = pd.read_csv('df.csv', names = attributeNames)
df.sample(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
16,39,fit,strictly anything,socially,,graduated from college/university,,"dancing, playing, exploring, smiling, and doin...","obscure dances from the '30's and '40's, laugh...",you tell me:),...,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs and has cats,atheism and laughing about it,f,aquarius but it doesn&rsquo;t matter,no,"english (fluently), spanish (okay)",single
1162,43,fit,mostly vegetarian,not at all,never,graduated from masters program,"smart, specifically fast smart. deep.<br />\nh...",i've gotten inquiries about what i mean by my ...,"policy analysis. thinking, writing, editing, n...",my charisma,...,"san francisco, california",,straight,dislikes dogs,other and somewhat serious about it,m,,no,english,single
1069,26,skinny,mostly anything,socially,,graduated from college/university,fuck it. if there's something you want to know...,"working as a research assistant, trying to sav...",this is more a list of things i like most; wou...,if you meet me before noon you might notice i'...,...,"menlo park, california",,straight,has dogs,,f,scorpio,when drinking,"english, spanish (poorly)",single
1784,38,average,,socially,,graduated from college/university,"passionate, logical, romantic, guarded, shy, t...",working too much. dreaming too much. not paint...,etch-a-sketching,i'm tall.,...,"san francisco, california",,straight,likes dogs and has cats,agnosticism but not too serious about it,f,aries and it&rsquo;s fun to think about,no,"english (fluently), german (okay), japanese (p...",single
1212,45,average,mostly other,rarely,,graduated from high school,well about my self i live in china town sf. on...,well i am waiting for 1 good women to tack hom...,lessening to people talk playing pool and vide...,that i am clean cut with old school manners al...,...,"san francisco, california",doesn&rsquo;t want kids,straight,likes dogs and likes cats,christianity but not too serious about it,m,capricorn and it&rsquo;s fun to think about,,english (fluently),single
1634,35,average,mostly anything,socially,,graduated from masters program,"i grew up on the east coast, but i love the we...","i'm a closeted adventure seeker, and i seek ac...","...i wish i could say golf, but that would not...",,...,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs,,f,gemini,,"english (fluently), spanish (poorly)",single
367,48,fit,mostly anything,rarely,never,graduated from masters program,,,,,...,"san mateo, california","doesn&rsquo;t have kids, but might want them",straight,,,m,,no,english (fluently),single
3066,29,fit,,socially,,,child-like curiosity. genuine<br />\n<br />\ne...,most recently: losing bets and having to uploa...,pulling off pranks on close friends<br />\nmak...,,...,"stanford, california",,straight,,,m,scorpio,,english,single
1817,44,average,mostly anything,socially,,graduated from law school,**/important update/**<br />\ni am now over ha...,i am currently riding myrecumbent trike from c...,"translating between lay, scientific and legal ...",i'm tall and people can't tell how old i am by...,...,"berkeley, california","doesn&rsquo;t have kids, but wants them",straight,likes dogs and likes cats,agnosticism and laughing about it,m,aries and it&rsquo;s fun to think about,no,english (fluently),single
3312,20,a little extra,mostly other,socially,sometimes,working on college/university,i'm eager to live life. i live in the moment f...,i'm currently enrolled at the academy of art u...,"design and illustration, tennis and headbanging.",my long black curly hair and my big grin,...,"richmond, california",doesn&rsquo;t have kids,straight,,atheism but not too serious about it,m,leo and it&rsquo;s fun to think about,sometimes,"english (fluently), spanish (fluently)",single


### Data Cleaning and Transformation

In [18]:
print(f'The shape of the dataset is {df.shape}\n')
print(f'Data types are below:')
print(df.dtypes)

The shape of the dataset is (5001, 31)

Data types are below:
age            object
body_type      object
diet           object
drinks         object
drugs          object
education      object
essay0         object
essay1         object
essay2         object
essay3         object
essay4         object
essay5         object
essay6         object
essay7         object
essay8         object
essay9         object
ethnicity      object
height         object
income         object
job            object
last_online    object
location       object
offspring      object
orientation    object
pets           object
religion       object
sex            object
sign           object
smokes         object
speaks         object
status         object
dtype: object


Some of the data types have come across as `object` rather than `int64`, so we need to transform them. But to do this, we need to remove any missing values first.

In [22]:
# Changing data types
df['age'] = df.age.astype(float)
df['height'] = df.height.astype(float)
df['income'] = df.height.astype(float)

# Check to see that this has worked
df.dtypes

ValueError: could not convert string to float: 'age'

### Checking for Missing Values

In [23]:
print(f'Number of missing values for each feature:')
print(df.isnull().sum())

Number of missing values for each feature:
age               0
body_type       459
diet           1965
drinks          256
drugs          1197
education       520
essay0          475
essay1          619
essay2          793
essay3          999
essay4          882
essay5          932
essay6         1172
essay7         1086
essay8         1731
essay9         1057
ethnicity       494
height            0
income            0
job             698
last_online       0
location          0
offspring      2941
orientation       0
pets           1695
religion       1706
sex               0
sign            900
smokes          498
speaks            4
status            0
dtype: int64


There are a lot of missing values, so we will need to remove these.

### Summary Statistics

-

### Continuous Features

-

### Fixing Column Names

-

### Categorical Features

-

### Dependent Variable

-

## Data Exploration 

-

### Univariate Visualisation

#### Bar Chart

#### Box plot

#### Histogram

### Multivariate Visualisation

#### Scatterplot

#### Categorical Attributes by

#### Facet plots

#### another scatterplot probably

## Statistical Modeling and Performance Evaluation

### Full Model

### Full Model Diagnostic Checks

#### predictive variable - scatterplot

#### actual variable - scatterplot

#### predictive variable - histogram

#### actual variable - histogram

### Backwards Feature Selection

### Reduced Model Diagnostic Checks

#### scatterplot of reduced model

#### histogram of reduced model|

## Summary and Conclusions

## References