# Predicting Age from OK Cupid Data

## Introduction
The objective of this assignment is to predict the age of an individual on a dating profile, using OK Cupid profile data of users near San Francisco. This dataset was sourced from openml.org at https://www.openml.org/d/42164.

## Table of contents

* [Overview](#Overview)
* [Data preparation](#References)

## Overview
### Data source
The dating profile dataset available on openml.org was originally sourced from public profiles on www.okcupid.com of users who lived within a 25 mile radius of San Francisco who were active online between 30/6/2011-30/6/2012 and had at least one profile picture. This data was sourced and made public with the permission of OK Cupid, and the csv linked on openml.org can be found here: https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip This source also includes a text file which gives a description of all the variables.

This dataset has 59,946 observations (i.e. users), and 31 features (including the target feature which is age). We will use a sample of 5,000 observations for this assignment.

### Project objective

### Target feature
The target features is `age`.

### Descriptive features
There are 30 descriptive features in this dataset. Some categorical features have two factors e.g. diet has mostly or strictly, as well as the type of diet.
- `body_type`: categorical
  - rather not say, thin, overweight, skinny, average, fit, athletic, jacked, a little extra, curvy, full figured, used up
- `diet`: categorical
  - mostly, strictly
  - anything, vegetarian, vegan, kosher, halal, other
- `drinks`: categorical
  - very often, often, socially, rarely, desperately, not at all
- `drugs`: categorical
  - never, sometimes, often
- `education`: categorical
  - graduated from, working on, dropped out of
  - high school, two-year college, university, masters program, law school, med school, Ph.D program, space camp
- `ethnicity`: categorical
  - asian, middle eastern, black, native american, indian, pacific islander, hispanic/latin, white, other
- `height`: continuous, in inches
- `income`: categorical in US $, -1 means rather not say
  - -1, 20000, 30000, 40000, 50000, 60000 70000, 80000, 100000, 150000, 250000, 500000, 1000000,
- `job`: categorical
  - student, art/music/writing, banking/finance, administration, technology, construction, education, entertainment/media, management, hospitality, law, medicine, military, politics/government, sales/marketing, science/engineering, transportation, unemployed, other, rather not say, retire
- `last online`:
- `location`:
- `offspring`: categorical
  - has a kid, has kids, doesnt have a kid, doesn't want kids
  - and/but might want them, wants them, doesn't want any, doesn't want more  
- `orientation`: categorical
  - straight, gay, bisexual
- `pets`: categorical
  - has dogs, likes dogs, dislikes dogs
  - has cats, likes cats, dislikes cats
- `religion`: categorical
  - agnosticism, atheism, Christianity, Judaism, Catholicism, Islam, Hinduism, Buddhism, Other
  - and very serious about it, and somewhat serious about it, but not too serious about it, and laughing about it
- `sex`: categorical
  - m, f
- `sign`: categorical
  - aquarius, pisces, aries, taurus, gemini, cancer, leo, virgo, libra, scorpio, sagittarius, capricorn
  - but it doesn’t matter, and it matters a lot, and it’s fun to think about
- `smokes`: categorical
  - yes, sometimes, when drinking, trying to quit, no
- `speaks`: categorical
  - Afrikaans, Albanian, Arabic, Armenian, Basque, Belarusan, Bengali, Breton, Bulgarian, Catalan, Cebuano, Chechen, Chinese, C++, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, Frisian, Georgian, German, Greek, Gujarati, Ancient Greek, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Ilongo, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Latin, Latvian, LISP, Lithuanian, Malay, Maori, Mongolian, Norwegian, Occitan, Other, Persian, Polish, Portuguese, Romanian, Rotuman, Russian, Sanskrit, Sardinian, Serbian, Sign Language, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Tibetan, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish 
  - fluently, okay, poorly
- `status`: categorical
  - single, seeing someone, married, in an open relationship
- `essay0`: id-like variable
  - "My self summary"
- `essay1`: id-like variable
  - "What I'm doing with my life"
- `essay2`: id-like variable
  - "I'm really good at"
- `essay3`: id-like variable
  - "The first thing people usually notice about me"
- `essay4`: id-like variable
  - "Favorite books, movies, show, music, and food"
- `essay5`: id-like variable
  - "The six things I could never do without"
- `essay6`: id-like variable
  - "I spend a lot of time thinking about"
- `essay7`: id-like variable
  - "On a typical Friday night I am"
- `essay8`: id-like variable
  - "The most private thing I am willing to admit"
- `essay9`: id-like variable
  - "You should message me if..."  

## Data Preparation

-

### Preliminaries

In [12]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

In [10]:
# Specifying the attribute names
attributeNames = [
    'age',
    'body_type',
    'diet',
    'drinks',
    'drugs',
    'education',
    'essay0',
    'essay1',
    'essay2',
    'essay3',
    'essay4',
    'essay5', 
    'essay6',
    'essay7',
    'essay8',
    'essay9',
    'ethnicity',
    'height',
    'income',
    'job',
    'last_online',
    'location',
    'offspring',
    'orientation',
    'pets',
    'religion',
    'sex',
    'sign',
    'smokes',
    'speaks',
    'status',
]

# Read in data
df = pd.read_csv('df.csv', names = attributeNames)
df.sample(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
1324,37,average,strictly anything,socially,never,graduated from masters program,"hi. in summary, i am fun, cuddly, a little shy...",good freakin' question. what am i doing with m...,seeing the bright side of things.,"like everyone else, my smile. my laugh, too (i...",...,"san francisco, california",,straight,likes dogs and likes cats,christianity but not too serious about it,f,capricorn but it doesn&rsquo;t matter,no,"english (fluently), spanish (poorly)",single
967,25,average,mostly anything,often,sometimes,dropped out of two-year college,,judging things.,,,...,"san francisco, california",,straight,,,f,sagittarius but it doesn&rsquo;t matter,when drinking,"english (fluently), spanish (poorly)",single
2970,31,curvy,mostly anything,socially,never,,"let's leave a little mystery, shall we?","refer to ""self-summary""","i am a master stick figure artist, a writer (b...",,...,"oakland, california",doesn&rsquo;t have kids,straight,likes dogs and has cats,christianity but not too serious about it,f,leo,no,english (fluently),single
3934,44,fit,mostly anything,socially,never,graduated from college/university,"life is good. very good in fact. but am here, ...",trying to keep learning about stuff i don't kn...,"taking good care of those i care most about, k...",i'm told my eyes tell the story...am intereste...,...,"oakland, california",,straight,likes dogs and likes cats,christianity,f,aquarius,no,"english, german (okay), french (poorly)",single
2479,22,athletic,mostly halal,rarely,never,working on college/university,hi im ali. i live in the bay area and im middl...,,,,...,"berkeley, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,islam,m,aquarius,no,english,single
690,25,thin,strictly anything,socially,,graduated from college/university,"grew up on the east coast, lived in australia ...",i ask myself the same thing,"self-deprecating humor/sarcasm, solving proble...","my height, lack of hair, and green eyes.<br />...",...,"san francisco, california","doesn&rsquo;t have kids, but might want them",gay,likes dogs,atheism but not too serious about it,m,virgo and it&rsquo;s fun to think about,no,english (okay),single
1040,24,,,,never,graduated from college/university,"grew up in the south bay, ucsd for undergrad, ...",laughing a lot.<br />\nwork: challenging mysel...,making people feel comfortable...i think :),"probably that my energy is positive, warm, and...",...,"san francisco, california",,straight,,agnosticism,f,capricorn,,english,single
2601,40,average,vegetarian,often,never,graduated from college/university,"march '11: <a class=""ilink"" href=\n""/interests...","working, but waiting to see what becomes of tu...",karaoke. puns. hugs.,"march '11, at a party: ""i like your mohawk!"" ""...",...,"oakland, california",doesn&rsquo;t want kids,straight,likes dogs and has cats,,m,cancer but it doesn&rsquo;t matter,no,"english (fluently), french (poorly), spanish (...",available
1156,33,athletic,,socially,never,graduated from ph.d program,`one night i was sitting on the bed in my hote...,i'm a scientist working in medical research. i...,definitely i`m a good soccer player..then i li...,american people? my italian accent...,...,"san francisco, california",,straight,,atheism,m,leo,no,"english, italian",single
2486,32,athletic,,socially,never,graduated from law school,i will say that i adore the opera and a wonder...,i believe in making my life's work mean someth...,"will hunting, from good will hunting, describe...",,...,"oakland, california",,straight,likes dogs,other,f,leo and it&rsquo;s fun to think about,no,"english (fluently), farsi (fluently), spanish ...",single


### Data Cleaning and Transformation

In [15]:
print(f'The shape of the dataset is {df.shape}\n')
print(f'Data types are below:')
print(df.dtypes)

The shape of the dataset is (5001, 31)

Data types are below:
age            object
body_type      object
diet           object
drinks         object
drugs          object
education      object
essay0         object
essay1         object
essay2         object
essay3         object
essay4         object
essay5         object
essay6         object
essay7         object
essay8         object
essay9         object
ethnicity      object
height         object
income         object
job            object
last_online    object
location       object
offspring      object
orientation    object
pets           object
religion       object
sex            object
sign           object
smokes         object
speaks         object
status         object
dtype: object


### Checking for Missing Values

-

### Summary Statistics

-

### Continuous Features

-

### Fixing Column Names

-

### Categorical Features

-

### Dependent Variable

-

## Data Exploration 

-

### Univariate Visualisation

#### Bar Chart

#### Box plot

#### Histogram

### Multivariate Visualisation

#### Scatterplot

#### Categorical Attributes by

#### Facet plots

#### another scatterplot probably

## Statistical Modeling and Performance Evaluation

### Full Model

### Full Model Diagnostic Checks

#### predictive variable - scatterplot

#### actual variable - scatterplot

#### predictive variable - histogram

#### actual variable - histogram

### Backwards Feature Selection

### Reduced Model Diagnostic Checks

#### scatterplot of reduced model

#### histogram of reduced model|

## Summary and Conclusions

## References