# Predicting Age from OK Cupid Data

## Introduction
The objective of this assignment is to predict the age of an individual on a dating profile, using OK Cupid profile data of users near San Francisco. This dataset was sourced from openml.org at https://www.openml.org/d/42164.

## Table of contents

* [Overview](#Overview)
* [Data preparation](#References)

## Overview
### Data source
The dating profile dataset available on openml.org was originally sourced from public profiles on www.okcupid.com of users who lived within a 25 mile radius of San Francisco who were active online between 30/6/2011-30/6/2012 and had at least one profile picture. This data was sourced and made public with the permission of OK Cupid, and the csv linked on openml.org can be found here: https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip This source also includes a text file which gives a description of all the variables.

This dataset has 59,946 observations (i.e. users), and 31 features (including the target feature which is age). We will use a sample of 5,000 observations for this assignment.

### Project objective

### Target feature
The target features is `age`.

### Descriptive features
There are 30 descriptive features in this dataset. Some categorical features have two factors e.g. diet has mostly or strictly, as well as the type of diet.
- `body_type`: categorical
  - rather not say, thin, overweight, skinny, average, fit, athletic, jacked, a little extra, curvy, full figured, used up
- `diet`: categorical
  - mostly, strictly
  - anything, vegetarian, vegan, kosher, halal, other
- `drinks`: categorical
  - very often, often, socially, rarely, desperately, not at all
- `drugs`: categorical
  - never, sometimes, often
- `education`: categorical
  - graduated from, working on, dropped out of
  - high school, two-year college, university, masters program, law school, med school, Ph.D program, space camp
- `ethnicity`: categorical
  - asian, middle eastern, black, native american, indian, pacific islander, hispanic/latin, white, other
- `height`: continuous, in inches
- `income`: categorical in US $, -1 means rather not say
  - -1, 20000, 30000, 40000, 50000, 60000 70000, 80000, 100000, 150000, 250000, 500000, 1000000,
- `job`: categorical
  - student, art/music/writing, banking/finance, administration, technology, construction, education, entertainment/media, management, hospitality, law, medicine, military, politics/government, sales/marketing, science/engineering, transportation, unemployed, other, rather not say, retire
- `last online`:
- `location`:
- `offspring`: categorical
  - has a kid, has kids, doesnt have a kid, doesn't want kids
  - and/but might want them, wants them, doesn't want any, doesn't want more  
- `orientation`: categorical
  - straight, gay, bisexual
- `pets`: categorical
  - has dogs, likes dogs, dislikes dogs
  - has cats, likes cats, dislikes cats
- `religion`: categorical
  - agnosticism, atheism, Christianity, Judaism, Catholicism, Islam, Hinduism, Buddhism, Other
  - and very serious about it, and somewhat serious about it, but not too serious about it, and laughing about it
- `sex`: categorical
  - m, f
- `sign`: categorical
  - aquarius, pisces, aries, taurus, gemini, cancer, leo, virgo, libra, scorpio, sagittarius, capricorn
  - but it doesn’t matter, and it matters a lot, and it’s fun to think about
- `smokes`: categorical
  - yes, sometimes, when drinking, trying to quit, no
- `speaks`: categorical
  - Afrikaans, Albanian, Arabic, Armenian, Basque, Belarusan, Bengali, Breton, Bulgarian, Catalan, Cebuano, Chechen, Chinese, C++, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, Frisian, Georgian, German, Greek, Gujarati, Ancient Greek, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Ilongo, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Latin, Latvian, LISP, Lithuanian, Malay, Maori, Mongolian, Norwegian, Occitan, Other, Persian, Polish, Portuguese, Romanian, Rotuman, Russian, Sanskrit, Sardinian, Serbian, Sign Language, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Tibetan, Turkish, Ukranian, Urdu, Vietnamese, Welsh, Yiddish 
  - fluently, okay, poorly
- `status`: categorical
  - single, seeing someone, married, in an open relationship
- `essay0`: id-like variable
  - "My self summary"
- `essay1`: id-like variable
  - "What I'm doing with my life"
- `essay2`: id-like variable
  - "I'm really good at"
- `essay3`: id-like variable
  - "The first thing people usually notice about me"
- `essay4`: id-like variable
  - "Favorite books, movies, show, music, and food"
- `essay5`: id-like variable
  - "The six things I could never do without"
- `essay6`: id-like variable
  - "I spend a lot of time thinking about"
- `essay7`: id-like variable
  - "On a typical Friday night I am"
- `essay8`: id-like variable
  - "The most private thing I am willing to admit"
- `essay9`: id-like variable
  - "You should message me if..."  

## Data Preparation

-

### Preliminaries

-

In [1]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

In [2]:
# Specifying the attribute names
attributeNames = [
    'age',
    'body_type',
    'diet',
    'drinks',
    'drugs',
    'education',
    'essay0',
    'essay1',
    'essay2',
    'essay3',
    'essay4',
    'essay5', 
    'essay6',
    'essay7',
    'essay8',
    'essay9',
    'ethnicity',
    'height',
    'income',
    'job',
    'last_online',
    'location',
    'offspring',
    'orientation',
    'pets',
    'religion',
    'sex',
    'sign',
    'smokes',
    'speaks',
    'status',
]

# Read in data
df = pd.read_csv('df.csv', names = attributeNames)
df.sample(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
1564,27,a little extra,,socially,sometimes,graduated from masters program,<p>i've been in california for a few years now...,<p>still trying to find the right balance betw...,seeing the humor or at least the positive aspe...,<p>i tend to be very open-minded and easy to g...,...,"san francisco, california",doesn&rsquo;t want kids,bisexual,likes dogs and likes cats,agnosticism but not too serious about it,m,libra but it doesn&rsquo;t matter,sometimes,"english (fluently), german (fluently)",single
1574,38,athletic,strictly other,socially,never,college/university,come play with me...,i'm enjoying every second and sweatin' to keep...,"i'm really good at understanding people, swear...",i'm a friendly person and i wanna talk to ever...,...,"novato, california",,straight,likes dogs and likes cats,agnosticism and laughing about it,m,capricorn and it&rsquo;s fun to think about,no,"english, spanish (okay)",single
1479,41,a little extra,mostly anything,socially,never,graduated from college/university,if i like you...can i keep you?<br />\n<br />\...,"i am secure &amp; happy with myself, but...lif...",everything...lol but i am really,,...,"richmond, california",has kids,straight,,catholicism and somewhat serious about it,m,virgo but it doesn&rsquo;t matter,no,english,single
3938,52,average,strictly other,socially,never,,i am having a self improvement kind of year. l...,i seem to have lots of things going on but try...,learning new things - lately i've been making ...,that i type in lower case letters - a lot,...,"oakland, california",,straight,,,f,,no,"english, danish (okay)",single
1918,33,thin,,socially,,graduated from med school,let's see life is good. i absolutely love what...,"doctoring. my patients are incredible, fascina...",intuition. silence. i wear my heart on my slee...,i use hand gestures when i talk. and my big eyes.,...,"san francisco, california",,straight,likes dogs,,f,gemini but it doesn&rsquo;t matter,no,"english (fluently), spanish (okay), hindi (poo...",single
4540,29,average,,socially,never,graduated from college/university,i love self deprecating humor and a good satir...,relocating to san francisco.<br />\n<br />\nin...,blurring the lines between nerdiness and nice ...,that would depend entirely upon their ability ...,...,"san francisco, california",,straight,,,m,,no,english,single
1995,26,skinny,,often,never,graduated from space camp,well after creating and deleting 7 intros i'm ...,,"making new friends, i love meeting new people.",well i tend to dye my hair interesting colors,...,"burlingame, california",,gay,likes dogs and has cats,atheism and somewhat serious about it,m,leo but it doesn&rsquo;t matter,when drinking,english (fluently),single
2897,29,athletic,,socially,never,graduated from college/university,i graduated from sacramento state university w...,i am trying to get my cpa licence in near future.,making people laugh and understanding people.,is my smile.,...,"daly city, california",,straight,,hinduism and somewhat serious about it,m,capricorn but it doesn&rsquo;t matter,when drinking,english,single
1608,31,,,socially,never,graduated from college/university,hey there my name is adam and i'm from palo al...,"i went to highschool in the bay area, college ...","i am really good at working with kids, sports,...",i think people usually notice my smile and my ...,...,"burlingame, california",,straight,,,m,,no,english,single
3594,30,thin,mostly vegetarian,socially,,graduated from college/university,,,,,...,"berkeley, california",,straight,likes dogs and likes cats,,f,taurus but it doesn&rsquo;t matter,when drinking,"english (fluently), french (okay)",single


### Data Cleaning and Transformation

-

### Checking for Missing Values

-

### Summary Statistics

-

### Continuous Features

-

### Fixing Column Names

-

### Categorical Features

-

### Dependent Variable

-

## Data Exploration 

-

### Univariate Visualisation

#### Bar Chart

#### Box plot

#### Histogram

### Multivariate Visualisation

#### Scatterplot

#### Categorical Attributes by

#### Facet plots

#### another scatterplot probably

## Statistical Modeling and Performance Evaluation

### Full Model

### Full Model Diagnostic Checks

#### predictive variable - scatterplot

#### actual variable - scatterplot

#### predictive variable - histogram

#### actual variable - histogram

### Backwards Feature Selection

### Reduced Model Diagnostic Checks

#### scatterplot of reduced model

#### histogram of reduced model|

## Summary and Conclusions

## References