## Getting a feel for the data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

In [None]:
! ls ../input

In [None]:
! head ../input/train.csv

Looks quite clean. Hard to know what the DateTime column means (time of registration at animal shelter?)

In [None]:
train_df = pd.read_csv('../input/train.csv')

In [None]:
train_df.head()

In [None]:
# AgeuponOutcome will probably be more useful as an actual numeric variable. Also, it's in multiple units.
# Let's take a look at the types of units..

import math

def is_a_value(pandas_value):
    # pandas uses NaN for missing values, 
    # which is kind of annoying
    if not isinstance(pandas_value, float):
        return True
    return not math.isnan(pandas_value)

ages = train_df.AgeuponOutcome.tolist()
ages = filter(lambda a: is_a_value(a), ages)
units = set(age.split()[1] for age in ages)
sorted(list(units))

Okay, so now we know what units we're dealing with, let's convert all of the records to a common unit. Weeks seems easiest.

In [None]:
def normalise_age_at_outcome(age):
    """
    >>> normalise_age_at_outcome("3 weeks")
    3.0
    >>> normalise_age_at_outcome("1 month")
    3
    
    """
    if not is_a_value(age):
        return age
    n, unit = age.split()
    n = int(n)
    if unit.startswith('month'):
        length_of_month_in_weeks = 52.0/12.0
        return n * length_of_month_in_weeks
    elif unit.startswith('year'):
        return n * 52.0
    elif unit.startswith('week'):
        return float(n)
    elif unit.startswith('day'):
        return float(n) / 7.0  

In [None]:
# a few quick tests
normalise_age_at_outcome('3 days'), normalise_age_at_outcome('7 weeks'), normalise_age_at_outcome('4 months')

In [None]:
train_df['age_at_outcome_in_weeks'] = train_df['AgeuponOutcome'].apply(normalise_age_at_outcome)

## Exploring Data

### Effect of species

It seems pretty realistic to assume that different animals will be 
treated differently by both the shelter as well as prospective 
adoptive families cats.

In [None]:
sns.factorplot(x='OutcomeType', y='age_at_outcome_in_weeks', col='AnimalType', data=train_df, kind='bar')

Wow, it looks like cats are euthanised about a year sooner than dogs.
They also live much shorter lives in a shelter. They'll often die
around a year.

It also looks like if you're older than about 2 years, it's unlikely 
that you'll be adopted as a dog.

:(

### Effect of breed

I wonder if we can identify which breeds 'do best' and whether being a purebreed is significant.

In [None]:
# looking at the data - see above - 

def is_mixed_breed(animal):
    return animal.endswith('Mix')

train_df['is_mixed_breed'] = train_df['Breed'].apply(is_mixed_breed)

In [None]:
sns.factorplot(x='OutcomeType', y='age_at_outcome_in_weeks', col='AnimalType', hue='is_mixed_breed', data=train_df, kind='bar')

I think that this is saying something.. but we need to incorporate counts in there to have a decent look

How do breeds do anyway?

In [None]:
sns.factorplot(x='OutcomeType', y='age_at_outcome_in_weeks', col='Breed', data=train_df, kind='bar', orient="h")

More to come...

In [None]:
# TODO .. build a proper classifier

from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
X = v.fit_transform(train_df.to_dict(orient='records'))
X