# Exploring the data

My impressions so far:

* The biggest challenge is dealing with the Breed, Color and Name variables
* Many of the variables are going to have interactions with one another, especially the AnimalType variable, therefore it may be useful to use a model that automatically considers feature interactions (ex. a Random Forest) over a model that requires us to manually create feature interactions (ex. Logistic Regression)
* Not a huge number of missing values in the training set, save for animal name, where not knowing the animal's name (or the animal not having a name) certainly tells us something useful...it may be worth imputing neutered and gender?

Data cleaning / feature engineering overview:

* Split `SexuponOutcome` into separate neutered and gender features
* `AgeuponOutcome` is converted into a numeric feature
* `DateTime` is converted into a set of date-specific features (year, month, day of the week, etc.)
* Find strategies for classifying `Breed` into clusters/taxonomies for dogs and cats separately, this may be difficult

In [None]:
%matplotlib inline

import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import seaborn as sns
import sklearn
from wordcloud import WordCloud, STOPWORDS

train = pd.read_csv('../input/train.csv')

## Overview

See what the data look like.

In [None]:
print('Number of training observations:', len(train.index))
train.describe()

Missing values by column:

In [None]:
train.isnull().sum()

## Examining the dependent variable, `OutcomeType`

What happened to the animals?

In [None]:
sns.countplot(x = "OutcomeType", data = train)

## Examining the independent variables


### `AnimalType`

Whether it's a cat or a dog.

In [None]:
sns.countplot(x = "AnimalType", data = train)

`OutcomeType` by `AnimalType` (relative frequency)

In [None]:
def rel_freq_plot(train, column):
    sns.pointplot(x = 'OutcomeType', y = 'Percent', hue = column, data = (train
        .groupby(['OutcomeType', column])
        .size()
        .reset_index()
        .rename(columns = {0: 'Count'})
        .merge(
            (train
             .groupby([column])
             .size()
             .reset_index()
             .rename(columns = {0: 'Total'})
            ), how = 'inner', on = column)
        .assign(Percent = lambda x: x.Count / x.Total)
    ))
    
rel_freq_plot(train, 'AnimalType')

Dogs are more likely to be returned to their owners and less likely to be transferred than cats.

### `SexuponOutcome`

This variable measures the sex of the animal as well as whether or not it is able to reproduce at the time of the outcome.

In [None]:
sns.countplot(x = "SexuponOutcome", data = train)

I'm thinking that these should be split into the following 2 variables instead of being in a single variable:

* Sex - Male / Female
* NeuteredSprayed - True / False

In [None]:
def create_sex_variables(data):
    SexuponOutcome = data['SexuponOutcome'].fillna('Unknown')
    results = []
    for row in SexuponOutcome:
        row = row.split(' ')
        if len(row) == 1:
            row = ['Unknown', 'Unknown']
        results.append(row)
    NeuteredSprayed, Sex = zip(
        *[['Neutered', x[1]] if x[0] == 'Spayed' else x for x in results])
    return (data.assign(Neutered = NeuteredSprayed).assign(Sex = Sex)
            .drop(['SexuponOutcome'], axis = 1))

train = train.pipe(create_sex_variables)

Quick look at the distribution of Neutered animals, and how being neutered affects outcomes:

In [None]:
sns.countplot(x = "Neutered", data = train)

In [None]:

rel_freq_plot(train, 'Neutered')

Quick look at the distribution of animal gender, and how gender affects outcomes:

In [None]:
sns.countplot(x = "Sex", data = train)

In [None]:
rel_freq_plot(train, 'Sex')

## `AgeuponOutcome`

Should be transformed into a numerical variable. There are 18 NAs - given how few, let's not worry too much about them and just impute the median.

In [None]:
def create_age_in_years(ages):
    results = []
    units = {'days': 365.0, 'weeks': 52.0, 'months': 12.0}
    for age in ages:
        if age == 'NA':
            results.append('NA')
        else:
            duration, unit = age.split(' ')
            results.append(float(duration) / units.get(unit, 1.0))
    impute = np.median([age for age in results if age != 'NA'])
    return [age if age != 'NA' else impute for age in results]

train = (train
         .assign(Age = create_age_in_years(list(train['AgeuponOutcome'].fillna('NA'))))
         .drop(['AgeuponOutcome'], axis = 1))

Look at the distribution of ages:

In [None]:
sns.distplot(train['Age'], bins = 22)

The right skew makes sense in the context of age. If we do a log transformation, can we eliminate the skew so that age follows a roughly Gaussian distribution?

In [None]:
sns.distplot([x if x == 0 else np.log(x) for x in train['Age']], bins = 10)

Are the cats in shelters older than dogs, or vice versa?

In [None]:
sns.boxplot(x = "Age", y = "AnimalType", data = train)

Let's quickly take a look at how age affects outcome:

In [None]:
sns.violinplot(x = "OutcomeType", y = "Age", hue = "AnimalType", data = train, cut = 0, split = True,
              palette = "Set3")

Animals returned to their owners tend to be slightly older. Cats are dying, being adopted, and being transferred at a younger age.

## `DateTime`

We want to split this into multiple variables:

* Year, month, day of the week, time (morning, afternoon, evening, night)

In [None]:
def time_of_day(hour):
    if hour > 4 and hour < 12:
        return 'morning'
    elif hour >= 12 and hour < 18:
        return 'afternoon'
    else:
        return 'evening/night'
    
def day_of_the_week(DateTime):
    return datetime.datetime.strptime(DateTime, '%Y-%m-%d %H:%M:%S').weekday()

train = (train
         .assign(Year = train.DateTime.map(lambda x: x[:4]))
         .assign(Month = train.DateTime.map(lambda x: x[5:7]))
         .assign(Day = train.DateTime.map(lambda x: day_of_the_week(x)))
         .assign(TimeOfDay = train.DateTime.map(lambda x: time_of_day(int(x[11:13]))))
         .drop(['DateTime'], axis = 1))

Day of the week (0 is Monday, 6 is Sunday)

In [None]:
sns.countplot(x = "Day", data = train)

Time of the day

In [None]:
sns.countplot(x = "TimeOfDay", data = train)

## `Breed`

Breed only makes sense in the context of the animal species, of AnimalType

In [None]:
print('Total number of breeds:', len(train['Breed'].unique()))
print('Number of cat breeds:', len(train[train['AnimalType'] == 'Cat']['Breed'].unique()))
print('Number of dog breeds:', len(train[train['AnimalType'] == 'Dog']['Breed'].unique()))


Given the massive number of breeds (mostly for dogs) I am worried about overfitting w/ regard to breed. Here is the distribution of the (log) number of animals by breed for both animals. We can see there are a small number of breeds containing a large number of animals, and then many breeds with a very small number of animals.

In [None]:

sns.distplot(np.log(train.groupby('Breed').size().values), bins = 10)

*Dog Breeds*

Let's look at the names of the top breeds to see if we can get any ideas around how to structure this feature.

In [None]:
sns.barplot(x = "Count", y = "Breed", data = (
        train[train['AnimalType'] == 'Dog']
        .groupby(['Breed'])
        .size()
        .reset_index()
        .rename(columns = {0: 'Count'})
        .sort_values(['Count'], ascending = False)
        .head(n = 25)))


Let's try splitting the breed names into words, and seeing which words show up frequently.

In [None]:

def wordcount_dict(wordlist):
    results = {}
    for item in wordlist:
        item = re.split('\W+', item)
        for word in item:
            try:
                results[word] += 1
            except:
                results[word] = 1
    return results

def wordcloud_string(wordcount_dict):
    final_list = []
    for word, count in wordcount_dict.items():
        final_list += [word] * count
    return ' '.join(final_list)

def display_wordcloud(wordcloud_string):
    wordcloud = (
        WordCloud(background_color = 'white', stopwords = STOPWORDS, height = 700, width = 1000)
        .generate(wordcloud_string))
    plt.imshow(wordcloud)
    plt.show()
    return None

display_wordcloud(
    wordcloud_string(
        wordcount_dict(list(train.query('AnimalType == "Dog"')['Breed']))
    )
)

Mix (mix-breed, as opposted to a pure-breed) is the most common word, not surpringly. Now that I am looking at this, I wonder if using n-grams in addition to pure words would give us additional information?

Let's see how many breeds there are for dogs when we split names on forward slashes and remove the word 'Mix':

In [None]:
def common_breeds(breeds):
    breed_counts = {}
    for breed in breeds:
        breed = breed.replace(' Mix', '').split('/')
        for subbreed in breed:
            try:
                breed_counts[subbreed] += 1
            except:
                breed_counts[subbreed] = 1
    return breed_counts

breed_counts = common_breeds(list(train[train['AnimalType'] == 'Dog']['Breed']))

Number of dog breeds with >= 30 animals (including mixed breeds) in the training set:

In [None]:
len([breed for breed, count in breed_counts.items() if count >= 30])

**I further tackle dog breeds in a different notebook!**

*Cat breeds*

There are few enough cat breeds that we may be able to manually create features that capture much of the information in breed - the things that come to mind are:

* Mix vs non-mix
* Short/medium/long hair

In [None]:
sns.barplot(x = "Count", y = "Breed", data = (
        train[train['AnimalType'] == 'Cat']
        .groupby(['Breed'])
        .size()
        .reset_index()
        .rename(columns = {0: 'Count'})
        .sort_values(['Count'], ascending = False)
        .head(n = 20)))

In [None]:
display_wordcloud(
    wordcloud_string(
        wordcount_dict(list(train.query('AnimalType == "Cat"')['Breed']))
    )
)