# Overview
In this notebook, I will be analyzing trends in outcome states of animals from Austin Animal Shelter.The analysis aims to find insights that could help shelters identify animals that need more help than others and possibly developing strategies to help them increase their chance of getting a forever home or at least focus on improving their situation in the shelter.

#### Contents:
<ul>
<li><a href="#explore">Exploring Data</a></li>
<li><a href="#quality">Adressing Data Quality Issues</a></li>
<li><a href="#visualize"> Data visualization for Insights</a></li>
</ul>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

<a id='explore'></a>
# Exploring Data


In [None]:
train_set = pd.read_csv('../input/shelter-animal-outcomes/train.csv.gz', compression='gzip', 
                         header=0, sep=',', quotechar='"')

In [None]:
train_set.info()

In [None]:
train_set.describe()

Animal ID, Datetime serve no good in exploration nor in prediction so I will drop these columns. Name of the animal also doesn't contribute to their outcome state either, but I wanted to keep it just to see trends in pets names.

In [None]:
train_set.drop(['AnimalID','DateTime'],axis=1,inplace=True)

In [None]:
train_set.head(8)

Lets take a look at missing values percentages in column features

In [None]:
missing_records=train_set.isnull().sum()
missing_percent=round(100*missing_records/train_set.isnull().count(),2)
missing_data=pd.concat([missing_records,missing_percent],axis=1,keys=['No. missing records', 'missing %'])

In [None]:
missing_data

51% of OutcomeSubtype is missing, lets see why that might be the case.

First, lets check OutcomeType levels and their corresponding OutcomeSubtypes

In [None]:
train_set.groupby('OutcomeType').OutcomeSubtype.unique().reset_index()

It turns out that the Return_to_owner Outcometype doesn't have any corresponfing subtypes. Other OutcomeType levels also have null values in their corresponding OutcomesubTypes.

SexuponOutcomes is missing one value and AgeuponOutcome is missing 18 values which could be imputed when needed according to the context of use in prediction.

Now, lets look at unique values in each column to help shape our plan for cleaning and visualization.

In [None]:
train_set.nunique()

I find that there are 2 distinct columns that have too many unique values : Breed and Color. Names also seem to be very unique among pets!

<a id='quality'></a>

## Addressing Data Quality Issues

Approach:

    Address Consistency issues in existing features and create possible new features that may help the analysis.

#### AgeuponOutcome

Written in string format, I need to to make AgeuponOutcome values consistent, I will need to convert each record to represent age of an animal as number of years.

In [None]:
# Converting values of age into numerical year values
output_age_in_years=[]
age_column=train_set['AgeuponOutcome'].tolist()
for record in age_column:
    if type(record)== float:  #nan(null) are of type float
            formatted_age=None
    else:
            if 'month' in record:
                formatted_age= round(float(record[0])/12,3)
            elif 'week' in record:
                formatted_age=round(float(record[0])/48,3)
            elif 'year' in record:
                formatted_age=int(record[0])
        
    output_age_in_years.append(formatted_age)
# drop the age column and insert the new one
del train_set['AgeuponOutcome']
train_set['outcomeAgeinYears']=output_age_in_years

In [None]:
train_set.head(8)

#### SexuponOutcome

In [None]:
train_set['SexuponOutcome'].value_counts()

This column features two pieces of information about an animal; Gender and sterilization/fertility state. I will be creating new features from SexuponOutcome: gender  and state

But first, impute the missing values in SexuponOutcome column, since there exists only 1 missing value so, I will impute it with the mode.


In [None]:
mode=train_set['SexuponOutcome'].mode()[0]
train_set['SexuponOutcome']=train_set['SexuponOutcome'].fillna(mode)
assert train_set['SexuponOutcome'].isnull().sum()==0

In [None]:
def create_gender_state_info(gender_info,field):
    
        if gender_info=='Unknown': return 'Unknown'
        else: 
            split_gender=gender_info.split(' ')
            if field=='gender':
                return split_gender[1]
            elif field=='state':
                return split_gender[0]

In [None]:
train_set['Gender']=train_set['SexuponOutcome'].apply(create_gender_state_info,args=('gender',))
train_set['State']=train_set['SexuponOutcome'].apply(create_gender_state_info,args=('state',))

Now, drop the original SexuponOutcome column as I don't need it anymore

In [None]:
train_set.drop('SexuponOutcome',axis=1,inplace=True)

#### Color

Colors vary a lot but there is a purity aspect to it that may be helpful to lessen the variability for analysis.I will create a ColorPurity column

In [None]:
def create_color_purity(color):
    if ' ' not in color:
        if color=='Tricolor': return 'Mix'
        elif '/' in color: return 'Mix'
        else:  return 'Pure'
        
    else :
        if  '/' in color: return 'Mix'
        else: return 'Pure'
        

In [None]:
train_set['ColorPurity']=train_set['Color'].apply(create_color_purity)

#### Breed

The same variation applies to the breed of an animal and so does purity of breed, I will create a BreedPuritty column.

In [None]:
def create_breed_purity(breed):
    if 'Mix' in breed or '/' in breed: return 'Mix'
    else: return 'Pure'

In [None]:
train_set['BreedPurity']=train_set['Breed'].apply(create_breed_purity)

In [None]:
train_set.head()

<a id='visualize'></a>

# Visualizing Data for Insights

In [None]:
# Visualizing outcome type
base_color=sns.color_palette()[0]
bars=sns.countplot(data=train_set,x='OutcomeType',color=base_color);
for i in range(len(bars.patches)):
    count=bars.patches[i].get_height()
    pcnt=100*count/len(train_set['OutcomeType'])
    string='{:0.2f}%'.format(pcnt)
    plt.text(i, count+20,string,ha='center')
patches_heights=[]
for patch in bars.patches:
    height=patch.get_height()
    patches_heights.append(height)
idx_tallest_bar=np.argmax(patches_heights)
bars.patches[idx_tallest_bar].set_facecolor('#a834a8')  
plt.xticks(rotation=20);
plt.title('Frequency of different animal outcome states from the shelter');

Most Animals (more than 75%) end up adopted or Transfered to another shelter, 17% return to their original owner and the rest die due to Euthanasia or die naturally or for other reasons.

### Regarding Adoption

**Which get adopted more, Cats or Dogs?**

In [None]:
adopted_animals=train_set.query('OutcomeType=="Adoption"')
adopted_types=adopted_animals['AnimalType'].value_counts()

In [None]:
adopted_types.plot.pie(autopct="%.1f%%",startangle=90,wedgeprops={'width':0.4},counterclock=False);
plt.title('Adoption percentage across Animals')
plt.ylabel(' ');

Dogs seem to be adopted more than cats from the shelter.

Lets check outcome subtypes of adopted animals

In [None]:
adopted_animals.OutcomeSubtype.value_counts()

For the most part, adoptions are either fostering or offsite adoptions, only one animal was asopted into a farm.

**Lets see how color purity and breed purity affect adoption frequency**

In [None]:
breed_color_data=adopted_animals.groupby(['ColorPurity', 'BreedPurity']).count()['OutcomeType'].reset_index()
heat_map_data=breed_color_data.pivot(index='ColorPurity',columns='BreedPurity',values='OutcomeType')

In [None]:
plt.figure(figsize=(12,4));
plt.subplot(1,2,1);
sns.countplot(train_set['ColorPurity'])
plt.ylabel('Adoption count')

plt.subplot(1,2,2);
sns.countplot(train_set['BreedPurity'])
plt.ylabel('Adoption count');

Mixed breeds are heavily adopted compared to Pure breeds. Mixed colors are also adoped relatively more than Pure colors.

In [None]:
ax=sns.heatmap(heat_map_data, annot = True,fmt='d',
           cbar_kws = {'label' : 'Number of Adoptions'},linewidths=.5);
ax.set_ylim([0,2]);

Mixed breeds are adopted way more than pure breeds especially those with mixed colors.

Adoption favor in descending order is:
    - Mixed Breed, Mixed colors
    - Mixed Breeds, Pure colors.
    - Pure Breeds, Mixed colors.
    - Pure Breeds, Pure colors colors.
Pure breeds in general need more attention to help them get adopted.

**How does gender of an animal affect their adoption chance?**

In [None]:
sns.countplot(adopted_animals['Gender']);

Based on Gender of the animal, It is clear that identified animals get adopted equally  in terms of gender, whereas Unkown gender animals don't get adopted at all !
Maybe sexing more animals would help their opportunity in getting adopted

Lets see where most Unknown gender animals end up being.

In [None]:
gender_grouped_by_outcome=train_set.groupby('OutcomeType')['Gender'].value_counts().rename('count').reset_index()
gender_grouped_by_outcome=gender_grouped_by_outcome.pivot(index='OutcomeType',columns='Gender', values='count').fillna(0)
gender_grouped_by_outcome.plot(kind='bar',figsize=(8,5));

Most Unknown gender animals end up transferred to another shelter which might not be the best option for an animal to keep moving from one shelter to another.

**Lets do the same approach to find out outcomes based on  state of sterility/ fertility**

In [None]:
state_grouped_by_outcome=train_set.groupby('OutcomeType')['State'].value_counts().rename('count').reset_index()
state_grouped_by_outcome=state_grouped_by_outcome.pivot(index='OutcomeType',columns='State', values='count').fillna(0)
state_grouped_by_outcome.plot(kind='bar',figsize=(8,5));

 Neutered and spayed animals are mostly adopted whereas Intact are least adopted, instead most Intact animals are transfered to another shelter along with Unknown state animals. So another preference for sterile animals shows and might increase their chance of adoption.

**Also, an important investigation at this point is top breeds that get adopted and also least adopted breeds.**

To get a clearer idea about how breed affects outcome state, I will narrow down the number of breeds to analyze into three categories: 
    Common, medium and rare breeds and find out how their numbers comparea across different outcomes.

In [None]:
def categorize_breeds_for_outcometypes(data,outcomes,idx1,idx2):
        count_per_breed=data['Breed'].value_counts()
        breed_availability={}
        breed_availability['common']=list(count_per_breed.index[0:idx1])
        breed_availability['medium']=list(count_per_breed.index[idx1:idx2])
        breed_availability['rare']=list(count_per_breed.index[idx2:])
        outcome_types=outcomes
        pcnts_outcome_across_breeds_categories=[]
        for outcome in outcome_types:
            outcome_data=data.query('OutcomeType==@outcome')
            pcnts_outcome_across_breed_category=[]
            for category in breed_availability.keys():
                breeds_of_category=breed_availability[category]
                category_outcome_count=outcome_data.query('@breeds_of_category in Breed').shape[0]
                category_total_count=data.query('@breeds_of_category in Breed').shape[0]
                category_outcome_pcnt=100*category_outcome_count/category_total_count
                pcnts_outcome_across_breed_category.append(category_outcome_pcnt)
            pcnts_outcome_across_breeds_categories.append(pcnts_outcome_across_breed_category)
        return pcnts_outcome_across_breeds_categories

In [None]:
def get_category_indices(breeds_count_in_shelter):     
        indices=[]
        for i in [50,5]:
            for index, value in enumerate(breeds_count_in_shelter):
                if value<=i:
                    indices.append(index)
                    break
        return indices

#### Cat breed categories outcome trends

create a series of breeds with corrsponding counts in a descending order, then get the indices for the count value that separates categories.
There will be two indices; one which separates breeds that have counts of 5 or less from those greater than 5, and the second indexx will separate breeds having counts of 50 or less( but greater than 5) and those greater than 50.

In [None]:
cat_data=train_set.query('AnimalType=="Cat"')
cat_count_per_breed=cat_data['Breed'].value_counts()
cat_indices=get_category_indices(cat_count_per_breed)
cat_result=categorize_breeds_for_outcometypes(cat_data,['Adoption','Transfer','Return_to_owner','Died','Euthanasia'],cat_indices[0],cat_indices[1])

In [None]:
column_names=['common_breeds','medium_breeds','rare_breeds']
indices=['Adoption','Transfer','Return to owner','Died','Euthanasia']
cat_breed_outcome_df=pd.DataFrame(cat_result,columns=column_names,index=indices)
cat_breed_outcome_df

In [None]:
cat_breed_outcome_df.transpose().plot.bar(figsize=(7,5))
plt.ylabel('Percent % for breed category')
plt.xlabel('Breed Category')
plt.xticks(rotation=0);
plt.legend(loc="center");
plt.title('Cat Outcome Trends by Breed category');

- An increasing adoption percentage trend is found as rarity of breeds increase where Common breeds are least adopted whereas rare breeds are most adopted.
- An oppsite trend is found for transfered animals. Common breeds are the most animals transfered to other shelters while rare breeds are least transferred.
- Rare breeds have highest return to owner outcome percentage and common breeds have lowest return percentage, which might be explained by the high percent of transfers among shelters which might make their returning harder if an owner is looking for them.
- Although small in percentage, euthanized animals are mostly common breeds and least euthanized are rare_breeds. Also die percentage is slighlty higher in rare breeds than common breeds.

In [None]:
# an alternative method using pivoting insteeadd of querying.
def categorize_breeds_for_outcometypes_alternative_function(data,idx1,idx2):
    count_per_breed=data['Breed'].value_counts()
    breed_availability={}
    breed_availability['common']=list(count_per_breed.index[0:idx1])
    breed_availability['medium']=list(count_per_breed.index[idx1:idx2])
    breed_availability['rare']=list(count_per_breed.index[idx2:])
    breeds_count_by_outcome=data.groupby('OutcomeType')['Breed'].value_counts().rename('count').reset_index()
    pivoted_breeds_by_outcome=breeds_count_by_outcome.pivot(index='OutcomeType',columns='Breed', values='count').fillna(0)
        
    pcnts_outcome_across_breeds_categories=[]
    for category in breed_availability.keys():
        breeds_of_category=breed_availability[category]
        pcnts_outcome_across_breed_category=[]
        for idx, outcome in pivoted_breeds_by_outcome.iterrows():
            category_outcome_pcnt=outcome[breeds_of_category].sum()*100/pivoted_breeds_by_outcome[breeds_of_category].sum().sum()
            pcnts_outcome_across_breed_category.append(category_outcome_pcnt)
        pcnts_outcome_across_breeds_categories.append(pcnts_outcome_across_breed_category)
    return pcnts_outcome_across_breeds_categories

#### Dog breed categories outcome trends

In [None]:
dog_data=train_set.query('AnimalType=="Dog"')
dog_count_per_breed=dog_data['Breed'].value_counts()
dog_indices=get_category_indices(dog_count_per_breed)
dog_result=categorize_breeds_for_outcometypes(dog_data,['Adoption','Transfer','Return_to_owner','Died','Euthanasia'],dog_indices[0],dog_indices[1])

In [None]:
column_names=['common_breeds','medium_breeds','rare_breeds']
indices=['Adoption','Transfer','Return to owner','Died','Euthanasia']
dog_breed_outcome_df=pd.DataFrame(dog_result,columns=column_names,index=indices)
dog_breed_outcome_df

In [None]:
dog_breed_outcome_df.transpose().plot.bar(figsize=(7,5))
plt.ylabel('Percent % for breed category')
plt.xlabel('Breed Category')
plt.xticks(rotation=0);
plt.legend(loc="upper left");
plt.title('Dog Outcome Trends by Breed Category');

A similar pattern for that of cats is observed for Adoption, very similar transfer percentages among breeds of dogys. for return to owner,also very close percentages across the three categories, euthanized dogs seem to take high percent for common breeds vs low percent for rare breeds.

In [None]:
grid=sns.FacetGrid(data=train_set,row='OutcomeType',sharey=False,col='AnimalType',height=3,aspect=5/3)
grid.map(plt.hist,'outcomeAgeinYears');


Looking at the histograms, mean age for adopting cats is less than than of dogs. This can be an area of work to promote adoption of older animals.

Very similar distributions for age among euthanized cats and dogs with mean around 1-2 years

For transfered aimals, mean age is similar for dogs and cats but there is a wider range of ages for dogs than for cats

cats die at young ages( 0-1) years compared to dogs that maintain very similar numbers of deaths in the range of 0-3 years.

### Finally, I am interested in viewing trends in dog and cat names

In [None]:
plt.figure(figsize=(7,5))
train_set.query('AnimalType=="Dog"')['Name'].value_counts().head(10).plot.bar();
plt.xticks(rotation=0);
plt.title('Top 10 Dog Names');

In [None]:
plt.figure(figsize=(7,5))
train_set.query('AnimalType=="Cat"')['Name'].value_counts().head(10).plot.bar()
plt.xticks(rotation=0);
plt.title('Top 10 Cat Names');