# Shelter Animal outcomes

The goal of this Kaggle is to predict the outcomes of some animals in a shelter in the united states to understand trends in animal outcomes.
It will allow the shelters to know on which animals they have to focus an extra effort to help them find a new home.
To do this we have a train dataset and the test dataset. There are 5 possible Outcome Type : return to owner, adopted, transfer, euthanasia or death.

We divided the work into 3 parts : pre-treatment of the data, analyse of the data and finally predictions.

https://www.kaggle.com/c/shelter-animal-outcomes#description

In [None]:
import numpy as np
import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import math as math
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score 
from sklearn.metrics import log_loss
from sklearn import datasets
from sklearn.model_selection import train_test_split 
from sklearn import tree
from sklearn import metrics 
from scipy.stats import itemfreq

We will open, read and see the data 

In [None]:
train=pd.read_csv('../input/train.csv')
test=pd.read_csv('../input/test.csv')

In [None]:
print("INFO TRAIN")
train.info()
print()
print('*'*50)
print()
print("INFO TEST")
test.info()

- We can see that we have 26,729 rows and 10 columns for 'train' and we miss a lot of names, 'OutcomeSubtype", ages, and one sex.
- We have 11456 rows and 8 columns for 'test'and we miss some ages, and names.
- We can also see that all the columns are object type exept for the Id in test so when we will change things in the pre-traitment, we will numerized it if it doesn't deteriorate the interprability

We Will have a look on the data :

In [None]:
train.head()

In [None]:
test.head()

We can see that we can delete the 'OutcomeSubtype' column because it's linked to OutocomeType and we don't have to predict it (and we can't use it for prediction because we don't have it in 'test')

We also can see that we can work on the columns :
- **Name :** Do we keep each name ?
- **DateTime :** does not have a date type, we will separate it in many columns : Years, Month, Day and Hour because we think they may not have the same importance. For example the Years indicate how much time the animal as passed in the shelter and the longer it is, the less chances he have to be reunited with his owner.
The month is important too because there are more abandonments during spring and during summer (because of the holidays). The day may have an importance too because maybe at the end of the month shelters may be more populated or may have less budget and they can't keep new arrivals for example. And the hour is important too because early or late in the day they may be less workers so suffering animals may don't have enough care for example. If we keep the date in only one column, we risk to loose these potential info.
- **AnimalType :** We will have a look at how many animal type we have.
- **SexUponOutcome :** There are two info in this column : the sex and if the animal was sterilized or not.
- **AgeUponOutcome :** We can see that there are mutiple units, we will have to use the same unit for all to be able to do some graph.
- **Breed :** We can see that there are again many info in this column, we will have a closer look on it later in the notebook.
- **Color :** Again, there are multiple info in this column.

In [None]:
del train['OutcomeSubtype']

## Treatment of the data

We will join test and train so they will have the same structure and the same modifications because for the predictions algorithme we will need test and train to have the same structure.

In [None]:
data=pd.concat([train,test], ignore_index=True)

In [None]:
data.info()

We have 11 columns because train and test had diffrent type of ID columns.

### "SexuponOutcome" column

We will create 2 new columns : 'Sex' and 'Neutered'. 

In [None]:
data['Sex']=data.SexuponOutcome.str.extract('. ([A-Za-z]+)', expand=False)
data['Neutered']=data.SexuponOutcome.str.extract('([A-Za-z]+) .', expand=False)
data.loc[(data.Sex.isnull()==1), 'Sex']='Unknow'
data.loc[(data.Neutered.isnull()==1), 'Neutered']='Unknow'

In [None]:
pd.value_counts(data.SexuponOutcome)

Sprayed=Neutered so we will have 3 info per columns : 'Male', 'Female' and 'Unknow' for the 'Sex' and 'Intact', 'Neutered', 'Unknow' for 'Neutered'

In [None]:
del data['SexuponOutcome']

We will numerize the info in the Neutered column : 1 if the animal is sterilized, 0 if not and 2 if we don't know.

In [None]:
data['Neutered']=data['Neutered'].replace(['Neutered','Spayed'],1)
data['Neutered']=data['Neutered'].replace('Intact',0)
data['Neutered']=data['Neutered'].replace('Unknow',2)

In [None]:
pd.value_counts(data.Neutered)

We will do the same for the Sex : 1 if it's a male, 0 for a female, and 2 if unknow.

In [None]:
data['Sex']=data['Sex'].replace('Male',1)
data['Sex']=data['Sex'].replace('Female',0)
data['Sex']=data['Sex'].replace('Unknow',2)

In [None]:
pd.value_counts(data.Sex)

### "DateTime" column

In [None]:
data.head()

We will extract the Year, Month, Day and Hour from the DateTime column :

In [None]:
data['Year']=data.DateTime.str.extract('([0-9]+)-', expand=False)
data['Month']=data.DateTime.str.extract('.-([0-9]+)-', expand=False)
data['Day']=data.DateTime.str.extract('.-([0-9]+) ', expand=False)
data['Hour']=data.DateTime.str.extract('. ([0-9]+):',expand=False)

In [None]:
del data['DateTime']

In [None]:
data.head()

### "AgeuponOutcome" and "AnimalType" columns

We want only numerical data in these columns :

#### AnimalType

In [None]:
pd.value_counts(data['AnimalType'])

There are only two type of animals : cats (0) and dogs (1). 

In [None]:
data['AnimalType']=data['AnimalType'].replace('Cat',0)

data['AnimalType']=data['AnimalType'].replace('Dog',1)

#### AgeunponOutcome

I will create two temporary columns 'values' and 'units' in order to see what type of unnits there are.

In [None]:
data['valeurs']=data.AgeuponOutcome.str.extract('([0-9]+) ', expand=False)
data['unités']=data.AgeuponOutcome.str.extract('. ([A-Za-z]+)', expand=False)

In [None]:
pd.value_counts(data.unités)

We can see that 'years' and 'year' will have the same value, same thing for 'days' and 'day' etc.

We will convert all the ages in one units : years. I thougth about it and I think converting in years it's the best despite a few cons : 
- It will be interpretable : a 104 weeks old animal is not something really easy to understand.
- It will avoid bearing : for example every 2 years old animal would have "exactly" 24 months. With the age in years, we will have continuous age under one then we will have discreet ages.
- But it will be less interpretable for us for the smaller ages (the ones in weeks or days) but for a 0.0192 age we know that it's a really young puppy or kitten.

We'll change every word of the units column into a number corresponding of its value in year :

In [None]:
data['unités']=data['unités'].replace(['months','month'],0.0833) #1/12=0.083333
data['unités']=data['unités'].replace(['years','year'],1)
data['unités']=data['unités'].replace(['weeks','week'],0.0192)#1/52=0.0192
data['unités']=data['unités'].replace(['days','day'],0.00274) #1/365=0.00273972

We also convert the column 'value' into float because it''s currently a string

In [None]:
data['valeurs']=data.valeurs.astype(float)

And finally each age is the product of its value multiplied by its units :

In [None]:
data['AgeInYears']=data['valeurs']*data['unités']

We can now delete the initial AgeuponOutcome column and the two temporary columns.

In [None]:
del data['AgeuponOutcome']
del data['unités']
del data['valeurs']

In [None]:
data.head()

### Colonnes "AnimalID" et "ID"

We wanted to see if there were "hidden" information in the AnimalID column (because sometime there can have piece of date or place or the number of the animal of the shelter...). To do this we created a temporary column to extract the first letter(s) of the Id to see how many different there are :

In [None]:
data['temp']=data.AnimalID.str.extract('([A-z]+)',expand=False)

data['temp2']=data.AnimalID.str.extract('.([0-9]+)')

In [None]:
pd.value_counts(data.temp)

There is only one letter : A. Plus we can see with head and tail that there is no pattern in the ID : nothing looks like a date or something. So we deduced that there is no information in the ID

We kept this column in order to produce the document csv that was required by the Kaggle Challenge ( a dataframe with the AnimalID and the predictionss of the probabilities of each Outcome Type).
We can delete our tempopary columns. 

In [None]:
#del data['AnimalID']
#del data['ID']
del data['temp']
del data['temp2']

In [None]:
data.head(40)

### Name

We choose to put 1 if the animal has a name and 0 if not. There are too many different names and our goal was to have multiple "independant" columns with as few info as possible in each one of them.

In [None]:
data.loc[(data['Name'].isnull()==0), 'Name']=1
data.loc[(data['Name'].isnull()==1), 'Name']=0

In [None]:
data.head()

### Breed

We have seen earlier that there are only cats and dogs in the shelter. Moreover, we can see in the first rows of the data that the info in 'breed' and 'color' are diffenrent depending on the animal type. So we choose to separate the data into two new datasest : cats and dogs.

In [None]:
cats=data[data.AnimalType==0]
dogs=data[data.AnimalType==1]

Let's start with the cats

## Cats ! 

In [None]:
pd.value_counts(cats.Breed)

We can see that there are too many of them. Some of them have 'Mix', some of them have two breeds separate with '/' and there is also information about hair

So we will create 4 new columns : 
- Breed 1 : with the first breed if there are two (so the one before '/') and the only breed if not.
- Breed 2 : with the second breed if there is one (so the breed after '/').
- Hair : with the length of the hair (or the texture).
- Mix : a binary column with 1 if the cat have 'Mix' in his Breed column or if he has two breeds.

Let's start with the hair : we can see that there is shorthair and longhair so we just have to extract the info before 'hair' in the Breed :

In [None]:
cats['Hair']=cats.Breed.str.extract('. ([A-Za-z]+)hair')

In [None]:
pd.value_counts(cats.Hair)

We checked the row of Wire, since it was only one cat concerned :

In [None]:
print(cats[cats.Hair=='Wire'])

So we can see that it's in the name of the breed : the 'American Wirehair'. We checked the breed on internet and it really have 'wirehair' so we will keep it.

There is also cats with 'Medium Hair' so we also have to extract information before ' Hair'. We create a temporary column for medium hair :

In [None]:
cats['Mediumtemp']=cats.Breed.str.extract('(Medium) Hair')

We check that we succesfully extract all of them :

In [None]:
pd.value_counts(cats.Mediumtemp)

We can see (thanks to the 'pd.value_counts(cats.Breed)' command) that we have :
>Domestic Medium Hair Mix                    1217

>Domestic Medium Hair                          67

So there should be at least 11217+67=1284 values in our 'Mediumtemp' column so it seems correct.

Now we can had 'Medium' in the 'Hair' column for the cats where our temporary column is not null :

In [None]:
cats.loc[(cats['Mediumtemp']=='Medium'), 'Hair']='Medium'

In [None]:
pd.value_counts(cats.Hair)

We can see that we have now the Medium info in the 'Hair' column, we can delete our temporary column.

In [None]:
del cats['Mediumtemp']

We will now create our column 'Mix'. First we will create temporary columns 
- In 'temp' we extract the word 'Mix' from the breed
- In 'temp2' we extract the '/' from the breed.

These 2 columns will be null execpt when there were 'Mix' or '/' so then we will just have to create the column 'Mix' and fill ot with 1 if 'temp' or 'temp2' is not null and then we will put 0 if the 'Mix' column is null. 

This way our column will be fill with 0 and 1.

In [None]:
cats['temp']=cats.Breed.str.extract(' (Mix)$',expand=False)
cats['temp2']=cats.Breed.str.extract('.(/).',expand=False)

In [None]:
cats.loc[(cats['temp'].isnull()==0) | (cats['temp2'].isnull()==0), 'Mix']=1
cats.loc[(cats['Mix'].isnull()==1), 'Mix']=0

Now for the breed itself. We will do this into multiple steps :

In [None]:
# First, if we have '/' in breed, (so if temp2 is not null) we will put all info after it into Race2
cats.loc[(cats['temp2'].isnull()==0), 'Race2']=cats.Breed.str.extract('/([A-Z-a-z]+)', expand=False)
# Then, if the cat is not Mix, have no '/' and have not info on Hair then we put Breed into Race1 
#(this way we nom we xon't have to change info into Race1 to remove 'Hair')
cats.loc[(cats['Mix']==0) & (cats['temp2'].isnull()==1) & (cats['Hair'].isnull()==1), 'Race1']=cats.Breed.str.extract('(.+)',expand=False)
# Next if the cat is not mix but has Hair info, we will put Breed minus hair info into Race1
cats.loc[(cats['temp2'].isnull()==1) & (cats['Hair'].isnull()==0), 'Race1']=cats.Breed.str.extract('(.+) .hair',expand=False)
# After, if the cat as two breeds (so if temp2 is not null), then we will put the info before '/' into Race1
cats.loc[(cats['Mix']==0) & (cats['temp2'].isnull()==0), 'Race1']=cats.Breed.str.extract('(.+)/',expand=False)
# Then if the cat is Mix, we will but Breed minus 'Mix' into Race1 
#This way we will still have the Hair info in Race1 but we will fix this later
cats.loc[(cats['Mix']==1), 'Race1']=cats.Breed.str.extract('(.+) Mix',expand=False)
#cats['Race1']=cats.Breed.str.extract('([A-Za-z]+)/',expand=False)
#Finally, there were somme Race1 missing so we put avery thing before '/' into Race1
cats.loc[(cats['Race1'].isnull()==1), 'Race1']=cats.Breed.str.extract('([A-Za-z]+)/',expand=False)
#cats['Race2']=cats.Breed.str.extract('/([A-Z-a-z]+)', expand=False)

In [None]:
cats.head(40)

We have to remove the Hair info from Race1

In [None]:
cats['Race1']=cats.Race1.str.replace('Domestic Shorthair','Domestic')
cats['Race1']=cats.Race1.str.replace('Domestic Longhair','Domestic')
cats['Race1']=cats.Race1.str.replace('DOmestic Medium Hair','Domestic')

In [None]:
cats.loc[(cats.Race1.isnull()==1) & (cats.Breed.str.find('Domestic')>-1), 'Race1']='Domestic'
cats.loc[(cats.Race1.isnull()==1) & (cats.Breed.str.find('British')>-1), 'Race1']='British'

Now let's see how many cats don't have the Hair info :

In [None]:
#We create a temporary dataframe with only cats where Hair is null :
catsnohair=cats[cats.Hair.isnull()]

Let's print them :

We print with 'Breed' and not 'Race1' so we can see that : 
- We succed into extract hair info because there isn't 'hair' info in the breeds here.
- There isn't a large number so we can do it manually.
- Some of them have to breeds so we will do a "mean" of the length of the two breeds

In [None]:
pd.value_counts(catsnohair.Breed)

If the cat has both Long and Short hair , we are going to put Medium
If the cat has both Long and Meduim hair , we are going to put Long

We looked on the internet for the length of the hair

In [None]:
cats['Siamese']=cats.Breed.str.extract('(Siamese)')

In [None]:
cats.loc[(cats.Siamese.notnull()==1), 'Hair']='Short'
cats.loc[(cats.Breed=='Snowshoe Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Maine Coon Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Manx Mix'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Russian Blue Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Himalayan Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Ragdoll Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Persian Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Angora Mix'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Balinese Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Bengal Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Bombay Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Cymric Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Devon Rex Mix'), 'Hair']='Wire'
cats.loc[(cats.Breed=='Devon Rex'), 'Hair']='Wire'
cats.loc[(cats.Breed=='Abyssinian Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Chartreux Mix '), 'Hair']='Short'
cats.loc[(cats.Breed=='Burmese'), 'Hair']='Short'
cats.loc[(cats.Breed=='Chartreux Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Japanese Bobtail Mix'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Maine Coon'), 'Hair']='Long'
cats.loc[(cats.Breed=='Cornish Rex Mix'), 'Hair']='Wire'
cats.loc[(cats.Breed=='Havana Brown Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Bengal'), 'Hair']='Short'
cats.loc[(cats.Breed=='Tonkinese Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Snowshoe'), 'Hair']='Short'
cats.loc[(cats.Breed=='Persian'), 'Hair']='Long'
cats.loc[(cats.Breed=='Himalayan'), 'Hair']='Long'
cats.loc[(cats.Breed=='Javanese Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Turkish Van Mix'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Norwegian Forest Cat Mix'), 'Hair']='Long'
cats.loc[(cats.Breed=='Ragdoll'), 'Hair']='Long'
cats.loc[(cats.Breed=='Sphynx'), 'Hair']='None'
cats.loc[(cats.Breed=='Scottish Fold Mix'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Oriental Sh Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Manx'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Angora/Persian'), 'Hair']='Long' #Mi-long + Long = Long non ? 
cats.loc[(cats.Breed=='Snowshoe/Ragdoll'), 'Hair']='Medium' #Long+Short=Medium
cats.loc[(cats.Breed=='Turkish Angora Mix'), 'Hair']='Medium'
cats.loc[(cats.Breed=='Ocicat Mix'), 'Hair']='Short'
cats.loc[(cats.Breed=='Russian Blue'), 'Hair']='Short'

In [None]:
del catsnohair

In [None]:
cats.head()

In [None]:
del cats['Siamese']
del cats['temp']
del cats['temp2']

In [None]:
pd.value_counts(cats.Hair)

In [None]:
cats.info()

We can see that we have filled the Hair, Race1, and Mix columns

For the Race2 column, we choose to put Race1 in it when Race2 is null so we don't loose the information of the second breed. We will have a lot of animal with a double information of the breed but for purebred animals it's a good thing. 

In [None]:
cats.loc[(cats.Race2.isnull()==1), 'Race2']=cats.Race1

### Colors

In [None]:
pd.value_counts(cats.Color)

We have a lot of colors and 3 informations : 2 colors (sometimes, separated by a '/') and 'Point' or 'Tabby'. Let's create a new column 'Tabby' with only 1 if the animal is 'Tabby' or 'Point' (there are a few of them) and 0 if not.

In [None]:
cats.loc[(cats.Color.str.find('Tabby')>-1), 'Tabby']=1
cats.loc[(cats.Color.str.find('Tabby')==-1), 'Tabby']=0

In [None]:
pd.value_counts((cats.Color.str.find('Point')>-1) & (cats.Color.str.find('Tabby')>-1))

We can see that there are really few with 'Point' and 'Tabby' se losing this double information is not really a problem.

In [None]:
pd.value_counts((cats.Color.str.find('Point')>-1))

And here we can see that there are only 851 cats with 'Point' so we can put them in 'Tabby'. (Basically this column will show if the cat has a pattern on his fur or just colors).

We extract the colors into two new columns Color1 and Color2

In [None]:
# We extract the color before the '/' symbole as the fisrt column
cats['Color1']=cats.Color.str.extract('(.+)/')
# If there is no '/' that means there is only one color so we can put it in Color1
cats.loc[(cats.Color.str.find('/')==-1) , 'Color1']=cats['Color']
# And finally, color after '/' is the second color.
cats['Color2']=cats.Color.str.extract('/(.+)')

In [None]:
cats.info()

In [None]:
pd.value_counts(cats.Color2)

In [None]:
pd.value_counts(cats.Color1)

We can see that we still need to remove 'Tabby' and 'Point' from the colors :

In [None]:
cats['Color1']=cats['Color1'].str.replace('Tabby','')
cats['Color2']=cats['Color2'].str.replace('Tabby','')
cats['Color1']=cats['Color1'].str.replace('Point','')
cats['Color2']=cats['Color2'].str.replace('Point','')

We will also do the same thing as with the Breeds : we put color1 into color2 when it's null

In [None]:
cats.loc[(cats.Color2.isnull()==1), 'Color2']=cats.Color1

In [None]:
del cats['Color']

In [None]:
cats.head()

So we have finished for the cats ! 

## Dogs !

In [None]:
dogs.head()

### Breed

In [None]:
pd.value_counts(dogs.Breed)

We can see that we will have to do the exact same thing for the dogs than for the cats.

Let's start with the Mix column

In [None]:
dogs.loc[(dogs.Breed.str.find('Mix')>1) | (dogs.Breed.str.find('/')>-1), 'Mix']=1

In [None]:
dogs.loc[(dogs.Mix.isnull()==1), 'Mix']=0

In [None]:
dogs.head()

In [None]:
dogs['Race1']=dogs.Breed.str.extract('(.+)/')
dogs['Race2']=dogs.Breed.str.extract('/(.+)')
dogs.loc[(dogs.Breed.str.find('/')==-1), 'Race1']=dogs['Breed']
dogs['Race1']=dogs.Race1.str.replace('Mix','')

In [None]:
pd.value_counts(dogs.Race1)

We will have to remove 'Mix' in the name of the Breed

We will try to do a 'Hair' column like the cats

In [None]:
dogs['Hair']=dogs.Breed.str.extract('([A-Za-z]+)hair')

In [None]:
pd.value_counts(dogs.Hair)

In [None]:
dogs['Race1']=dogs['Race1'].str.replace('Mix','')
dogs['Race2']=dogs['Race2'].str.replace('Mix','')

There are too few of them. to create a column because to fill it we would have to search manually on the Internet. 
We tried to get the info from this site : http://www.dogbreedchart.com/ 

We succeeded intocollecting them in a data set but the names of the breeds weren't exactly the same (for example we have Pit Bull and Americain Pit Bull Terrier).
We choose to not use a variable Hair for the dogs.

In [None]:
dogs.info()

Once again, we will put Race1 into Race2 if it's null.

In [None]:
dogs.loc[(dogs.Race2.isnull()==1), 'Race2']=dogs.Race1
del dogs['Hair']

### Color

In [None]:
dogs.head()

Again, we will extract 2 colors and put Color1 into Color2 if it's empty

In [None]:
dogs['Color1']=dogs.Color.str.extract('(.+)/')
dogs.loc[(dogs.Color.str.find('/')==-1), 'Color1']=dogs['Color']
dogs['Color2']=dogs.Color.str.extract('/(.+)')

In [None]:
dogs.head()

In [None]:
del dogs['Color']

In [None]:
pd.value_counts(dogs.Color1)

In [None]:
dogs.loc[(dogs.Color2.isnull()==1), 'Color2']=dogs.Color1

In [None]:
dogs.info()

We can see that we succeded to have all the new columns without any missing info

## Complete missing data :

- **Let's see the cats : **

In [None]:
cats.info()

For the age : We use the fillna function with pad method. We used it on the Titanic challenge and it produced very good results.

In [None]:
cats.loc[(cats.AgeInYears.isnull()==1), 'AgeInYears']=cats['AgeInYears'].fillna(method='pad')

We split again cats into cats_train and cats_test datasets

In [None]:
cats_train=cats[(cats.OutcomeType.isnull()==0)]
cats_test=cats[(cats.OutcomeType.isnull()==1)]

In [None]:
#cats_train.to_csv('C:/Users/mathi/Downloads/all1/cats_train2.csv')
#cats_test.to_csv('C:/Users/mathi/Downloads/all1/cats_test2.csv')

- **Now for the dogs : **

In [None]:
dogs.info()

We have only one missing age. We will complete it manually. Let's see wich dogs it is : 

In [None]:
dogs[dogs.AgeInYears.isnull()==1]

For dogs, lifespan is higly dependant of the breed so let's draw the ages of Toy Poodle dogs

In [None]:
sns.countplot(data=dogs[dogs.Breed=='Toy Poodle Mix'], x='AgeInYears')

In [None]:
dogs[dogs.Breed=='Toy Poodle Mix'].mean()

We can see that the mean is close to 4.5 years, we choose it to remplace the unknown age.

In [None]:
dogs['AgeInYears'][3875]=4.5

In [None]:
dogs.info()

In [None]:
del dogs['Breed']

We separate dogs in train and test dataset

In [None]:
dogs_train=dogs[(dogs.OutcomeType.isnull()==0)]
dogs_test=dogs[(dogs.OutcomeType.isnull()==1)]

In [None]:
#dogs_train.to_csv('C:/Users/mathi/Downloads/all1/dogs_train2.csv')
#dogs_test.to_csv('C:/Users/mathi/Downloads/all1/dogs_test2.csv')

data_train=pd.concat([cats_train,dogs_train])
data_test=pd.concat([cats_test,dogs_test])


In [None]:
data_train.info()

# Data Visualisation : Dogs and Cats /Cats/Dogs

To see the relationship between the OutcomeType variable and the other features , we realise some graphs 

We begin working on the commun Data (Cats and Dogs together) 

We use barplots to represent the proportion of each OutcomeType of our training dataset 

In [None]:
#Animals(Cats and Dogs) Outcomes 
OutcomeTypeTrain = data_train.OutcomeType.value_counts() / len(data_train.index)
OutcomeTypeTrain.plot(kind='barh')
#We see that the most animals are adopted or transfered 
#We remarque that the number of died or euthanized animal is less important

We make the same graph (barplots) to represent the proportion of each OutcomeType on the cats training datset and than on the dogs training dataset 

In [None]:
#Cats Future
OutcomeTypeCats = cats_train.OutcomeType.value_counts() / len(cats_train.index)
OutcomeTypeCats.plot(kind='barh')
#Cats are transfered with a large percentage of 50% and get adopted with a percentage of ~40%
#A less percentage of them are euthanized , returned_to_owner or died

In [None]:
#Dogs Future
OutcomeTypeDogs = dogs_train.OutcomeType.value_counts() / len(dogs_train.index)
OutcomeTypeDogs.plot(kind='barh')
#Dogs have more than 40% of chance to be adopted, 28% to be returned to their owner and 25% of chance to be transfered
#The resting percents are for dogs who are euthanized or died
#The order of popularity of the outcomes is different than the one of the cats

Now we want thanks to the coming graphs , to see the relation between OutcomeType and the other variables 

Histograms are commnly used to describe the distribution of the feature of Age (AgeInYears) as it's a numerical variable (a quantiative continuous variable) 

We get the number of animals for each interval of Age (AgeInYears get automatically discretized in subintervals  )

In [None]:
data_train["AgeInYears"].plot.hist(weights = np.ones_like(data_train.index) / len(data_train.index))
#Most animals (~60%) have an age between 0 and 2.5 Years 

We use now boxplots to confirm the results obtained with histograms for Age distribution 

In [None]:
cats_train['Animal']='cats'
dogs_train['Animal']='dogs'
cats_train['AnimalType']='0'
dogs_train['AnimalType']='1'
data_train=pd.concat([cats_train,dogs_train])
sns.boxplot(x = "Animal", y = "AgeInYears", data = data_train) 
#We remarque that as we see previously , cats in our training dataset are in majority young (age between 0 and 1 Year)
#dogs seems to have more variation in their ages, even if the majority are young 

With the last histogram and boxplot , we have seen the distribution of Age for animals in general.
Now we want to see the relation between the Age and the OutcomeType 

We perform it for the commun Data than for cats than for dogs 

In [None]:
#OutcomeType for cats in reference to the Age 
g = sns.FacetGrid(cats_train, col='OutcomeType')
g.map(plt.hist, 'AgeInYears')
#The majority of cats transfered or adopted are young ( <2.5 years)

In [None]:
#OutcomeType for dogs in reference to the Age 
g = sns.FacetGrid(dogs_train, col='OutcomeType')
g.map(plt.hist, 'AgeInYears')
#The majority of dogs transfered or adopted are young ( <2.52 years)
#Contraty to the cats, the biggest part of dogs returned to their owner or euthanized are not the youngest

It will be more easy to work on the variable AgeInYears , if we transform it into a categorical variable
We choose the variable Stage representing the 6 stages of an animal life 

In [None]:
cats_train.loc[(cats_train.AgeInYears<=0.5), 'Stage']='Baby'
cats_train.loc[(cats_train.AgeInYears>0.5) & (cats_train.AgeInYears<3), 'Stage']='Junior'
cats_train.loc[(cats_train.AgeInYears>=3) & (cats_train.AgeInYears<7), 'Stage']='Prime'
cats_train.loc[(cats_train.AgeInYears>=7) & (cats_train.AgeInYears<11), 'Stage']='Mature'
cats_train.loc[(cats_train.AgeInYears>=11) & (cats_train.AgeInYears<15), 'Stage']='Prime'
cats_train.loc[(cats_train.AgeInYears>=15), 'Stage']='Geriatric'
cats_train.count()
dogs_train.loc[(dogs_train.AgeInYears<=0.5), 'Stage']='Baby'
dogs_train.loc[(dogs_train.AgeInYears>0.5) & (dogs_train.AgeInYears<3), 'Stage']='Junior'
dogs_train.loc[(dogs_train.AgeInYears>=3) & (dogs_train.AgeInYears<7), 'Stage']='Prime'
dogs_train.loc[(dogs_train.AgeInYears>=7) & (dogs_train.AgeInYears<11), 'Stage']='Mature'
dogs_train.loc[(dogs_train.AgeInYears>=11) & (dogs_train.AgeInYears<15), 'Stage']='Prime'
dogs_train.loc[(dogs_train.AgeInYears>=15), 'Stage']='Geriatric'

In [None]:
#Age distribution for cats and dogs 
fig=plt.figure(figsize=(13,13))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.set_title('Age Distribution for Cats')
ax2.set_title('Age Distribution for Dogs')
sns.countplot(x="Stage" , hue="OutcomeType" , data=cats_train,ax=ax1)
sns.countplot(x="Stage" , hue="OutcomeType" , data=dogs_train,ax=ax2)
#Kittens are the more numerous, followed by Juniors.
#Juniors seem the most likely to be transfered, and kittens to die.
#the majority of adopted cats are kittens.

Now we want to see the relation between the Sex and the OutcomeType 

In [None]:
#Cats destiny in reference to the Sex
sns.countplot(x="Sex",hue='OutcomeType',data=cats_train)
#If the sex of the cat is known(0=Female,1=Male), they are more likely adopted or transfered
#the Number of female and male cats adopted or transfered is the same 
#The cats of unknown sex are more likely transfered or euthanized
#it does not seem to exist a relation between the sex and the outcome of the cat, if the sex is known

In [None]:
#Dogs destiny in reference to the Sex
sns.countplot(x="Sex",hue='OutcomeType',data=dogs_train)
#If the sex of the dog is known(0=Female,1=Male) , they are more likely to be adopted, returned to their owner or transfered
#There are no dogs of unknown sex who get adopted

In [None]:
#Cats selected by their Sex , we visualize their destiny 
sns.barplot(x="Sex",y="AgeInYears" , hue="OutcomeType", data=cats_train)
#the average age of a euthanized or returned cat of known sex is far superior to an adopted, transfered or dead one.
#The exception is for the cats of unknown sex.

In [None]:
#Dogs selected by their Sex , we visualize their destiny 
sns.barplot(x="Sex",y="AgeInYears" , hue="OutcomeType", data=dogs_train)
#the average age of a euthanized or returned dog of known sex is superior to an adopted, transfered or dead one.
#The exception is for the dogs of unknown sex

In [None]:
#Numerical transformation for the OutcomeType (to use it for the violon plots)
data=cats_train
cats_train.loc[(data['OutcomeType']=='Adoption'), 'Destin']=0
cats_train.loc[(data['OutcomeType']=='Died'), 'Destin']=1
cats_train.loc[(data['OutcomeType']=='Euthanasia'), 'Destin']=2
cats_train.loc[(data['OutcomeType']=='Return_to_owner'), 'Destin']=3
cats_train.loc[(data['OutcomeType']=='Transfer'), 'Destin']=4

data=dogs_train
dogs_train.loc[(data['OutcomeType']=='Adoption'), 'Destin']=0
dogs_train.loc[(data['OutcomeType']=='Died'), 'Destin']=1
dogs_train.loc[(data['OutcomeType']=='Euthanasia'), 'Destin']=2
dogs_train.loc[(data['OutcomeType']=='Return_to_owner'), 'Destin']=3
dogs_train.loc[(data['OutcomeType']=='Transfer'), 'Destin']=4

In [None]:
#Graphics in violon to describe the destiny of cats depending on their age and Sex 
fig=plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.set_title('Age Distribution for Cats')
ax2.set_title('Age Distribution for Dogs')
sns.violinplot(x='Sex', y='Destin', hue='Stage', data=cats_train,ax=ax1)
sns.violinplot(x='Sex', y='Destin', hue='Stage', data=dogs_train,ax=ax2)            
#As said previously, the fact that the cat is a male or a female does not seem to play a role in it outcome : the repartitions for male and female look alike. 
#If the sex is unknown, the outputs are not the same.
#The age definitely plays a role : there are variations depending on the age.

Now we configure the relation between the OutcomeType and the resting Variables (Neutered , Name)

In [None]:
#Animals Destiny depending if they are intact(0), neutered(1) or unknown(2)
sns.countplot(x="Neutered",hue='OutcomeType',data=data_train)
#Neutered Animals are the most adopted and reutrned to teir owner
#Intact animals and animals for which it's unknown  are more likely to be transfered

In [None]:
#Animals Destiny depending if they have a name or not
sns.countplot(x="Name",hue='OutcomeType',data=data_train)
#It's clear that animals with a Name are the most adopted and reutrned to their owner
#Having a Name is an important feature for adoption or return to the owner

Now we want to see the corrleation between the Hair and the OutcomeType only for cats , as it's an additional caracteristic for cats 

In [None]:
#Cats Destiny depending on their Hair (Short,Meduim,...)
sns.countplot(x="Hair",hue='OutcomeType',data=cats_train)
# The majority of the cats seems to have short hair. There seem to be more likey to be transfered.
#there does not seem to exist a difference between the outcomes of cats with long or medium hair.

We use now a heat Map to visualise in a brief way all the correlations between variables 

In [None]:
data_train=pd.concat([cats_train,dogs_train])
del data_train['Tabby'] # we delete the column Tabby, present only for the cats
# we make sure to have all the numerical variables in a numeric type
data_train['Year'] = data_train['Year'].apply(pd.to_numeric, errors='coerce')
data_train['Month'] = data_train['Month'].apply(pd.to_numeric, errors='coerce')
data_train['Hour'] = data_train['Hour'].apply(pd.to_numeric, errors='coerce')
data_train['AnimalType'] = data_train['AnimalType'].apply(pd.to_numeric, errors='coerce')
data_train.info()

In [None]:
#heat Map for all animals : data_train is composed of the train sets of cats and dogs 

plt.figure(figsize=(10,10))
h=sns.heatmap(data_train.corr(),annot=True)
#AgeInYears, AnimalType, Hour, Name, Neuterd and Sex are the variables that are the most correlated with the animal future

LEARNING ON CATS AND DOGS TOGETHER

In preparation to the learning methods, we are going to make sure that all the variables are usable ( we are going to try to avoid categorical variables, who often can't be treated by the learning methods

In [None]:
#We load the data for cats and dogs
cats_train=cats[(cats.OutcomeType.isnull()==0)]
cats_test=cats[(cats.OutcomeType.isnull()==1)]

dogs_train=dogs[(dogs.OutcomeType.isnull()==0)]
dogs_test=dogs[(dogs.OutcomeType.isnull()==1)]
    

In [None]:
#We add AnimalType to difference between cats and dogs after putting them in the same dataframe
cats_train['AnimalType']='0'
dogs_train['AnimalType']='1'

cats_test['AnimalType']='0'
dogs_test['AnimalType']='1'

data_train=pd.concat([cats_train,dogs_train])
data_test=pd.concat([cats_test,dogs_test])
data_train.head()

In [None]:
#In order to use the learning methods, we need to convert object columns (Race, Color...) in int64 columns.
from sklearn import preprocessing 
le=preprocessing.LabelEncoder()

le.fit(data_train.Race1)
le.transform(data_train.Race1)
data_train['Race12']=le.transform(data_train.Race1)

le.fit(data_train.Race2)
le.transform(data_train.Race2)
data_train['Race22']=le.transform(data_train.Race2)

le.fit(data_train.Color1)
le.transform(data_train.Color1)
data_train['Col1']=le.transform(data_train.Color1)

le.fit(data_train.Color2)
le.transform(data_train.Color2)
data_train['Col2']=le.transform(data_train.Color2)

le.fit(data_test.Race1)
le.transform(data_test.Race1)
data_test['Race12']=le.transform(data_test.Race1)

le.fit(data_test.Race2)
le.transform(data_test.Race2)
data_test['Race22']=le.transform(data_test.Race2)

le.fit(data_test.Color1)
le.transform(data_test.Color1)
data_test['Col1']=le.transform(data_test.Color1)

le.fit(data_test.Color2)
le.transform(data_test.Color2)
data_test['Col2']=le.transform(data_test.Color2)

data_train.head()

In [None]:
#We cut the column AgeInYears in 6 categories
#we create a new column of types int64, containing the OutcomeType converted in int64
data_train.loc[(data_train.AgeInYears<=0.5), 'Stage']='0'
data_train.loc[(data_train.AgeInYears>0.5) & (data_train.AgeInYears<3), 'Stage']='1'
data_train.loc[(data_train.AgeInYears>=3) & (data_train.AgeInYears<7), 'Stage']='2'
data_train.loc[(data_train.AgeInYears>=7) & (data_train.AgeInYears<11), 'Stage']='3'
data_train.loc[(data_train.AgeInYears>=11) & (data_train.AgeInYears<15), 'Stage']='4'
data_train.loc[(data_train.AgeInYears>=15), 'Stage']='5'

data_train.loc[(data_train['OutcomeType']=='Adoption'), 'Destin']='0'
data_train.loc[(data_train['OutcomeType']=='Died'), 'Destin']='1'
data_train.loc[(data_train['OutcomeType']=='Euthanasia'), 'Destin']='2'
data_train.loc[(data_train['OutcomeType']=='Return_to_owner'), 'Destin']='3'
data_train.loc[(data_train['OutcomeType']=='Transfer'), 'Destin']='4'

data_test.loc[(data_test.AgeInYears<=0.5), 'Stage']='0'
data_test.loc[(data_test.AgeInYears>0.5) & (data_test.AgeInYears<3), 'Stage']='1'
data_test.loc[(data_test.AgeInYears>=3) & (data_test.AgeInYears<7), 'Stage']='2'
data_test.loc[(data_test.AgeInYears>=7) & (data_test.AgeInYears<11), 'Stage']='3'
data_test.loc[(data_test.AgeInYears>=11) & (data_test.AgeInYears<15), 'Stage']='4'
data_test.loc[(data_test.AgeInYears>=15), 'Stage']='5'

We are now going to try different learning methods :
- decision tree
- random Forest
- naive Bayes

We are going to try in first the decision tree classifier.
It’s a classification model which creates set of rules from the training dataset, used to predict a target class. We choose to try this method because it’s one of the fastest and well known, and if it’s the method we chose, we can visualize The trained decision tree.

In [None]:
#decision tree
#We create a dataframe df with the data set data_train
df=data_train[['AgeInYears','AnimalType','Race1','Race2','Color1','Color2','Day','Hour','Mix','Month','Name','Neutered','OutcomeType','Sex','Year','Race12','Race22','Col1','Col2','Stage','Destin','OutcomeType']]
#X is a dataframe with a selection of variables that are going to be our predictors variables 
#Y is a dataframe containing only the OutcomeType (the target variable)
X=df[['AnimalType','Mix','Month','Day','Hour','Year','Name','Neutered','Sex','Year','Race12','Race22','Col1','Col2','Stage']].values
Y=df['Destin'].values
X_train,X_test,Y_train,Y_test = train_test_split(X,Y)#we split the train dataset in two : X_train et Y_train to train the method
#and X_test and Y_test to test it accuracy and log loss
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train,Y_train)
Y_predict=classifier.predict(X_test)#to get the accuracy, we predict the class of the outcomes
accuracy_score(Y_test,Y_predict)# accuracy

In [None]:
predictions=classifier.predict_proba(X_test)#to get the log loss, we predict the probabilities of each classes of the Outcome
log_loss(Y_test,predictions)#log loss

Now, we are going to use random Forest : it’s build different trees based on multiple samples and combine their predictions

In [None]:
#randomForest
rfclassifier = RandomForestClassifier(n_estimators=100)
rfclassifier.fit(X_train,Y_train)
predictions = rfclassifier.predict(X_test)
accuracy_score(Y_test, predictions)

In [None]:
predictions=rfclassifier.predict_proba(X_test)
log_loss(Y_test,predictions)

Then we try the naiveBayes Classification method. It builds on Bayes theorems regarding conditional probabilities.  
It has the advantages of being very fast and to be adapted to very high-dimensional datasets : if it’s the method we chose, these characteristics will be useful for the shelters. 

In [None]:
#naiveBayes
nbclassifier = GaussianNB()
nbclassifier.fit(X_train,Y_train)
predictions = nbclassifier.predict(X_test)
accuracy_score(Y_test, predictions)

In [None]:
predictions=nbclassifier.predict_proba(X_test)
log_loss(Y_test,predictions)

We see that the method Random Forest is the one who gets the best result. We are going to use it to predict the Outcome of the 
data_test.

In [None]:
#we delete the column OutcomeType of the data_test, we are going fill it later.
del data_test['OutcomeType']

In [None]:
#we choose the method with the best result : random forest
X_missing = data_test[['AnimalType', 'Mix', 'Month','Day','Hour','Year','Name', 'Neutered', 'Sex',  'Year','Race12','Race22', 'Col1', 'Col2', 'Stage']].values
# we select the same variables for X_missing that we selected earlier for X_train
prediction = rfclassifier.predict(X_missing)

data_test['OutcomeType']=prediction#we are going to assign the prediction to the column OutcomeType
data_test.head()

In [None]:
#we convert the numbers resulting into the OutcomeTyoe associated
data_test.loc[(data_test['OutcomeType']=='0'), 'OutcomeType']='Adoption'
data_test.loc[(data_test['OutcomeType']=='1'), 'OutcomeType']='Died'
data_test.loc[(data_test['OutcomeType']=='2'), 'OutcomeType']='Euthanasia'
data_test.loc[(data_test['OutcomeType']=='3'), 'OutcomeType']='Return_to_owner'
data_test.loc[(data_test['OutcomeType']=='4'), 'OutcomeType']='Transfer'
data_test.head()

In [None]:
#We see that with the prediction model, we still have the fact that the majority of dogs are transfered or adopted.
#The difference is that the first Outcome here is transfer, and not adoption
OutcomeTypeTest= data_test.OutcomeType.value_counts() / len(data_test.index)
OutcomeTypeTest.plot(kind='barh')

We are now going to write the csv required by the kaggle challenge : a csv file with the animal id, all candidate outcome names, and a probability for each outcome

In [None]:
topredict=data_test[['AnimalType', 'Mix', 'Month','Day','Hour','Year','Name', 'Neutered', 'Sex',  'Year','Race12','Race22', 'Col1', 'Col2', 'Stage']].values
pred=rfclassifier.predict_proba(topredict)#we create the prediction of the probabilities of each Outcomes
pred2=pd.DataFrame(pred, columns=['Adoption', 'Died', 'Euthanasia','Return_to_owner','Transfer'])
#we start the index of data_test at one, to make the concatenation easier 
data_test.reset_index(drop=True,inplace=True)
pred2['ID']=data_test['ID']

#we are then going to change the order of the columns, to have AnimalID in first
columnsTitles = ['ID','Adoption', 'Died', 'Euthanasia','Return_to_owner','Transfer']
final=pred2.reindex(columns=columnsTitles)
final.ID = final.ID.astype(int)

In [None]:
final.head()

In [None]:
#we save the dataframe in a csv
#final.to_csv('C:/Users/mathi/Downloads/all1/final.csv', index=False)

LEARNING ON CATS ONLY  

We want to know why the accuracy score and the log loss are not better.
We decided to work on the data separately to see if we get differents scores for one type of animals ( thus showing if the low accuracy can be attributed to the cats or the dogs ) or if it’s the same for the two.


We begin with the cats. The steps are the same as previously, the difference beign that we keep the features Hair and Tabby,
that exists only for the cats. We transform Hair the same way that we transformed the Race and Color features.

In [None]:
#We encode labels with Object type by LabelEncoder to get int64 data that we can use it later to apply the learning methods 
from sklearn import preprocessing 
le=preprocessing.LabelEncoder()

cats_train=cats[(cats.OutcomeType.isnull()==0)]
cats_test=cats[(cats.OutcomeType.isnull()==1)]

le.fit(cats_train.Hair)
le.transform(cats_train.Hair)
cats_train['Pelage']=le.transform(cats_train.Hair)

le.fit(cats_test.Hair)
le.transform(cats_test.Hair)
cats_test['Pelage']=le.transform(cats_test.Hair)

le.fit(cats_train.Race1)
le.transform(cats_train.Race1)
cats_train['Race12']=le.transform(cats_train.Race1)

le.fit(cats_train.Race2)
le.transform(cats_train.Race2)
cats_train['Race22']=le.transform(cats_train.Race2)

le.fit(cats_train.Color1)
le.transform(cats_train.Color1)
cats_train['Col1']=le.transform(cats_train.Color1)

le.fit(cats_train.Color2)
le.transform(cats_train.Color2)
cats_train['Col2']=le.transform(cats_train.Color2)

le.fit(cats_train.Hair)
le.transform(cats_train.Hair)
cats_train['Pelage']=le.transform(cats_train.Hair)

le.fit(cats_test.Race1)
le.transform(cats_test.Race1)
cats_test['Race12']=le.transform(cats_test.Race1)

le.fit(cats_test.Race2)
le.transform(cats_test.Race2)
cats_test['Race22']=le.transform(cats_test.Race2)

le.fit(cats_test.Color1)
le.transform(cats_test.Color1)
cats_test['Col1']=le.transform(cats_test.Color1)

le.fit(cats_test.Color2)
le.transform(cats_test.Color2)
cats_test['Col2']=le.transform(cats_test.Color2)

In [None]:
#We resume the varibale AgeInYears in a variable Stage describing the differents life stages of a cat and converting them into numerical values(int64)
#Destin a variable to have the 5 cases of Destiny  according them numerical values(int64)
cats_train.loc[(cats_train.AgeInYears<=0.5), 'Stage']='0'
cats_train.loc[(cats_train.AgeInYears>0.5) & (cats_train.AgeInYears<3), 'Stage']='1'
cats_train.loc[(cats_train.AgeInYears>=3) & (cats_train.AgeInYears<7), 'Stage']='2'
cats_train.loc[(cats_train.AgeInYears>=7) & (cats_train.AgeInYears<11), 'Stage']='3'
cats_train.loc[(cats_train.AgeInYears>=11) & (cats_train.AgeInYears<15), 'Stage']='4'
cats_train.loc[(cats_train.AgeInYears>=15), 'Stage']='5'

cats_train.loc[(cats_train['OutcomeType']=='Adoption'), 'Destin']='0'
cats_train.loc[(cats_train['OutcomeType']=='Died'), 'Destin']='1'
cats_train.loc[(cats_train['OutcomeType']=='Euthanasia'), 'Destin']='2'
cats_train.loc[(cats_train['OutcomeType']=='Return_to_owner'), 'Destin']='3'
cats_train.loc[(cats_train['OutcomeType']=='Transfer'), 'Destin']='4'

cats_test.loc[(cats_test.AgeInYears<=0.5), 'Stage']='0'
cats_test.loc[(cats_test.AgeInYears>0.5) & (cats_test.AgeInYears<3), 'Stage']='1'
cats_test.loc[(cats_test.AgeInYears>=3) & (cats_test.AgeInYears<7), 'Stage']='2'
cats_test.loc[(cats_test.AgeInYears>=7) & (cats_test.AgeInYears<11), 'Stage']='3'
cats_test.loc[(cats_test.AgeInYears>=11) & (cats_test.AgeInYears<15), 'Stage']='4'
cats_test.loc[(cats_test.AgeInYears>=15), 'Stage']='5'

In [None]:
#We create a dataframe df with the data set cats_train
df=cats_train[['AgeInYears','Race1', 'Race2', 'Color1', 'Color2','Day', 'Hair', 'Hour', 'Mix', 'Month', 'Name', 'Neutered', 'OutcomeType', 'Sex', 'Year',  'Race12','Race22', 'Col1', 'Col2', 'Stage','Tabby', 'Pelage','Destin']]
#X is a dataframe with a selection of variables that are going to be our predictors variables 
#Y is a dataframe containing only the OutcomeType (the target variable) 
X=df[[ 'Mix', 'Month','Day','Hour','Year','Pelage','Tabby','Name', 'Neutered', 'Sex',  'Year',  'Race12', 'Race22', 'Col1', 'Col2', 'Stage']].values
Y=df['Destin'].values

In [None]:
#We split X and Y in 2 train subsets and 2 test subsets 
X_train,X_test,Y_train,Y_test = train_test_split(X,Y)

In [None]:
#decision trees
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train,Y_train)
Y_predict=classifier.predict(X_test)
accuracy_score(Y_test,Y_predict)

In [None]:
predictions=classifier.predict_proba(X_test)
log_loss(Y_test,predictions)

In [None]:
#randomForest
rfclassifier = RandomForestClassifier(n_estimators=100)
rfclassifier.fit(X_train,Y_train)
predictions = rfclassifier.predict(X_test)
accuracy_score(Y_test, predictions)

In [None]:
predictions=rfclassifier.predict_proba(X_test)
log_loss(Y_test,predictions)

In [None]:
#naiveBayes
nbclassifier = GaussianNB()
nbclassifier.fit(X_train,Y_train)
predictions = nbclassifier.predict(X_test)
accuracy_score(Y_test, predictions)

In [None]:
predictions=nbclassifier.predict_proba(X_test)
log_loss(Y_test,predictions)

Conclusion : for cats , we get the highest accuracy_score for the randomForest method : our model has a accuracy of ~80% and a log loss of 0.75

We are now going to predict the class of the Outcomes on the cats_test, using Random Forest

In [None]:
#We delete the column OutcomeType of the cats_test,it will be completed later
del cats_test['OutcomeType']

In [None]:
#we choose the classifier of the RandomForest method to work with 
X_missing = cats_test[['Mix', 'Month','Day','Hour','Year','Pelage','Tabby','Name', 'Neutered', 'Sex',  'Year',  'Race12', 'Race22', 'Col1', 'Col2', 'Stage']].values
X_missing
prediction = rfclassifier.predict(X_missing)

cats_test['OutcomeType']=prediction
cats_test.head()

In [None]:
#The numbers are converted to the corresponding OutcomeType
cats_test.loc[(cats_test['OutcomeType']=='0'), 'OutcomeType']='Adoption'
cats_test.loc[(cats_test['OutcomeType']=='1'), 'OutcomeType']='Died'
cats_test.loc[(cats_test['OutcomeType']=='2'), 'OutcomeType']='Euthanasia'
cats_test.loc[(cats_test['OutcomeType']=='3'), 'OutcomeType']='Return_to_owner'
cats_test.loc[(cats_test['OutcomeType']=='4'), 'OutcomeType']='Transfer'
cats_test.head()

In [None]:
#We see that with the prediction model, we still have the fact that the majority of cats are transfered or adopted 
OutcomeTypeCatsTest= cats_test.OutcomeType.value_counts() / len(cats_test.index)
OutcomeTypeCatsTest.plot(kind='barh')

LEARNING ON DOGS ONLY 

We are now going to do the same steps for the dogs

In [None]:
#We encode labels with Object type by LabelEncoder to get int64 data that we can use it later to apply the learning methods 
dogs_train=dogs[(dogs.OutcomeType.isnull()==0)]
dogs_test=dogs[(dogs.OutcomeType.isnull()==1)]

le.fit(dogs_train.Race1)
le.transform(dogs_train.Race1)
dogs_train['Race12']=le.transform(dogs_train.Race1)

le.fit(dogs_train.Race2)
le.transform(dogs_train.Race2)
dogs_train['Race22']=le.transform(dogs_train.Race2)

le.fit(dogs_train.Color1)
le.transform(dogs_train.Color1)
dogs_train['Col1']=le.transform(dogs_train.Color1)

le.fit(dogs_train.Color2)
le.transform(dogs_train.Color2)
dogs_train['Col2']=le.transform(dogs_train.Color2)

le.fit(dogs_test.Race1)
le.transform(dogs_test.Race1)
dogs_test['Race12']=le.transform(dogs_test.Race1)

le.fit(dogs_test.Race2)
le.transform(dogs_test.Race2)
dogs_test['Race22']=le.transform(dogs_test.Race2)

le.fit(dogs_test.Color1)
le.transform(dogs_test.Color1)
dogs_test['Col1']=le.transform(dogs_test.Color1)

le.fit(dogs_test.Color2)
le.transform(dogs_test.Color2)
dogs_test['Col2']=le.transform(dogs_test.Color2)

In [None]:
dogs_train.loc[(dogs_train.AgeInYears<=0.5), 'Stage']='0'
dogs_train.loc[(dogs_train.AgeInYears>0.5) & (dogs_train.AgeInYears<3), 'Stage']='1'
dogs_train.loc[(dogs_train.AgeInYears>=3) & (dogs_train.AgeInYears<7), 'Stage']='2'
dogs_train.loc[(dogs_train.AgeInYears>=7) & (dogs_train.AgeInYears<11), 'Stage']='3'
dogs_train.loc[(dogs_train.AgeInYears>=11) & (dogs_train.AgeInYears<15), 'Stage']='4'
dogs_train.loc[(dogs_train.AgeInYears>=15), 'Stage']='5'

dogs_train.loc[(dogs_train['OutcomeType']=='Adoption'), 'Destin']='0'
dogs_train.loc[(dogs_train['OutcomeType']=='Died'), 'Destin']='1'
dogs_train.loc[(dogs_train['OutcomeType']=='Euthanasia'), 'Destin']='2'
dogs_train.loc[(dogs_train['OutcomeType']=='Return_to_owner'), 'Destin']='3'
dogs_train.loc[(dogs_train['OutcomeType']=='Transfer'), 'Destin']='4'

dogs_test.loc[(dogs_test.AgeInYears<=0.5), 'Stage']='0'
dogs_test.loc[(dogs_test.AgeInYears>0.5) & (dogs_test.AgeInYears<3), 'Stage']='1'
dogs_test.loc[(dogs_test.AgeInYears>=3) & (dogs_test.AgeInYears<7), 'Stage']='2'
dogs_test.loc[(dogs_test.AgeInYears>=7) & (dogs_test.AgeInYears<11), 'Stage']='3'
dogs_test.loc[(dogs_test.AgeInYears>=11) & (dogs_test.AgeInYears<15), 'Stage']='4'
dogs_test.loc[(dogs_test.AgeInYears>=15), 'Stage']='5'

In [None]:
#We create df a dataframe with the dataset dogs_train
df=dogs_train[['AgeInYears','Race1', 'Race2', 'Color1', 'Color2','Day', 'Hour', 'Mix', 'Month', 'Name', 'Neutered', 'OutcomeType', 'Sex', 'Year',  'Race12','Race22', 'Col1', 'Col2', 'Stage', 'Destin']]
X=df[[ 'Mix', 'Month','Day','Hour','Year','Name', 'Neutered', 'Sex',  'Year',  'Race12', 'Race22', 'Col1', 'Col2', 'Stage']].values
Y=df['Destin'].values

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y)

In [None]:
#decision trees
classifier = tree.DecisionTreeClassifier()
classifier.fit(X_train,Y_train)
Y_predict=classifier.predict(X_test)
accuracy_score(Y_test,Y_predict)

In [None]:
predictions=classifier.predict_proba(X_test)
log_loss(Y_test,predictions)

In [None]:
#randomForest
rfclassifier = RandomForestClassifier(n_estimators=100)
rfclassifier.fit(X_train,Y_train)
predictions = rfclassifier.predict(X_test)
accuracy_score(Y_test, predictions)

In [None]:
predictions=rfclassifier.predict_proba(X_test)
log_loss(Y_test,predictions)

In [None]:
#naiveBayes
nbclassifier = GaussianNB()
nbclassifier.fit(X_train,Y_train)
predictions = nbclassifier.predict(X_test)
accuracy_score(Y_test, predictions)

In [None]:
predictions=nbclassifier.predict_proba(X_test)
log_loss(Y_test,predictions)

Conclusion : For dogs, we get the highest accuracy_score with the randomForest method : our model has an accuracy of ~59% and a log loss of 1.065.
We are now going to predict the class of the Outcomes on the dogs_test, using Random Forest

In [None]:
#We delete the column OutcomeType of the dogs_test,it will be completed later
del dogs_test['OutcomeType']

In [None]:
#we choose the classifier of the RandomForest method to work with 
X_missing = dogs_test[[ 'Mix', 'Month','Day','Hour','Year','Name', 'Neutered', 'Sex',  'Year','Race12','Race22', 'Col1', 'Col2', 'Stage']].values
X_missing
prediction = rfclassifier.predict(X_missing)

dogs_test['OutcomeType']=prediction
dogs_test.head()

In [None]:
#The numbers will be converted to the corresponding OutcomeType
dogs_test.loc[(dogs_test['OutcomeType']=='0'), 'OutcomeType']='Adoption'
dogs_test.loc[(dogs_test['OutcomeType']=='1'), 'OutcomeType']='Died'
dogs_test.loc[(dogs_test['OutcomeType']=='2'), 'OutcomeType']='Euthanasia'
dogs_test.loc[(dogs_test['OutcomeType']=='3'), 'OutcomeType']='Return_to_owner'
dogs_test.loc[(dogs_test['OutcomeType']=='4'), 'OutcomeType']='Transfer'
dogs_test

In [None]:
#We see that with the prediction model, we still have the fact that the majority of dogs are adopted or returned to owner
OutcomeTypeDogsTest= dogs_test.OutcomeType.value_counts() / len(dogs_test.index)
OutcomeTypeDogsTest.plot(kind='barh')

Conclusion : We have seen that for the same method, the cats get much better results than the dogs, suggesting that our model of randomForest is good. It may be because the cats have more feautures to input in the model ( Hair and Tabby ).
While it’s been possible to predict the probabilities of Outcomes of the animals with a satisfying precision, if we aim to improve the precision, itwill be necessary to add more features ( the behavior of the animal ( nice, stressed … ) ,if it has a disease or not...)


In [None]:
#csv result file required by the Kaggle :

#final.to_csv('F:/final.csv', index=False)
