# Pet Finder - Adopción de Mascotas

PetFinder.es una plataforma de adopción de Mascotas de Malasia, con una base de datos de más de 150.000 animales. 

Las tasas de adopción de animales están fuertemente correlacionadas con los metadatos asociados con sus perfiles en línea, como el texto descriptivo y las características de las fotografías. Como ejemplo, PetFinder está experimentando actualmente con una herramienta de inteligencia artificial simple llamada Cuteness Meter, que clasifica qué tan linda es una mascota en función de las cualidades presentes en sus fotos.

En esta competencia, desarrollará algoritmos para predecir la adoptabilidad de las mascotas, específicamente, ¿qué tan rápido se adopta una mascota?

Web Organización: https://www.petfinder.my/

Datos: https://www.kaggle.com/c/petfinder-adoption-prediction



## Script Inicial - Análisis Exploratorio

En este trabajo se busca predecir la velocidad a la que se adopta una mascota, según la lista de la mascota en PetFinder. 

A veces, un perfil representa a un grupo de mascotas. En este caso, la velocidad de adopción está determinada por la velocidad a la que se adoptan todas las mascotas. 

Los datos incluyen datos de **Texto Libre**, **Datos Tabulares** e **Imágenes**, lo que lo hace un dataset sumamente rico para explorar

### Ejemplos de Scripts de EDA (Exploratory Data Analysis)

* En R: https://www.kaggle.com/jaseziv83/an-extensive-eda-of-petfinder-my-data
* En Python: https://www.kaggle.com/artgor/exploration-of-data-step-by-step

File descriptions:

* train.csv - Tabular/text data for the training set
* test.csv - Tabular/text data for the test set
* sample_submission.csv - A sample submission file in the correct format
* breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
* color_labels.csv - Contains ColorName for each ColorID
* state_labels.csv - Contains StateName for each StateID

In [None]:
# Importación de Librerías
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 
from wordcloud import WordCloud

%matplotlib inline
pd.set_option('display.max_columns', 30)
plt.rcParams['figure.figsize'] = [12.0, 8.0]

In [None]:
# Datos Tabulares
train = pd.read_csv('../input/petfinder-adoption-prediction/train/train.csv')
test = pd.read_csv('../input/petfinder-adoption-prediction/test/test.csv')



In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.head() 

In [None]:
train.describe()

### Data Fields

* PetID - Unique hash ID of pet profile
* **AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.**
* Type - Type of animal (1 = Dog, 2 = Cat)
* Name - Name of pet (Empty if not named)
* Age - Age of pet when listed, in months
* Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
* Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
* Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
* Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
* Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
* Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
* MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
* FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
* Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
* Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
* Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
* Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
* Quantity - Number of pets represented in profile
* Fee - Adoption fee (0 = Free)
* State - State location in Malaysia (Refer to StateLabels dictionary)
* RescuerID - Unique hash ID of rescuer
* VideoAmt - Total uploaded videos for this pet
* PhotoAmt - Total uploaded photos for this pet
* Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

In [None]:
g= train['AdoptionSpeed'].value_counts().sort_index(ascending = False).plot(kind='bar', color='teal');
plt.xticks(rotation = 'horizontal');
plt.title('Adoption speed classes counts (lower is faster)');
ax=g.axes
for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points')  

### Target: AdoptionSpeed

The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:

* 0 - Pet was adopted on the same day as it was listed.
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

In [None]:
# Mapeamos el Tipo de Mascota
train['Type'] = train['Type'].apply(lambda x: 'Dog' if x == 1 else 'Cat')

In [None]:
g = sns.countplot(x='Type', data=train);
plt.title('Number of cats and dogs');
ax=g.axes
for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points')  

In [None]:
g= sns.countplot(x='AdoptionSpeed', data=train, hue='Type');
plt.title('Number of cats and dogs by AdoptionSpeed');
plt.xticks(rotation = 'horizontal')
plt.yticks(fontsize = 'xx-large')
plt.title('Distribucion Sexo', fontsize = 'xx-large')


In [None]:

g =train['Gender'].value_counts().rename({  1:'Masculino',
                                            2:'Femenino',
                                            3:'Mixto (grupo)'}).plot(kind = 'bar', 
                                                                figsize = (15,6))
plt.xticks(rotation = 'horizontal')
plt.yticks(fontsize = 'xx-large')
plt.title('Distribucion Genero', fontsize = 'xx-large')
ax=g.axes
for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points')  

In [None]:
print(train.Name.value_counts())

In [None]:
train['Name'].str.contains('No Name Yet').value_counts()

In [None]:
train['Name'].str.contains('Adoption|Puppies|Kittens').value_counts() 

In [None]:
train['Name'].str.contains('Puppies').value_counts()

In [None]:
train['Name'].str.contains('Kittens').value_counts()

In [None]:
plt.subplot(1, 2, 1)
text_cat = ' '.join(train.loc[train['Type'] == 'Cat', 'Name'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white',
                      width=1200, height=1000).generate(text_cat)
plt.imshow(wordcloud)
plt.title('Top cat names')
plt.axis("off")

plt.subplot(1, 2, 2)
text_dog = ' '.join(train.loc[train['Type'] == 'Dog', 'Name'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white',
                      width=1200, height=1000).generate(text_dog)
plt.imshow(wordcloud)
plt.title('Top dog names')
plt.axis("off")

plt.show()

In [None]:
#normalizo los nombres de los animales, particularmente aquellos que no tienen nombre
train['Name'] = train['Name'].fillna('Unnamed')
train['Name'] = train['Name'].replace('No Name','Unnamed' )
train['Name'] = train['Name'].replace('No Name Yet','Unnamed', )

train['Tiene_Nombre'] = 1
train.loc[train['Name'] == 'Unnamed', 'Tiene_Nombre'] = 0
sns.countplot(x='Tiene_Nombre', data=train, hue='AdoptionSpeed');

In [None]:
print('Nombres mas populares de acuerdo a su velocidad de adopcion')
for n in train['Name'].value_counts().index[:10]:
    print(n)
    print(train.loc[train['Name'] == n, 'AdoptionSpeed'].value_counts().sort_index())
    print('')

## Tareas a Realizar

#### Pre-procesamiento:

* Chequear Nulos y Decidir si Imputar o Descartar
* Chequear Variables Categóricas y convertirlas en Numéricas
* Estandarizar o Normalizar

#### Ideas para EDA:

* Analizar la Raza de las Mascotas
* Analizar el Género
* Analizar los 3 Colores Informados para cada Mascota
* Estudiar las variables de Salud: Vacunación, Esterilización, Desparacitación
* Entender el impacto del Fee cobrado
* Estudiar las Regiones Geográficas disponibles
* Analizar el impacto de tener Fotos y/o Videos

Todos estos análisis pueden hacerse en función de la Especie (Gato/Perro) y respecto al Target


#### Análisis del Texto Libre de la Descripción:

* Tokenizar Palabras más frecuentes
* Limpiar palabras
* Bag of Words / TF-IDF



In [None]:
train['Age'].value_counts().head(20)

In [None]:
train.Age[train['Age']<=100].plot(kind='hist')
plt.title('Distribution of pets age in Months');

In [None]:
sns.heatmap(train.isnull(), cbar=False)

In [None]:
train.isnull().sum()

# *** Entender el impacto del Fee cobrado**


In [None]:
sns.histplot(data=train, x="Fee", bins=50)

In [None]:
#creo una variable nueva si fue adoptado de forma gratuita o si tuvo que pagar fee
train['Adopcion_Gratuita'] = np.where(train['Fee'] == 0, 1, 0)
#cantidad de adopciones gratuitas
train.Fee[train.Fee == 0].value_counts()
#print(train.Fee.max())
#train.Fee.min().value_counts())
#train['Fee'].value_counts(bins=50)

In [None]:
#distribucion de las adopciones gratuitas segun tipo animal
g= sns.countplot(x='Adopcion_Gratuita', data=train, hue='Type');
plt.title('Number of cats and dogs by AdoptionSpeed');
plt.xticks(rotation = 'horizontal')
plt.yticks(fontsize = 'xx-large')
plt.title('Distribucion adopciones Gratuitas o no, segun tipo animal', fontsize = 'xx-large')
ax = g.axes
for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points') 

In [None]:
#distribucion de las adopciones gratuitas segun tipo animal

g= sns.countplot(x='Adopcion_Gratuita', data=train, hue='Gender');
plt.title('Number of cats and dogs by AdoptionSpeed');
plt.xticks(rotation = 'horizontal')
plt.yticks(fontsize = 'xx-large')
plt.title('Distribucion Adopciones Gratuitas o no, segun genero', fontsize = 'xx-large')
ax = g.axes
for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points') 

* Estudiar las Regiones Geográficas disponibles


In [None]:
estados = pd.read_csv('../input/petfinder-adoption-prediction/StateLabels.csv')
#train.join(estados, lsuffix='State', rsuffix='StateID', how ='left')
train=train.merge(estados, how='left', left_on='State',right_on='StateID')

In [None]:
#distribucion de las adopciones gratuitas segun tipo animal
g= sns.countplot(y='StateName', data=train, hue='Type');
plt.title('Number of cats and dogs by AdoptionSpeed');
plt.xticks(rotation = 'vertical')
plt.yticks(fontsize = 'xx-large')
plt.title('Distribucion adopciones por region, segun tipo animal', fontsize = 'xx-large')
#ax = g.axes
#for p in ax.patches:
#     ax.annotate(f"{p.get_width() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_height() / 2., p.get_width()),
#         ha='center', va='center', fontsize=9, color='gray', rotation=0, xytext=(0, 10),
#         textcoords='offset points') 