# PetFinder.my Adoption Prediction: Exploratory Data Analysis

### David Mora Garrido, Bachelor Dissertation (1st part)

In [None]:
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns

from collections import defaultdict
from tqdm import tqdm

!pip install wordcloud
import utils_tfg_pet_adoption_eda as utils

In [None]:
seed = 27912

First we load the training dataset, and also the .csv files with additional information for the categorical variables related to breeds and colors:

In [None]:
train = pd.read_csv('../input/petfinder-adoption-prediction/train/train.csv')
breeds = pd.read_csv('../input/petfinder-adoption-prediction/PetFinder-BreedLabels.csv')
colors = pd.read_csv('../input/petfinder-adoption-prediction/PetFinder-ColorLabels.csv')
states = pd.read_csv('../input/petfinder-adoption-prediction/PetFinder-StateLabels.csv')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
train.info()

In [None]:
target = "AdoptionSpeed"

In [None]:
train.sample(10,random_state=seed)

Breed and Color names are in a different .csv, so we will use those names instead of their numerical IDs at least for the EDA. Let's take a look at the content of the breed .csv:

In [None]:
breeds.sample(5,random_state=seed)

`Type` tell us whether the pet is a dog (1) or cat (2). As there is already a column which indicates that in the main dataset, we will ignore it when joining the `BreedName` to the main dataset (in fact, it will be a left join in order to keep those instances that may not have a value for `Breed1` or `Breed2`), but we will not keep the IDs, just the names:

In [None]:
utils.left_join(train, breeds, left_on=["Breed1", "Breed2"], right_ID="BreedID", var_to_keep="BreedName")

In [None]:
breeds_set = set(breeds["BreedName"].values) | {np.nan}
print("Number of instances whose BreedID was not specified in the breeds .csv:",
      train[(~train["BreedName1"].isin(breeds_set)) | (~train["BreedName2"].isin(breeds_set))].shape[0])

Now we do the same for `Color1`, `Color2` and `Color3`. First of all, let's see the number of unique values of color:

In [None]:
colors.value_counts()

We can see that there are 7 different colors; we can also see that there isn't a color with ID 0, but in the sample of the training set we found some 0s, so this is our value for 'NaN' or missing value.

At the moment, we will join the color labels with each instance for the 3 predicting variables:

In [None]:
utils.left_join(train, colors, left_on=["Color1", "Color2", "Color3"], right_ID="ColorID", var_to_keep="ColorName")

In [None]:
colors_set = set(colors["ColorName"].values) | {np.nan}
print("Number of instances whose ColorID was not specified in the colors .csv:",
      train[(~train["ColorName1"].isin(colors_set)) | (~train["ColorName2"].isin(colors_set)) |
            (~train["ColorName3"].isin(colors_set))].shape[0])

Now, as the majority of variables are categorical but are represented using numbers, we will create dictionaries with number-state as key-value for each categorical variable. This way, at least for the EDA, it will be easier to visualize de data:

In [None]:
replace_dict = {
    'Type': {1: 'Dog', 2: 'Cat'},
    'Gender': {1: 'Male', 2: 'Female', 3: 'Mixed'},
    'MaturitySize': {1: 'Small', 2: 'Medium', 3: 'Large', 4: 'Extra Large', 0: 'Not Specified'},
    'FurLength': {1: 'Short', 2: 'Medium', 3: 'Long', 0: 'Not Specified'},
    'Vaccinated': {1: 'Yes', 2: 'No', 3: 'Not Sure'},
    'Dewormed': {1: 'Yes', 2: 'No', 3: 'Not Sure'},
    'Sterilized': {1: 'Yes', 2: 'No', 3: 'Not Sure'},
    'Health': {1: 'Healthy', 2: 'Minor Injury', 3: 'Serious Injury', 0: 'Not Specified'}
}

In [None]:
utils.replace_val_categorical(train, replace_dict)
display(train.sample(5, random_state=seed))

Finally, as the target variable is also a categorical one but is represented using numbers, we will just convert it to categorical in the `DataFrame`:

In [None]:
utils.to_categorical(train, [target])

Now we can proceed with the Exploratory Data Analysis on the base data we are given.

## Tabular data

### Target: AdoptionSpeed

First we start visualizing the target variable, `AdoptionSpeed`. According to the description of the data set, this variable has 5 possible states:

* `0`: Pet was adopted on the same day as it was listed.
* `1`: Pet was adopted between 1 and 7 days (1st week) after being listed.
* `2`: Pet was adopted between 8 and 30 days (1st month) after being listed.
* `3`: Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* `4`: No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

Thus, this is a classification problem. Let's visualize its univariate distribution:

In [None]:
_, ax = plt.subplots(figsize=(10,6))
sns.countplot(x=target, data=train, ax=ax)
for tick in ax.xaxis.get_major_ticks() + ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(12)
ax.xaxis.label.set_size(16)
ax.xaxis.labelpad = 10
ax.yaxis.label.set_size(16)
ax.yaxis.labelpad = 10
plt.suptitle("Distribution of values of AdoptionSpeed", fontsize=18);

In relative terms:

In [None]:
train["AdoptionSpeed"].value_counts(normalize=True, dropna=False)

At first sight, we can say that the class is unbalanced, since the state 0 only represents the 2.7% of data, but the other values are more or less balanced. Thus, we should take this into account, as we may need to oversample the data so that we have enough instances of class `0` to work with.

### Predicting variables

### Type

This variable indicates if the pet is a `Dog` or a `Cat` (originally noted as `1` or `2`, respectively). First of all, let's see the univariate distribution of values:

In [None]:
type_counts = train["Type"].value_counts(dropna=False)
type_counts

In [None]:
utils.plot_pieplot(type_counts.values, type_counts.index, title="Type")

There are more dogs than cats, but the difference is not too big. We also know that every instance has a type (there are no missing values). Now let's take a look to the distribution of values of the target variable for dogs and cats:

In [None]:
utils.plot_vert_barplot(train, "Type", target)

One interesting thing that we can see here is that the speed at which cats are adopted in the sample is greater than that for dogs (the percentage of cats that are adopted in the first 30 days after they are listed is higher, while it is lower than that for dogs when the publication time is more than 30 days), and it is more likely that dogs end up not being adopted than cats.

### Name

This is the name, if any, the pet was given. It could be interesting to see if the fact of not having a name could influence the adoption speed:

In [None]:
train["Name"].value_counts(dropna=False)

We can see that the 1257 pets haven't got a name (`NaN` value), but in fact it is a little bit more since there are more values indicating a missing or not specified name. Thus, we should group these values (either 'NaN', contains 'no name' or its length is smaller than 3) and compare the effect of having a name or not on the adoption speed:

In [None]:
np.array(list(filter(lambda x: len(str(x)) < 3, train["Name"])))

In [None]:
no_name_equivalents = set(filter(lambda x: 'no name' in str(x).lower() or len(str(x)) < 3 or x is np.nan, list(train["Name"].unique())))
print("Number of pets without name:", train[train["Name"].isin(no_name_equivalents)].shape[0])

As we can see, the number of pets that don't have a name or what was listed as a name is not really a name is 1578, approximately the 10.5% of the instances in the dataset, so it is sufficiently large to compare the effect of having or not a name on the target variable:

In [None]:
utils.plot_vert_barplot(train, "Name", target, nan_div=True, nan_eqs=no_name_equivalents)

Not having a name could someway affect the adoption speed; at least the percentage of pets that are not adopted after 100 days is a 7% higher when the pet hasn't got a name. Maybe it could be a good idea to substitute this variable by another which indicates whether the pet has a name or not, at least to discard a variable which has too many possible values that do not make difference.

On the other hand, it may not have much sense to study the effect of having a longer or shorter name, as it is probable that the `Name` length will be proportional to the number of pets listed in the profile (more than one can be given up for adoption in a single profile), as we can see in some of the unique values of `Name` listed above. However, we will take a look at that just to make sure our assumption is met:

In [None]:
train["NameLength"] = ''
train.loc[~train["Name"].isin(no_name_equivalents), ["NameLength"]] = train["Name"].astype(str)
train["NameLength"] = train["NameLength"].map(len)

In [None]:
utils.plot_histogram_and_density(data=train, x="NameLength", hue=target)

(the first plot is a 'stacked' histogram, so that the different values of the target don't overlap and we can see all of them, hence the value in the Count axis is actually the count of the aggregation of all values of the target variable)

In [None]:
utils.plot_boxplot(data=train, x="NameLength", y=target)

As we expected, the length of the `Name` variable in those cases that have an appropiate name does not seem to have any effect on the target variable. We could do the same for those instances where the `Name` variable is a single word, which in general will be profiles with a single pet, but let's check it first:

In [None]:
single_name = list(filter(lambda x: len(str(x).split()) == 1 and x not in no_name_equivalents, train["Name"]))
display(np.array(single_name[:50]))
single_name = set(single_name)

We can already see that there are names corresponding to several pets and not just one, for example 'Kitties'. Let's get those instances with a single name:

In [None]:
instances_single_name = train.loc[train["Name"].isin(single_name), ["Name", "Quantity", target]]

instances_single_name["NameLength"] = ''
instances_single_name["NameLength"] = instances_single_name["Name"].astype(str)
instances_single_name["NameLength"] = instances_single_name["NameLength"].map(len)

In [None]:
num_instances_single_name = instances_single_name.shape[0]
num_instances_single_name_qty_more_1 = instances_single_name[instances_single_name["Quantity"] > 1].shape[0]
num_instances_single_name_qty_1 = num_instances_single_name - num_instances_single_name_qty_more_1
num_instances_compound_name_qty_1 = train[(~train["Name"].isin(instances_single_name["Name"].values))
                                          & (~train["Name"].isin(no_name_equivalents))
                                          & (train["Quantity"] == 1)].shape[0]

print("Number of profiles with a single name:", num_instances_single_name)
print("Number of profiles with a single name corresponding to a single pet:", num_instances_single_name_qty_1)
print("Number of profiles with a single name corresponding but multiple pets:", num_instances_single_name_qty_more_1)
print("Number of profiles with a compound name but a single pet:", num_instances_compound_name_qty_1)

As we can see, in general, when the profile has a single word as `Name`, the profile will probably list a single pet, but in that case we are getting a sample of the total instances with a single pet (and, of course, some that are listing multiple pets), as there are quite a few profiles with more than one word as `Name` (compound) but a single pet.

Thus, let's check the effect of the `Name` length when it is a single word, which are mainly single pet profiles, with some multiple pets profiles:

In [None]:
utils.plot_histogram_and_density(data=instances_single_name, x="NameLength", hue=target,
                           additional_desc="when the name is a single word")

In [None]:
utils.plot_boxplot(data=instances_single_name, x="NameLength", y=target,
             additional_desc="when the name has a single word")

We can discard the `NameLength` variable definitively. Hence, we will stick to a variable which tell us whether the profile has a `Name` or not.

### Age

This is the first numeric variable that we find, so we can check its statistics:

In [None]:
train["Age"].describe()

The first thing that we can notice is that the maximum value could be interpreted as abnormal at first, but the reason is that `Age` is listed in months, not years. The standard deviation is also high compared to the mean. Let's see its distribution:

In [None]:
utils.plot_histogram_and_density(data=train, x="Age", hue=target)

As we can see, the distribution of `Age` is greatly skewed to small values, in particular less than 12 (which implies that the majority are dog or cat cubs). We can also see that there are peeks of instances in multiples of 12 (so they were probably listed with the number of years instead of months).

We can apply the natural logarithm to this variable in order to see whether it could fit a normal distribution and also to see more clearly the distribution of the target variable. As we cannot apply the logarithm to 0 (`Age` is less than one month), it is in fact the natural logarithm of 1 + the `Age` value:

In [None]:
train_copy = train[["Age", "Type", target]].copy(deep=True)
train_copy["Age"] = np.log(1 + train_copy["Age"])

utils.plot_histogram_and_density(data=train_copy, x="Age", hue=target, additional_desc="(logarithmic scale)")

We can see that beyond 1.75 or 1.8 (which would be $e^{1.75}=5.75$ or $e^{1.8}=6.05 \approx 6$), approximately, the most common outcome is 4, while below those values is almost fairly distributed among 1, 2 and 3. The 0 outcome mainly seems to appear when the `Age` value is small.


In [None]:
utils.plot_boxplot(data=train, x="Age", y=target)

We can say that an important factor for the pet not to be adopted or at least not in early phases since the publication could be the `Age`, as we can see in the two previous plots that from `Age` 6 the most common class is 4 (not adopted after 100 days since the listing). Thus, we could use this information in order to discretize Age if we need to.

Finally, we should distinguish between cats and dogs, as their life expectancy is different:

In [None]:
utils.plot_histogram_and_density(data=train_copy, x="Age", hue="Type")

In [None]:
utils.plot_boxplot(data=train, hue="Type", x="Age", y=target, figsize=(15,10))

We can see that overall the cats that are given up for adoption are younger than the dogs, and in the case of cats the `Age` is probably more relevant to determine the response of the target variable than in the case of dogs.

### Breed

In this case, there are two categorical variables: `BreedName1` and `BreedName2`. Let's see if they have missing values:

In [None]:
print(f"Number of pets missing BreedName1: {len(list(filter(lambda x: x is np.nan, train['BreedName1'])))}")
print(f"Number of pets missing BreedName2: {len(list(filter(lambda x: x is np.nan, train['BreedName2'])))}")

We can see that almost every pet has a value for `BreedName1`, so we can say it is the primary breed, while more than two-thirds is missing `BreedName2`, let's call it secondary breed. Thus, we should check if, for each one of the instances that does not have a primary breed, we have a secondary breed, and also if there are instances whose primary and secondary breed are the same:

In [None]:
both_nan = train[(train["BreedName1"].isna() & train["BreedName2"].isna())]
print(f"Number of pets with unknown breed: {both_nan.shape[0]}")

same_breeds = train[(train["BreedName1"] == train["BreedName2"])]
print(f"Number of pets with the same primary and secondary breeds: {same_breeds.shape[0]}")

For every pet without primary breed, there is a secondary breed. Thus, we could impute missing values using the secondary breed, or if we see that there's is no secondary breed, we could just use the most repeated value or even use a classifier based on the color or even the support images and their metadata.

On the other hand, there are 1510 pets with the same primary and secondary breeds, so we could just use `BreedName1` given that we also saw that 10762 pets haven't been listed with a secondary breed.

In [None]:
train["BreedName1"].value_counts(dropna=False)

In [None]:
train["BreedName2"].value_counts(dropna=False)

Both in the primary breed and secondary breed (when it is specified), the most common breeds are 'Mixed Breed', 'Domestic Short Hair' and 'Domestic Medium Hair'. However, each breed correspond to a different type of animal (either cat or dog), so let's check the most common breeds by pet type:

In [None]:
count_breed_by_type = (train.groupby(["Type"])["BreedName1"]
                            .value_counts(dropna=False)
                            .rename('count')
                            .reset_index()
                            .sort_values(['count'], ascending=False))
domestic_x_hair = {"Domestic Short Hair", "Domestic Medium Hair", "Domestic Long Hair"}
display(count_breed_by_type[count_breed_by_type["BreedName1"] == "Mixed Breed"])
display(count_breed_by_type[count_breed_by_type["BreedName1"].isin(domestic_x_hair)])

As we can see, every 'Domestic {x} Hair' corresponds to a cat. Something similar happens with 'Mixed Breed', but in this case there are 4 instances that are cats. On the other hand, as long as the previous cases do not refer to a particular breed, we can say that all of those cases indicate that the pet does not have a 'pure' breed, or at least it is not specified.

Let's check whether a cat can be given the 'Mixed Breed' label in this dataset by looking at the .csv which contains all the possible breeds:

In [None]:
breeds[breeds["BreedName"] == "Mixed Breed"]

As we can see, only dogs can be given the 'Mixed Breed' label for the `BreedName1` or `BreedName2` variables. Thus, the 4 previous instances of cats are anomalies: either they are not actually cats but they are labelled as cats, or they are cats with an incorrect breed.

It is not time yet to study the images that we are given, but let's see those 4 instances in order to see what actually happens:

In [None]:
utils.plot_images(train[(train["Type"] == "Cat") & (train["BreedName1"] == "Mixed Breed")], 1, 4, (20,4),
            "Cats that were given the 'Mixed Breed' label (anomalies)", subtitle_var="PetID")

As we can see, the 4 instances are cats, so their breed value is wrong. We will need to identify these cases and give them a correct value, either just imputing the most common value or for example checking whether the image metadata can give us some clues.

In [None]:
display(count_breed_by_type[count_breed_by_type["Type"] == "Dog"].head(5))
display(count_breed_by_type[count_breed_by_type["Type"] == "Cat"].head(5))

The most common primary 'pure' breeds of dogs are 'Labrador Retriever', 'Shih Tzu' (common in Asia) and 'Poodle', while those of cats are 'Tabby' and 'Siamese'. We can see that the vast majority of pets are those that cannot be considered to have a 'pure' breed.

Let's see the behaviour of the target variable depending on each posible breed:

In [None]:
utils.plot_hor_barplot(train, "BreedName1", target, percentage=False, title_y=0.89)

The primary breed could be a good variable since the distribution of the target variable among the different values of `BreedName1` varies considerably. However, there are many breeds that does not have many instances (for example 'Affenpinscher', 'Chausie', etc.), so in this case we especially need additional information, or they could be grouped as 'rare breed' or something similar for example for encoding purposes. The high cardinality of this variable and also `BreedName2` can be a problem.

Now, the secondary breed, `BreedName2`, has many missing or not specified values, let's see if the target variable has a different distribution of values when `BreedName2` is specified or not:

In [None]:
utils.plot_vert_barplot(train, "BreedName2", target, nan_div=True)

It seems that the fact that `BreedName2` is specified or not does not really have a huge effect on the target variable, the adoptions seem to be done a little bit later or not done at all with a slightly higher probability when this condition is met. However, given that it also has high cardinality (not that high as `BreedName1`, but keeping both could probably be worse), we could just drop this variable in the training (but we may need it in the preprocessing if the primary breed is missing and this one isn't).

However, there is some ambiguity when dealing with the breed, since, according to the documentation, the secondary breed indicates that the pet is mixed-breed, but in the primary breed the most repeated value (for dogs) is 'Mixed Breed', so maybe we should create a new variable to overcome that ambiguity, so that when the primary breed is 'Mixed Breed' or it isn't but the secondary breed is not 'NaN' (or it is also 'Mixed Breed'), then the pet hasn't got a 'pure breed'. In the case that the only specified breed is 'Domestic {x} Hair' (cats) we could do the same or even use another value to specify that we are not sure about whether the pet has a 'pure breed' or not.

In [None]:
train["PureBreed"] = "Yes"
train.loc[(train["BreedName1"] == "Mixed Breed") | (train["BreedName2"] == "Mixed Breed") |
          ((train["BreedName2"].isna()) & (train["BreedName1"].notna()) & (train["BreedName1"] != "Mixed Breed")) |
          ((train["BreedName1"].isna()) & (train["BreedName2"].notna()) & (train["BreedName2"] != "Mixed Breed")),
          "PureBreed"] = "No"

train.loc[(train["Type"] == "Cat") & 
          ((train["BreedName1"].isin(domestic_x_hair)) | (train["BreedName2"].isin(domestic_x_hair)) |
          ((train["BreedName2"].isna()) & (train["BreedName1"].notna()) & (~train["BreedName1"].isin(domestic_x_hair))) |
          ((train["BreedName1"].isna()) & (train["BreedName2"].notna()) & (~train["BreedName2"].isin(domestic_x_hair))) |
          ((train["BreedName1"] == "Mixed Breed") | (train["BreedName2"] == "Mixed Breed"))),
          "PureBreed"] = "No"

In [None]:
utils.plot_vert_barplot(train, "PureBreed", target)

As we can see, pets that have a pure breed or that are listed with a single unequivocal breed are adopted earlier than those that are 'Mixed Breed' or 'Domestic {x} Hair', and it is less probable that they end up not being adopted. We haven't specified the 'Not Sure' value for those cats with 'Domestic {x} Hair' values, as their distribution would be very similar to that of the entire portion of the dataset that are cats.

Finally, we need to tackle the fact that BreedName1 has a very high cardinality; in this dataset, it has 176 different values, but we need to take into account that the breeds .csv file has 307 different values, so if it we would represent it using one-hot encoding, we would add 307 columns to the dataset, increasing too much the dimensionality of the data, given the relatively small number of instances. On the other hand, we shouldn't use ordinal encoding, because this is a pure categorical variable, but we would represent some order relationship between different values that are not related. Moreover, mean encoding could lead to overfitting, as there are many breeds that appear just once or twice.

Thus, a solution could be using frequency encoding, that is, replacing each breed by the number of times it appears divided by the total number of instances. Moreover, it makes sense as we would actually create a variable telling us whether that breed is common or rare. However, we should take into account that maybe the most common breeds of cats do not have as high values as the most common breed of dogs due to the fact that there are more dogs than cats in the dataset. Thus, we need to discriminate by `Type`:

In [None]:
freq_breed_1 = (train.groupby(["Type"])["BreedName1"]
                .value_counts(normalize=True)
                .reset_index(name='Breed1_freq_encode')
                .sort_values(["Type", "Breed1_freq_encode"], ascending=False))
print(freq_breed_1[freq_breed_1["Type"] == "Dog"])
print(freq_breed_1[freq_breed_1["Type"] == "Cat"])

In [None]:
freq_breed_1 = train.groupby(["Type"])["BreedName1"].value_counts(normalize=True).reset_index(name='Breed1_freq_encode')
train.loc[:, "Breed1_freq_encode"] = train.merge(freq_breed_1, how="left", left_on=["Type", "BreedName1"],
                                                 right_on=["Type", "BreedName1"])['Breed1_freq_encode']

In [None]:
utils.plot_histogram_and_density(data=train, x="Breed1_freq_encode", hue=target)

As we can see, it seems that the probability of ending up not being adopted (4) increases (compared to the other values) as the frequency of the primary breed increases. Moreover, the probability of ending up being adopted early, 0 or 1, is higher compared to the other target's values when the breed frequency encoding is closer to 0.0.

### Gender

This is a categorical variable with 3 possible values: 'Male', 'Female' or 'Mixed' (previously 1, 2 or 3, respectively). 'Mixed' means that the instance actually represents more than one pet listed in one adoption profile (so we asume that they can only be adopted as a 'pack' and not individually) and there is at least on Male and one Female. We should take this into account; nevertheless, we will take a look at this when studying the variable `Quantity`.

In [None]:
gender_counts = train["Gender"].value_counts(dropna=False)
gender_counts

In [None]:
utils.plot_pieplot(gender_counts.values, gender_counts.index, "Gender")

Almost a half of the listed pets are Female, while the Mixed condition represents 14.5% of the total instances. Let's see the relative distribution of the class for each of the previous values:

In [None]:
utils.plot_vert_barplot(train, "Gender", target)

From this plot we can draw two early conclussions:

1. Male pets seem to be adopted earlier than Female pets, and the percentage of Female pets that are not adopted after 100 days is also higher than that of Male pets.
2. The percentage of Mixed profiles that are not adopted after 100 days is higher than that of individual profiles. However, we need to find whether this is due to the fact that Mixed indicates that the profile lists several pets (more than one), or due to the actual meaning of this value, that is, those pets haven't got the same gender. We will check this when studying the `Quantity` variable.

### Color

As the pets may have different fur colors, there are 3 variables: `ColorName1`, `ColorName2` and `ColorName3` (previously indicated with the color ID, 'Color1', 'Color2' and 'Color3'). Let's see the number of values of each one:

In [None]:
train["ColorName1"].value_counts(dropna=False)

The vast majority of primary colors are Black and Brown.

In [None]:
train["ColorName2"].value_counts(dropna=False)

In [None]:
train["ColorName3"].value_counts(dropna=False)

As we can see, every instance has a primary color, but may not have a secondary color or a third color. Thus, let's see first if there are some inconsistencies and also the number of pets which only have one fur color:

In [None]:
inconsistencies = train[(train["ColorName2"].isna() & train["ColorName3"].notna())]
instances_one_color = train[(train["ColorName2"].isna() & train["ColorName3"].isna())]

print("Number of instances without a second color but with a third color (inconsistencies):", inconsistencies.shape[0])
print("Number of instances with just one (primary) color:", instances_one_color.shape[0])

There are no instances without a second color but with a third color, and the number of instances with just one fur color is the same as the instances that hasn't got a secondary and third color, which is actually the instances that does not have just the secondary color (4471).

Let's see the relative distribution of the class for each value of primary color:

In [None]:
utils.plot_vert_barplot(train, "ColorName1", target, display_numbers=False)

In general, it seems that less usual colors (those that are not Black or Brown) seem to be adopted earlier, with the exception of Yellow.

It does not make much sense to see the effect of the secondary and third colors by themselves on the target variable without knowing the primary color, so maybe we should check the combination of primary, secondary and third colors:

In [None]:
train_copy = train.copy(deep=True)
train_copy["ColorName123"] = train_copy["ColorName1"].astype(str) + \
                train_copy["ColorName2"].replace({np.nan: ''}).astype(str) + \
                train_copy["ColorName3"].replace({np.nan: ''}).astype(str)
display(train_copy["ColorName123"])

In [None]:
counts = utils.plot_hor_barplot(train_copy, "ColorName123", target, percentage=False,
                                figsize=(15,40), return_counts=True, title_y=0.9)

In [None]:
counts.groupby(['ColorName123'])['count'].sum().reset_index().sort_values(['count'], ascending=False)

There are some differences in the distribution of the class given the different combination of all the colors, but the biggest differences correspond to combinations with few instances, so they may not be very representative.

Finally, as we know that the pets without value for `ColorName2` are the ones that have just one fur color, let's see if there is a difference in the target variable when this condition is met:

In [None]:
utils.plot_vert_barplot(train, "ColorName2", target, nan_div=True)

The difference between having more than one fur color or not is not very significant when the pets end up being adopted, but it seems that the percentage of pets that are not adopted after 100 days is a little bit higher when the pet only has one fur color.

### MaturitySize

This is another variable which indicates the size of the pet when it reaches the maturity (which may not be the same size that it has when its profile is published). It has 4 possible values: 'Small', 'Medium', 'Large', 'Extra Large' or 'Not Specified' (previously 1, 2, 3, 4 or 0, respectively). Probably, we should differentiate between cats and dogs, as the size is relative for each type of pet:

In [None]:
train.groupby(["Type"])["MaturitySize"].value_counts(normalize=True, dropna=False)

The majority of instances have a Medium or Small maturity size, while Extra Large is a rare value, with few occurrences. Let's see the distribution of the class given the maturity size for dogs:

In [None]:
utils.plot_vert_barplot(train[train["Type"] == 'Dog'], "MaturitySize", target, additional_desc="for dogs")

Is interesting to see that Medium dogs don't seem to be adopted as early as Large or Small dogs, maybe just for the preferences of the people or because Medium size is the most common one. Let's see now what happens with cats:

In [None]:
utils.plot_vert_barplot(train[train["Type"] == 'Cat'], "MaturitySize", target, additional_desc="for cats")

It seems that Small cats are adopted earlier than Medium or Large cats.

### FurLength

This is a variable with 4 possible values: 'Short', 'Medium', 'Long' or 'Not Specified' (previously 1, 2, 3 or 4, respectively). It could be used as a categorical or numerical variable, as its values follow an obvious order, for example we can say 'Medium' < 'Long' (maybe the only drawback if we used it as numerical is the placement of the value 'Not Specified' in that order). Let's see how many instances have each value:

In [None]:
fur_length_counts = train["FurLength"].value_counts(dropna=False)
fur_length_counts

We can see that there aren't unspecified values, so this variable could definitively be used as numerical.

In [None]:
utils.plot_pieplot(fur_length_counts.values, fur_length_counts.index, "FurLength")

More than a half of the dataset are instances with Short fur length, while Long fur length is a rather unusual value.

In [None]:
utils.plot_vert_barplot(train, "FurLength", target)

It seems that pets with Long fur length are adopted quite earlier than the ones with Medium or Short length. However, this difference is not as representative as we would like, since only 5.5% of the instances have Long fur.

As we saw that many pets are labeled as 'Domestic Medium Hair', 'Domestic Short Hair' or 'Domestic Long Hair' in the variables `BreedName1` or `BreedName2`, which could be redundant given this variable, let's see if they actually match with the corresponding values of `FurLength`:

In [None]:
breed_names = ['Domestic Short Hair', 'Domestic Medium Hair', 'Domestic Long Hair']
fur_lengths = ['Short', 'Medium', 'Long']

inconsistencies = 0
for breed in breed_names:
    for fur_length in fur_lengths:
        num_occurrences = train[(train["BreedName1"] == breed) & (train["FurLength"] == fur_length)].shape[0]
        print(f"Breed: {breed} - Fur length: {fur_length} --> {num_occurrences} occurrences")
        if fur_length not in breed:
            inconsistencies += num_occurrences
            
print("\nNumber of inconsistencies:", inconsistencies)

There are 644 instances whose breed is not correctly labeled according to their fur length. Consequently we may have to modify the breed so that it matches with the fur length (or, if the secondary breed is specified and it is not inconsistent, we could use it as the primary breed), modify `FurLength` or even add a variable that specifies whether the fur length and the breed matches in those cases.

Let's check the last approach:

In [None]:
indexes_fur_length_inconsistencies = []
for index, row in train.iterrows():
    fur_length = str(row["FurLength"]).lower()
    breed_name = str(row["BreedName1"])
    if breed_name in domestic_x_hair and fur_length not in breed_name.lower():
        indexes_fur_length_inconsistencies.append(index)

train["BreedMatchesFurLength"] = "Yes"
train.loc[indexes_fur_length_inconsistencies, "BreedMatchesFurLength"] = "No"

As these anomalies only happen when the pet is a cat, as the breeds 'Domestic {x} Hair' always correspond to cats, let's see the distribution of the target variable when the pet is a cat depending on whether the breed matches the fur length or not:

In [None]:
utils.plot_vert_barplot(train[train["Type"] == "Cat"], "BreedMatchesFurLength", target, additional_desc="(cats)")

It seems that the probability for the cat to end up not being adopted could slightly increase if the specified breed does not match the `FurLength`, even though this could just be due to randomness.

### Vaccinated

This is a categorical variable with 3 possible values: 'Yes', 'No' or 'Not Sure' (originally 1, 2 or 3, respectively).

In [None]:
vaccinated_counts = train["Vaccinated"].value_counts(dropna=False)
vaccinated_counts

In [None]:
utils.plot_pieplot(vaccinated_counts.values, vaccinated_counts.index, "Vaccinated")

In [None]:
utils.plot_vert_barplot(train, "Vaccinated", target)

It seems that people tend to adopt non-vaccinated pets earlier than those already vaccinated, which is actually surprising since we could think that vaccination is an extra expense for the person adopting the pet. Moreover, it seems that the uncertainty of being vaccinated or not could be a cause for the people not to adopt the pet.

### Dewormed

This is also a categorical variable which has the same possible values as `Vaccinated`: 'Yes' (1), 'No' (2), 'Not Sure' (3). Also, the a priori hypothesis is the same one that we had with `Vaccinated`: if the pet is not dewormed, it would imply extra expenses, so it may be a factor not to adopt a pet. Let's see:

In [None]:
dewormed_counts = train["Dewormed"].value_counts(dropna=False)
dewormed_counts

In [None]:
utils.plot_pieplot(dewormed_counts.values, dewormed_counts.index, "Dewormed")

The distribution of values of `Dewormed` is not very different compared to that of `Vaccinated`, maybe their values usually coincide.

In [None]:
utils.plot_vert_barplot(train, "Dewormed", target)

This is interesting, for two reasons:

1. Not dewormed pets still seem to be adopted earlier than pets that have been dewormed, but the difference is not that significant as it was when they were (or not) vaccinated.
2. The main information that we can extract here is that the fact that people don't know whether the pet is dewormed or not (uncertainty) increases the probability that the pet could not be adopted after 100 days of the profile publication.

If we compare the distribution of values between `Vaccinated` and `Dewormed`, they are more or less similar, but the difference between the distributions 'Yes' and 'No' is different in these variables, so it is worth keeping both.

In fact, we can check how does the combination of values affect the target variable in our sample:

In [None]:
vacc_dew = train[["Vaccinated", "Dewormed", "AdoptionSpeed"]].copy(deep=True)
vacc_dew["Vaccinated_Dewormed"] = vacc_dew["Vaccinated"].astype(str) + '_' + vacc_dew["Dewormed"].astype(str)
vacc_dew.drop(["Vaccinated", "Dewormed"], axis=1, inplace=True)
display(vacc_dew["Vaccinated_Dewormed"])

In [None]:
vacc_dew_counts = vacc_dew["Vaccinated_Dewormed"].value_counts(dropna=False)
vacc_dew_counts

As we can see, the most usual cases are those in which the values of `Vaccinated` and `Dewormed` coincide. In the case that they don't coincide, it is probable that the pet is not `Vaccinated` but it is `Dewormed`.

In [None]:
utils.plot_vert_barplot(vacc_dew, "Vaccinated_Dewormed", target, figsize=(18,6), display_numbers=False)

The important thing to note here is that the distribution of values of the target variable is different for each combination (even if some of them are not very reliable due to the small number of instances with those combinations). In particular, the effect of not being `Vaccinated` and not being `Dewormed` is more similar to not being `Vaccinated` but being `Dewormed` on the target variable than to being both `Vaccinated` and `Dewormed`, so maybe `Vaccinated` provides more information than `Dewormed`.

However, we should take into account that pets with 0 or 1 month of `Age` cannot be vaccinated (https://pets.webmd.com/pet-vaccines-schedules-cats-dogs#1). We can take a look also at the relationship between the `Age` and being `Vaccinated` (just for small values):

In [None]:
train_copy = train[train["Age"] < 40]
utils.plot_histogram_and_density(data=train_copy, x="Age", hue="Vaccinated",
                           additional_desc="(Age < 40)")

In [None]:
utils.plot_boxplot(data=train_copy, x="Age", y="Vaccinated", additional_desc="(Age < 40)")

We can see that there is some degree of correlation between `Age` and `Vaccinated`, in general the pets that are not `Vaccinated` are those that are young (we can also see that the interquartile range and the median of `Age` when `Vaccinated` is 'Not Sure' is also higher than those of 'No'; if the pet is 0 months old, i.e., 4 weeks or less, and we don't know whether it is `Vaccinated` or not, it is very probable that the pet is not `Vaccinated`).

### Sterilized

This is another 'Yes' (1), 'No' (2) or 'Not Sure' (3) categorical variable which, given its meaning (if we see it in terms of extra expenses of adoption) and the fact that it is also related to the health, we expect it (or the behaviour of the target variable given this one) to be similar to `Vaccinated` and `Dewormed`. Let's see:

In [None]:
sterilized_counts = train["Sterilized"].value_counts(dropna=False)
sterilized_counts

In [None]:
utils.plot_pieplot(sterilized_counts.values, sterilized_counts.index, "Sterilized")

We can see a big difference here compared to `Vaccinated` and `Dewormed`: the amount of 'No' values is now bigger than the amount of 'Yes' values, and also the proportion is quite different, beyond inverted compared to the previous variables. Nevertheless, 'Not Sure' still represents (approximately) the same proportion of values.

In [None]:
utils.plot_vert_barplot(train, "Sterilized", target)

Now, we can see that there are also big differences in the distribution of the target variable. First of all, we saw that, given the two previous variables, when the value is 'No' the adoption takes place earlier than when it is 'Yes', but in this case it is even more drastical, both in terms of 'earliness' and in the fact that the percentage of pets that are not adopted after 100 days is quite lower when it is 'No' than when it is 'Yes'. It seems that Malay people do care about whether the pet is sterilized or not, probably they don't want to discard having cubs in the future. The 'Not Sure' distribution is more or less similar to those of `Vaccinated` and `Dewormed`, but the percentage of the cases that are not adopted after 100 days is also higher.

### Health

This is a categorical variable which can be treated as numerical, given its possible values: 'Healthy' (1), 'Minor Injury' (2), 'Serious Injury' (3) or 'Not Specified' (0). As we can see, there is some continuity between the values, because the difference between saying 'Healthy' or 'Serious Injury' and 'Healthy' or 'Minor Injury' is not the same, if we would predict 1 instead of 3 we would have a bigger mistake than if it was actually 2 instead of 3. Let's see the number of instances for each value and the distribution of the target variable:

In [None]:
train["Health"].value_counts(dropna=False)

The vast majority of instances are 'Healthy', while the number of injured pets is relatively small, at least those with serious injuries that may have a lower life expectancy and they would imply extra expenses in their care. Moreover, in the training set there aren't unspecified values.

In [None]:
utils.plot_vert_barplot(train, "Health", target)

As we would expect, the percentage of pets that are not adopted after 100 days is higher when they are injured, especially when it is a serious injury. Maybe the difference in the percentage ot the rest of values of the target variable is not that big between 'Healthy' and 'Minor Injury', but at least it also seems that the former ones are adopted earlier. In the case of a 'Serious Injury', even if the amount of instances is low to draw conclussions (34), it is something that we expected.

Let's see the `Age` and `AdoptionSpeed` distribution in the case of injured pets:

In [None]:
utils.plot_histogram_and_density(data=train[train["Health"] != 'Healthy'], x="Age", hue=target,
                          additional_desc="when pets are injured")

We can see that the distribution of Age is similar to the initial one, but it does not span more than 200 months (the are some cases around 140, but not beyond), so if the pet is older, it is less likely that it is injured.

### Quantity

This is a numerical variable which indicates how many pets are listed in the profile (besides 1 pet, it is possible to adopt more than one in the same operation, for example when it is a whole litter of cubs). Thus, if the adoption of a profile implies adopting every pet that is listed in that profile, it may have influence in the target variable:

In [None]:
train["Quantity"].value_counts(dropna=False)

We can see that more than the 75% of profiles list a single pet, and unusual values can be considerend 6 or 7 and greater. If we try to see the histogram of this variable and the target variable, it is impossible to even perceive the unusual values, so we can discretize this variable in three intervals: $1, [2,6], [7,\infty)$ (maybe not $\infty$):

In [None]:
conditions = [train["Quantity"] == 1, (train["Quantity"] > 1) & (train["Quantity"] < 7), train["Quantity"] >= 7]
subplots_titles = ["1 pet", "Between 2 and 6 pets", "7 or more pets"]
yticks = [0,5,10,15,20,25,30,35,40]
utils.plot_relative_countplot(data=train, x=target, conditions=conditions, 
                        title=f"Comparison of the outcome of {target} given different range of Quantity values",
                        subplots_titles=subplots_titles, yticks=yticks)

We can see that the percentage of profiles that are not adopted after 100 days since the publication is lower than the aggregated distribution when there's only 1 pet, but is higher when there are between 2 and 6 pets, and even more when there are 7 or more pets. Actually, the difference between profiles with a single pet and profiles with a number between 2 and 6 pets is not that much (only significant when the target variable is actually 4), but we can see that profiles with 2 or more pets up to 6 are adopted earlier than those with 7 or more pets. Thus, this discretization could be good if we needed it.

Moreover, we saw that the `Gender` variable may have as value 'Mixed' when there is more than one pet given up for adoption and at least one is male and another one is female. We can compare these cases with those where there are several pets but all of them are male or female:

In [None]:
several_same_gender = train[(train["Quantity"] > 1) & (train["Gender"] != "Mixed")]
num_several_same_gender = several_same_gender.shape[0]
print("Number of profiles with more than one pet with the same gender:", num_several_same_gender)

several_different_gender = train[(train["Quantity"] > 1) & (train["Gender"] == "Mixed")]
num_several_different_gender = several_different_gender.shape[0]
print("Number of profiles with more than one pet with different genders:", num_several_different_gender)

data = [several_same_gender, several_different_gender]
conditions = []
subplots_titles = ["Several pets, same gender",
                   "Several pets, different gender"]
yticks = [0,5,10,15,20,25,30,35,40]
utils.plot_relative_countplot(data=data, x=target, conditions=conditions, figsize=(15,6),
                        title=f"Comparison of the outcome of {target} given the Gender when Quantity is greater than 1",
                        subplots_titles=subplots_titles, yticks=yticks, data_already_given=True)

It seems that when the groups of pets have the same gender, the probability for them to end up not being adopted after 100 days of the profile publication is higher than those groups of pets that have different gender ('Mixed'), but the latter ones are not necessarily adopted earlier.

In [None]:
qty_more_1_type_counts = train[train["Quantity"] > 1]["Type"].value_counts()
utils.plot_pieplot(qty_more_1_type_counts.values, qty_more_1_type_counts.index, "Type when Quantity > 1")

Finally, it seems that the original `Type` of pet proportion is reversed when the `Quantity` of pets given up for adoption is more than 1, so is more usual to give up several cats than several dogs for adoption. Thus, the proportion of dogs when `Quantity` is 1 will be greater than the combined sample one:

In [None]:
qty_1_type_counts = train[train["Quantity"] == 1]["Type"].value_counts().rename('count').reset_index().sort_values(['count'])
utils.plot_pieplot(qty_1_type_counts['count'], qty_1_type_counts['index'], "Type when Quantity is 1")

### Fee

This is another numerical variable which indicates the amount of money we have to pay in order to adopt the pet. It is not clear which currency is used, but it is not relevant as long as we can see whether paying more money has an effect on the target variable. We are particularly interested in the comparison between a free adoption (0) and adoptions that are not free.

In [None]:
train["Fee"].describe()

We can see that the vast majority of instances are free to adopt, as all the quantiles are 0. Let's check how many are 0 and non-zero:

In [None]:
fee_zero = train.loc[(train["Fee"] == 0),["Fee", "AdoptionSpeed"]]
fee_non_zero = train.loc[(train["Fee"] > 0),["Fee", "AdoptionSpeed"]]

print("Number of instances that were free to adopt:", fee_zero.shape[0])
print("Number of instances that were not free to adopt:", fee_non_zero.shape[0])

There are sufficient non-zero cases to determine the effect of this variable, so let's check the distribution of the target variable when `Fee` is zero and greater than zero:

In [None]:
print("Proportion of each outcome when the adoption is free:")
display(fee_zero[target].value_counts(normalize=True))
print("Proportion of each outcome when the adoption is not free:")
display(fee_non_zero[target].value_counts(normalize=True))

In [None]:
data = [fee_zero, fee_non_zero]
subplots_titles = ["Free adoption", "Non-free adoption"]
yticks = [0,5,10,15,20,25,30,35]
utils.plot_relative_countplot(data=data, x=target, conditions=[], 
                        title=f"Comparison of the outcome of {target} given the adoption Fee",
                        subplots_titles=subplots_titles, yticks=yticks, data_already_given=True)

It's interesting to see that the distribution is very similar, the only difference between free and non-free adoption in our training sample is that the percentage of pets that are not adopted after 100 days is slightly higher in the latter case (about 3.5%). Let's check the distribution of values for both `Fee` and `AdoptionSpeed` when the adoption is not free:

In [None]:
utils.plot_histogram_and_density(data=fee_non_zero, x="Fee", hue=target)

We can see that the distribution is far from being normal, and even more if we depicted the `Fee` = 0 cases. Thus, as the distribution of `Fee` is skewed, let's apply the natural logarithm:

In [None]:
fee_non_zero_log = fee_non_zero.copy(deep=True)
fee_non_zero_log["Fee"] = np.log(fee_non_zero_log["Fee"])
utils.plot_histogram_and_density(data=fee_non_zero_log, x="Fee", hue=target)

If we take a look at the distribution of the class at the different peeks of values, we can see that the distribution is more or less maintained, at least for those groups of values that we can see in the plot, so maybe we don't need to discretize or further divide this interval of values (just leaving it 0 or non-zero if we need to discretize).

In [None]:
utils.plot_boxplot(data=fee_non_zero, x="Fee", y=target, additional_desc="(Fee > 0)")

It's interesting to see that the cases that ended up not being adopted are more gathered at lower `Feed` values compared to the pets that were adopted (in fact, the 6 profiles with higher `Feed` all ended up being adopted), but as we saw in the previous plot, there isn't a single "clear" cut or discretization of `Fee` values.

Let's check also the `Fee` when we discriminate the instances by the `Type`:

In [None]:
fee_non_zero_w_type = train.loc[train["Fee"] > 0, ["Fee", "Type", target]]
print(f'Mean Fee for each type of pet:\n{fee_non_zero_w_type.groupby(["Type"])["Fee"].mean()}')

In [None]:
utils.plot_boxplot(data=fee_non_zero_w_type, x="Fee", y=target, hue="Type", additional_desc="(Fee > 0)")

It also seems that when dogs are given up for adoption and there is a fee, it is usually higher than that for cats. It also seems that the fee in the case of dogs is not really a problem, and in fact it seems like dogs with higher `Fee` values were adopted earlier. On the other hand, the `Fee` value in the case of cats seem to affect negatively to the `AdoptionSpeed`. Thus, it could be interesting to maintain this variable's values as they are if we don't really need to discretize.

### State?

This is a categorical variable that indicates the State in Malaysia in which the pet was sheltered or given in adoption.

There is one problem with this variable: if our model uses it, it can only be used to predict the adoption time in Malaysia (given the sample of data that we have), but PetFinder operates in many more countries. Thus, we can do one of two things:

1. If it does not provide insights on `AdoptionSpeed`, we could discard it.
2. Use external information so that this variable is converted to another one which can be found for every country or state. In particular, in the competition people have used the GDP per capita of each State of Malaysia, which is a good indicator of the standard of living of its habitants, so it could give more information on `AdoptionSpeed`, and it can be easily found browsing the Internet.

First of all, the number of instances in each State of Malaysia:

In [None]:
train["State"].value_counts(dropna=False)

We can see that they are listed using some numeric code, but we can substitute it using the `.csv` file of states:

In [None]:
utils.left_join(train, states, left_on=["State"], right_ID="StateID", var_to_keep="StateName")

In [None]:
train["StateName"].value_counts(dropna=False)

Let's see the distribution of the target variable in the first 8 or 9 States by number of instances, as the rest may not  be very representative due to the small number of instances:

In [None]:
representative_states = {"Selangor", "Kuala Lumpur", "Pulau Pinang", "Johor",
                         "Perak", "Negeri Sembilan", "Melaka", "Kedah", "Pahang"}
utils.plot_vert_barplot(train[train["StateName"].isin(representative_states)], "StateName", target, display_numbers=False)

There are differences in the distribution of `AdoptionSpeed` depending on the State in Malaysia, although, as we could expect, when the number of instances is higher those differences are not that drastical ("Selangor", "Kuala Lumpur", "Pulau Pinang", "Johor"). Now, let's take a different approach and substitute this variable by the GDP per capita of each one (https://en.wikipedia.org/wiki/List_of_Malaysian_states_by_GDP, 2019):

In [None]:
gdp_per_capita = {
    "Kuala Lumpur": 129472,
    "Labuan": 77798,
    "Penang": 55243,
    "Selangor": 54995,
    "Sarawak": 53358,
    "Malacca": 49172,
    "Negeri Sembilan": 45373,
    "Johor": 37342,
    "Pahang": 36474,
    "Perak": 31668,
    "Terengganu": 30933,
    "Perlis": 25656,
    "Sabah": 25326,
    "Kedah": 22412,
    "Kelantan": 14300
}

Let's check if every State in the dataset is listed in the dictionary:

In [None]:
state_coincidences = set(train["StateName"].unique()) & set(gdp_per_capita.keys())
print("Coincidences:", state_coincidences)
states_without_match = set(train["StateName"].unique()) - state_coincidences
print("States that don't match:", states_without_match)

The States were listed in Malay, not in English, so we must convert first "Melaka" to "Malacca" and "Pulau Pinang" to "Penang":

In [None]:
train["StateGDP"] = train["StateName"].copy(deep=True)
train["StateGDP"].replace({"Melaka": "Malacca", "Pulau Pinang": "Penang"}, inplace=True)
display(train["StateGDP"].unique())

And finally we create another variable with the GDP per capita of the state in which the pet was sheltered:

In [None]:
train["StateGDP"] = train["StateGDP"].replace(gdp_per_capita).astype('int64')
train.drop(["StateName"], axis=1, inplace=True)

### RescuerID?

This is a variable that contains the identifier (some kind of hash) of the person or entity that is giving the pet on adoption. Thus, we have again the same problem: we may use information of certain rescuers, but in a production environment or just in the test dataset, we could encounter rescuers that we didn't see before, so we have to find a way to represent some characteristic of the rescuer different from its identity.

One characteristic that we can extract from each rescuer is the number of times it appears in the sample: that will tell us whether it could be an individual (1 or few occurrences) or a shelter (greater number of occurrences). Thus, we will create a new variable called, for example, `RescuerFrequency`.

In [None]:
rescuer_count = train["RescuerID"].value_counts(dropna=False).rename('RescuerCount').reset_index()
rescuer_count

As we can see, there are 5595 different rescuers, which would make a mean of roughly 3 profiles published by each one, but we can see that there are some rescuers that have published just 1 profile, and other that have published hundreds.

Let's see how many have published a single profile (which may have more than one pet!) and how many have published a significant number, for example more than 30:

In [None]:
print("Number of rescuers giving in adoption a single pet in the sample:",
      rescuer_count[rescuer_count["RescuerCount"] == 1].shape[0])
print("Number of rescuers giving in adoption more than 30 pets in the sample:",
      rescuer_count[rescuer_count["RescuerCount"] > 30].shape[0])

Now let's replace the `RescuerID` variable by another which indicates how many times that rescuer has appeared in the training dataset, called `RescuerCount`: 

In [None]:
utils.left_join(train, rescuer_count, left_on=["RescuerID"], right_ID="index", var_to_keep="RescuerCount")

Let's take a look at the distribution of values of `RescuerCount` and the target variable:

In [None]:
utils.plot_histogram_and_density(data=train, x="RescuerCount", hue=target)

As we saw, the distribution is greatly skewed towards small values, but there are some high values which probably correspond to shelters. Let's apply the natural logarithm to see in more detail the distribution of the target variable:

In [None]:
train_copy = train[["RescuerCount", "AdoptionSpeed"]].copy(deep=True)
train_copy["RescuerCount"] = np.log(train_copy["RescuerCount"])

utils.plot_histogram_and_density(data=train_copy, x="RescuerCount", hue=target, additional_desc="(logarithmic scale)")

In these plots we can see that the probability for a pet not to be adopted after 100 days from the publication of the profile is higher when the rescuer has published a smaller number of profiles (in fact, this probability decreases substantially as `RescuerCount` increases), probably because they are individuals and most people that wants to adopt a pet will probably trust the care of a shelther more than that of a person that may not treat with these types of animals very often, or just because it is easier or safer to go to a shelter rather than to meet with an unknown person.

In [None]:
utils.plot_boxplot(data=train, x="RescuerCount", y=target)

Moreover, when the pets end up being adopted, it seems that as the `RescuerCount` value is smaller that is done earlier, maybe because people know that pets could last a little bit longer in shelters than in an individual's house.

### VideoAmt

This is another numerical variable which indicates how many videos of the pet (or pets) given up for adoption where included in the profile. Let's see how many different values we have:

In [None]:
train["VideoAmt"].value_counts(dropna=False)

As we can see, just a few profiles (574) were published including at least one video of the pets (the most common case is just one video). Thus, let's check if publishing a video or not has some effect on the target variable:

In [None]:
conditions = [train["VideoAmt"] == 0, train["VideoAmt"] > 0]
subplots_titles = ["Without video", "With at least one video"]
yticks = list(range(0,36,5))
utils.plot_relative_countplot(data=train, x=target, conditions=conditions, figsize=(15,6),
                        title=f"Comparison of the outcome of {target} given the presence or not of videos in the profile",
                        subplots_titles=subplots_titles, yticks=yticks)

As we can see, there is a clear difference between having published or not a video, at least the percentage of pets that are not adopted after 100 days is quite lower.

Thus, as the amount of instances with 3 or more videos is very small, it does not make sense to maintain the original granularity, we could just create a new variable which indicates whether a video was published or not in the profile.

### PhotoAmt

This is the last numerical variable of the main dataset, which indicates how many photographs were published in each profile. First of all, let's see some univariate statistics:

In [None]:
train["PhotoAmt"].describe()

The average number of photographs that are published in a profile is almost 4, while there are cases with no photographs at all or even 30 photographs. Let's see how this could affect the target variable:

In [None]:
no_photo = train.loc[train["PhotoAmt"] == 0, ["PhotoAmt", target]]
print("Profiles without photographs:", no_photo.shape[0])
with_photo = train.loc[train["PhotoAmt"] > 0, ["PhotoAmt", target]]

data = [no_photo, with_photo]
subplots_titles = ["Without photo", "With at least one photo"]
yticks = list(range(0,71,5))
utils.plot_relative_countplot(data=data, x=target, conditions=[], figsize=(15,6),
                        title=f"Comparison of the outcome of {target} given the presence or not of photos in the profile",
                        subplots_titles=subplots_titles, yticks=yticks, data_already_given=True)

As we can see, if no photograph is published, the probability that the pet is not adopted eventually is very high. Let's see now in more detail the distribution of the target variable for each value of `PhotoAmt` when it is not zero:

In [None]:
utils.plot_histogram_and_density(data=with_photo, x="PhotoAmt", hue=target, additional_desc="(PhotoAmt > 0)")

In [None]:
utils.plot_boxplot(data=with_photo, x="PhotoAmt", y="AdoptionSpeed", additional_desc="(PhotoAmt > 0)")

If we don't take into account those profiles with zero photos, we can see that there isn't a clear difference when the amount of photographs increases; maybe if there are 1 or 2 photos the chances of not being adopted after 100 days are slightly higher. Thus, we could also replace this variable by another one telling whether there is at least 1 photo in the profile or not.

## Text

### Description

This variable is the only one that is not categorical nor numerical (if we don't take into account `Name`, which we discarded and will be replaced by `HasName`), so its processing will be different (if we want to include this information, we will have to employ different methods like TF-IDF, word embeddings, etc.). According to the description of the dataset, the majority are written in English, but there are some in Malay or Chinese. Let's take a look at some descriptions:

In [None]:
description_sample = train.sample(10, random_state=seed)
for index, row in description_sample.iterrows():
    print(f'Index {index} - PetID {row["PetID"]}: \n{row["Description"]}\n')

In this small sample, we can see that there are huge differences among the descriptions of different profiles, there is not structure at all; some descriptions provide information about the context/state of the pet and some others include redundant data or data that is supposed to be already included in other variables. In the former case, we are lucky, as we are provided with extra data coming from the sentiment analysis done by [Google's Natural Language API](https://cloud.google.com/natural-language/docs/basics). In the latter case, there's little we can do to overcome it or extract the non-redundant information from that; however, we can check if the description information matches with the tabular data:

For example the description of instances `Index 5868 - PetID 32141c266` and `Index 1584 - PetID 3595648ea` say that the pets are already dewormed and vaccinated. Let's see:

In [None]:
description_sample.loc[[5868, 1584], ["Dewormed", "Vaccinated"]]

The description of `Index 6926 - PetID ac495f25b` says that 5 dogs are given up for adoption, they are a mix of males and females, the breed is also a mix and they are 1 month old: 

In [None]:
description_sample.loc[6926, ["Quantity", "Gender", "BreedName1", "BreedName2", "Age"]]

We will just trust the descriptions, at least the previous ones are consistent with the tabular data, but we cannot ensure that there aren't any inconsistencies.

Let's see know how many profiles doesn't have a description:

In [None]:
print("Number of profiles without description:", train[train["Description"].isna()].shape[0])

Now, let's create a `WordCloud` of the description, which tell us the most frequent words found in them (the bigger it is represented in the picture, the more occurrences it has). As the way in which publishing an adoption profile may vary depending on the type of pet, we will create two word clouds: one for dogs and another for cats.

In [None]:
all_descriptions_cats = " ".join(desc for desc in train.loc[train["Type"] == "Cat", "Description"].fillna('').values)
all_descriptions_dogs = " ".join(desc for desc in train.loc[train["Type"] == "Dog", "Description"].fillna('').values)

In [None]:
utils.plot_wordcloud([all_descriptions_cats, all_descriptions_dogs], 1, 2,
               ["Most frequent words for cats", "Most frequent words for dogs"], seed)

As we can see, the most common words for each type of pet are very similar (the name of the type of pet, 'love', 'found', 'kitten' or 'puppie', 'playful', etc.), and we should take into account that those don't really provide much information. We should be particularly interested in those that are not as commonly used (TD measures the frequency of each word in the entire dataset vocabulary, while IDF subtracts importance to words that are common in the whole corpus of documents), which may show a pattern between the target variable and the occurrence of those words (for example, 'street', 'stray', 'abandoned', etc).

Another interesting thing to check is the relationship between the target variable and the length of the description in number of characters:

In [None]:
train["DescriptionLength"] = train["Description"].fillna('').map(lambda x: len(str(x)) if str(x) else 0)

In [None]:
train["DescriptionLength"].describe()

In [None]:
utils.plot_histogram_and_density(data=train, x="DescriptionLength", hue=target)

As the distribution of values is skewed to the left and the observed domain is too large to clearly observe the distribution of the target variable, let's apply the natural logarithm (there are 12 zero values, those instances that don't have a description):

In [None]:
train_copy = train[["DescriptionLength", "AdoptionSpeed"]].copy(deep=True)
train_copy["DescriptionLength"] = np.log(1+train_copy["DescriptionLength"])

utils.plot_histogram_and_density(data=train_copy, x="DescriptionLength", hue=target, additional_desc="(logarithmic scale)")

In [None]:
utils.plot_boxplot(data=train, x="DescriptionLength", y="AdoptionSpeed")

The median of `DescriptionLength` and the IQR when the target variable is 4 seems to be a little bit smaller than the rest of values, we can see that in the boxplot and in the density plot we can also see that the probability of that value in longer descriptions decreases in our sample.

### Description metadata

Now, we will take a look and extract the useful information that we are provided in each `.json` file in the `train_sentiment` directory containing the sentiment analysis carried out by Google's Natural Language API. The general structure of each of this files can be found [here](https://cloud.google.com/natural-language/docs/basics); the API can perform 5 types of analysis:

- Sentiment analysis
- Entity analysis
- Entity sentiment analysis
- Syntactic analysis
- Content classification

In this case, looking at some of the `json` files (the name of the file corresponds to the PetID of an instance in the main dataset), we are provided with the Sentiment analysis and Entity analysis; the former corresponds to the `sentences` and `documentSentiment` attributes, while the latter corresponds to the `entities` attribute. Thus, the original request to the API would look like (example for the description of the instance with `PetID = 0008c5398`:

In [None]:
train[train["PetID"] == '0008c5398']["Description"].values

```
{
  "document":{
    "type":"PLAIN_TEXT",
    "content":"Ollie was rescued from the construction site behind my house. He is quite the manja type and loves to play. He makes a good companion and playmate for young children. He is quite the handsome chap with a distinct mark on his face like a beauty mark."
  },
  "features":{
    "extractEntities":true,
    "extractDocumentSentiment":true
  },
  "encodingType":"UTF8"
}
```
The `features` attribute of the request contains the type of analysis we want to get; if a certain type of analysis is not specified or it is included as `false`, then it won't be performed.

The actual response is the following (we will use this one as example to see the composition of each attribute):

```
{
  "sentences": [
    {
      "text": {
        "content": "Ollie was rescued from the construction site behind my house.",
        "beginOffset": -1
      },
      "sentiment": {
        "magnitude": 0,
        "score": 0
      }
    },
    {
      "text": {
        "content": "He is quite the manja type and loves to play.",
        "beginOffset": -1
      },
      "sentiment": {
        "magnitude": 0.9,
        "score": 0.9
      }
    },
    {
      "text": {
        "content": "He makes a good companion and playmate for young children.",
        "beginOffset": -1
      },
      "sentiment": {
        "magnitude": 0.9,
        "score": 0.9
      }
    },
    {
      "text": {
        "content": "He is quite the handsome chap with a distinct mark on his face like a beauty mark.",
        "beginOffset": -1
      },
      "sentiment": {
        "magnitude": 0.8,
        "score": 0.8
      }
    }
  ],
  "tokens": [],
  "entities": [
    {
      "name": "Ollie",
      "type": "PERSON",
      "metadata": {},
      "salience": 0.71007854,
      "mentions": [
        {
          "text": {
            "content": "Ollie",
            "beginOffset": -1
          },
          "type": "PROPER"
        }
      ]
    },
    {
      "name": "construction site",
      "type": "LOCATION",
      "metadata": {},
      "salience": 0.13050501,
      "mentions": [
        {
          "text": {
            "content": "construction site",
            "beginOffset": -1
          },
          "type": "COMMON"
        }
      ]
    },
    // 9 more entities ...
  ],
  "documentSentiment": {
    "magnitude": 2.8,
    "score": 0.7
  },
  "language": "en",
  "categories": []
}
```

#### Sentiment analysis

The `sentences` attribute is a list of objects (one for each sentence) with two attributes: `text` and `sentiment`. The former, in turn, has two attributes: `content`, which is the sentence itself and `beginOffset`, which is the absolute displacement of the sentence (its first character) from the beginning of the entire text (0), in number of characters (according to the documentation, it depends on the value that we give to the attribute `encodingType` in the request). We know that a Syntactic analysis wasn't performed on the descriptions because `beginOffset` is always -1, and also as the root object's attribute `tokens` is empty.

On the other hand, the `sentiment` attribute of a sentence object contains two important attributes:

- `magnitude`: according to the documentation, it indicates the strength of emotion (regardless of whether it is positive or negative) of the sentence, in the interval 0.0 to $+\infty$.
- `score`: it indicates the emotional leaning of the sentence, between -1.0 (negative) and 1.0 (positive)

However, as the number of sentences of the description varies from one profile to another, the attribute `documentSentiment` provides a measure of both the overall `score` and `magnitude` of the entire text. However, we must take into account that, as `magnitude` is not normalized, unlike `score`, longer descriptions may have greater values of overall `magnitude`; in this case, we can normalize the `magnitude` with the description length or the number of sentences, or just keep it if we include the description length in dataset. In any case, we will extract these two attributes of `documentSentiment` and include it in the main dataset for each instance, in order to analyse their effect on the target variable.

Note that `score` can be negative (up to -1.0), which shows a 'negative' sentence or emotion. For example, in the description of the instance with `PetID = 00156db4a`, we have an example of this type of sentences:

```
{
  "sentences": [
    {
      "text": {
        "content": "I found Pup with 2 other brothers who were abandoned and were in terrible condition.",
        "beginOffset": -1
      },
      "sentiment": {
        "magnitude": 0.7,
        "score": -0.7
      }
    },
    // 4 more sentences...
  ],
  // ...
}
```

However, `magnitude`, as a measure of 'strength', is always positive in value (not necessarily in emotion, that is indicated by `score`). Depending on the combination of values of `score` and `magnitude`, we can say how much emotion or 'sentiment' was put in the description. In general, if the `score` is close to 1.0 or -1.0 and the magnitude is closer to the number of sentences, we can say that the text was overall clearly positive or negative, respectively. However, if `score` is close to 0.0, there are two situations: the text is 'low-emotional' when the `magnitude` is low, but if the `magnitude` is higher then we can say that there are mixed emotions (positive and negative emotions that cancel out each other).

#### Entity analysis

The Entity analysis corresponds to the `entities` list of objects; each one of these objects have several attributes: `name` of the entity ('cat', 'kittens', proper name given to the pet, context information like 'street', 'birth', etc.), `type`, which can be 'PERSON', 'LOCATION', 'OTHER', etc. (if we take a look at some documents, we can see that the proper names given to the pets or names like 'kittens' are labelled as 'PERSON', but then if 'cat' or 'dog' are mentioned, they are labelled as 'OTHER'), `salience`, which is a number in the range $[0,1]$ that indicates the importance or relevance of that entity in the entire text, and a list called `mentions`, which indicates positions in the text in which the entity is mentioned, even though in this case we don't have the `beginOffset` attribute as the Syntactic analysis wasn't performed, and also the `type` of name, that is 'PROPER' or 'COMMON'.

Pets can be mentioned with their names or just the type of pets they are, that is, sometimes they are labelled as 'PERSON' and other as 'OTHER'; however, for example, 'kids' are also labelled as 'PERSON', and 'Thanks' is labelled as 'OTHER', so maybe it doesn't make sense to extract for example the mean `salience` of those types of entities (in fact, some of these types may not appear in the description), but we could extract the maximum `salience` of any entity (if it is low, maybe it is difficult to identify which entities are the most important). On the other hand, we could also extract the number of detected `entities` (even though it will probably be correlated with the `DescriptionLength`).


Additionally, every call to the API, regardless the type of analysis, returns an attribute `language` which contains the string code of the detected document's language (if the request didn't include it, which is the case as we have this attribute in the response). This is another attribute that is interesting to include in the main dataset and study how it could affect the target variable.

Now, we will retrieve the atributes `score` and `magnitude` of the overall document or description (in the object `documentSentiment`), the length of the `entities` list, the maximum `salience` of any entity (if present) and the `language` attribute from the `.json` files, and we will include this information in the training dataset. However, we have to take into account that there are some instances that doesn't have a description, and maybe some other instances that the API could not analyse:

In [None]:
# description_metadata = utils.get_description_metadata()

In [None]:
without_desc_analysis = utils.include_description_metadata(train) #, description_metadata=description_metadata)

Let's take a sample to see that we have included the new variables:

In [None]:
train.sample(10, random_state=seed)[["PetID","DescriptionScore", "DescriptionMagnitude", "DescriptionLanguage",
                                     "DescriptionNumEntities", "DescriptionMaxSalience"]]

Now let's see how many profiles had a description but it could not be analysed by the API:

In [None]:
print("Number of profiles whose description could not be analyzed:",
      len(without_desc_analysis) - train[train["Description"].isna()].shape[0])

In these cases, we have set the `score` and `magnitude` to 0.0 (neutral, or no emotion in the description), and the `language` is just `NaN`. Maybe we can find a reason that explains why the API could not analyse them, let'see some of these descriptions:

In [None]:
sample_without_analysis = train[train.index.isin(without_desc_analysis) & train["Description"].notna()
                               ].sample(15, random_state=seed)

for index, row in sample_without_analysis.iterrows():
    print(f'Index {index} - PetID {row["PetID"]}:\n {row["Description"]}\n')

As we can see, the main reason why these descriptions could not be analysed is the fact that they mix English and Malay languages or just when it is in Malay. However, this is a sample of 10 out of 539 instances, so maybe there are other reasons, for example the difficulty to split the text in sentences (for example, `Index 3914 - PetID 039fea576` above).

### DescriptionLanguage

Let's see how many different languages where used in those descriptions that were analysed by the API:

In [None]:
with_desc = train[train["Description"].notna()].copy(deep=True)
with_desc["DescriptionLanguage"].value_counts(dropna=False)

`en` stands for English, `zh` stands for Chinese, `zh-Hant` stands for Mandarin Chinese and `de` stands for German. The number of values is higly unbalanced, English is the language used in the vast majority of descriptions. Let's see now the distribution of the target variable for each one:

In [None]:
utils.plot_vert_barplot(with_desc, "DescriptionLanguage", target, display_numbers=False)

If we ignore the descriptions that are in German, because they are just 2, it seems that when the profiles were published with a description in Chinese or Mandarin Chinese, the majority of those pets were not eventually adopted. However, given the huge difference between the number of instances in English and in Chinese, and given that the latter case corresponds to just 131 instances, this variable may not provide much information.

Let's see if there is a different distribution when the description cannot be analysed by the API (mainly due to a mix of languages or the language is Malay):

In [None]:
utils.plot_vert_barplot(with_desc, "DescriptionLanguage", target, nan_div=True, display_numbers=False)

There are two things happening here: when the pets are adopted with a profile whose description is in Malay or is mixed ('Without DescriptionLanguage' in the plot), they are adopted earlier compared to those instances with a description in pure English or Chinese, but the probability of ending up not being adopted is a little bit higher in the former case than in the latter case. Of course, these are early conclussions, as the difference in the number of cases to support them is big (539 and 14440, respectively).

### DescriptionScore

First of all, let's check the statistics of this numerical variable:

In [None]:
with_desc["DescriptionScore"].describe()

As we can see, the mean value is positive (and also the three quartiles), so overall the amount of "positive" description in terms of emotion/sentiment is larger than those who are "negative", especially if we consider that there are 539 cases that are considered as neutral (0.0, so the mean would be slightly higher) since their description could not be analysed. On the other hand, there isn't something like "fully positive" or "fully negative" descriptions, as the maximum and minimum values are 0.9 and -0.9, respectively, probably due to the fact that the API will always encounter neutral entities.

Now, let's check the distribution of the target variable given the degree of negative (closer to -1.0), positive (closer to 1.0) or neutral/mixed (closer to 0.0) emotions in the descriptions:

In [None]:
utils.plot_histogram_and_density(data=with_desc, x="DescriptionScore", hue=target)

Just looking at the previous plots, we can say that `DescriptionScore` may not really help, at least by itself, to identify the target variable's value. However, the previous plots represent the distribution of values before the imputation we have to do for the instances whose description could not be analysed (we left `DescriptionScore` set to 0.0). As those instances actually have a description, it would not be fair to set these properties set to 0. Hence, let's impute the values using the mean of those instances whose description was analysed and see the distribution before and after that imputation:

In [None]:
desc_not_analysed_ids = set(train[train.index.isin(without_desc_analysis) & train["Description"].notna()]["PetID"].values)
desc_analysed = with_desc[~with_desc["PetID"].isin(desc_not_analysed_ids)]

after_des_score_imputation = with_desc[["PetID","DescriptionScore", target]].copy(deep=True)
after_des_score_imputation.loc[after_des_score_imputation["PetID"].isin(desc_not_analysed_ids),
                     "DescriptionScore"] = desc_analysed["DescriptionScore"].mean()

comparison_desc_score_imp = with_desc[["PetID","DescriptionScore", target]].copy(deep=True)
comparison_desc_score_imp["Imputation"] = "Before"
after_des_score_imputation["Imputation"] = "After"
comparison_desc_score_imp = comparison_desc_score_imp.append(after_des_score_imputation)

utils.plot_boxplot(data=comparison_desc_score_imp, x="DescriptionScore", y=target, hue="Imputation", figsize=(20,12),
            show_title=False)

We can see that `DescriptionScore` is more or less similiar before and after the imputation (in fact, 4 out of 5 possible classifications have the same `DescriptionScore` median in the former case, and 3 out of 5 in the latter case; the interquartile range remains the same for values 0, 2, 3 and 4 of the target variable). Nevertheless, it seems that when the target variable indicates that the pet was not adopted after 100 days, the median and third quartile of `DescriptionScore` is higher than in other cases both before and after the imputation, so maybe this means that if there is an overall positive emotion in the description of the pet, people may not take a possible adoption as urgently as others which are more negative.

### DescriptionMagnitude

Let's check now the statistics of the descriptions' `magnitude` or strength of emotions:

In [None]:
with_desc["DescriptionMagnitude"].describe()

As we can see, the mean value of `DescriptionMagnitude` is 2.05, but we cannot really say whether this is good or bad in the overall, as we know that is probable that longer descriptions have a higher `DescriptionMagnitude` value. For example, the maximum value is 32, let's check the description and the `score` of that case:

In [None]:
example_magnitude_32 = with_desc[with_desc["DescriptionMagnitude"] == 32]
print(f'Instance with magnitude 32, score {example_magnitude_32.iloc[0]["DescriptionScore"]}:\n\
          {example_magnitude_32.iloc[0]["Description"]}')

We can see thay maybe the score does not tell us something about the description in this case, because it is rather neutral, but a high `magnitude` could indicate more information, even as a substitute for `DescriptionLength`, or more strength of emotions. Let's check the distribution of values of this variable and the target variable:

In [None]:
utils.plot_histogram_and_density(data=with_desc, x="DescriptionMagnitude", hue=target)

The distribution of `DescriptionMagnitude` is very skewed, so let's apply the natural logarithm in order to see more details (even though we can see that the target's value 4 is more skewed to smaller values than for example 2 or 3):

In [None]:
with_desc_copy = with_desc[["DescriptionMagnitude", target]].copy(deep=True)
with_desc_copy["DescriptionMagnitude"] = np.log(1 + with_desc_copy["DescriptionMagnitude"])
utils.plot_histogram_and_density(data=with_desc_copy, x="DescriptionMagnitude", hue=target,
                           additional_desc="(logarithmic scale)")

Now we can see that the 4 value is predominant in smaller values, while its proportion decreases when the `DescriptionMagnitude` is higher. Let's see the boxplot of `DescriptionMagnitude` for each `AdoptionSpeed` value before the imputation (of the instances whose description couldn't be analysed, 0.0) and after (using the mean):

In [None]:
after_des_mag_imputation = with_desc[["PetID","DescriptionMagnitude", target]].copy(deep=True)
after_des_mag_imputation.loc[after_des_mag_imputation["PetID"].isin(desc_not_analysed_ids),
                     "DescriptionMagnitude"] = desc_analysed["DescriptionMagnitude"].mean()

comparison_desc_mag_imp = with_desc[["PetID","DescriptionMagnitude", target]].copy(deep=True)
comparison_desc_mag_imp["Imputation"] = "Before"
after_des_mag_imputation["Imputation"] = "After"
comparison_desc_mag_imp = comparison_desc_mag_imp.append(after_des_mag_imputation)

utils.plot_boxplot(data=comparison_desc_mag_imp, x="DescriptionMagnitude", y=target, hue="Imputation", figsize=(20,12),
            show_title=False)

We cannot say anything in addition to the fact overall the pets that ended up not being adopted had a smaller `DescriptionMagnitude`; let's see if the combination with `DescriptionScore` explains some behaviour of the target variable:

In [None]:
utils.plot_pairplot(desc_analysed, ["DescriptionMagnitude", "DescriptionScore"], target)

It seems that, in general, the most usual combination of values is `DescriptionScore` between 0.0 and 0.5 and `DescriptionMagnitude` between 0.0 and 10.0, but there is no clear combination of values that could provide us pure or near-pure partitions of the target variable. We can see that the majority of instances whose `AdoptionSpeed` is 0 are gathered at smaller values of `DescriptionMagnitude` than the other values of the target variable and they also have a neutral or positive emotion. It also seems that when `DescriptionMagnitude` is larger than 10.0, the majority of instances are adopted in the mid or long term (2 and 3 values). However, there is not much correlation between these variables, and their combination does not really seem to give us additional information about the target variable.

Let's check if `DescriptionMagnitude` and `DescriptionLength` are more or less proportional and if they could be redundant:

In [None]:
utils.plot_single_pairplot(desc_analysed, "DescriptionLength", "DescriptionMagnitude", target)

We can see that there is some relationship between these variables, it could fit a linear regression, so we can confirm that these two variables may be redundant; we will have to determine which one we should use (or use both), for example by means of techniques of Feature Subset Selection.

### DescriptionNumEntities

This is the number of entities the IA detected in the descriptions. Let's check its statistics:

In [None]:
with_desc["DescriptionNumEntities"].describe()

We can already see that the variance coefficient is greater than one and the maximum value is far away the IQR. Let's see that in the histogram and density plots:

In [None]:
utils.plot_histogram_and_density(data=with_desc, x="DescriptionNumEntities", hue=target)

The density plot is very similar to another one that we already saw: `DescriptionLength`. Maybe this variables are correlated, let's check it using the instances that were analysed:

In [None]:
utils.plot_single_pairplot(desc_analysed, "DescriptionLength", "DescriptionNumEntities", hue=target)

As we can see, `DescriptionNumEntities` and `DescriptionLength` are strongly correlated. Thus, as `DescriptionLength` is easier to obtain and we know its value for all non-empty descriptions, we can discard `DescriptionNumEntities`. Moreover, we can still identify the descriptions were no entity was detected using `DescriptionMaxSalience`.

In [None]:
train.drop(["DescriptionNumEntities"], axis=1, inplace=True)

### DescriptionMaxSalience

This variable indicates the maximum importance in a description of any entity that was detected by the IA. We included it in order to check whether low values indicates that it is not easy to identify the main subject or entity (the "focus") of the description. Let's check its statistics:

In [None]:
with_desc["DescriptionMaxSalience"].describe()

It seems that this variable could be normal, let's see that in the plots:

In [None]:
utils.plot_histogram_and_density(data=with_desc, x="DescriptionMaxSalience", hue=target)

We could say that the variable is normal, but there are two peeks of values: 0.0 and 1.0. The first one, as we know, is due to those instances whose description could not be analysed. However, those are 539, but in the plot are 800 approximately, so there are some descriptions with no detected entities:

In [None]:
print("Number of analysed descriptions without entities:",
      desc_analysed[desc_analysed["DescriptionMaxSalience"] == 0].shape[0])

Let's see some of them:

In [None]:
desc_analysed[desc_analysed["DescriptionMaxSalience"] == 0].sample(20, random_state=seed)["Description"].values

It seems that descriptions with no detected entities do not have nouns and mainly use adjectives, and it also seems that the descriptions in this case are short (it is more likely that short descriptions have a smaller number of entities).

Let's check now the case of `DescriptionMaxSalience` = 1:

In [None]:
desc_analysed[desc_analysed["DescriptionMaxSalience"] == 1]["DescriptionLength"].describe()

We can see that those instances with a maximum salience of any entity equal to 1 are those whose description is very short (the IQR in the entire sample was $[117, 431]$, but in this case the bounds are smaller, $[12,45]$), so it could make sense that the salience is 1 since the number of entitities will also be very small:

In [None]:
desc_analysed[desc_analysed["DescriptionMaxSalience"] == 1].sample(20, random_state=seed)["Description"].values

In [None]:
conditions = [desc_analysed["DescriptionMaxSalience"] == 0, desc_analysed["DescriptionMaxSalience"] == 1, desc_analysed["DescriptionMaxSalience"] > 0]
subplots_titles = ["0", "1", "Non zero or one values"]
yticks = list(range(0,41,5))
utils.plot_relative_countplot(data=desc_analysed, x=target, conditions=conditions, figsize=(20,6),
                        title=f"Comparison of the outcome of {target} given the salience of Description",
                        subplots_titles=subplots_titles, yticks=yticks)

As we can see, we could group the 0.0 and 1.0 values into just one value, but, we would probably get the same information if we would extract small values of `DescriptionLength`. Moreover, as they are the opposite bounds of the interval, it does not make much sense to do that; on the other hand, we could discretize it in 3 values, but those with value 0.0 or 1.0 are a few, about 1000, so as we can see non zero or one values yields the same distribution of `AdoptionSpeed` as the entire train dataset. Thus, it does not make much sense to discretize in this case.

Let's check now the boxplot before and after the imputation of `DescriptionMaxSalience` for those instances whose descriptions could not be analysed. As usual, we impute values using the mean, and in this case we will not plot the bounds 0.0 and 1.0 as we have already see their effect:

In [None]:
after_des_max_sal_imputation = with_desc[["PetID","DescriptionMaxSalience", target]].copy(deep=True)
after_des_max_sal_imputation.loc[after_des_max_sal_imputation["PetID"].isin(desc_not_analysed_ids),
                     "DescriptionMaxSalience"] = desc_analysed["DescriptionMaxSalience"].mean()

comparison_desc_max_sal_imp = with_desc[["PetID","DescriptionMaxSalience", target]].copy(deep=True)
comparison_desc_max_sal_imp["Imputation"] = "Before"
after_des_max_sal_imputation["Imputation"] = "After"
comparison_desc_max_sal_imp = comparison_desc_max_sal_imp.append(after_des_max_sal_imputation)

utils.plot_boxplot(data=comparison_desc_max_sal_imp[~comparison_desc_max_sal_imp["DescriptionMaxSalience"].isin({0,1})],
             x="DescriptionMaxSalience", y=target, hue="Imputation", figsize=(20,12),show_title=False)

The range of values of `DescriptionMaxSalience` is very similar regardless the outcome of the target variable, but we can see that the median is greater and the IQR bounds are also larger when the outcome was 4 than when it was smaller, both before and after the imputation.

Finally, let's check if there is some additional correlations between pairs of numerical variables (not including those categorical that can be used as numerical, like `MaturitySize` or `FurLength`):

In [None]:
utils.plot_pairplot(train, ["Age", "Quantity", "Fee", "VideoAmt", "PhotoAmt"], target)

We can see that there is little or no correlation at all among the original numerical variables.

In [None]:
# pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

In [None]:
# from pandas_profiling import ProfileReport

# profile = ProfileReport(train, title="Pandas Profiling Report")
# profile.to_file("pet_adoption_report.html")

## Images

As we saw, the vast majority of instances (14652/14993 = 0.977) or profiles have at least a photo of the pet or pets that is/are listed in the profile. Each profile may have more than one photo (the name of the file is {PetID}-{num_photo}.jpg), but we are particularly interested in the {PetID}-1.jpg one, as it is the profile or default photo, so it is the first one that we see, if it exists.

Nevertheless, when it is time to construct features extracted from the photos, we will have to determine whether we will use just the profile photo or if we will someway aggregate all of them, if there's more than 1.

First of all, before going into the images' properties and metadata, we will see some examples of them, especially filtering by some variables that we have already seen:

In [None]:
utils.plot_images(train[train["Type"] == "Cat"].sample(5, random_state=seed), 1, 5, (20,5),
                  "Sample of cat photos", subtitle_var="PetID")

In [None]:
utils.plot_images(train[train["Type"] == "Dog"].sample(5, random_state=seed), 1, 5, (20,5),
                  "Sample of dog photos", subtitle_var="PetID")

The first thing that we can see from the two previous samples is that the aspect ratio of each photo is different, so we will have to pad them to a fixed (squared, as they will be used as input of a CNN to extract features) aspect ratio.

In the samples we can see that the pet is in general the main 'protagonist', but not always they span a big part of the image, for example the second cat photo shows two cats, but they represent half or less of the picture. Moreover, there may be cases where the pet's face is not even shown (fifth cat photo), but maybe this type of situations are not really significant, because in this problem we are not really dealing with the pet's recognition, identification or classification, but other more abstract properties that could be related to cuteness, physical context, image metadata like the dullness that may or may not attract people to adopt the pet or even to see the profile.

Talking about dullness, we can see that the third of fourth photo of the cat sample may not be as 'attractive' as the first cat, the first or the second dog.

There may be cases where part of the pet's body cannot be seen in the photo, more than one pet is shown but only one is listed in the profile, there may be collages (like the last dog photo), etc. We expect that CNN models extract the relevant high-level features of the images.

Let's see now some samples of pets given their `Age`. As the majority of instances have 12 months or less, but there are cases that have even more than 200, we will extract three samples: young, adult and old pets.

In [None]:
utils.plot_images(train[train["Age"] <= 12].sample(5, random_state=seed), 1, 5, (20,3),
            "Sample of young pets", subtitle_var="PetID")

In [None]:
utils.plot_images(train[(train["Age"] > 12) & (train["Age"] < 120)].sample(5, random_state=seed), 1, 5,
            (20,3), "Sample of adult pets", subtitle_var="PetID")

In [None]:
utils.plot_images(train[train["Age"] >= 120].sample(5, random_state=seed), 1, 5, (20,3),
            "Sample of old pets", subtitle_var="PetID")

Some photos are not easy to classify as adult or old (for example the first and last ones of the old pets sample), and in other cases young pets or cubs may be shown with the parent, which will probably span a larger portion of the image (for example the first photo in the young sample).

Let's see know some examples of pets that are listed as 'Domestic {x} Hair' but then the value x does not match with the value specified in `FurLength`, in order to see whether we should fix this type of inconsistences, and in that case whether we should use the `BreedName1` (or 2) or `FurLength`:

In [None]:
for x in ['Short', 'Medium', 'Long']:
    breed = f'Domestic {x} Hair'
    fur_lengths = ['Short', 'Medium', 'Long']
    fur_lengths.remove(x)
    for fur_length in fur_lengths:
        condition = ((train["BreedName1"] == breed) | (train["BreedName2"] == breed)) & (train["FurLength"] == fur_length)
        condition &= (train["PhotoAmt"] > 0)
        utils.plot_images(train[condition].sample(5, random_state=seed), 1, 5, (20,3),
                    f"Sample of {breed} pets but {fur_length} FurLength", subtitle_var="PetID")

There are some cases where the breed seems to describe the actual fur length better than `FurLength`, but in some other cases is quite the opposite, so what we can do is adding another variable which tells us whether `BreedName1` or `BreedName2` matches the `FurLength` value when it is 'Domestic {x} Hair'. In the rest of cases, we can say that it just matches (when it is a dog, as these breed denominations are exclusive of cats, or neither breed is 'Domestic {x} Hair').

Let's check now a sample of images corresponding to profiles of a single pet and another one with images of profiles giving up for adoption more than one pet:

In [None]:
single_pet_sample = train[train["Quantity"] == 1].sample(10, random_state=seed)
utils.plot_images(single_pet_sample, 2, 5, (20,6), "Sample of profiles with a single pet", subtitle_var="PetID")

In [None]:
several_pets_sample = train[train["Quantity"] > 1].sample(10, random_state=seed)
utils.plot_images(several_pets_sample, 2, 5, (20,6), "Sample of profiles listing more than one pet", subtitle_var="PetID")

As we can see in these samples (and even some of the previous ones), some profiles that give up for adoption just one pet show as profile image a collage where the pet is shown more than once (first sample, fifth image), so it could lead to a mixlead if the `Quantity` variable is not read, as some profiles that give up for adoption several pets could have a similar 'collage' profile image where the pets are so similar that the user looking at the image could think that it is just one pet (for example, the third image of the second sample). In some other cases, when there are several pets there's really one image (second sample, first image or last one), and not a collage (second sample, eigth image). Another thing to remark is that it seems that many profiles giving up for adoption more than one pet have as profile image just one of them.

We have saved the samples in two variables in order to check whether we could identify the aforementioned situations using the image metadata.

Now we will extract again two samples, but this time in order to see the context or 'environmental' conditions in which the pets were photographed when the variable `RescuerCount` has a small value (1, for example) or a big value (more than 30, for example). The objective is to check whether our assumption of individuals vs. shelters could be identified looking at the photos:

In [None]:
seldom_rescuers_sample = train[(train["RescuerCount"] == 1) & (train["PhotoAmt"] > 0)].sample(10, random_state=seed)
utils.plot_images(seldom_rescuers_sample, 2, 5, (20,6),
                  "Sample of profiles published by a seldom rescuer", subtitle_var="PetID")

In [None]:
usual_rescuers_sample = train[train["RescuerCount"] > 30].sample(10, random_state=seed)
utils.plot_images(usual_rescuers_sample, 2, 5, (20,6),
                  "Sample of profiles published by a usual rescuer", subtitle_var="PetID")

It is difficult to tell whether the pets have been rescued by a shelter or by an individual looking at the photos. Maybe when there is a grating ('rejilla') or something similar to a cage it is more probable that the pet is in a shelter (fourth, seventh, eigth and tenth images of the second sample), but it could also be the case of an individual (fifth image of the first sample).

Let's check now some profiles with more than one photo:

In [None]:
several_photos_sample = train[(train["PhotoAmt"] > 1) & (train["PhotoAmt"] < 6)].sample(10, random_state=seed)

for index, row in several_photos_sample.iterrows():
    utils.plot_images(row, 1, int(row["PhotoAmt"]), (20,3), f'Photos of pet {row["PetID"]} (index {index})',
                subtitle_var="PetID", multiple=True)

In the sample, we can say that the profile image selection (the first one) is well done, as it is usually the best photo or the one that let us clearly see the pet and its face. Moreover, it seems that it is also the less blurry one. Thus, we may not have to extract features from the rest of photos when there is more than one; however, there are cases where the additional images could give extra information to the user (for example, the second image of the first instance in the sample shows a dog with a big wound in the head, while in the first image is difficult to tell whether there is a scar; besides, the `Health` condition of the dog is 'Healthy' now, see following cell, but that image tell us something that could change the decision of the adopter).

In [None]:
several_photos_sample.iloc[0]["Health"]

There are other aspects that could be interesting to include, for example if there are people in the profile image, like we can see in the seventh instance, and study if those conditions could influence the adoption speed; these kind of data can be discovered using the image metadata.

### Image metadata

We are provided with additional information or metadata of each image obtained using the Google Vision API. Every image has been analysed and at least we are given the Label Annotation and Image Properties, and in some cases, the Text Annotation and Face Annotation (only when text or faces of people appear in the image, of course; it could be interesting to see how many images have text or people along with the pets, and whether this could influence the target variable).

The requests to this API are very simple, we just need to perform a POST HTTP request to `https://vision.googleapis.com/v1/images:annotate` attaching a JSON file with the following data:

```
{
  "requests": [
    {
      object (AnnotateImageRequest)
    }
  ],
}
```

Where `requests` is a list of `AnnotateImageRequest` objects:

```
{
  "image": {
    "content": string,
    "source": {
      object (ImageSource)
    }
  },
  "features": [
    {
      "type": enum (Type)
    }
  ]
}
```

`image` is the image data, which may be represented as a string of bytes in `content` or as a location in the Google Cloud Storage service (`source`). `features` is a list where we specify the type of analysis we want to get, if it is possible (in this case FACE_DETECTION, LABEL_DETECTION, TEXT_DETECTION and IMAGE_PROPERTIES). There are other attributes to provide additional information or to configure the analysis, but we have kept it simple just to see an example.

In response, we get a JSON file:

```
{
  "responses": [
    {
      object (AnnotateImageResponse)
    }
  ]
}
```

Where the object `AnnotateImageResponse` contains all the analysis or metadata information that we get for a particular image. For example, we will print the metadata of the first image (profile image) of the 7th instance in the last sample, as, in adition to the pet, two people, and also some text probably from an advertisement in the street (the faces are blurred and also this text as it may contain personal information, like telephone numbers):

```
{
  "cropHintsAnnotation": {
    "cropHints": [
      {
        "boundingPoly": {
          "vertices": [
            {},
            {
              "x": 253
            },
            {
              "x": 253,
              "y": 398
            },
            {
              "y": 398
            }
          ]
        },
        "confidence": 1,
        "importanceFraction": 0.61
      }
    ]
  },
  "faceAnnotations": [
    { // PERSON 1
      "angerLikelihood": "VERY_UNLIKELY",
      "blurredLikelihood": "VERY_UNLIKELY",
      "boundingPoly": {
        "vertices": [
          {
            "x": 121,
            "y": 59
          },
          {
            "x": 172,
            "y": 59
          },
          {
            "x": 172,
            "y": 118
          },
          {
            "x": 121,
            "y": 118
          }
        ]
      },
      "detectionConfidence": 0.69617987,
      "fdBoundingPoly": {
        "vertices": [
          {
            "x": 122,
            "y": 70
          },
          {
            "x": 172,
            "y": 70
          },
          {
            "x": 172,
            "y": 120
          },
          {
            "x": 122,
            "y": 120
          }
        ]
      },
      "headwearLikelihood": "VERY_UNLIKELY",
      "joyLikelihood": "VERY_LIKELY",
      "landmarkingConfidence": 0.5136512,
      "landmarks": [
        {
          "position": {
            "x": 131.91956,
            "y": 92.05938,
            "z": 1.6014292e-05
          },
          "type": "LEFT_EYE"
        },
        {
          "position": {
            "x": 145.53963,
            "y": 86.606606,
            "z": -7.2926764
          },
          "type": "RIGHT_EYE"
        },
        {
          "position": {
            "x": 127.36458,
            "y": 90.031364,
            "z": 3.2862678
          },
          "type": "LEFT_OF_LEFT_EYEBROW"
        },
        // another 31 face parts objects...
      ],
      "panAngle": -26.952665,
      "rollAngle": -22.567362,
      "sorrowLikelihood": "VERY_UNLIKELY",
      "surpriseLikelihood": "VERY_UNLIKELY",
      "tiltAngle": -1.921653,
      "underExposedLikelihood": "VERY_UNLIKELY"
    },
    { // PERSON 2
      "angerLikelihood": "VERY_UNLIKELY",
      "blurredLikelihood": "VERY_UNLIKELY",
      "boundingPoly": {
        "vertices": [
          {
            "x": 68,
            "y": 8
          },
          {
            "x": 130,
            "y": 8
          },
          {
            "x": 130,
            "y": 79
          },
          {
            "x": 68,
            "y": 79
          }
        ]
      },
      "detectionConfidence": 0.55142355,
      "fdBoundingPoly": {
        "vertices": [
          {
            "x": 75,
            "y": 27
          },
          {
            "x": 125,
            "y": 27
          },
          {
            "x": 125,
            "y": 77
          },
          {
            "x": 75,
            "y": 77
          }
        ]
      },
      "headwearLikelihood": "VERY_UNLIKELY",
      "joyLikelihood": "VERY_LIKELY",
      "landmarkingConfidence": 0.4075274,
      "landmarks": [
        {
          "position": {
            "x": 94.42963,
            "y": 41.065422,
            "z": 0.0003771847
          },
          "type": "LEFT_EYE"
        },
        {
          "position": {
            "x": 112.92656,
            "y": 45.479057,
            "z": 2.8220258
          },
          "type": "RIGHT_EYE"
        },
        {
          "position": {
            "x": 89.86982,
            "y": 34.76869,
            "z": 0.25249478
          },
          "type": "LEFT_OF_LEFT_EYEBROW"
        },
        // another 31 face parts objects...
      ],
      "panAngle": 8.648216,
      "rollAngle": 15.458582,
      "sorrowLikelihood": "VERY_UNLIKELY",
      "surpriseLikelihood": "VERY_UNLIKELY",
      "tiltAngle": -6.6151457,
      "underExposedLikelihood": "VERY_UNLIKELY"
    }
  ],
  "imagePropertiesAnnotation": {
    "dominantColors": {
      "colors": [
        {
          "color": {
            "blue": 145,
            "green": 120,
            "red": 100
          },
          "pixelFraction": 0.08687219,
          "score": 0.098936334
        },
        {
          "color": {
            "blue": 25,
            "green": 23,
            "red": 20
          },
          "pixelFraction": 0.07360147,
          "score": 0.047244612
        },
        // another 8 dominant colors...
      ]
    }
  },
  "labelAnnotations": [
    {
      "description": "car",
      "mid": "/m/0k4j",
      "score": 0.96272033,
      "topicality": 0.96272033
    },
    {
      "description": "motor vehicle",
      "mid": "/m/012f08",
      "score": 0.93533605,
      "topicality": 0.93533605
    },
    {
      "description": "mammal",
      "mid": "/m/04rky",
      "score": 0.9234644,
      "topicality": 0.9234644
    },
    {
      "description": "vehicle",
      "mid": "/m/07yv9",
      "score": 0.8976179,
      "topicality": 0.8976179
    },
    {
      "description": "mode of transport",
      "mid": "/m/079bkr",
      "score": 0.87516415,
      "topicality": 0.87516415
    },
    {
      "description": "family car",
      "mid": "/m/088l6h",
      "score": 0.7479246,
      "topicality": 0.7479246
    },
    {
      "description": "product",
      "mid": "/m/01jwgf",
      "score": 0.6130812,
      "topicality": 0.6130812
    },
    {
      "description": "dog like mammal",
      "mid": "/m/01z5f",
      "score": 0.577422,
      "topicality": 0.577422
    }
  ],
  "textAnnotations": [
    {
      "boundingPoly": {
        "vertices": [
          {
            "x": 10,
            "y": 30
          },
          {
            "x": 246,
            "y": 30
          },
          {
            "x": 246,
            "y": 79
          },
          {
            "x": 10,
            "y": 79
          }
        ]
      },
      "description": "XXX\nXXX XXXX\nXXX\nXXX 6\n",
      "locale": "en"
    }
  ]
}
```

The first attribute that we can see is `cropHintsAnnotation`, which we are not interested in as it just give information about the possible crops of an image. The next attribute, which is exclusive to those images in which a person's face appears is `faceAnnotations`; in fact, as two people appear in this particular image, it is a list with two objects, one for each face. For each face, we are provided information about the area of the image in which it is found (`boundingPoly` and `fdBoundingPoly`, the last one is focused just on the sking part of the face), the detection confidence, the `landmarks` (that is, the position of each part of the face like 'LEFT_EYE', 'NOSE_TIP', etc. and the confidence on that landmark analysis), the orientation of the face (`panAngle`, `rollAngle` and `tiltAngle`) and a set of attributes related to the probability of different conditions and sentiments: 

- `joyLikelihood`: it could be interesting to extract this information, maybe seeing happy people in the image gives additional reasons to the user to adopt the pet.

- sorrowLikelihood	

- angerLikelihood	

- surpriseLikelihood	

- underExposedLikelihood	

- `blurredLikelihood`: we can see that the faces are blurred in the image but this attribute is 'VERY_UNLIKELY'; this is because the analysis was done without blurring the faces, then they were blurred in order not to reveal personal information.

- headwearLikelihood

Each of these attributes has one of the following values: 'UNKNOWN', 'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY' or 'VERY_LIKELY'.

Then, the root object has another attribute called `imagePropertiesAnnotation`, which mainly contains information about the dominant colors in the image. Each color is represented in RGB format and we also get the fraction of pixels the color occupies in the image (`pixelFraction`); it could be interesting to iterate over all these colors and get the sum of the `pixelFraction` values: if the dominant colors occupy a larger fraction of the image, it could be very monotone, but if it is smaller, then the image is colorful (although the former one could mean that the main focus or a large part of the image is occupied by the pet).

Now, we have the `labelAnnotations` list, which contains the entity analysis of the image, that is, it is the information about objects (but not people or text) that can be found in the image. Each entity has a `description`, a `score` (or confidence on that label) and a `topicality` (which is the entity's relevance given the context of the image; the example that gives the API's documentation is the following: if there is a "tower", then it is more relevant when we have detected that it is the "Eiffel Tower" than when it is a distant towering building, even though they could have the same `score`). We could extract the `topicality` of the first entity that contains a 'cat' or a 'dog' in its description in order to have an estimation of the relevance of the pet in the image; in the example, it is just 0.577422, as the fraction of the image occupied by the dog is relatively small and there are other entities easier to detect like a car. We can also extract the number of entities, and even the concatenation of all their descriptions.

Finally, we also have a `textAnnotations` attribute, which is a list of all the text that can be found in the image, even though its content might not been analysed (this is the case of the previous example, the description of the text is not included as emails and phone numbers have been anonymized).

Summing up, we will extract:

- Whether there are people in the image (there would be a `faceAnnotations` object), and the average `joyLikelihood` (if there is more than one person).
- The sum of `pixelFraction` of the dominant colors.
- The number of entities.
- The concatenation of the description of those entities.
- The maximum `topicality` of the entities that include 'cat' or 'dog' in their description.
- Whether there is text or not in the image.

We will extract this data from every image, not just the profile image, in case we try to aggregate the features of several images afterwards.

First of all let's extract the metadata for a small sample, like `several_pets_sample`, in order to see that everything works:

In [None]:
sample_profile_image_metadata = utils.get_image_metadata(sample=several_pets_sample)

In [None]:
sample_metadata = sorted(sample_profile_image_metadata.items(), key=lambda x: x[0])
for key, val in sample_metadata:
    print(f"PetID {key}:")
    faces = "faces" in val
    print(f'\n\tAre there people? {faces}')
    if faces:
        for face in val["faces"]:
            print(f'\t{face}')
    print(f'\n\tPresence of dominant colors: {val["sum_pixelFraction"]}')
    print(f'\n\tNumber of entities: {val["num_entities"]}')
    print(f'\n\tConcatenation of entities\' descriptions: {val["desc_concatenation"]}')
    print(f'\n\tmax_pet_topicality: {val["max_pet_topicality"]}')
    print(f'\n\tAre there text annotations? {val["has_text"]}\n\n')

We can see that the `PetID` of the profile images that have text annotations are those that, indeed, had text in the image (third and eigth images in the sample). None of them have human faces.

Now, let's extract the metadata from the all the images, including those that are not the profile photo, just in case we need them in the training or feature engineering:

In [None]:
# all_images_metadata = utils.get_image_metadata(all_images=True)

all_images_metadata = defaultdict(dict, utils.load_json('../input/tfg-pet-adoption-data/train_all_images_metadata.json'))

Let's see how many images include peoples' faces and/or text, and how many doesn't have pets or they haven't been detected by Google Cloud Vision:

In [None]:
num_images_with_faces = 0
num_images_with_text = 0
num_images_without_pet_or_not_detected = 0
num_profile_images_with_faces = 0
num_profile_images_with_text = 0
num_profile_images_without_pet_or_not_detected = 0
images_no_entities = []

for pet_id, metadata in all_images_metadata.items():
    if "faces" in metadata:
        num_images_with_faces += 1
        if pet_id.endswith("-1"):
            num_profile_images_with_faces += 1
    if metadata["has_text"]:
        num_images_with_text += 1
        if pet_id.endswith("-1"):
            num_profile_images_with_text += 1
    if metadata["max_pet_topicality"] == 0:
        if metadata["num_entities"] == 0:
            images_no_entities.append(pet_id)
        num_images_without_pet_or_not_detected += 1
        if pet_id.endswith("-1"):
            num_profile_images_without_pet_or_not_detected += 1
        
print(f"Number of profile images that have at least a person's face: {num_images_with_faces} (profile images: {num_profile_images_with_faces})")
print(f"Number of profile images that have text: {num_images_with_text} (profile images: {num_profile_images_with_text})")
print(f"Number of profile images without pet (or not detected): {num_images_without_pet_or_not_detected} (profile images: {num_profile_images_without_pet_or_not_detected})")
print("Images with no detected entities:\n", images_no_entities)

Of course we don't consider here those instances that doesn't have at least a photo; now, we will retrieve the metadata corresponding to the profile images, so we can check how many `PetID` values don't have profile image's metadata:

In [None]:
# profile_images_metadata = defaultdict(dict)

# for index, row in tqdm(train.iterrows()):
#     pet_id = row["PetID"]
#     profile_images_metadata[pet_id] = all_images_metadata[f'{pet_id}-1']

# utils.save_as_json(profile_images_metadata, "train_profile_images_metadata", convert_to_dict=True)
  
profile_images_metadata = defaultdict(dict, utils.load_json('../input/tfg-pet-adoption-data/train_profile_images_metadata.json'))
no_images = 0
for index, row in train.iterrows():
    pet_id = row["PetID"]
    if not profile_images_metadata[pet_id]:
        no_images += 1

print("Number of instances without images:", no_images)        

As we can see, the number of instances without profile image's metadata is the same as those that we saw that didn't have any photo. Let's check now if the retrieved metadata have some effect on the target variable.

Now, we will include the metadata we extracted from each profile image (if present) in the dataset:

In [None]:
utils.include_profile_image_metadata(train, image_metadata=profile_images_metadata)

#### Pet entity topicality (maximum)

We have included the maximum topicality of any entity in the profile image whose description includes 'cat' or 'dog' in the train dataset, as `MaxPetTopicality`. Let's see its distribution of values and its relation with the target variable:

In [None]:
utils.plot_histogram_and_density(data=train, x="MaxPetTopicality", hue=target)

As there are 130 profile images with no detected pets and 341 instances that don't have any photo, there is a peek of 471 values at 0.0, which doesn't let us clearly see what is happening in the most frequent range of values due to the scale. However, we can already see that the predominant value of the target variable when `MaxPetTopicality` is smaller is 4, while the density of this value decreases when `MaxPetTopicality` gets closer to 1.0.

In [None]:
non_zero_topicality = train[train["MaxPetTopicality"] > 0]
utils.plot_boxplot(data=non_zero_topicality, x="MaxPetTopicality", y=target,
             additional_desc="(MaxPetTopicality > 0)")

The median, the interquartile range, the left whisker and the density of those values that could be interpreted as outliers let us clearly see that `MaxPetTopicality` can be an important variable: the greater its value, the more likely `AdoptionSpeed` will have a smaller value.

Let's check now the distribution of the target variable when `MaxPetTopicality` is 0.0:

In [None]:
conditions = [(train["MaxPetTopicality"] == 0) & (train["PhotoAmt"] == 0),
              (train["MaxPetTopicality"] == 0) & (train["PhotoAmt"] > 0)]
subplots_titles = ["No profile image", "No pet entity detected in the profile image"]
yticks = list(range(0,71,5))
utils.plot_relative_countplot(data=train, x=target, conditions=conditions, figsize=(15,6),
                        title=f"Comparison of the outcome of {target} when MaxPetSalience is 0.0",
                        subplots_titles=subplots_titles, yticks=yticks)

As we can see, the effect of not having a profile image is worse than having a profile image in which it may not be easy to detect the pet (given the assumption that it is the Google's Cloud Vision IA what yields this situation, maybe a person could detect it in some cases); however, in the latter case, the `AdoptionSpeed` is still worse than the one observed in the complete train dataset (and we can also see that the greater the `AdoptionSpeed` value is, the more instances with that value, and as `MaxPetTopicality` is 0.0, the minimum it can have, our previous assumption holds: the greater the `MaxPetTopicality` value, the more likely `AdoptionSpeed` will have a smaller value).

Let's see some examples of profile images in which the IA could not detect any pet or any entity at all:

In [None]:
no_detected_pet_profile_image = train[(train["MaxPetTopicality"] == 0) & (train["PhotoAmt"] > 0)]
no_detected_pet_profile_image_sample = no_detected_pet_profile_image.sample(20, random_state=seed)
utils.plot_images(no_detected_pet_profile_image_sample, 4, 5, (20,16), "Profile images with no detected pets",
            subtitle_var="PetID", title_y=0.95)

The problems that we may encounter are:

- The pet cannot be seen at all. For example, the last image of the second row or the second image in the last row.
- The pet occupies a small fraction of the image. For example, the last image of the first row, the fourth image of the second row or fourth image of the last row.
- The pet's face is not clearly shown or it is difficult to distinguish. For example, the third image of the first row, the second image of the second row, fourth image (its head fur color is black, and it is placed in a darker spot than the rest of the body) and last image of the third row, first image and third image of the last row.
- The pet (or its fur color) can be confused with the place in which the photo was taken. For example, the last image of the first row or the first image of the third row. --> Maybe this case can be identified using the variable pixelFraction.
- The image quality or the place illumination is bad. For example, the last image of the first row, the first (maybe this one is just because the number of pets is too big to be adopted) and third images of the second row. --> Maybe we should extract other image properties, like dullness, whiteness, size, dimensions.
- The image is a collage and the pet is not the main part of the collage, or there are elements that get more focus. For example the second image of the first row or the last image in the sample (we can see in the following cell the detected entities: the IA classifies it as a product, design, material, that is, the collage gets more focus than the pets).
- Mix of the previous ones.

In [None]:
with open(f'../input/petfinder-adoption-prediction/train_metadata/bc7b17cd5-1.json') as f:
    json_file = json.load(f)
    print(json.dumps(json_file["labelAnnotations"], indent=2, sort_keys=True))

Here we have the list of `PetID` of all the profile images where no pets were detected:

In [None]:
no_detected_pet_profile_image["PetID"].values

#### Is there text?

We saw that there are 1075 profile images which contain text: it may be an indicator of being a collage, some contact information or just some text in a T-shirt or an advertisement in the background, for example. We have included a new variable to the train dataset, called `ProfileImageText`, which indicates whether the profile image has text ('Yes') or not ('No'):

In [None]:
utils.plot_vert_barplot(train, "ProfileImageText", target, display_numbers=False)

It seems that profile images with text are not as attractive to the user as the ones that don't have text. Maybe this is not directly caused by the text, it may have happen by chance, because the difference in the distributions is not huge, but it could also be due to the fact that people prefer to see a clean image of the pet instead of a collage or an image overloaded with text or design.

#### Are there human faces?

We have included in the train dataset `ProfileImageHumanFace`, which tells us whether a human face was detected in the profile image ('Yes') or not ('No'), and `ProfileImageFaceJoyLikelihood`, which indicates how likely (in discrete values: 'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY' or 'VERY_LIKELY') the human or humans that appear in the profile image have joy (if there is more than one face, we just get the most common value).

Let's extract a sample of those profile images in which at least one human face was detected in order to see how accurate the IA has been:

In [None]:
human_faces_sample = train[train["ProfileImageHumanFace"] == "Yes"].sample(20, random_state=seed)
utils.plot_images(human_faces_sample, 4, 5, (20,16),
                  "Sample of profile images in which at least a human face was detected",
                  subtitle_var="PetID", title_y=0.95)

As we can see, the IA wasn't very accurate: there are 10 out of the 20 instances in the sample in which at least a human face was detected but we can clearly see that they don't have human faces. Thus, we need to extract additional information from the metadata: the `detectionConfidence` attribute of each face detected in the image, in order to see if there is a particular threshold that we can apply to get the correct values for `ProfileImageHumanFace`.

In [None]:
def get_face_detection_confidence(directory):
    images_face_detection_confidence = defaultdict(list)

    for filename in tqdm(os.listdir(directory)):
        pet_id = filename[:-5]
        with open(os.path.join(directory, filename)) as f:
            json_file = json.load(f)
            if "faceAnnotations" in json_file:
                for face in json_file["faceAnnotations"]:
                    images_face_detection_confidence[pet_id].append(face["detectionConfidence"])
    
    return images_face_detection_confidence

In [None]:
directory = '../input/petfinder-adoption-prediction/train_metadata'
all_images_face_detection_confidence = get_face_detection_confidence(directory)

In [None]:
profile_images_face_detection_confidence = defaultdict(list)

for index, row in train.iterrows():
    pet_id = row["PetID"]
    profile_images_face_detection_confidence[pet_id] = all_images_face_detection_confidence[f'{pet_id}-1']

We add the variable `ProfileImageFaceDetectionConfidence`, which will hold the maximum `detectionConfidence` of any detected human face in the profile image:

In [None]:
train["ProfileImageFaceDetectionConfidence"] = 0.0

for index, row in train.iterrows():
    pet_id = row["PetID"]
    levels_confidence = profile_images_face_detection_confidence[pet_id]
    if len(levels_confidence) > 0:
        train.loc[index, "ProfileImageFaceDetectionConfidence"] = max(levels_confidence)

Now, let's see again the previous sample with the maximum face detection confidence, in order to see if we can apply a threshold to determine whether there is a human face or not just using the aforementioned value:

In [None]:
human_faces_sample = train[train["ProfileImageHumanFace"] == "Yes"].sample(20, random_state=seed)
utils.plot_images(human_faces_sample, 4, 5, (20,16),
                  "Sample of profile images in which at least a human face was detected",
                  subtitle_var="ProfileImageFaceDetectionConfidence", title_y=0.95)

The maximum face detection confidence of an image that does not actually include a human face is 0.575 (in this sample), while all the images with a greater value seem to have a human face (the immediately greater value is 0.598); however, there is an image where there is a human face that has a confidence value of 0.549. Let's set the threshold to 0.59, and see how good it is, given that we may have lost some images that actually had a human face.

In particular, the number of profile images where a human face was detected that are above the confidence threshold of 0.59 is:

In [None]:
human_faces_above_threshold = train[(train["ProfileImageHumanFace"] == "Yes") & (train["ProfileImageFaceDetectionConfidence"] > 0.59)]
human_faces_above_threshold.shape[0]

Let's see all of them:

In [None]:
utils.plot_images(human_faces_above_threshold, 15, 5, (20,60),
            "All profile images with a face detection confidence above the threshold of 0.59",
            subtitle_var="ProfileImageFaceDetectionConfidence", title_y=0.9)

There is still a profile image without a human face that has a confidence value of 0.628. Given this innacuracy and the fact that the proportion of images where we can say there is a human face is very small (75/14993), we can discard using the information if the distribution of the target variable is not very different from the original:

In [None]:
for index, row in train.iterrows():
    pet_id = row["PetID"]
    metadata = profile_images_metadata[pet_id]
    if row["ProfileImageHumanFace"] == "Yes" and row["ProfileImageFaceDetectionConfidence"] <= 0.59:
        train.loc[index, "ProfileImageHumanFace"] = "No"
        train.loc[index, "ProfileImageFaceJoyLikelihood"] = np.nan

In [None]:
utils.plot_vert_barplot(train, "ProfileImageHumanFace", target, display_numbers=False)

As we can see, the proportion of pets that end up not being adopted after 100 days from the publication is slighly lower when there is a human face, but when the end up being adopted, it is not necessarily earlier (0 decreases, 1 and 3 increase). However, the difference in the distributions is small, and given that the one on the right side is based on just 74 instances, we can discard `ProfileImageHumanFace`. Thus, we will also discard `ProfileImageFaceJoyLikelihood`, there is all the more reason, as the back up data for each of its possible values will be very small:

In [None]:
train["ProfileImageFaceJoyLikelihood"].value_counts(dropna=False)

Hence, we will delete these variables:

In [None]:
train.drop(["ProfileImageHumanFace", "ProfileImageFaceJoyLikelihood", "ProfileImageFaceDetectionConfidence"],
          axis=1, inplace=True)

#### Number of entities

This is the number of entities the IA detected in each profile image. Let's see how many different values there are, and how many instances have each one:

In [None]:
train["ProfileImageNumEntities"].value_counts(dropna=False)

We can see that the IA detected between 8 and 10 entities in the majority of profile images. There are 343 instances where no instance was detected, 2 more than the 341 instances that don't have a profile image; this is because there are 2 profile images were no entity was detected (in particular, images `a977f3331-1` and `dd4b67059-1`).

Let's see the behaviour of the target variable given each value of the previous ones:

In [None]:
utils.plot_vert_barplot(train, "ProfileImageNumEntities", target, display_numbers=False)

We can say that the smaller the number of detected entities in the profile image, the more likely the pet end up not being adopted or even the more likely it is not adopted earlier, if we pay special attention to the range between 5 and 10, as the range between 1 and 4 has few instances and may be noise (but the first assumption still holds in those cases). The distribution given the value 0 also reinforces the first assumption (this distribution is almost identical to the one that we have already seen when the number of photos is 0, with two additional instances).

Let's take a sample of different intervals of the number of detected entities:

In [None]:
num_entities_more_7_sample = train[train["ProfileImageNumEntities"] > 7].sample(10, random_state=seed)
utils.plot_images(num_entities_more_7_sample, 2, 5, (20,6),
                  "Sample of profile images with more than 7 detected entities",
                  subtitle_var="ProfileImageNumEntities")

In [None]:
num_entities_btw_5_7_sample = train[(train["ProfileImageNumEntities"] > 4) &
                                    (train["ProfileImageNumEntities"] <= 7)].sample(10, random_state=seed)
utils.plot_images(num_entities_btw_5_7_sample, 2, 5, (20,6),
                  "Sample of profile images with between 5 and 7 detected entities",
                  subtitle_var="ProfileImageNumEntities")

The difference between the two previous samples is not really visible for us, maybe when the proportion of image or focus that the pet gets is smaller (4th image), when their face are not as clearly shown (5th, 8th) or when they are in a uncommon position or the image is sideways (8th, 9th), it could be harder to detect more instances.

In [None]:
num_entities_btw_1_4_sample = train[(train["ProfileImageNumEntities"] > 0) &
                                    (train["ProfileImageNumEntities"] <= 4)].sample(10, random_state=seed)
utils.plot_images(num_entities_btw_1_4_sample, 2, 5, (20,6),
                  "Sample of profile images with between 1 and 4 detected entities",
                  subtitle_var="ProfileImageNumEntities")

In [None]:
num_entities_0 = train[(train["PhotoAmt"] > 0) & (train["ProfileImageNumEntities"] == 0)]
utils.plot_images(num_entities_0, 1, 2, (10,4), "Profile images with 0 detected entities",
            subtitle_var="PetID")

In the two previous samples we can see that the reasons that could make the IA to detect a smaller number of entities are those that we have already seen, related to the fact of not having detected any pet in the image: poor image quality or lightning coditions, the pet's fur color can be confused with the environment, their faces are not shown or other elements get more focus than the pet (especially if it is a collage).

#### Presence of dominant colors

This variable indicates the fraction of the image that is occupied by the most common images. If it is high (close to 1), it means that there are few colors in the image; if it is closer to 0, it means that the number of different colors in the image is high. Let's see the distribution of values of `DominantColorsPixelFraction` and the target variable:

In [None]:
utils.plot_histogram_and_density(data=train, x="DominantColorsPixelFraction", hue=target)

We can see that the majority of cases are between 0.6 and 0.7; it seems that when `DominantColorsPixelFraction` is greater than 0.8 approximately, the probability that the target varriable is 4 is greater, while the one of 2 decreases. Let's see the boxplots of each target variable's value without 0.0 values of `DominantColorsPixelFraction`:

In [None]:
non_zero_sum_pixel_fraction = train[train["DominantColorsPixelFraction"] > 0]
utils.plot_boxplot(data=non_zero_sum_pixel_fraction, x="DominantColorsPixelFraction", y=target,
             additional_desc="(DominantColorsPixelFraction > 0)")

It seems that when the pets end up being adopted, the earlier the slightly higher probability that `DominantColorsPixelFraction` is greater (both the IQR and the median follow this displacement, but the whiskers indicate that all the classes span almost the entire `DominantColorsPixelFraction` range, so this assumption may not be very strong); however, this does not hold when they end up not being adopted.

The following are the profile images where the dominant colors occupy all the image, that is, the ones where the variety of colors is smaller:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('DominantColorsPixelFraction',
                                                           ascending = False).head(10),
                  2, 5, (20,6), "Profile images where the detected dominant colors occupy all the pixels",
                  subtitle_var="DominantColorsPixelFraction", title_y=1.05)

It seems that many of them are black and white, with a white or near white background or dark image conditions.

Let's see those that are supposed to have a greater variety of colors:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('DominantColorsPixelFraction',
                                                           ascending = True).head(10),
                  2, 5, (20,6), "Profile images where the detected dominant colors occupy all the pixels",
                  subtitle_var="DominantColorsPixelFraction", title_y=1.05)

#### Concatenation of entities' description

We have included the concatenation of the description of every entity that was detected in each profile image as `ProfileImageDescription`. Let's depict two `WordCloud` objects, as we did with `Description`, given the `Type` of pet, in order to see the most common words, and maybe some that are not as common but could give additional information:

In [None]:
all_profile_image_descriptions_cats = " ".join(desc for desc in train.loc[train["Type"] == "Cat",
                                                                "ProfileImageDescription"].fillna('').values)
all_profile_image_descriptions_dogs = " ".join(desc for desc in train.loc[train["Type"] == "Dog",
                                                                "ProfileImageDescription"].fillna('').values)

In [None]:
utils.plot_wordcloud([all_profile_image_descriptions_cats, all_profile_image_descriptions_dogs], 1, 2,
               ["Most frequent words in cats' profile image descriptions",
                "Most frequent words in dogs' profile image descriptions"],
               seed)

We can see that many frequent words are actually information that we already have, mainly the pet's breed, the fur length and the type of pet. In fact, it is probable that many profile were completed using this information, so it may be redundant. However, if we take a closer look to the left one, we can see that there is at least one 'dog' word; let's check whether the `Type` value matches the entities' description:

In [None]:
not_matching_type_ids = []

for index, row in train[train["ProfileImageDescription"].notna()].iterrows():
    entities_description = str(row["ProfileImageDescription"]).lower()
    if str(row["Type"]).lower() not in entities_description and row["MaxPetTopicality"] > 0.0:
        not_matching_type_ids.append(row["PetID"])
        
print("Number of instances where the specified type does not match what was detected by the IA:", len(not_matching_type_ids))

Let's take a sample of these instances and see their profile image:

In [None]:
not_matching_type_entities = train[train["PetID"].isin(not_matching_type_ids)]
not_matching_type_entities_sample = not_matching_type_entities.sample(20, random_state=seed)
utils.plot_images(not_matching_type_entities_sample, 4, 5, (20,12),
            "Sample of profile images where the Type does not match any entity detected by the IA",
            subtitle_var="Type", title_y=0.95)

In [None]:
not_matching_type_entities_sample[["PetID","Type","BreedName1"]]

The label of each image is the `Type` value of that instance. There are 15 out of 20 instances that are cats but they weren't detected by the IA, while there is a dog that wasn't detected. Moreover, there are 4 instances out of 20 whose `Type` value is incorrect (and the IA detected the right `Type`), in the third and fourth rows. Thus, we can say that the accuracy of the IA when the pet is a cat is more likely to be lower than whet it is a dog (in this sample, but if we take 40 we get the same overall results). On the other hand, there are some incorrectly labeled pets, for example:

In [None]:
train[train["PetID"] == 'ddbb99929']

We can see in the image that this pet is a cat, but its `Type` is 'Dog'. Let's see its `Description`:

In [None]:
train[train["PetID"] == 'ddbb99929']["Description"].values

In this case, for example, we cannot get the actual `Type` from the description, but if there is a conflict then one way to solve it could be to get the `Type` from the `Description` text. However, when there is a conflict the `ProfileImageDescription` may become useless or it could even add noise when the `Type` value is right (majority of cases when it is actually a cat). Thus, instead of fixing it, as it is likely that `Type` is right in the majority of cases, we could add a new variable telling us whether `Type` is included in `ProfileImageDescription`, if we see that this affects the target variable:

In [None]:
conditions = [~train["PetID"].isin(not_matching_type_ids),
              train["PetID"].isin(not_matching_type_ids)]
subplots_titles = ["Type and entities match", "Type and entities do not match"]
yticks = list(range(0,46,5))
utils.plot_relative_countplot(data=train, x=target, conditions=conditions, figsize=(15,6),
                        title=f"Comparison of the outcome of {target} given the occurrence of Type in ProfileImageDescription",
                        subplots_titles=subplots_titles, yticks=yticks)

We can see that there are some differences: when they don't match, the probability of ending up not being adopted is higher (more than 10% higher); however when they end up being adopted, that is done earlier. The last assumption is very probably due to chance, or just the fact that it is more likely that the IA hasn't detected the right type of pet than a human; maybe in those cases we could just delete or replace the wrong type of pet in `ProfileImageDescription`, instead of creating a new variable telling whether they coincide or not, for example.

In fact, there are some cases where the `Type` value is incorrect but we can look to the breed in order to see whether it is a dog or a cat. For example, the following one is a Shih Tzu dog (last image of the third row in the previous sample), but it is labelled as a Cat:

In [None]:
train[train["PetID"] == "6c399cb06"]

Hence, we could do the following to get the right or the most probable `Type`:

- Extract it from `Description`.
- If it is not present in `Description`, then get the type according to the `BreedName1` or `BreedName2` from `breeds` csv file.
- If the `Type` is correct given the breed, then it is likely that the pet is actually a Cat.

Once we have it, we change the value of `Type` if it is not the same, and we change it every time the wrong one appears in `ProfileImageDescription` if we actually use this variable as a big part of its information is already indicated in tabular data.

### Extra image properties

Finally, there are other characteristics of the profile image that could make it unattractive to the user, for example being too dark or blurry. Thus, we will extract six different image properties, using the functions of this notebook: https://www.kaggle.com/shivamb/ideas-for-image-features-and-image-quality/notebook, and then we will check whether they have some effect on the target variable.

In [None]:
# profile_images_properties = utils.get_profile_image_properties()

In [None]:
utils.include_profile_image_properties(train)

#### Dullness

The first image property that we will look is the dullness, or in few words, the amount of 'darkness' or the lack of vividness in the image. For example, we can check the dullness of an image that we know is very dark: the profile image of the pet with ID `f6959f1ca` (previous sample of images were no pet was detected):

In [None]:
train[train["PetID"] == "f6959f1ca"]["ProfileImageDullness"]

This give us an estimate of the dark percent ('dark' meaning those colors equal or below (20,20,20) RGB) considering the 1000 most dominant colors (at most, maybe there is a smaller number of different colors) in each half of the image, then we calculate the mean of the dullness of those halves to get the estimate of the entire image. Thus, we can see that around the 70% of the image is dark (more precisely, the average of the darkness percentage in the 1000 most dominant colors of the two halves is that). In particular, for this example: 

In [None]:
profile_image_path = f'../input/petfinder-adoption-prediction/train_images/f6959f1ca-1.jpg'
if os.path.isfile(profile_image_path):
    width, height = utils.get_dimensions(profile_image_path)
    print("Total number of pixels in the image:", width*height)
    dullness, whiteness = utils.perform_color_analysis(profile_image_path, print_info=True, what_info="dullness")

We can see that the overall image can be considered very dull, but especially the first half, which seems to have a smaller variety of colors than the second half, which is dull but not that much. Thus, the dullness value of the first half is 88.8% (this is the percentage of dark colors among the 1000 most dominant ones), which has 7517 different colors, while the dullness of the second half is 52.5%, smaller than the first half as we expected, with 14684 different colors from which we extract the 1000 most dominant ones. Hence, the dullness value of the complete image is the average, 70.65%. We can see that the 10 most dominant colors of both halves are all dark (that is, below (20,20,20)), but in the second half there are larger RGB values than in the first one, and they are not as dominant.

The fact of just looking at the 1000 most dominant colors is done in order to save computing time, as the possible number of different colors (in general, given the RGB encoding) is $2^{24}$ or more than 16 millions, and, in particular, the total number of pixels in the image, if each one had a different RGB value (which is likely to be significantly greater than 1000).

Now that we know what this variable means, let's take a look at the values of entire dataset:

In [None]:
train["ProfileImageDullness"].describe()

We can see that the dullest image has a value of 92.005, while it seems that there is a significant amount of instances with dullness 0, as the first quartile and the median are very small. However, we know that 341 of those instances are not really 0 (nor any other value, as they don't have a profile image), so maybe we should change their value to the mean dullness of those instances that do have a profile image. But first, let's take a look at the distribution of this variable and the target variable before the imputation of values:

In [None]:
instances_with_profile_image = train[train["PhotoAmt"] > 0].copy(deep=True)
utils.plot_histogram_and_density(data=instances_with_profile_image, x="ProfileImageDullness", hue=target,
                          additional_desc="before imputation")

The distribution is very skewed, so let's apply the natural logarithm:

In [None]:
dullness_log = instances_with_profile_image[["ProfileImageDullness", target]].copy(deep=True)
dullness_log["ProfileImageDullness"] = np.log(1 + dullness_log["ProfileImageDullness"])
utils.plot_histogram_and_density(data=dullness_log, x="ProfileImageDullness", hue=target,
                          additional_desc="before imputation (logarithmic scale)")

The dullness distribution of values for each target variable's value is overall very are similar.

Let's take a look to the dullest profile images:

In [None]:
utils.plot_images(train.sort_values('ProfileImageDullness', ascending = False).head(10), 2, 5, (20,6),
            "Dullest (darkest) profile images",
            subtitle_var="ProfileImageDullness", title_y=1.0)

We can see that some profile images which are given a high dullness value is due to the fact that they have black bars at the sides of the image (the original aspect ratio was different), but they are not really that dull compared to the ones that don't have those bars. However, we can keep it as maybe this kind of profile images are not as attractive either because the amount of information that is shown could be smaller.

Let's see know the less dull profile images, in particular those whose value is 0.0:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('ProfileImageDullness', ascending=True).head(10),
                  2, 5, (20,8), "Less dull profile images (0.0)", subtitle_var="PetID", title_y=1.0)

We can see a big difference comparing to the dullest images, we can say that they are more attractive in general, even if this effect may not be 'propagated' to the target variable as we saw previously. However, there may be some images that are not very bright but we can see that they are not dull, for example the fifth one:

In [None]:
profile_image_path = f'../input/petfinder-adoption-prediction/train_images/abbe2568f-1.jpg'
if os.path.isfile(profile_image_path):
    width, height = utils.get_dimensions(profile_image_path)
    print("Total number of pixels in the image:", width*height)
    dullness, whiteness = utils.perform_color_analysis(profile_image_path, print_info=True, what_info="dullness")

Thus, the dullness strongly depends on the threshold that we establish, as for example in this image all the different colors are above RGB (20,20,20), but maybe we could say that it is more dull than other images that also have a dullness of 0.0.

Now, let's check the distribution of values of the target variable given the dullness when the fur color of the pet is not black, as they could be given a higher `ProfileImageDullness` value (especially if they occupy a big portion of the image) than other pets:

In [None]:
instances_with_profile_image_no_main_black = instances_with_profile_image[
    (instances_with_profile_image["ColorName1"] != "Black")]
print("Number of pets without black (main) fur color:", instances_with_profile_image_no_main_black.shape[0])
utils.plot_boxplot(data=instances_with_profile_image_no_main_black, x="ProfileImageDullness", y=target,
            additional_desc="when the main fur color is not black")

We can see that in this sample the amount of dullness values that are greater than 40 is smaller, and in general the variable gives a little bit more information about the target variable than when black pets are included (the difference in the median is more visible than before, it seems to increase a little bit as the adoption speed increases when the pets end up being adopted).

Now, let's impute the `ProfileImageDullness` value of those instances that do not have a profile image using the mean of this variable when the instances have a profile image:

In [None]:
after_dullness_imputation = train[["PetID", "PhotoAmt", "ProfileImageDullness", target]
                                 ].copy(deep=True)
after_dullness_imputation.loc[after_dullness_imputation["PhotoAmt"] == 0,
        "ProfileImageDullness"] = train[train["PhotoAmt"] > 0]["ProfileImageDullness"].mean()

comparison_dullness_imputation = train.loc[train["PhotoAmt"] > 0, ["PetID", "PhotoAmt",
                                "ProfileImageDullness", target]].copy(deep=True)
comparison_dullness_imputation["Imputation"] = "Before"
after_dullness_imputation["Imputation"] = "After"
comparison_dullness_imputation = comparison_dullness_imputation.append(after_dullness_imputation)

utils.plot_boxplot(data=comparison_dullness_imputation, x="ProfileImageDullness", y=target, hue="Imputation",
             show_title=False)

After the imputation of the missing values of dullness using the mean of the instances that have a profile image, we can see that now the median of `ProfileImageDullness` when the target variable is actually 4 increases (as we know that the majority of instances without profile image ended up not being adopted after 100 days), which is good as now the median of `ProfileImageDullness` seems to increase a little bit when `AdoptionSpeed` increases. However, the IQR does not follow this trend, so this variable by itself may not give additional information in the usual range of values, but we can see that beyond a value of 70, it is more likely that the target variable is 4: 

In [None]:
utils.plot_histogram_and_density(data=train[train["ProfileImageDullness"] > 40], x="ProfileImageDullness", hue=target,
                          additional_desc="(ProfileImageDullness > 0)")

#### Whiteness

This variable is obtained in a similar way as the dullness, but in this case, what we measure is the amount of brightness in the image, with the threshold established at RGB (240,240,240) (all colors above are considered 'bright'). We can take an example again:

In [None]:
pet_id = train[train["PhotoAmt"] > 0].sample(1, random_state=seed)["PetID"].values[0]
profile_image_path = f'../input/petfinder-adoption-prediction/train_images/{pet_id}-1.jpg'
if os.path.isfile(profile_image_path):
    width, height = utils.get_dimensions(profile_image_path)
    print("Total number of pixels in the image:", width*height)
    dullness, whiteness = utils.perform_color_analysis(profile_image_path, print_info=True, what_info="whiteness")

As we can see, the image is not very bright, and that is reflected in the `ProfileImageWhiteness` value: 7.125, that is, the estimation of the percentage of bright colors in the image is 7.125. We can see also that if this estimation would be done on the 1000 most dominant colors of the entire image, its value would probably be higher, as there is a big white object in the background (so the whiteness of the first half is greater), but we can see that the second half of the image is not bright at all (there is some white or near white colors in the pet's fur, but they are not among the 1000 most dominant ones because the whiteness of that half is 0.0).

On the other hand, not being bright does not necessarily mean that it is dull or too dark (and vice versa):

In [None]:
profile_image_path = f'../input/petfinder-adoption-prediction/train_images/{pet_id}-1.jpg'
if os.path.isfile(profile_image_path):
    width, height = utils.get_dimensions(profile_image_path)
    print("Total number of pixels in the image:", width*height)
    dullness, whiteness = utils.perform_color_analysis(profile_image_path, print_info=True, what_info="dullness")

In this case the dullness is also small, overall 6.19, and something similar to what happened with the whiteness occurs here: one half of the image is not dull at all, while the other one is darker, so the overall is not a big value. Thus, we can have images which are not dull nor bright.

Let's check now the distribution of values of `ProfileImageWhiteness` without the instances that don't have profile image (value 0.0) first, and then we will impute this values using the mean of the rest of instances:

In [None]:
instances_with_profile_image["ProfileImageWhiteness"].describe()

First of all, it seems that there are a lot of 0.0 values, as the median is really small (0.1), so little we can extract from the histogram and density plots considering that the maximum value is 98.45 due to the scale:

In [None]:
utils.plot_histogram_and_density(data=instances_with_profile_image, x="ProfileImageWhiteness", hue=target)

Let's apply the natural logarithm:

In [None]:
whiteness_log = instances_with_profile_image[["ProfileImageWhiteness", target]].copy(deep=True)
whiteness_log["ProfileImageWhiteness"] = np.log(1 + whiteness_log["ProfileImageWhiteness"])
utils.plot_histogram_and_density(data=whiteness_log, x="ProfileImageWhiteness", hue=target)

Even after applying the natural logarithm the distribution is still very skewed. We can see that the distribution of the target variable remains more or less the same compared to the entire dataset in those values closer to 0.0.  Let's take a closer look at those instances whose whiteness is greater than 0.1 in the original range of values, the median:

In [None]:
utils.plot_boxplot(data=instances_with_profile_image[instances_with_profile_image["ProfileImageWhiteness"] > 0.1],
            x="ProfileImageWhiteness", y=target, additional_desc="before imputation (and > 0.1)")

It seems that the instances that ended up not being adopted (4) are slightly more gathered in lower `ProfileImageWhiteness` values than the rest of majority values of the target variable, but in general there are no big differences among them.

Let's see now the profile images were white or near white colors are more dominant among all of them:

In [None]:
utils.plot_images(train.sort_values('ProfileImageWhiteness', ascending = False).head(10), 2, 5, (20,6),
            "Profile images where white or near white colors are more dominant",
            subtitle_var="ProfileImageWhiteness", title_y=1.05)

As we can see, a high whiteness value is often due to a collage with white background, or just a white background that gives more focus to the pet. The one with the greatest value is basically noise.

The following is a sample of the profile images that doesn't have near-white dominant colors:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('ProfileImageWhiteness',
                                                           ascending = True).head(10),
                  2, 5, (20,8),
                  "Profile images where white or near white colors are not dominant (0.0)",
                  subtitle_var="ProfileImageWhiteness", title_y=1.0)

We can see that the image lightning conditions may affect more the whiteness measure than the dullness, as a dark fur color will look dark regardless the illumination, but a white fur color may not look as white, we can see that in the second images, as the `ProfileImageWhiteness` value is 0.0.

On the other hand, a white fur color does not seem to affect the whiteness value, while a black fur color did affect the dullness:

In [None]:
instances_with_profile_image_no_main_white = instances_with_profile_image.loc[
    (instances_with_profile_image["ColorName1"] != "White") & 
    (instances_with_profile_image["ColorName2"] != "White"),
    ["PetID", "ProfileImageWhiteness", target]].copy(deep=True)
print("Number of pets without white (main) fur color:", instances_with_profile_image_no_main_white.shape[0])
comparison_whiteness = instances_with_profile_image[["PetID", "ProfileImageWhiteness", target]].copy(deep=True)
comparison_whiteness["White Pets"] = "With"
instances_with_profile_image_no_main_white["White Pets"] = "Without"
comparison_whiteness = comparison_whiteness.append(instances_with_profile_image_no_main_white)

utils.plot_boxplot(data=comparison_whiteness, x="ProfileImageWhiteness", y=target, hue="White Pets",
             show_title=False)

Let's impute now the `ProfileImageWhiteness` value of those instances without profile image using the mean value (5.26688) of those that do have it:

In [None]:
after_whiteness_imputation = train[["PetID", "PhotoAmt", "ProfileImageWhiteness", target]
                                 ].copy(deep=True)
after_whiteness_imputation.loc[after_whiteness_imputation["PhotoAmt"] == 0,
        "ProfileImageWhiteness"] = train[train["PhotoAmt"] > 0]["ProfileImageWhiteness"].mean()

comparison_whiteness_imputation = train.loc[train["PhotoAmt"] > 0, ["PetID", "PhotoAmt",
                                "ProfileImageWhiteness", target]].copy(deep=True)
comparison_whiteness_imputation["Imputation"] = "Before"
after_whiteness_imputation["Imputation"] = "After"
comparison_whiteness_imputation = comparison_whiteness_imputation.append(after_whiteness_imputation)

utils.plot_boxplot(data=comparison_whiteness_imputation, x="ProfileImageWhiteness", y=target, hue="Imputation",
             show_title=False)

The imputation has a similar effect to the dullness one: the median and IQR increases, especially when the target variable is 4.

Maybe we should try to build a new variable with dullness and whiteness, since they will be hard to handle due to being very skewed, for example:

In [None]:
instances_with_profile_image["AggDullnessWhiteness"] = np.log(1 +
        (instances_with_profile_image["ProfileImageDullness"] * instances_with_profile_image["ProfileImageWhiteness"]))

In [None]:
utils.plot_boxplot(data=instances_with_profile_image,x="AggDullnessWhiteness", y=target)

This new variable seems to give more information about the target variable than `ProfileImageDullness` and `ProfileImageWhiteness` each one by itself, even though the number of 0.0 values is higher and thus the median is still 0.0:

In [None]:
instances_with_profile_image["AggDullnessWhiteness"].describe()

We could try two different pipelines, one using the original variables and other using their aggregation in place of the previous one, in order to see which approach is better.

#### Blurrness

This variable gives us an estimation of how blurry is an image using the Laplacian filter, which is applied on it using convolutions. In particular, if the `ProfileImageBlurness` value is small, it means that the image can be considered 'blurry', a high value means that the image is very sharp.

As before, let's study this variable without the instances that don't have a profile image, and then again after the imputation of values for those instances using the mean:

In [None]:
instances_with_profile_image["ProfileImageBlurrness"].describe()

In [None]:
utils.plot_histogram_and_density(data=instances_with_profile_image, x="ProfileImageBlurrness", hue=target,
                          additional_desc="before imputation")

As we can see, there is a high variety of values but the distribution is not that highly skewed to the left as in the two previous variables. In fact, the decrease from the highest count value is gradual and if we apply the natural logarithm it will fit a normal distribution very well:

In [None]:
blurrness_log = instances_with_profile_image[["ProfileImageBlurrness", target]].copy(deep=True)
blurrness_log["ProfileImageBlurrness"] = np.log(1 + blurrness_log["ProfileImageBlurrness"])
utils.plot_histogram_and_density(data=blurrness_log, x="ProfileImageBlurrness", hue=target,
                          additional_desc="before imputation (logarithmic scale)")

In [None]:
utils.plot_boxplot(data=instances_with_profile_image, x="ProfileImageBlurrness", y=target,
            additional_desc="before imputation")

What we see here is not what we expected, but at least there is a clear tendency: the lower the `ProfileImageBlurrnes`, the more likely the `AdoptionSpeed` value is smaller; however, this does not mean that it is more likely that profiles with an image that is very blurry will be likely adopted earlier, as we are talking about the median, IQR and the whiskers range of values, in which the images are not really that blurry:

In [None]:
utils.plot_images(instances_with_profile_image[
                (instances_with_profile_image["ProfileImageBlurrness"] > 545) &
                (instances_with_profile_image["ProfileImageBlurrness"] < 585)
            ].sample(10, random_state=seed), 2, 5, (20,8),
            "",
            subtitle_var="ProfileImageBlurrness", title_y=1.0)

However, we can clearly see the difference between these images:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('ProfileImageBlurrness',
                                                           ascending = True).head(10),
                  2, 5, (20,8), "Most blurry profile images",
                  subtitle_var="ProfileImageBlurrness", title_y=1.0)

And these ones:

In [None]:
utils.plot_images(train.sort_values('ProfileImageBlurrness', ascending = False).head(10), 2, 5,
                  (20,8), "Sharpest profile images",
                  subtitle_var="ProfileImageBlurrness", title_y=1.0)

Let's impute the `ProfileImageBlurrness` value for the 341 instances without profile image:

In [None]:
after_blurrness_imputation = train[["PetID", "PhotoAmt", "ProfileImageBlurrness", target]
                                 ].copy(deep=True)
after_blurrness_imputation.loc[after_blurrness_imputation["PhotoAmt"] == 0,
        "ProfileImageBlurrness"] = train[train["PhotoAmt"] > 0]["ProfileImageBlurrness"].mean()

comparison_blurrness_imputation = train.loc[train["PhotoAmt"] > 0, ["PetID", "PhotoAmt",
                                "ProfileImageBlurrness", target]].copy(deep=True)
comparison_blurrness_imputation["Imputation"] = "Before"
after_blurrness_imputation["Imputation"] = "After"
comparison_blurrness_imputation = comparison_blurrness_imputation.append(after_blurrness_imputation)

utils.plot_boxplot(data=comparison_blurrness_imputation, x="ProfileImageBlurrness", y=target, hue="Imputation",
             show_title=False)

After the imputation it seems that the distribution of the usual values of blurrness when the target variable is 0 or 1 are more similar, so the previous assumption becomes a little bit worse.

**\*\*As a remark, in the 3 previous variables we have imputed using the mean, but the variance coefficient was greater than 1 in every case, so maybe the mean is not the most representative value. Thus, we should try different methods, or even still using the mean but applying the natural logarithm before as the distribution will not be that skewed.**

#### Size

This variable indicates the size of the profile image file, and consequently it is likely that a greater size indicates a better image quality. Let's see the statistics of `ProfileImageSize`:

In [None]:
train[train["PhotoAmt"]>0]["ProfileImageSize"].describe()

A difference between this variable and the three previous ones is that the variance coefficient is smaller than 1, so it is not as skewed as them. Hence, the mean is a representative value here and could be used to impute values; however, now it makes sense to say that the `ProfileImageSize` of those instances that don't have a profile image is 0.0, because there is no image. Thus, in this case we should consider both options. Let's see how the values are distributed along with the response of the target variable, before and after the imputation:

In [None]:
utils.plot_histogram_and_density(data=train, x="ProfileImageSize", hue=target,
                          additional_desc="before imputation")

We can see that when the size is greater than 25000, it is more likely the target's value is 4 than 2 (as in the original train distribution), but beyond 40000 approximately the gap between them is greater; on the other hand, when the image size is smaller than 25000, the most likely value is 2 and the probabilities that the target variable is 1, 3 or 4 are more similar. However, when the image size is 0.0 or near 0.0 the probability of 4 increases again, as we know. Thus, this peek of values may be an obstacle if we want to assume that the variable could be modelled as a normal distribution.

Let's see now the distribution when we use the mean to impute the values of those instances that don't have a profile image:

In [None]:
size_after_imputation = train[["PetID", "PhotoAmt", "ProfileImageSize", target]].copy(deep=True)
size_after_imputation.loc[size_after_imputation["PhotoAmt"] == 0, "ProfileImageSize"] = \
    size_after_imputation.loc[size_after_imputation["PhotoAmt"] > 0, "ProfileImageSize"].mean()

utils.plot_histogram_and_density(data=size_after_imputation, x="ProfileImageSize", hue=target,
                          additional_desc="after imputation")

As we can see, if we impute the values using the mean, we get a more uniform distribution, which in fact is not that different to the previous one (without considering the 0.0 values), as the most likely target's value is still 4 around the original mean of the distribution without considering those 341 instances (even though now the chances of getting that 4 value are greater). Moreover we know that the information of not having a profile image can be extracted from `PhotoAmt`, especially if we create a new variable to reduce its cardinality which simply tells us whether there is at least 1 image or not.

In [None]:
comparison_size_imputation = train.loc[train["PhotoAmt"] > 0, ["PetID", "PhotoAmt",
                                "ProfileImageSize", target]].copy(deep=True)
comparison_size_imputation["Imputation"] = "Before"
size_after_imputation["Imputation"] = "After"
comparison_size_imputation = comparison_size_imputation.append(size_after_imputation)

utils.plot_boxplot(data=comparison_size_imputation, x="ProfileImageSize", y=target, hue="Imputation",
             show_title=False)

In both cases we get a similar result: it seems that the smaller value of `ProfileImageSize`, the higher probability that the target variable's value is smaller, at least in the usual range of values of `ProfileImageSize` (but it also seems that the density of what is plotted as outliers, beyond the right whisker, is higher when the target value is higher, as we also know from the previous plots). Thus, this is similar to `ProfileImageBlurrness`, as we may expect that the effect is the opposite one (the higher the image quality could have implied a better look of the profile; maybe people tend to adopt earlier pets that seem to be in worse conditions when looking at the profile image, in order to help them and improve their lifes).

The following are the profile images with the smallest file size:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('ProfileImageSize',
                                                           ascending=True).head(5),
                  1, 5, (20,4), "Smallest profile images", subtitle_var="ProfileImageSize",
                  title_y=1.1)

And these are the ones with the biggest file size:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('ProfileImageSize',
                                                           ascending=False).head(5),
                  1, 5, (20,4), "Biggest profile images", subtitle_var="ProfileImageSize",
                  title_y=1.0)

#### Width and Height

These variables represent the dimensions of the profile image of each profile.

In [None]:
train["ProfileImageWidth"].describe()

In [None]:
train["ProfileImageHeight"].describe()

We can see that there is not much variance, and the usual range of values of both variables are very similar. Thus, maybe we should combine them in some way, but first let's see their distribution and the response of the target variable:

In [None]:
utils.plot_histogram_and_density(data=train, x=["ProfileImageWidth", "ProfileImageHeight"], hue=target,
                           comparison=True, figsize=(20,12))

First of all, we can see that their cannot be considered normal, there are many peeks of values of usual dimensions like 300, 400, 480, etc. and some other dimensions that are very unusual. Thus, in this case imputing the values of the instances without profile image does not really matter, in fact it could be better to maintain it set as 0.0.

In [None]:
utils.plot_boxplot(data=train, x=["ProfileImageWidth", "ProfileImageHeight"], y=target, figsize=(20,12),
             compare_dif_vars=True, comparison_orient="v")

Moreover, we can see that the usual range of values of these variables are almost the same regardless the outcome of the target variable. We can also see that the number of different values of these variables is high, but we know that the prevailing ones are around 6 or 7:

In [None]:
print("Number of unique profile image Width values:", train["ProfileImageWidth"].unique().size)
print("Number of unique profile image Height values:", train["ProfileImageHeight"].unique().size)

Thus, let's create another two variables that are the same but in which we round the original values to the nearest hundred (this is similar to a discretization, even though we could encounter higher values in the test that we didn't saw and, consequently, we should use intervals so that all values are covered, but for the current purpose we will just round the values):

In [None]:
train["NewWidth"] = 0
train["NewHeight"] = 0
for index, row in train.iterrows():
    new_width = round(row["ProfileImageWidth"]/100)*100
    new_height = round(row["ProfileImageHeight"]/100)*100
    train.loc[index, "NewWidth"] = new_width
    train.loc[index, "NewHeight"] = new_height

Now, we will create two new variables that are the result of dividing the width by the height (that is, the aspect ratio), one will be created with the original values and the other with the rounded ones:

In [None]:
train["ProfileImageAspectRatio"] = 0.0
train.loc[train["PhotoAmt"] > 0, "ProfileImageAspectRatio"] = train.loc[train["PhotoAmt"] > 0, "ProfileImageWidth"] / train.loc[train["PhotoAmt"] > 0, "ProfileImageHeight"]
train["NewAspectRatio"] = 0.0
train.loc[train["PhotoAmt"] > 0, "NewAspectRatio"] = train.loc[train["PhotoAmt"] > 0, "NewWidth"] / train.loc[train["PhotoAmt"] > 0, "NewHeight"]

Let's see now their distribution, and whether one seems to give more information about the target variable than the other:

In [None]:
axes = utils.plot_histogram_and_density(data=train, x=["ProfileImageAspectRatio", "NewAspectRatio"], hue=target,
                           comparison=True, figsize=(20,12), return_axes=True)
for ax in axes:
    ax.set_xticks([0,1,2,3,4,5])

We can see in the density plots that in general the differences among the distributions of the target variable's values are more pronounced when the aspect ratio is computed after rounding the width and the height to the nearest hundred

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('NewAspectRatio',
                                                           ascending=True).head(5),
                  1, 5, (20,4), "Profile images with the smallest NewAspectRatio",
                  subtitle_var="NewAspectRatio", title_y=1.1)

In [None]:
utils.plot_images(train.sort_values("NewAspectRatio", ascending=False).head(5), 1, 5, (20,4),
                  "Profile images with the largest NewAspectRatio",
                  subtitle_var="NewAspectRatio", title_y=0.90)

Let's take these samples and compare the values that we would have as aspect ratio with the original width and height:

In [None]:
utils.plot_images(train[train["PhotoAmt"] > 0].sort_values('NewAspectRatio', ascending=True).head(5),
                  1, 5, (20,4),
                  "Profile images with the smallest NewAspectRatio (original aspect ratio values indicated)",
                  subtitle_var="ProfileImageAspectRatio", title_y=1.1)

In [None]:
utils.plot_images(train.sort_values("NewAspectRatio", ascending=False).head(5), 1, 5, (20,4),
            "Profile images with the largest NewAspectRatio (original aspect ratio values indicated)",
            subtitle_var="ProfileImageAspectRatio", title_y=0.90)

As we can see, the original aspect ratio of these images varies considerably, but for us they would probably be similar when looking at them, and that is what we get with the `NewAspectRatio` variable both for smaller and larger aspect ratios.

## Summing up

The next step is feature engineering and feature subset selection. Here we sum up all the ideas we have seen, and also some additional remarks:

- `AdoptionSpeed` (target): it is unbalanced, there are few 0 instances, so we may need to explore oversample techniques.

- `Name`: we will replace it by a variable which indicates whether there is a significative name (more than 2 letters) or not.

- `Age`: skewed and many different values (maybe we don't really need a granularity of months, just years), so we can try to discretize it.

- `Breed`: very high cardinality, we could replace it using frequency encoding (this way we also reduce the probability of overfitting that could happen using mean encoding, as there are many breeds which just appear once or a couple of times in the dataset). There are just 5 instances in the train dataset without the primary breed, but more than 10000 didn't have the secondary breed; we discard the second one, but the primary must be imputed using the secondary if it exists or just using the most common value given the `Type` of animal. We will also create the variable `PureBreed`, which overcomes the ambiguity of the dataset description (mixed breed pets are not only those that have a secondary breed, but also when the breed is specified as 'Mixed Breed', and the same when it is 'Domestic {x} Hair').

- `Gender`, `Color1`, `Color2`, `Color3`, `Vaccinated`, `Dewormed`, `Sterilized`: one-hot encoding.

- `MaturitySize`, `FurLength` and `Health`: we could try both one-hot and ordinal encoding, as they can be treated as numeric values. As these variables can have a 'Not specified' value (doesn't happen in the train dataset), we could change those values using the mode when we use ordinal encoding.

- `Quantity`: as there are few instances with more than 7 or 8 instances, we can try to discretize this variable in 3 intervals (1 pet, between 2 and 6 and more than 7, for example).

- `Fee`: we can create a new variable indicating whether there is a fee or not, or just leave it (we saw that there is no clear discretization).

- `State`: although if we want to use the model we create in a different context (different country) we should retrain the model, we need to represent each state in a way that allows the model to be more robust (we could encounter a state that we didn't see before); this way can be, for example, the GDP per capita.

- `RescuerID`: as this is represents a unique identifier, we will replace it by the number of times each rescuer appears in the training dataset, which may also give us some clue on what type of rescuer each one is (we can also try to discretize this new variable for this purpose, a rescuer that has published 3 will probably be similar to a rescuer that has published 4 profiles, but maybe not to another that has published 120 profiles).

- `VideoAmt` and `PhotoAmt`: they can be replaced by two variables which simply tells us wheter there is at least one video or one photo, respectively, in the profile.

#### Description

We can use the TF-IDF technique to someway get information from the description text; of course this may result in a huge number of columns, so we will have to condense them using SVD or PCA, for example. Moreover, this process will be done in each iteration of cross validation, in order to avoid data leaks from the corresponding validation dataset.

From the metadata, we will include:

- Language. Moreover, as the 3.6% of the dataset descriptions could not be analyzed as they were in Malay or a mix of Malay and English, we will include an additional value in this variable so that those cases are also taken into account.

- Score and Magnitude. We saw that they do not provide much information about the target variable by themselves, so we may need to combine them.

- Maximum salience of any entity detected in the description.

We need to impute the values in the previous variables when the description is not empty but it could not be analyzed by the IA.

#### Images

In the samples that we sar, the profile images are the most sharp and representative photos when there is more than one included in the profile (and also the only one that is actually shown when we search pets), so we will just extract features from the profile images using pre-trained CNNs. For this purpose we will also need to resize the images.

#### Image metadata

We have extracted:

- The sum of `pixelFraction` of the dominant colors.
- The number of entities.
- The concatenation of the description of those entities: this variable will not be used as we would need to apply a similar strategy to that of the `Description` variable, plus we saw that some tabular data was extracted from the image entities' descriptions. However, we may need to use it to correct some `Type` values, as we concluded.
- The maximum `topicality` of the entities that include 'cat' or 'dog' in their description.
- Whether there is text or not in the image.

#### Image properties

- `Dullness` and `Whiteness`: they are very skewed, we can try to combine them as their distribution is very similar.

- `Blurrness`.

- `Size`.

- `Aspect Ratio` after discretizing `Width` and `Height` (rounding to the nearest hundred).

In general, we can compute the variance coefficient of the numerical variables in order to determine a threshold that indicates whether we should add a new variable which results from applying the natural logarithm on the previous one, when it is skewed. In fact, when we need to impute values of skewed variables, as their mean won't be very representative, we may need to explore other imputation strategies, such as regression trees, KNN or even Box-Cox transformation.

We need to take into account that some of the features can be modelled once, at the beginning of the process, but some others (frequency, count, TF-IDF) can only be modelled in each training + validation step using just the training dataset in order to avoid data leaks. 