Import libraries and load data

In [None]:
#libraries
import numpy as np 
import pandas as pd 
import os
import json
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import time
import datetime
from PIL import Image
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import gc
from catboost import CatBoostClassifier
from tqdm import tqdm_notebook
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import random
import warnings
warnings.filterwarnings("ignore")
from functools import partial
pd.set_option('max_colwidth', 500)
pd.set_option('max_columns', 500)
pd.set_option('max_rows', 100)
import os
import scipy as sp
from math import sqrt
from collections import Counter
from sklearn.metrics import confusion_matrix as sk_cmatrix

import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import eli5
from IPython.display import display

nltk.download('punkt')
nltk.download('stopwords')

STOP_WORDS = stopwords.words()

In [None]:
breeds = pd.read_csv('../input/breed_labels.csv')
colors = pd.read_csv('../input/color_labels.csv')
states = pd.read_csv('../input/state_labels.csv')

train = pd.read_csv('../input/train/train.csv')

### Table of Contents

* [Data Overview](#data_overview)
    * [What data files are we dealing with](#1_1)
    * [What are the columns in the main dataset? What is the data size? (Any missing data?)](#1_2)
* [Cats vs Dogs](#cats_vs_dogs)
    * [How many cat and dogs do we have in the dataset?](#2_1)
    * [Adoption speed](#2_2)
* [Maturity Size](#maturity_size)
    * [What are the sizes of pets in pet finder system?](#3_1)
    * [What does the different maturity sizes look like?](#3_2)
* [Age](#age)
    * [What are the ages of pets in pet finder system?](#4_1)
* [Health](#health)
    * [What is the general health condition of pets in pet finder system?](#5_1)
* [Sentiment](#sentiment)
    * [What is the general sentiment of pets in the system?](#6_1)
    * [What are the top words for pets that get adopted?](#6_2)
* [Others](#others)
    * [7.1 Breeds](#7_1)
    * [7.2 Name](#7_2)
    * [7.3 Gender](#7_3)
    * [7.4 Colors](#7_4)
    * [7.5 Fur Length](#7_5)
    * [7.6 Fee](#7_6)
    * [7.7 Location (State)](#7_7)

## 1. Data overview <a class="anchor" id="data_overview"></a>

### 1.1 What data files are we dealing with? <a class="anchor" id="1_1"></a>

In [None]:
print(os.listdir("../input"))

### 1.2 What are the columns in the main dataset? What is the data size? (Any missing data?) <a class="anchor" id="1_2"></a>

In [None]:
train.head()

Some people left descriptions about pets. It could be insightful so let's map sentiment to main dataset.
While we're at it, let's also give map breed name to breed too.

In [None]:
sentiment_dict = {}
for filename in os.listdir('../input/train_sentiment/'):
    with open('../input/train_sentiment/' + filename, 'r') as f:
        sentiment = json.load(f)
    pet_id = filename.split('.')[0]
    sentiment_dict[pet_id] = {}
    sentiment_dict[pet_id]['sentiment_magnitude'] = sentiment['documentSentiment']['magnitude']
    sentiment_dict[pet_id]['sentiment_score'] = sentiment['documentSentiment']['score']
    sentiment_dict[pet_id]['sentiment_language'] = sentiment['language']
    
print(f'{len(sentiment_dict)} sentiment found')
sentiment = pd.DataFrame.from_dict(sentiment_dict, orient='index')
train = train.merge(sentiment, left_on='PetID', right_index=True, how='left')
# print(train.loc[:5,['PetID','Description','sentiment_score','sentiment_magnitude','sentiment_language','AdoptionSpeed']])

In [None]:
breeds_dict = {k: v for k, v in zip(breeds['BreedID'], breeds['BreedName'])}
train['Breed1_name'] = train['Breed1'].apply(lambda x: '_'.join(breeds_dict[x].split()) if x in breeds_dict else 'Unknown')
train['Breed2_name'] = train['Breed2'].apply(lambda x: '_'.join(breeds_dict[x]) if x in breeds_dict else '-')

In [None]:
train.info()

* We have almost 15k total dogs and cats in the dataset;
* Missing Data:
    * 1k++ pets don't have names (Not suprising)
    * around 500 pets don't sentiment (but that's ok. We'll work with what we have)
* The dataset contains important information about pets: maturity size, age, breed, color, some characteristics and other things;
* There are images and metadata for pets which we could possibly use;
* There are separate files with labels for breeds, colors and states;
* Descriptions were analyzed using Google's Natural Language API providing sentiments and entities. (We could do this ourselves if we wanted but since the file was provided already, we'll just map it accordingly)

# Let the exploration begin!
## 2. Cats vs Dogs <a class="anchor" id="cats_vs_dogs"></a>
### 2.1 How many cat and dogs do we have in the dataset? <a class="anchor" id="2_1"></a>

In [None]:
train['Type'] = train['Type'].apply(lambda x: 'Dog' if x == 1 else 'Cat')
train['Adopted'] = train['AdoptionSpeed'].apply(lambda x: False if x == 4 else True)
plt.figure(figsize=(14, 6))
g = sns.countplot(x='Type', data=train, hue='Adopted')
plt.title('Number of cats and dogs in dataset')
ax=g.axes
for p in ax.patches:
     ax.annotate(f"{p.get_height()} -> {p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points')  

Looks promissing, more than 70% of pets coming into the pet finder system gets adopted. 
There are a little more Dogs than Cats. However, Cats have a slightly higher chance of getting adopted.

### 2.2 Adoption speed <a class="anchor" id="2_2"></a>

* 0 - Pet was adopted on the same day as it was listed.
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed.
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed.
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days). 

In [None]:
def make_count_plot(df, x, hue='AdoptionSpeed', title=''):
    """
    Plotting countplot with correct annotations.
    """
    g = sns.countplot(x=x, data=df, hue=hue)
    plt.title(f'AdoptionSpeed by {title}')
    ax = g.axes

    for p in ax.patches:
        ax.annotate(f"{p.get_height() * 100 / df.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
             ha='center', va='top', fontsize=11, color='grey', rotation=0, xytext=(0, 10),
             textcoords='offset points')

In [None]:
plt.figure(figsize=(14, 6));
g = sns.countplot(x='AdoptionSpeed', data=train)
plt.title('Adoption speed classes rates');
ax=g.axes

for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / train.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points')
  
plt.figure(figsize=(18, 6));
plt.subplot(1, 2, 1)
make_count_plot(df=train[train['Type'] == 'Cat'], x='Type', title='pet Type(Cat)')
plt.subplot(1, 2, 2)
make_count_plot(df=train[train['Type'] == 'Dog'], x='Type', title='pet Type(Dog)')
plt.tight_layout()

We can see that some pets were adopted immediately, but these are rare cases: maybe someone wanted to adopt any pet, or the pet was lucky to be seen by person, who wanted a similar pet.

Not surprisingly enough, there is a slight linear relationship - the worse the situation (longer they are in the system), the higher the number.
Interestingly cats are more likely to be adopted early than dogs. However, more dogs are adopted after several months.

For the purpose of the workshop, let's focus on Dogs only starting from this point
## 3. Maturity Size <a class="anchor" id="maturity_size"></a>
Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
### 3.1 What are the sizes of pets in pet finder system? How does that relate to adopted speed? <a class="anchor" id="3_1"></a>

In [None]:
dog = train[train['Type'] == 'Dog']
plt.figure(figsize=(14, 6));
g = sns.countplot(x='MaturitySize', data=dog, hue='Adopted', orient='h')
plt.title('Matutiry Size classes rates');
ax=g.axes

for p in ax.patches:
     ax.annotate(f"{p.get_height() * 100 / dog.shape[0]:.2f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
         ha='center', va='center', fontsize=11, color='gray', rotation=0, xytext=(0, 10),
         textcoords='offset points')
  
plt.figure(figsize=(18, 10));
plt.tight_layout()
make_count_plot(df=dog, x='MaturitySize', title='Maturity Size')

Interesting, all extra large dogs got adopted (did not expect that!). Seems like most of the pets coming into the system are Medium sized.
While extra large dogs always get adopted, next comes small dogs (73% adopted), then large dogs (71% adopted), and lastly medium sized  (70% adopted).
Not a strong relationship between maturity size with adoption speed though. The distribution is almost the same throughout.

However, keep in mind the population of extra large dogs that actually come into the system. (they are very little!) Hence keep in mind the biasness in the data.
Note that small sized dogs get readily adopted earlier compared to other sizes.

*Cute doggo pics break*
### 3.2 What does the different maturity sizes look like? <a class="anchor" id="3_2"></a>
Check validity of data (whenever possible)

In [None]:
images = [i.split('-')[0] for i in os.listdir('../input/train_images/')]
size_dict = {1: 'Small', 2: 'Medium', 3: 'Large', 4: 'Extra Large'}
for m in size_dict.keys():
    df = dog.loc[dog['MaturitySize'] == m]
    top_breeds = list(df['Breed1_name'].value_counts().index)[:5]
    m = size_dict[m]
    print(f"Most common {m} Breeds:")

    fig = plt.figure(figsize=(25, 4))

    for i, breed in enumerate(top_breeds):
        # excluding pets without pictures
        b_df = df.loc[(df['Breed1_name'] == breed) & (df['PetID'].isin(images)), 'PetID']
        if len(b_df) > 1:
            pet_id = b_df.values[1]
        else:
            pet_id = b_df.values[0]
        ax = fig.add_subplot(1, 5, i+1, xticks=[], yticks=[])

        im = Image.open("../input/train_images/" + pet_id + '-1.jpg')
        plt.imshow(im)
        ax.set_title(f'Breed: {breed}')
    plt.show();

Ok. Seems like we can see some breeds appear across different maturity sizes so data is not entirely correct. But close enough!

## 4. Age <a class="anchor" id="age"></a>
Doggo funfact: Puppy (0-12 months), Adult (12-84 months), Senior (84 months == 7 years and above)
### 4.1 What are the ages of pets in pet finder system? How does that relate to adopted speed? <a class="anchor" id="4_1"></a>

In [None]:
data = []
for a in range(5):
    df = dog.loc[dog['AdoptionSpeed'] == a]

    data.append(go.Scatter(
        x = df['Age'].value_counts().sort_index().index,
        y = df['Age'].value_counts().sort_index().values,
        name = str(a)
    ))
    
layout = go.Layout(dict(title = "AdoptionSpeed trends by Age",
                  xaxis = dict(title = 'Age (months)'),
                  yaxis = dict(title = 'Counts'),
                  )
                  )
py.iplot(dict(data=data, layout=layout), filename='basic-line')

So what's the top common ages in the dataset?

In [None]:
dog['Age'].value_counts().head(10)

* We can see that young pets are adopted quite fast and most of them are adopted;
* Most pets are less than 4 months old with a huge spike at 2 months;
* It seems that a lot of people don't input exact age and write age in years (or multiples of 12)

## 5. Health <a class="anchor" id="health"></a>
There are four features showing health of the pets:

* Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
* Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
* Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
* Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
### 5.1 What is the general health condition of pets in pet finder system? How does that relate to adopted speed? <a class="anchor" id="5_1"></a>

In [None]:
plt.figure(figsize=(20, 12));
plt.subplot(2, 2, 1)
make_count_plot(df=dog, x='Vaccinated', title='Vaccinated')
plt.xticks([0, 1, 2], ['Yes', 'No', 'Not sure']);
plt.title('AdoptionSpeed and Vaccinated');

plt.subplot(2, 2, 2)
make_count_plot(df=dog, x='Dewormed', title='Dewormed')
plt.xticks([0, 1, 2], ['Yes', 'No', 'Not sure']);
plt.title('AdoptionSpeed and Dewormed');

plt.subplot(2, 2, 3)
make_count_plot(df=dog, x='Sterilized', title='Sterilized')
plt.xticks([0, 1, 2], ['Yes', 'No', 'Not sure']);
plt.title('AdoptionSpeed and Sterilized');

plt.subplot(2, 2, 4)
make_count_plot(df=dog, x='Health', title='Health')
plt.xticks([0, 1, 2], ['Healthy', 'Minor Injury', 'Serious Injury']);
plt.title('AdoptionSpeed and Health');

plt.suptitle('Adoption Speed and health conditions');

* Almost all pets are healthy! Pets with minor injuries are rare and sadly most of them don't get adopted. Number of pets with serious injuries is negligible.
* It is interesting that people don't mind not vaccinated and not sterilized dogs. Maybe it ties in with the fact that most dogs are adopted at a young age so pet adopters don't mind bringing them to the vet. (Or maybe... they want puppies in the future)
* Quite important is the fact that when there is no information about health condition (not sure), the probability of not being adopted is much higher.

Let's have a look at most popular health conditions combinations.
* We're going to combined all 4 health related attributes in this sequence (vaccinated, dewormed, sterilized, health)
* To help put the numbers into perspective, we will measure them based on base adoption speed rates.

Note: This method could apply to earlier stats as well

This is how it works:
* Total dogs dataset is 8132
* As we saw earlier the base rate of dogs with Adoption speed 0 is 170 / 8132 = 0.0209;
* Now there are a total 1462 health 1_1_1_1 (vaccinated,dewormed,sterilized,healthy), of which 28 have Adoption Speed 0, the rate is 28 / 1462 = 0.0191;
* 0.0191/0.0209 = 0.914, so by splitting out the data to the different health conditions, we can see that health 1_1_1_1 has a 8% decrease chance of adoption speed class 0 over the base rate of adoption;

In [None]:
main_count = dog['AdoptionSpeed'].value_counts(normalize=True).sort_index()
def prepare_plot_dict(df, col, main_count):
    """
    Preparing dictionary with data for plotting.
    
    I want to show how much higher/lower are the rates of Adoption speed for the current column comparing to base values (as described higher),
    At first I calculate base rates, then for each category in the column I calculate rates of Adoption speed and find difference with the base rates.
    
    """
    main_count = dict(main_count)
    plot_dict = {}
    for i in df[col].unique():
        val_count = dict(df.loc[df[col] == i, 'AdoptionSpeed'].value_counts().sort_index())

        for k, v in main_count.items():
            if k in val_count:
                plot_dict[val_count[k]] = ((val_count[k] / sum(val_count.values())) / main_count[k]) * 100 - 100
            else:
                plot_dict[0] = 0

    return plot_dict

def make_factor_plot(df, x, col, title, main_count=main_count, hue=None, ann=True, col_wrap=4):
    """
    Plotting countplot.
    Making annotations is a bit more complicated, because we need to iterate over axes.
    """
    if hue:
        g = sns.factorplot(col, col=x, data=df, kind='count', col_wrap=col_wrap, hue=hue);
    else:
        g = sns.factorplot(col, col=x, data=df, kind='count', col_wrap=col_wrap);
    plt.subplots_adjust(top=0.9);
    plt.suptitle(title);
    ax = g.axes
    plot_dict = prepare_plot_dict(df, x, main_count)
    if ann:
        for a in ax:
            for p in a.patches:
                text = f"{plot_dict[p.get_height()]:.0f}%" if plot_dict[p.get_height()] < 0 else f"+{plot_dict[p.get_height()]:.0f}%"
                a.annotate(text, (p.get_x() + p.get_width() / 2., p.get_height()),
                     ha='center', va='center', fontsize=11, color='green' if plot_dict[p.get_height()] > 0 else 'red', rotation=0, xytext=(0, 10),
                     textcoords='offset points')  

dog['health'] = dog['Vaccinated'].astype(str) + '_' + dog['Dewormed'].astype(str) + '_' + dog['Sterilized'].astype(str) + '_' + dog['Health'].astype(str)

make_factor_plot(df=dog.loc[dog['health'].isin(list(dog.health.value_counts().index[:5]))], x='health', col='AdoptionSpeed', title='Counts of pets by main health conditions and Adoption Speed')

* Healthy, dewormed and non-sterilized pets tend to be adopted faster. (Pattern X_1_2_1 is reoccuring)
* Weirdly enough, completely healthy dogs are more likely to be not adopted! Perhaps people pay attention to other characteristics and factors.
* And healthy pets with no information (not sure value) also tend to be adopted less frequently (50% above the adoption baserate). Maybe people prefer having information, even if it is negative

## 6. Sentiment <a class="anchor" id="sentiment"></a>
### 6.1 What is the general sentiment of pets in the system? How does that relate to the adoption speed? <a class="anchor" id="6_1"></a>

In [None]:
# Rewriting the make_count_plot to have it based on base adoption rate
def make_count_plot(df, x, hue='AdoptionSpeed', title='', main_count=main_count):
    """
    Plotting countplot with correct annotations.
    """
    g = sns.countplot(x=x, data=df, hue=hue);
    plt.title(f'AdoptionSpeed {title}');
    ax = g.axes

    plot_dict = prepare_plot_dict(df, x, main_count)

    for p in ax.patches:
        h = p.get_height() if str(p.get_height()) != 'nan' else 0
        text = f"{plot_dict[h]:.0f}%" if plot_dict[h] < 0 else f"+{plot_dict[h]:.0f}%"
        ax.annotate(text, (p.get_x() + p.get_width() / 2., h),
             ha='center', va='center', fontsize=11, color='green' if plot_dict[h] > 0 else 'red', rotation=0, xytext=(0, 10),
             textcoords='offset points')
        
data = []
for a in range(5):
    df = dog.loc[dog['AdoptionSpeed'] == a]

    data.append(go.Scatter(
        x = df['sentiment_score'].value_counts().sort_index().index,
        y = df['sentiment_score'].value_counts().sort_index().values,
        name = str(a)
    ))
    
layout = go.Layout(dict(title = "AdoptionSpeed trends by Sentiment Score",
                  xaxis = dict(title = 'Score'),
                  yaxis = dict(title = 'Counts'),
                  )
                  )
py.iplot(dict(data=data, layout=layout), filename='basic-line')

In [None]:
dog['lang'] = dog['PetID'].apply(lambda x: sentiment_dict[x]['sentiment_language'] if x in sentiment_dict else 'no')
plt.figure(figsize=(18, 10));
plt.tight_layout()
plt.subplot(1, 2, 1)
make_count_plot(df=dog, x='sentiment_language', title='Sentiment language')

plt.figure(figsize=(16, 6));
plt.subplot(1, 2, 1)
sns.violinplot(x="AdoptionSpeed", y="sentiment_score", data=dog);
plt.title('AdoptionSpeed by score');

plt.subplot(1, 2, 2)
sns.violinplot(x="AdoptionSpeed", y="sentiment_magnitude", data=dog);
plt.title('AdoptionSpeed by magnitude of sentiment');

* English seems to be the most common language used to write descriptions. It seems that if descriptions are written in Mandrin, the chances of the dog getting adopted are lower
* It seems that the lower is the magnitude of score, the faster pets are adopted.Though the sentiment doesn't seem to carry much weight across adoption speed.

### 6.2 What are the top words for pets that get adopted? <a class="anchor" id="6_2"></a>

In [None]:
def cleaning(text):
    """
    Convert to lowercase.
    Remove URL links, special characters and punctuation.
    Tokenize and remove stop words.
    """
    text = text.lower()
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('[’“”…]', '', text)

    # removing the stop-words
    text_tokens = word_tokenize(text)
    tokens_without_sw = [
        word for word in text_tokens if not word in STOP_WORDS]
    filtered_sentence = (" ").join(tokens_without_sw)
    text = filtered_sentence

    return text

df_cleaned = dog.loc[dog['Description'].notna(), 'Description'].apply(cleaning)
word_count = Counter(" ".join(df_cleaned).split()).most_common(10)
word_frequency = pd.DataFrame(word_count, columns = ['Word', 'Frequency'])
print(word_frequency)

Typically what you would expect.

## 7. Others <a class="anchor" id="others"></a>
### 7.1 Breeds <a class="anchor" id="7_1"></a>

In [None]:
dog['Pure_breed'] = 0
dog.loc[train['Breed2'] == 0, 'Pure_breed'] = 1

print(f"Rate of pure breed dogs in the data: {dog['Pure_breed'].sum() * 100 / dog['Pure_breed'].shape[0]:.4f}%.")

fig, ax = plt.subplots(figsize = (20, 18))
plt.subplot(1, 2, 1)
text_dog1 = ' '.join(dog.loc[:, 'Breed1_name'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='black', collocations=False,
                      width=1200, height=1000).generate(text_dog1)
plt.imshow(wordcloud)
plt.title('Top dog breed1 in the system')
plt.axis("off")

plt.subplot(1, 2, 2)
text_dog2 = ' '.join(dog.loc[dog['Adopted'], 'Breed1_name'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='black', collocations=False,
                      width=1200, height=1000).generate(text_dog2)
plt.imshow(wordcloud)
plt.title('Top dog breed1 adopted')
plt.axis("off")

* As expected, mixed breed is the main breed in the system and adopted. Although, it is interesting that 69% of dogs in the system are pure bred. Most likely means they were abandoned :(
* There isn't any notable breed that gets adopted over the other breeds. Generally if a breed appears in the system more, that breed will also have more adoptions.

### 7.2 Name <a class="anchor" id="7_2"></a>
Names don't really matter since new owners would most likely rename them.

Let's look at most common names to start

In [None]:
plt.figure(figsize=(16, 12));
text_dog = ' '.join(dog.loc[:, 'Name'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white',
                      width=1200, height=1000).generate(text_dog)
plt.imshow(wordcloud)
plt.title('Top dog names')
plt.axis("off")

plt.show()

It is worth noticing some things:
* Quite often people write simply who is there for adoption: "Pup", "Puppies"
* Very often the color of pet is written, sometimes gender
* And it seems that sometimes names can be strange or there is some info written instead of the name. Example: Urgent, Adoption

In [None]:
dog['Name'] = dog['Name'].fillna('Unnamed')
dog['No_name'] = 0
dog.loc[dog['Name'] == 'Unnamed', 'No_name'] = 1
print(f"Rate of unnamed dogs in data: {dog[dog['No_name'] == 1].shape[0] / dog.shape[0] * 100:.4f}%.")

plt.figure(figsize=(18, 8))
make_count_plot(df=dog, x='No_name', title='without name')

Less than 10% of pets don't have names, but they have a higher possibility of not being adopted.

#### "Bad" names

Some shorter names tend to be meaningless and some names are less than 3 characters.
Here is an examples

In [None]:
print(dog[dog['Name'].apply(lambda x: len(str(x))) == 3]['Name'].value_counts().tail())
print(dog[dog['Name'].apply(lambda x: len(str(x))) < 3]['Name'].unique())

It seems to show that name is meaningless - pets with these names could have less success in adoption.

### 7.3 Gender <a class="anchor" id="7_3"></a>
 1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets

In [None]:
plt.figure(figsize=(18, 8))
make_count_plot(df=dog, x='Gender', title='by gender')

It seems that male pets are adopted faster than female. Having no information about the gender really decreases chances.

### 7.4 Colors <a class="anchor" id="7_4"></a>

In [None]:
colors_dict = {k: v for k, v in zip(colors['ColorID'], colors['ColorName'])}
dog['Color1_name'] = dog['Color1'].apply(lambda x: colors_dict[x] if x in colors_dict else '')
dog['Color2_name'] = dog['Color2'].apply(lambda x: colors_dict[x] if x in colors_dict else '')
dog['Color3_name'] = dog['Color3'].apply(lambda x: colors_dict[x] if x in colors_dict else '')

In [None]:
plt.figure(figsize=(18, 8))
make_count_plot(df=dog, x='Color1_name', title='by main color')

We can see that most common colors are black and brown. Interesting to notice that there are almost no gray or yellow dogs.

In a real analysis, we could take this further by seeing which combination of colors are more popular (Remember we have 3 color fields). We could also see how they relate to other characters, such as: gender.

### 7.5 FurLength <a class="anchor" id="7_5"></a>

 (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)

In [None]:
plt.figure(figsize=(18, 8))
make_count_plot(df=dog, x='FurLength', title='by Fur Length')

* We can see that most of the pets have short fur and long fur is the least common
* Pets with long hair tend to have a higher chance of being adopted. It could be because they look prettier? Or it could be because of randomness due to low count

Remember some breed have hair length in the breed name? We could cross validate if the information is correct and proceed to clean data based on that. (But let's keep it simple for now)

### 7.6 Fee <a class="anchor" id="7_6"></a>
One of interesting features is adoption fee. Some pets can be gotten for free, adopting some required paying a certain amount.

In [None]:
dog['Free'] = dog['Fee'].apply(lambda x: 'Free' if x == 0 else 'Not Free')
        
plt.figure(figsize=(18, 8))
make_count_plot(df=dog, x='Free', title='which are Free')

plt.figure(figsize=(18, 8))
sns.violinplot(x="AdoptionSpeed", y="Fee", data=dog)
plt.title('AdoptionSpeed by Fee')

Most pets are free and it seems that asking for a fee slightly decreased the chance of adoption. 

Some fees can go up to 3k (wow!) Maybe owners need extra time to think about such high fees which contributes to slower adoption speed. Let look at some of the top highest fees in the dataset and understand why are these dogs so expensive.

In [None]:
dog.sort_values('Fee', ascending=False)[['Name', 'Description', 'Fee', 'AdoptionSpeed']].head(10)

* It is interesting that pets with high fee tend to be adopted quite fast! Maybe people prefer to pay for "better" pets: healthy, trained and so on
* Most pets are given for free and fees are usually lower than 100

### 7.7 Location (State)

In [None]:
states_dict = {k: v for k, v in zip(states['StateID'], states['StateName'])}
dog['State_name'] = dog['State'].apply(lambda x: '_'.join(states_dict[x].split()) if x in states_dict else 'Unknown')

In [None]:
dog['State_name'].value_counts(normalize=True)

These is the distribution of dogs in the pet finder system. Not surprised that most of the dogs are listed in KL and Selangor (major cities).

Let's see how location impacts adoption speed for the top 3 states listed.

In [None]:
make_factor_plot(df=dog.loc[dog['State_name'].isin(list(dog.State_name.value_counts().index[:3]))], x='State_name', col='AdoptionSpeed', title='Counts of pets by states and Adoption Speed')

Selangor peeps seems to be more keen on adopting dogs compared to KL and Penang. Sad to see such a high number of unadopted dogs at Penang.