## Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import os # importing data
import seaborn as sns # visualization
import matplotlib.pyplot as plt # visualization

print('drives in environment:')
print(os.listdir("../input"))

## Read Data and overview

In [None]:
data = pd.read_csv("../input/train/train.csv")
data.head()

In [None]:
print('This dataset has {} rows and {} columns'.format(data.shape[0],data.shape[1]))

In [None]:
print('Number of entries and data type: \n\n')
data.info()

In [None]:
print('Basic statistics of numeric columns:')
data.describe()

* Should AdoptionSpeed be categorical?
* Type, State, Health, Sterilized,Dewormed, Vaccinated,FurLength,MaturitySize,Gender should be categorical
* Data is quite neat. Most data is in. Name has the highest number of missing data.

## Data Cleaning

In [None]:
# change encoding back to descriptive text
cleanup_cat = {
    "Type": {1: "Dog", 2: "Cat"},
    "MaturitySize": {1: "Small", 2: "Medium", 3: "Large", 4: "Extra Large", 5:"Unsure"},
    "FurLength": {1: "Short", 2: "Medium", 3: "Long", 4: "Unsure" },
    "Gender": {1: "Male",2:"Female",3:"Group"},
    "Vaccinated": {1: "Yes",2:"No",3:"Unsure"},
    "Dewormed": {1: "Yes",2:"No",3:"Unsure"},
    "Sterilized": {1: "Yes",2:"No",3:"Unsure"},
    "Health": {1: "Healthy",2:"Minor Injury",3:"Serious Injury",0:"Unsure"}
    
}

data.replace(cleanup_cat, inplace = True)



In [None]:
# convert categorical
convert = ['Type','State','Health','Sterilized','Dewormed','Vaccinated',
     'FurLength','MaturitySize','Gender','Color1','Color2','Color3','Breed1','Breed2']
data[convert] = data[convert].apply(lambda x: x.astype('category'))
data.head()

## Visualization

In [None]:
sns.set(font_scale = 5)
cat_col = list(data.select_dtypes("category").columns)
excluded_cat_col = [col for col in cat_col if col not in ['Breed1','Breed2','State']]
f, axes = plt.subplots(round(len(excluded_cat_col)), 2, figsize=(100,550))  # create plot
axes_list = [item for sublist in axes for item in sublist]  # flatten axes
f.suptitle('Analysis of AdoptionSpeed for categorical variable', y = 0.89, fontsize = 100)
for i, c in zip(range(0,len(excluded_cat_col)*2,2),excluded_cat_col):
    g1 = sns.countplot(x = c, data = data, ax = axes_list[i])

    # percentage
    total = data[c].count()
    for p in g1.patches:
        height = p.get_height()
        g1.text(p.get_x()+p.get_width()/2., height+40, '{0:.1%}'.format(height/total),ha = 'center')
        
    # stacked bar chart
    counter = data.groupby(c)['AdoptionSpeed'].value_counts().unstack()
    percentage_dist = 100 * counter.divide(counter.sum(axis = 1), axis = 0)
    g2 = percentage_dist.plot.bar(stacked = True, ax = axes_list[i+1], rot = 0)
    #g2 = sns.countplot(x = c, data = data, ax = axes_list[i+1], hue = "AdoptionSpeed", dodge = False)
    for p in g2.patches:
        width, height = p.get_width(), p.get_height()
        x, y = p.get_xy() 
        g2.annotate('{:.0f} %'.format(height), (p.get_x()+.15*width, p.get_y()+.4*height))
    
    

It will be helpful to see how different variables interact with AdoptionSpeed. The count of each Variables are plotted on the left while their breakdown by AdoptionSpeed are plotted on the right. Breed1, Breed2 and State are excluded since there are too many of them. To include them in this analysis, we may find the top 10 counts (for future follow up). 

#### AdoptionSpeed 
* 0 - Pet was adopted on the same day as it was listed. 
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

#### Findings
* **Overall**: There is no variable that signficantly affect AdoptionSpeed as we see the distribution of the speed to be consistent among the variables.
* **Type**: More dogs than cats. However, there are more cats than dogs in test set( [Extensive Pet Finder EDA by JasonZivkovic](http://https://www.kaggle.com/jaseziv83/extensive-pet-finder-eda)). AdoptionSpeed between both type of animals show a similar pattern. 
* **Gender**:  More female than male.  Why are some pets grouped together?  **Follow up 1**
* **Color**: Color does not seem to affect AdoptionSpeed as we see similar AdoptionSpeed among different colors.
* **MaturitySize**:  Mostly Medium, follow by small. Large animals is the least. Extra large pets seems to have the highest adoption rate but it may be a result of its low count. 
* **FurLength**: Animals seem well kept since most of them have short or medium furlength. Similarly, long fur length seems to have the highest adoption rate but we are unsure how accurate this observation is since there is not a lot of such pets in the dataset.
* **Vaccinated, Dewormed and Sterilized**: Most animals are not vaccinated, not sterilized but dewormed. As such, Dewormed seem to be viewed as more important than Vaccinated and Sterilized. It will be interesting to look at the percentage of animals among these 3 variables. ***Follow up 2***. Nevertheless, it is interesting to point out that whether a pet is Sterilized does not seem to affect the AdoptionSpeed. Most pets are not sterilized (67.2%) but the adoption rate is actually the highest, as implied by the small purple section!
* **Health**: Most pets are healthy! That's good news. There is a very tiny portion of pets in serious injury (0.23%) and these pets have high risk of not being adopted (41%).



In [None]:
sns.set()
g = sns.FacetGrid(data, col = "Gender", xlim = (0,30))
g.map(sns.distplot,"Age", kde = False)
g.fig.suptitle("Histogram of Age for different gender")
g.fig.subplots_adjust(top = 0.8) # adjust title position

g2 = sns.FacetGrid(data, col = "Type")
g2.map(sns.countplot,"Gender")
g2.fig.suptitle("Count of Type by Gender")
g2.fig.subplots_adjust(top = 0.8) # adjust title position


### Follow up 1
There are animals which are grouped together, presumably one can adopt them altogether. But why? 

There are more cats (Type 2) than dogs which have the 3rd gender. The age distribution is similar among the three gender. However, the 3rd gender has less entries between 10 to 25 years old.

In [None]:
g = sns.FacetGrid(data, col = "Vaccinated", row = "Dewormed", margin_titles = True)
g.map(sns.countplot,"Sterilized")
g.fig.suptitle("Count of Sterilized Animals")
g.fig.subplots_adjust(top = 0.8)
[plt.setp(ax.texts, text="") for ax in g.axes.flat] # remove the original texts
                                                    # important to add this before setting titles
g.set_titles(row_template = 'Dewormed - {row_name}', col_template = 'Vaccinated - {col_name}')

## Follow up 2

The relationship between Dewormed, Vaccinated and Sterilized are plotted out in FacetGrid. The rows are Dewormed, columns are Vaccinated while the count of Sterilized animals are plotted out. For instance at the top left hand corner, it plots animals which are dewormed and vaccinated. It is clear that the diganoal graphs are noteworthy. They are graphs where animals are dewormed and vaccinated, not dewormed and not vaccinated and unsure if dewormed and vaccinated.  There is a strong relationship between dewormed and vaccinated. The relationship of Sterilized between these other two variables are harder to be determined. Generally, animals which are not dewormed and vaccinated will likely to be not Sterilized as well.

In [None]:
num_col = list(data.select_dtypes(np.number).columns)
f, axes = plt.subplots(round(len(num_col)/3),3, figsize = (20,10))
f.suptitle("Distribution of numeric variables")
axes_list = [item for sublist in axes for item in sublist]
for i,c in enumerate(num_col):
    sns.distplot(data[c], ax = axes_list[i])


Bars represent distribution while line represents kernel density estimate.
Kernel Density Estimate indicates the propability of finding a data point at that particular point. Read up on it [here](https://mathisonian.github.io/kde/)

#### Findings
* **Age** is a right-skewed distribution where the right tail is long. 
* **Quantity** : Number of pets represented in profile. Not a normal distribution and we see the frequency of quantity decreases expoentially as quantity increases.
* **Fee**: Some outliers
* **VideoAmt**: Mostly no video
* **PhotoAmt**: Mostly between 0 to 5. Unsurprisingly, photos are more likely to be posted than video

In [None]:
sns.heatmap(data[num_col].corr())

### Correlation heatmap
Variables do not have high correlation with each other. The highest pairwise correlation is VideoAmt and PhotoAmt. 

In [None]:
included_num = ['Age','PhotoAmt']
f, axes = plt.subplots(1, 2, figsize=(25,10))  # create plot
#axes_list = [item for sublist in axes for item in sublist]  # flatten axes
f.suptitle('Analysis of AdoptionSpeed for numerical variable')
for i, c in enumerate(included_num):
    g = sns.boxplot(x = "AdoptionSpeed", y = c, data = data, ax = axes[i], showfliers = False)
    

There are many outliers in numerical variables which are removed for the box plot. Moreover, Quantity, Fee and VideoAmt are excluded since most of them are zero. Plotting a boxplot for them will just show a single median line.

#### AdoptionSpeed 
* 0 - Pet was adopted on the same day as it was listed. 
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

Findings:
* **Age**: Pets being adopted within a reasonable amount of time (more than a day and less than 3 months after posting) show a smaller IQR than pets being adopted on the same day of listing or pets which are not adopted. The median age for pets which are not adopted are also slightly higher. This is expected since it is reasonable to love younger pets more but thankfully, this relationship is not strong. 
* **PhotoAmt**: The range for AdoptionSpeed less than 30 days (0,1 and 2) is between 2 to 5. On the other hand, the AdoptionSpeed for no adoption is bwteen 1 to 4. Again, we see that photo amount may not affect adoptionSpeed significantly.