> One of my assignment is to become expert on Kaggle, so if any possible, please vote of me or give me some advices! Thank you so much

## Animal Care and Control Adopted Animals

In early 2017, the Bloomington Animal Shelter migrated management software from AnimalShelterNet to Shelter Manager. We attempted to preserve as much information as possible from the old system. The outcome fields in animal shelter are scattered in multiple fields not just one, for example Dead on arrival, Put to sleep, Movement Type and others are all considered as part of outcome.

By analysis the dataset, we can know what will influence adoptions of animals, what feature probably will influence the stories of those animals.

This notebook is mainly about: 
* Data processing - Missing data, Data type, Outliers etc
* Data analytics and visualization - Bar, histogram, line, heatmap, time series

I'm still working on regression on this dataset, will post soon.

In [None]:
#prepare the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [None]:
df=pd.read_csv('../input/animal-data/animal-data-1.csv')
df.head()

### Data Processing

In [None]:
print('Data Shape',df.shape)
df.info()

1. Seems we have some null data, including'intakereason','breedname' 'identichipnumber','returndate','returnedreason','deceaseddate'. 

2. We have some date features but the type of data are object, these features should be specified as date objects to allow for easier feature engineering and analysis later on.

So I'll do two things here
* Fix the missing data
* Change the data type
* Remove the outliers

In [None]:
df['intakedate'] = pd.to_datetime(df.intakedate)
df['movementdate'] = pd.to_datetime(df.movementdate)
df['returndate'] = pd.to_datetime(df.returndate)
df['deceaseddate'] = pd.to_datetime(df.deceaseddate)

In [None]:
#Date features creations and deletions
df['year_take']=df.intakedate.dt.year
df['month_take']=df.intakedate.dt.month
df['day_take']=df.intakedate.dt.day

df['year_move']=df.movementdate.dt.year
df['month_move']=df.movementdate.dt.month
df['day_move']=df.movementdate.dt.day

After I changed data type, I checked the dataset and find out there are some 'Outliers' for the date columns, so I'm going to remove it to make plot looks better

In [None]:
df['year_take'].unique()
df.groupby('year_take').count()

In [None]:
df['year_move'].unique()
df.groupby('year_move').count()

Seems we don't have much data before 2017 so I will just remove the data before that

In [None]:
df = df[df['year_take']>2016]
df = df[df['year_move']>2016]

About the NaN data, I cannot access all those data, so I will replace all the null data by 0 or 'missing'.
Normally when we missing the numerical data, we can replace it by the `mean()` or other things you prefer

In [None]:
df['identichipnumber']=df['identichipnumber'].fillna(0)
df['istrial']=df['istrial'].fillna(0)
df['intakereason'].fillna('missing',inplace=True)
df['breedname'].fillna('missing',inplace=True)
df.info()

### Feature Variables-Data Analytics

First I'm going to check what kind of animals they have the most, maybe Dog? or Cat?

In [None]:
sp_count=pd.DataFrame(df.groupby(['speciesname'], as_index=False)['id'].count())

df1=pd.DataFrame({'speciesname':sp_count.speciesname,'count':sp_count.id})
df1=df1.sort_values(by=['count'],ascending=False)

plt.figure(figsize=(10,8))
ax=sns.barplot(x=df1['count'],y=df1['speciesname'],palette='Set3')

plt.ylabel('Animal species')
plt.xlabel('count')
plt.title('the number of animals has been to this shelter by specie')
plt.grid(alpha=0.3)

In [None]:
#plt.figure(figsize=(25,10))
#sns.countplot(df.speciesname, palette='Set3')

#plt.xlabel('Animal species',fontsize=15)
#plt.ylabel('count',fontsize=15)
#plt.title('the number of animals has been to this shelter by specie',fontsize=20)
#plt.grid()
#plt.show()

As we expect, the cat has the most number, followed by dogs. We also have rabbit, rat, guinea pig, bird which is also kinda popular in the shelthers.
The first question will be Why they are here? Is there any possible the speices will have some connection with the reason they are here?

First we can check why the aniamls came to the shelter:

In [None]:
plt.figure(figsize=(25,10))
sns.countplot(df.intakereason, palette='Set3')

plt.xticks(rotation=90)
plt.xlabel('intakereason',fontsize=15)
plt.ylabel('count',fontsize=15)
plt.title('How the animal end up at the shelter',fontsize=20)
plt.grid(alpha=0.6)
plt.show()

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(18,20))

sns.countplot(data=df, y='intakereason',hue='speciesname', ax=ax1, 
              palette='Set2', alpha=0.6)
sns.countplot(data=df, y='speciesname',hue='intakereason', ax=ax2, 
              palette='Set2', alpha=0.6)

ax1.set_title('Intakereason and Speciesname')
ax1.grid(alpha=0.5)
ax2.grid(alpha=0.5)
plt.show()

Oops seems we have too many species, let me narrow down to the two most common types: Cat and Dog 

In [None]:
f, (ax1, ax2) = plt.subplots(1,2, figsize=(18,15))
x_cat=df.loc[(df['speciesname']=='Cat')]
x_dog=df.loc[(df['speciesname']=='Dog')]
#x_rabbit=df.loc[(df['speciesname']=='House Rabbit')]
df2=pd.concat([x_cat,x_dog])

sns.countplot(data=df2, y='intakereason',hue='speciesname', ax=ax1, 
              palette='Set2', alpha=0.6)
sns.countplot(data=df2, y='speciesname',hue='intakereason', ax=ax2, 
              palette='Set2', alpha=0.6)

ax1.set_title('Intakereason and Speciesname')
ax1.grid(alpha=0.5)
ax2.grid(alpha=0.5)
plt.show()

Next we can compare some other things:
* Movementtype and Sexname
* Movementtype and Speciesname(top3)
* Movementtype and Age

In [None]:
f, (ax1, ax2) = plt.subplots(2,1, figsize=(25, 15))
sns.countplot(data=df, x='movementtype',hue='sexname', ax=ax1, 
              palette='Set2', alpha=0.6)
sns.countplot(data=df, x='sexname',hue='movementtype', ax=ax2, 
              palette='Set2', alpha=0.6)

ax1.set_title('Movementtype and Sexname',fontsize=20)
ax1.grid(alpha=0.5)
ax2.grid(alpha=0.5)
plt.show()

Seems all values are close to each other, but male reclaimed is little bit more than female reclaimed.

In [None]:
f, (ax1, ax2) = plt.subplots(2,1, figsize=(25,20))

x_rabbit=df.loc[(df['speciesname']=='House Rabbit')]
df2=pd.concat([x_cat,x_dog,x_rabbit])

sns.countplot(data=df2, x='movementtype',hue='speciesname', ax=ax1, 
              palette='Set2', alpha=0.6)
sns.countplot(data=df2, x='speciesname',hue='movementtype', ax=ax2, 
              palette='Set2', alpha=0.6)

ax1.set_title('Movementtype and Speciesname',fontsize=20)
ax1.grid(alpha=0.5)
ax2.grid(alpha=0.5)
plt.show()

Next I will put all the age in a same range and to check the distribution
I used doube x-aixs plot too see the distribution and total count of different age

In [None]:
df['age']=df['animalage'].str.split(' ',expand=True)[0]

plt.figure(figsize=(10,4), dpi= 80)
ax1=sns.distplot(df['age'],color="g")

ax2 = ax1.twinx()
ax2=sns.distplot(df.age, bins = 60, kde=False)

plt.title('The distribution of Age of the animals')
plt.grid(alpha=.6)
plt.show()

In [None]:
def calc_age_category(x):
    x = float(x)
    if x < 3.: return 'young'
    if x < 5.: return 'young adult'
    if x < 10.: return 'adult'
    return 'old'
df['AgeCategory'] = df.age.apply(calc_age_category)

In [None]:
plt.figure(figsize=(10,4))
sns.countplot(df.AgeCategory, palette='Set3')

plt.xticks(rotation=90)
plt.xlabel('Age type')
plt.ylabel('count')
plt.grid(alpha=0.6)
plt.show()

In [None]:
f, (ax1, ax2) = plt.subplots(2,1, figsize=(25,20))
sns.countplot(data=df, x='movementtype',hue='AgeCategory', ax=ax1,
             palette='Set1', alpha=0.6)
sns.countplot(data=df, x='AgeCategory',hue='movementtype', ax=ax2,
             palette='Set1', alpha=0.6)

ax1.set_title('Movementtype and Age',fontsize=15)
ax1.grid(alpha=0.5)
ax2.grid(alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(10,25))
sns.countplot(data=df2, y='basecolour',hue='movementtype', palette='Set1')

plt.xlabel('Color type')
plt.ylabel('count')
plt.grid(alpha=0.6)
plt.show()

Seems black base colour has the most number and has been adopted the most, also orange cat are kinda popluar, the ratio between adoptiona and forester are all around 2:1

Time series:

Will the month influence the intake number of the pets?

In [None]:
group_year=df.groupby(['month_take','year_take'],as_index=False).count()

plt.figure(figsize=(10,4))
sns.lineplot(x="month_take",y='id', hue='year_take', data=group_year)
plt.grid()
plt.show()

Seems everyyear during spring time the shelter will have more intake animals, probably because that is the time the aniamls give birth, which will cause the increasing number of it.

In [None]:
plt.figure(figsize=(10,4))
df['movementtype'].groupby(df['movementdate']).count().plot(kind="line",alpha=.7)
plt.grid()
plt.show()

In [None]:
#For each outcome
monthGroup=df['movementdate'].groupby(df['movementtype'])

plt.subplots(7, 1, figsize=(15,25), sharex=True)
plt.subplots_adjust(hspace=0.7)
colors = list('rgbcmyk')
for i, (_, g) in enumerate(monthGroup):
    plt.subplot(7,1,i+1)
    plt.title(_)
    g.groupby(df["movementdate"]).count().plot(kind="line", 
                                               color=colors[i],grid=True,alpha=.5)

In [None]:
#Monthly time series
df_ym=df.movementdate.map(lambda x: x.strftime('%Y-%m'))
df_ym_move = df_ym.groupby(df["movementtype"])

plt.subplots(7, 1, figsize=(15, 25), sharex=True)
plt.subplots_adjust(hspace=0.7)
colors = list('rgbcmyk')
for i, (_, g) in enumerate(df_ym_move):
    plt.subplot(7,1,i+1)
    plt.title(_)
    g.groupby(df_ym).count().plot(kind="line", color=colors[i],
                                  grid=True,alpha=.5)

In [None]:
df_heat=df.drop(columns=['intakedate','movementdate','id','identichipnumber',
                         'animalname','animalage','returndate','returnedreason',
                         'deceaseddate','deceasedreason',
                        'year_take','month_take','day_take',
                        'year_move','month_move','day_move',
                        'diedoffshelter','isdoa'])

le_sp = preprocessing.LabelEncoder()
df_heat.speciesname = le_sp.fit_transform(df_heat.speciesname)

le_take = preprocessing.LabelEncoder()
df_heat.intakereason = le_take.fit_transform(df_heat.intakereason)

le_breed = preprocessing.LabelEncoder()
df_heat.breedname = le_breed.fit_transform(df_heat.breedname)

le_color = preprocessing.LabelEncoder()
df_heat.basecolour = le_color.fit_transform(df_heat.basecolour)

le_sex = preprocessing.LabelEncoder()
df_heat.sexname = le_sex.fit_transform(df_heat.sexname)

le_loc = preprocessing.LabelEncoder()
df_heat.location = le_loc.fit_transform(df_heat.location)

le_move = preprocessing.LabelEncoder()
df_heat.movementtype = le_move.fit_transform(df_heat.movementtype)

le_age = preprocessing.LabelEncoder()
df_heat.AgeCategory = le_age.fit_transform(df_heat.AgeCategory)

In [None]:
# Correlations to trip_duration
corr = df_heat.select_dtypes(include = ['float64', 'int64']).iloc[:, 1:].corr()
cor_dict = corr['movementtype'].to_dict()
del cor_dict['movementtype']
print("List the numerical features in decending order by their correlation with trip_duration:\n")
for ele in sorted(cor_dict.items(), key = lambda x: -abs(x[1])):
    print("{0}: {1}".format(*ele))
    
# Correlation matrix heatmap
corrmat = df_heat.corr()
plt.figure(figsize=(12, 7))

# Number of variables for heatmap
k = 50000
cols = corrmat.nlargest(k, 'movementtype')['movementtype'].index
cm = np.corrcoef(df_heat[cols].values.T)

# Generate mask for upper triangle
mask = np.zeros_like(cm, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.set(font_scale=1)
sns.heatmap(cm, mask=mask, cbar=True, annot=True, square=True,\
                 fmt='.2f',annot_kws={'size': 12}, yticklabels=cols.values,\
                 xticklabels=cols.values, cmap = 'PuBu',lw = .1)

plt.show()

That actually is good, it means people will adopt animals regardless their age, breed etc things.