## Problem Statement

Leading Pet adobtion agency is plannning to create virtual tour experience for their customers showcasing all animals that are available in their shelter. To enable this experience we need to build ML model that determines the type and breed of the animal based on its the physical attributes and other factors.

This a ML Hackathon Problem posted in [Codemonk-HackersRank](https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-pet-adoption/)

In [None]:
# importing Libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as  sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
pd.set_option('display.max_column',None)
pd.set_option('display.max_row',None)

## Data

Data consists of following columns
1. pet_id            - Unique Pet Id
2. issue_date        - Date on which pet was issued to the shelter 
3. listing_date      - Date on which pet was arrived at the shelter
4. condition         - Condition of the pet
5. color_type        - Color of the pet
6. length(m)         - Length of the Pet in meters
7. height(cm)        - Height of the pet in Centimeter
8. X1                - Anonymous column 1
9. X2                - Anonymous column 2
10. breed_category   - Target Variable 1 - Breed Category of the Pet
11. pet_category     - Target Variable 2 - Category of the Pet

In [None]:
# Reading the Data
train=pd.read_csv('../input/hackerearth-ml-challenge-pet-adoption/train.csv',index_col='pet_id',parse_dates=['issue_date','listing_date'])
test=pd.read_csv('../input/hackerearth-ml-challenge-pet-adoption/test.csv',index_col='pet_id',parse_dates=['issue_date','listing_date'])

## Exploring the Dataset

In [None]:
#Displaying first five rows of train Dataset
train.head()

In [None]:
#shape of train and test dataset
train_obs,train_ftr=train.shape
test_obs,test_ftr=test.shape
print("Train Dataset has {} observations and Test Dataset has {} Dataset".format(train_obs,test_obs))

In [None]:
# column datatype 
train.info()

In [None]:
# Summary Statistics
train.describe()

In [None]:
# Null Values in Dataset
train.isna().sum()

'condition' column has 1477 Null Values. This is 8% of Training Set

In [None]:
# exploring the NAN values in 'condition' column
train_category_na_df=train[train['condition'].isna()]
train_category_na_df.groupby(['pet_category'])['breed_category'].value_counts()

In [None]:
# groupby to find minimum height
train.groupby(['pet_category','breed_category'])['height(cm)'].min()

In [None]:
# groupby to find maximum height
train.groupby(['pet_category','breed_category'])['height(cm)'].max()

In [None]:
train.groupby(['pet_category'])['breed_category'].value_counts(sort=False)

All NaN Values in the 'condition' column belongs to the 'breed_category' of different 'pet_category'. If we remove the all observation with NaN value, then we are going to neglect the breed_category 2 of all pets.So we can't remove these observation.

In [None]:
train_length_na_df=train[train['length(m)']==0]
train_length_na_df.groupby(['pet_category'])['breed_category'].value_counts()

93 observations have 0 value as value 'length(m)' column. It is not possible. All animals has some length.Dropping the 93 observations.

In [None]:
# removing observation with length 0
train=train[train['length(m)']!=0]
train.shape

In [None]:
# exploring whether height has 0 cm
train[train['height(cm)']==0].shape

In [None]:
# X1 column - value counts
train.X1.value_counts(sort=False)

In [None]:
# X2 column - value counts
train.X2.value_counts(sort=False)

In [None]:
# train.condition.value_counts()
# train['length(m)'].value_counts()
## Pet Category- Category Distribution
# train.pet_category.value_counts(sort=False)
# train.breed_category.value_counts()

In [None]:
train.groupby(['pet_category'])['color_type'].value_counts(sort=False,normalize=True)*100

'color_type' may be an important feature. "smoke","Tabby","cream" color patterns are for pet category 1. similarly other pet category has their own color type. 

## Feature Extraction

* converting the length from meters into centimeter
* creating new area column by multiplying height and length
* converting the color name into color hexcodes and splitting into three columns for red,green,blue values.
* Extracting text features from the color_type
* creating new column from listing_date and issue_date. May be it gives the lifetime,etc

In [None]:
# cocatenating train and test set
train['train_or_test']='train'
test['train_or_test']='test'
dataframe=pd.concat([train,test],axis=0)

#length(cm)= length(m)*100
dataframe['length(cm)']=dataframe['length(m)']*100

#area(cm^2)= length(cm)*height(cm)
dataframe['area(cm^2)']=dataframe['length(cm)']*dataframe['height(cm)']

In [None]:
# fill NAN value with 3
dataframe['condition'].fillna(3,inplace=True)

In [None]:
dataframe['time_for_listing']=(dataframe['listing_date']-dataframe['issue_date'])/1000000000000
dataframe['time_for_listing']=pd.to_numeric(dataframe['time_for_listing'])

In [None]:
dataframe['Year_arrival'] = (dataframe['listing_date']).dt.year
dataframe['Month_arrival'] = (dataframe['listing_date']).dt.month
dataframe['Day_arrival'] = (dataframe['listing_date']).dt.day
dataframe['Dayofweek_arrival'] = (dataframe['listing_date']).dt.dayofweek
dataframe['DayOfyear_arrival'] = (dataframe['listing_date']).dt.dayofyear
dataframe['Week_arrival'] = (dataframe['listing_date']).dt.week
dataframe['Quarter_arrival'] = (dataframe['listing_date']).dt.quarter 



dataframe['Year_issue'] = (dataframe['issue_date']).dt.year
dataframe['Month_issue'] = (dataframe['issue_date']).dt.month
dataframe['Day_issue'] = (dataframe['issue_date']).dt.day
dataframe['Dayofweek_issue'] = (dataframe['issue_date']).dt.dayofweek
dataframe['DayOfyear_issue'] = (dataframe['issue_date']).dt.dayofyear
dataframe['Week_issue'] = (dataframe['issue_date']).dt.week
dataframe['Quarter_issue'] = (dataframe['issue_date']).dt.quarter 



dataframe['year_took']=dataframe['Year_arrival']-dataframe['Year_issue']
dataframe['months_took']=dataframe['Month_arrival']-dataframe['Month_issue']
dataframe['days_took']=dataframe['Day_arrival']-dataframe['Day_issue']

In [None]:
train=dataframe[dataframe['train_or_test']=='train']
test=dataframe[dataframe['train_or_test']=='test']

train_X=train.drop(['issue_date','color_type','listing_date','length(m)','breed_category', 'pet_category','train_or_test'],axis=1)
train_y=train['pet_category']
final_test_X=test.drop(['issue_date','color_type','listing_date','length(m)','breed_category', 'pet_category','train_or_test'],axis=1)

In [None]:
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as XGB

X_train,X_test,y_train,y_test=train_test_split(train_X,train_y,test_size=0.2,stratify=train_y)

In [None]:
xg_cl=XGB.XGBClassifier()

xg_cl.fit(X_train,y_train)
y_train_predict=xg_cl.predict(X_train)
y_predict=xg_cl.predict(X_test)
test['pet_category']=xg_cl.predict(final_test_X)
print(accuracy_score(y_train,y_train_predict))
print(accuracy_score(y_test,y_predict))
print(confusion_matrix(y_test,y_predict))
print(f1_score(y_test,y_predict,average='weighted'))

In [None]:
XGB.plot_importance(xg_cl)

In [None]:
train_X=train.drop(['issue_date','color_type','listing_date','length(m)','breed_category','train_or_test'],axis=1)
train_y=train['breed_category']
final_test_X=test.drop(['issue_date','color_type','listing_date','length(m)', 'breed_category','train_or_test'],axis=1)


xg_cl=XGB.XGBClassifier()

xg_cl.fit(X_train,y_train)
y_train_predict=xg_cl.predict(X_train)
y_predict=xg_cl.predict(X_test)
test['breed_category']=xg_cl.predict(final_test_X)
print(accuracy_score(y_train,y_train_predict))
print(accuracy_score(y_test,y_predict))
print(confusion_matrix(y_test,y_predict))
print(f1_score(y_test,y_predict,average='weighted'))

In [None]:
XGB.plot_importance(xg_cl)

In [None]:
dfy=test[['breed_category','pet_category']]
dfy.head(1000)

In [None]:
dfy.to_csv('output.csv')