<img src="https://sterlingshelter-animalshelterinc.netdna-ssl.com/wp-content/uploads/2017/09/adoption.jpg" />

## Problem Statement
<h4> A leading pet domain Agency is planning to create a virtual tour experience for their customers showcasing all the animals that are available in the shelter. You are required to build Machine Learning Model that determines type and breed of the animal based on its physical attributes and other factors</h4>
<br>
<h4>Target Variables : <b>breed_category</b> ,<b>pet_category</b></h4>

 # Table of Contents
 
 <ol>
    <li><h3> Understanding Data</h3></li>
    <li> <h3>Feature Engineering</h3></li>
    <ul>
        <li><h4> Handling Text id</h4></li>
        <li><h4> Handling Date columns</h4></li>
        <li><h4> Handling missing values</h4></li>
        <li><h4> Handling Categorical values</h4></li>
    </ul>
    <li> <h3>Exploratory Data Analysis</h3> </li>
    <li><h3> MultiOutput Classification</h3></li>
    <ul>
        <li> <h4>Model I creation</h4></li>
        <li><h4>Model II creation</h4></li>
    </ul>
    </ol>
        

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Understanding Data

In [None]:
train=pd.read_csv("/kaggle/input/hackerearth-ml-challenge-pet-adoption/train.csv")
train.head()

In [None]:
train.info()

In [None]:
train.shape

#### Unique Values  

In [None]:
for col in train.columns:
    print(col,':',len(train[col].unique()))

<h1 style="color:green;"> Feature Engineering </h1>

## Handling text id

In [None]:
#removing unwanted text in pet_id
train['pet_id']=train['pet_id'].str.replace('[^0-9]',"")

#converting into int data type
train['pet_id'] = train.pet_id.astype(int)
                                      

## Handling Date columns

In [None]:
#converting both the columns into datetime format
train['issue_date']=pd.to_datetime(train['issue_date'])
train['listing_date']=pd.to_datetime(train['listing_date'])



In [None]:
#taking duration
train['duration']=train['listing_date']-train['issue_date']
train['duration']

In [None]:
#considering only no of days ---duration of days
train['duration'] = train['duration'].dt.days
train['duration']

In [None]:
#drop issue_date and listing date columns

train.drop(['issue_date','listing_date'],axis=1,inplace=True)

## Handling Missing Values

In [None]:
#Checking for missing values
train.isna().sum()

we have <b>1477</b> missing values in condition feature

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
sns.countplot(x=train['condition'],data=train)
plt.title("Condition values composition")
plt.show()

In [None]:
train=train.fillna(2.0)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
sns.countplot(x=train['condition'],data=train)
plt.title("Condition values composition")
plt.show()

In [None]:
train.isna().sum()

<h2> Handling Categorical Variables</h2>

 <h4>Frequency Encoding</h4>
 It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.
 <br><br>
 <b>Three-step for this : </b>
 <ul>
    <li>Select a categorical variable you would like to transform</li>
    <li>Group by the categorical variable and obtain counts of each category</li>
    <li>Join it back with the training dataset</li>
 </ul>

In [None]:
#frequency Encoding
feq_encode = train.groupby('color_type').size()/len(train)
print(feq_encode)

train.loc[:,'color_type'] = train['color_type'].map(feq_encode)

 <h1 style="color:blue;">Exploratory Data Analysis</h1>

In [None]:
plt.figure(figsize=(10,8))
sns.distplot(train['length(m)'])
plt.title("Length data Distribution")
plt.show()

plt.figure(figsize=(10,8))
sns.distplot(train['height(cm)'])
plt.title("Height data Distribution")
plt.show()

In [None]:
sns.set_style('darkgrid')
plt.figure(figsize=(10,8))
sns.countplot("condition",hue="pet_category",data=train)
plt.show()

In [None]:
plt.figure(figsize=(10,8))
sns.countplot("condition",hue="breed_category",data=train)
plt.show()

### Correlation

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(train.corr(),annot=True)

In [None]:
plt.figure(figsize=(10,8))
sns.regplot(x="X1",y="X2",data=train)
plt.title("realtion between X1 and X2")
plt.show()

# Multi Output Classification

The most obvious way to do this is to split a multioutput classification problem into multiple single-output classification problems.

For example, if a multioutput classification problem required the prediction of three values y1, y2 and y3 given an input X, then this could be partitioned into two single-output classification problems:

Problem 1: Given X, predict y1.
<br>
Problem 2: Given X, predict y2.
<br>
There are two main approaches to implementing this technique.

The first approach involves developing a separate classification model for each output value to be predicted. We can think of this as a direct approach, as each target value is modeled directly.

The second approach is an extension of the first method except the models are organized into a chain. The prediction from the first model is taken as part of the input to the second model, and the process of output-to-input dependency repeats along the chain of models.

<b>Direct Multioutput:</b> Develop an independent model for each numerical value to be predicted.
<b>Chained Multioutput:</b> Develop a sequence of dependent models to match the number of labels to be predicted.

### Model I

In [None]:
import xgboost as xgb

In [None]:
x=train.drop(['pet_category','breed_category'],axis=1)
y=train.breed_category #target label

In [None]:
x.head(5)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [None]:
## Hyper Parameters

params={
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30,0.50 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [None]:
from sklearn.model_selection import RandomizedSearchCV
xgb_model = xgb.XGBClassifier()

random_search=RandomizedSearchCV(xgb_model,param_distributions=params,n_iter=5,n_jobs=-1,cv=5,verbose=3)
random_search.fit(x_train,y_train)

In [None]:
random_search.best_params_ #printing best parameters

In [None]:
random_search.best_estimator_  #best estimator 

In [None]:
first_model=xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, gamma=0.0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.15, max_delta_step=0, max_depth=10,
              min_child_weight=1,  monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
first_model.fit(x_train,y_train)

In [None]:
y_pred=first_model.predict(x_test)

In [None]:
print("Accuracy score:",accuracy_score(y_test,y_pred))
cm=confusion_matrix(y_pred,y_test)
plt.figure(figsize=(8,6))
sns.heatmap(cm,annot=True)
plt.show()

# Model II

In [None]:
y=train.pet_category #target label
sns.countplot('pet_category',data=train)

In [None]:
x.head(5)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [None]:
xgb_model = xgb.XGBClassifier()

random_search=RandomizedSearchCV(xgb_model,param_distributions=params,n_iter=5,n_jobs=-1,cv=5,verbose=3)
random_search.fit(x_train,y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
second_model = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=0.2, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.1, max_delta_step=0, max_depth=10,
              min_child_weight=3,monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
second_model.fit(x_train,y_train)

In [None]:
ypred=second_model.predict(x_test)

In [None]:
print("Accuracy score:",accuracy_score(y_test,y_pred))

<h1 style="color:blue;"> I hope you learned something New , Thanking You</h1>