# HackerEarth Machine Learning challenge: Adopt a buddy

Steps to be taken:


1.   Get the Data from the source.
2.   Exploratory Data Analysis on the data
3.   Handling missing data
4.   Feature Engineering - creating new features
5.   Preparing the data to be feed in ML
6.   Handling categorical data
7.   Scaling the data for better ML models performance
8.   Training ML model
9.   Evaluating model
10.   Predicting the values from model
11.   Exporting the data to CSV

importing libraries

In [1]:
import numpy as np
import pandas as pd

import os

Get the data from the cvs files provided from the hackerearth

In [2]:
train=pd.read_csv("/content/train.csv") # Training Data
test=pd.read_csv("/content/test.csv") # Testing Data
train.head()

Unnamed: 0,pet_id,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2,breed_category,pet_category
0,ANSL_69903,2016-07-10 00:00:00,2016-09-21 16:25:00,2.0,Brown Tabby,0.8,7.78,13,9,0.0,1
1,ANSL_66892,2013-11-21 00:00:00,2018-12-27 17:47:00,1.0,White,0.72,14.19,13,9,0.0,2
2,ANSL_69750,2014-09-28 00:00:00,2016-10-19 08:24:00,,Brown,0.15,40.9,15,4,2.0,4
3,ANSL_71623,2016-12-31 00:00:00,2019-01-25 18:30:00,1.0,White,0.62,17.82,0,1,0.0,2
4,ANSL_57969,2017-09-28 00:00:00,2017-11-19 09:38:00,2.0,Black,0.5,11.06,18,4,0.0,1


In [3]:
test.head()

Unnamed: 0,pet_id,issue_date,listing_date,condition,color_type,length(m),height(cm),X1,X2
0,ANSL_75005,2005-08-17 00:00:00,2017-09-07 15:35:00,0.0,Black,0.87,42.73,0,7
1,ANSL_76663,2018-11-15 00:00:00,2019-05-08 17:24:00,1.0,Orange Tabby,0.06,6.71,0,1
2,ANSL_58259,2012-10-11 00:00:00,2018-04-02 16:51:00,1.0,Black,0.24,41.21,0,7
3,ANSL_67171,2015-02-13 00:00:00,2018-04-06 07:25:00,1.0,Black,0.29,8.46,7,1
4,ANSL_72871,2017-01-18 00:00:00,2018-04-26 13:42:00,1.0,Brown,0.71,30.92,0,7


# Exploring the data

Shape of the data

In [4]:
train.shape,test.shape

((18834, 11), (8072, 9))

## Checking missing values

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18834 entries, 0 to 18833
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   pet_id          18834 non-null  object 
 1   issue_date      18834 non-null  object 
 2   listing_date    18834 non-null  object 
 3   condition       17357 non-null  float64
 4   color_type      18834 non-null  object 
 5   length(m)       18834 non-null  float64
 6   height(cm)      18834 non-null  float64
 7   X1              18834 non-null  int64  
 8   X2              18834 non-null  int64  
 9   breed_category  18834 non-null  float64
 10  pet_category    18834 non-null  int64  
dtypes: float64(4), int64(3), object(4)
memory usage: 1.6+ MB


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8072 entries, 0 to 8071
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   pet_id        8072 non-null   object 
 1   issue_date    8072 non-null   object 
 2   listing_date  8072 non-null   object 
 3   condition     7453 non-null   float64
 4   color_type    8072 non-null   object 
 5   length(m)     8072 non-null   float64
 6   height(cm)    8072 non-null   float64
 7   X1            8072 non-null   int64  
 8   X2            8072 non-null   int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 567.7+ KB


It seems there are only null values in 'Condition' column.

Check unique values in ***breed_category***.

In [7]:
train['breed_category'].value_counts()

0.0    9000
1.0    8357
2.0    1477
Name: breed_category, dtype: int64

Checking which of the breed_category are none w.r.t condition.

In [8]:
a=train['breed_category'][(np.isnan(train['condition']))]
a.value_counts()

2.0    1477
Name: breed_category, dtype: int64

so it means all the missing values belong to a single particular label.That's why,we can fill them with a unique value like -1 

Let's combine our data for further analysis. 

In [9]:
test_id=test['pet_id'] #copy all test id to create submission file
ntrain=train.shape[0] #save the train left...it will use when the combine data will split into the previous train and test data after doing feature engineering

Saving target variables i.e label

In [10]:
y1=train['breed_category']
y2=train['pet_category']

Combine test and train data

In [11]:
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['breed_category','pet_category'], axis=1, inplace=True)

now it's ready to do feature engineering

# Feature Engineering

## Handling missing data
missing value fillup with -1

In [12]:
all_data['condition'].value_counts()

1.0    9747
0.0    8966
2.0    6097
Name: condition, dtype: int64

In [13]:
all_data['condition'].fillna(-1,inplace=True)

In [14]:
all_data['condition'].value_counts()

 1.0    9747
 0.0    8966
 2.0    6097
-1.0    2096
Name: condition, dtype: int64

In [15]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26906 entries, 0 to 26905
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   pet_id        26906 non-null  object 
 1   issue_date    26906 non-null  object 
 2   listing_date  26906 non-null  object 
 3   condition     26906 non-null  float64
 4   color_type    26906 non-null  object 
 5   length(m)     26906 non-null  float64
 6   height(cm)    26906 non-null  float64
 7   X1            26906 non-null  int64  
 8   X2            26906 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 1.8+ MB


Now we have no ***null*** values in out dataset.

## Handling issue_date and listing_data

In [16]:
all_data['issue_date']=pd.to_datetime(all_data['issue_date'])
all_data['listing_date']=pd.to_datetime(all_data['listing_date'])


In [17]:
x=[]
for d in all_data['issue_date']:
    y=d.month
    x.append(y)
all_data['issue_month']=x

In [18]:
x=[]
for d in all_data['listing_date']:
    y=d.month
    x.append(y)
all_data['listing_month']=x

**time difference between issue and listing date(new feature)**

it will be a great feature as all animals that are available in shelter must be mature.may be,the issue date set up by birth time and the listing date set up by after getting mature for staying in shelter.so the difference between two of them indicates the time for getting mature of an animal which is not same for all animal.so it's also an important feature  

In [19]:
x=[]
for d in all_data['listing_date']:
    y=d.year+(d.month/12.0)+(d.day/365.0)
    x.append(y)
all_data['modified_listing_date']=x

In [20]:
x=[]
for d in all_data['issue_date']:
    y=d.year+(d.month/12.0)+(d.day/365.0)
    x.append(y)
all_data['modified_issue_date']=x

In [21]:
all_data['took_time']=abs(all_data['modified_listing_date']-all_data['modified_issue_date'])

## Modified pet id and extract important feature 


In [22]:
all_data['1stnum'] = all_data['pet_id'].str[:6]
all_data['1st2num'] = all_data['pet_id'].str[:7]

# Preparing data to be feed in ML algorithm.

split back to the train and test data

In [23]:
train = all_data[:ntrain]
test = all_data[ntrain:]

In [24]:
#drop some unnecessary features
x=train.drop(['pet_id','issue_date','listing_date','modified_issue_date'],axis=1)
test=test.drop(['pet_id','issue_date','listing_date','modified_issue_date'],axis=1)


 ## Handling categorical data

In [25]:
x.select_dtypes(exclude='number').columns.to_list()

['color_type', '1stnum', '1st2num']

There are a lot of ways to handle categorical variable, but I will use One-Hot encoding.

In [26]:
x.shape

(18834, 12)

In [27]:
x=pd.get_dummies(x)
test=pd.get_dummies(test)


In [28]:
x.shape,test.shape

((18834, 97), (8072, 95))

look,both shape are not same.train has 97 and test has 95 columns.it means the train and test data contain 2 extra column after one hot endcoding.we have to remove those 2 columns from train data.

In [29]:
a=set(x.columns)-set(test.columns)

In [30]:
a=list(a)
a

['color_type_Brown Tiger', 'color_type_Black Tiger']

In [31]:
x=x.drop(a,axis=1)

In [32]:
x.shape,test.shape

((18834, 95), (8072, 95))

In [33]:
#again combining
all_data = pd.concat((x, test)).reset_index(drop=True)

 ## Scaling the data using StandardScaler()

In [34]:
from sklearn import preprocessing
# Get column names first
names = all_data.columns
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df = scaler.fit_transform(all_data)
all_data = pd.DataFrame(scaled_df, columns=names)
all_data.head()

Unnamed: 0,condition,length(m),height(cm),X1,X2,issue_month,listing_month,modified_listing_date,took_time,color_type_Agouti,color_type_Apricot,color_type_Black,color_type_Black Brindle,color_type_Black Smoke,color_type_Black Tabby,color_type_Blue,color_type_Blue Cream,color_type_Blue Merle,color_type_Blue Point,color_type_Blue Smoke,color_type_Blue Tabby,color_type_Blue Tick,color_type_Blue Tiger,color_type_Brown,color_type_Brown Brindle,color_type_Brown Merle,color_type_Brown Tabby,color_type_Buff,color_type_Calico,color_type_Calico Point,color_type_Chocolate,color_type_Chocolate Point,color_type_Cream,color_type_Cream Tabby,color_type_Fawn,color_type_Flame Point,color_type_Gold,color_type_Gray,color_type_Gray Tabby,color_type_Green,...,color_type_Tan,color_type_Torbie,color_type_Tortie,color_type_Tortie Point,color_type_Tricolor,color_type_White,color_type_Yellow,color_type_Yellow Brindle,1stnum_ANSL_4,1stnum_ANSL_5,1stnum_ANSL_6,1stnum_ANSL_7,1st2num_ANSL_49,1st2num_ANSL_50,1st2num_ANSL_51,1st2num_ANSL_52,1st2num_ANSL_53,1st2num_ANSL_54,1st2num_ANSL_55,1st2num_ANSL_56,1st2num_ANSL_57,1st2num_ANSL_58,1st2num_ANSL_59,1st2num_ANSL_60,1st2num_ANSL_61,1st2num_ANSL_62,1st2num_ANSL_63,1st2num_ANSL_64,1st2num_ANSL_65,1st2num_ANSL_66,1st2num_ANSL_67,1st2num_ANSL_68,1st2num_ANSL_69,1st2num_ANSL_70,1st2num_ANSL_71,1st2num_ANSL_72,1st2num_ANSL_73,1st2num_ANSL_74,1st2num_ANSL_75,1st2num_ANSL_76
0,1.40918,1.024225,-1.514343,1.169789,1.26275,0.062139,0.622139,-1.457263,-0.713151,-0.014935,-0.024393,-0.568681,-0.058575,-0.03954,-0.056627,-0.219337,-0.023618,-0.070748,-0.02925,-0.021123,-0.145394,-0.039066,-0.020224,-0.326505,-0.165581,-0.047669,3.196171,-0.082753,-0.136902,-0.025144,-0.121592,-0.025144,-0.09487,-0.09878,-0.08975,-0.052516,-0.038586,-0.128341,-0.059839,-0.021986,...,-0.278293,-0.113298,-0.139278,-0.03557,-0.159069,-0.385807,-0.088266,-0.02925,-0.03341,-0.769033,1.300231,-0.585963,-0.03341,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.19637,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,5.089794,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.183557
1,0.292938,0.747384,-1.020842,1.169789,1.26275,1.277405,1.45714,1.235802,0.916711,-0.014935,-0.024393,-0.568681,-0.058575,-0.03954,-0.056627,-0.219337,-0.023618,-0.070748,-0.02925,-0.021123,-0.145394,-0.039066,-0.020224,-0.326505,-0.165581,-0.047669,-0.312874,-0.082753,-0.136902,-0.025144,-0.121592,-0.025144,-0.09487,-0.09878,-0.08975,-0.052516,-0.038586,-0.128341,-0.059839,-0.021986,...,-0.278293,-0.113298,-0.139278,-0.03557,-0.159069,2.591969,-0.088266,-0.02925,-0.03341,-0.769033,1.300231,-0.585963,-0.03341,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.19637,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,5.089794,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.183557
2,-1.939546,-1.225103,1.035542,1.475018,-0.157894,0.669772,0.900472,-1.364754,-0.094221,-0.014935,-0.024393,-0.568681,-0.058575,-0.03954,-0.056627,-0.219337,-0.023618,-0.070748,-0.02925,-0.021123,-0.145394,-0.039066,-0.020224,3.062744,-0.165581,-0.047669,-0.312874,-0.082753,-0.136902,-0.025144,-0.121592,-0.025144,-0.09487,-0.09878,-0.08975,-0.052516,-0.038586,-0.128341,-0.059839,-0.021986,...,-0.278293,-0.113298,-0.139278,-0.03557,-0.159069,-0.385807,-0.088266,-0.02925,-0.03341,-0.769033,1.300231,-0.585963,-0.03341,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.19637,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,5.089794,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.183557
3,0.292938,0.401334,-0.741371,-0.814202,-1.010281,1.581222,-1.604531,1.328311,-0.091489,-0.014935,-0.024393,-0.568681,-0.058575,-0.03954,-0.056627,-0.219337,-0.023618,-0.070748,-0.02925,-0.021123,-0.145394,-0.039066,-0.020224,-0.326505,-0.165581,-0.047669,-0.312874,-0.082753,-0.136902,-0.025144,-0.121592,-0.025144,-0.09487,-0.09878,-0.08975,-0.052516,-0.038586,-0.128341,-0.059839,-0.021986,...,-0.278293,-0.113298,-0.139278,-0.03557,-0.159069,2.591969,-0.088266,-0.02925,-0.03341,-0.769033,-0.769094,1.706594,-0.03341,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.19637,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,5.089794,-0.196472,-0.196472,-0.196472,-0.196472,-0.183557
4,1.40918,-0.013926,-1.261818,1.932862,-0.157894,0.669772,1.178806,-0.077498,-0.731366,-0.014935,-0.024393,1.758456,-0.058575,-0.03954,-0.056627,-0.219337,-0.023618,-0.070748,-0.02925,-0.021123,-0.145394,-0.039066,-0.020224,-0.326505,-0.165581,-0.047669,-0.312874,-0.082753,-0.136902,-0.025144,-0.121592,-0.025144,-0.09487,-0.09878,-0.08975,-0.052516,-0.038586,-0.128341,-0.059839,-0.021986,...,-0.278293,-0.113298,-0.139278,-0.03557,-0.159069,-0.385807,-0.088266,-0.02925,-0.03341,1.300334,-0.769094,-0.585963,-0.03341,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,5.089794,-0.196472,-0.19637,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.196472,-0.183557


# Machine Learning models



## pet_category Prediction

Split data for 1st model i.e pet_category prediction

In [35]:
x = all_data[:ntrain]
test = all_data[ntrain:]

In [36]:
from sklearn.model_selection import train_test_split
x1_train,x1_test,y1_train,y1_test=train_test_split(x,y2,test_size=0.2,random_state=44,shuffle=True)

In [37]:
from xgboost import XGBClassifier
model1 = XGBClassifier(n_estimators=500, n_jobs=5,learning_rate=0.1)
model1.fit(x1_train, y1_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=500, n_jobs=5,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

## breed_category Prediction

We'll use the output of **pet_category model** as an input feature of **breed_category model**.

In [38]:
#new_feat is new feature i.e the predicted pet_category of model 1 for train data
new_feat=model1.predict(x)
#output1 is new first output i.e the predicted pet_category of model 1 for test data
output1=model1.predict(test)
#vld1 is validation 1 i.e we'll check score with the predicted result of validation data of model 1
vld1=model1.predict(x1_test)

In [39]:
x2 = pd.DataFrame(x, columns=names)
test2 = pd.DataFrame(test, columns=names)

In [40]:
#the predicted pet_category of model 1 for train data is used as a input variable or feature of the train data of model 2
x2['output1']=new_feat
#the predicted pet_category of model 1 for test data is used as a input variable or feature of the test data of model 2
test2['output1']=output1

spliting the data

In [41]:
x2_train,x2_test,y2_train,y2_test=train_test_split(x2,y1,test_size=0.2,random_state=44)

Now Training the model.

In [42]:
model2 = XGBClassifier(n_estimators=450, n_jobs=5)
model2.fit(x2_train, y2_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=450, n_jobs=5,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [43]:
#output 2 is the predicted breed_category of model 2 for test data
output2=model2.predict(test)
#vld2 is validation 2 i.e we'll check score with the predicted result of validation data of model 2
vld2=model2.predict(x2_test)

# Check Accuracy

In [44]:
from sklearn.metrics  import f1_score

In [45]:
s1=f1_score(y1_test,vld1,average='weighted')
s2=f1_score(y2_test,vld2,average='weighted')
accuracy=100*((s1+s2)/2)
accuracy

90.57226920753799

# Create Submission file

In [46]:
sub_new=pd.DataFrame({
    "pet_id":test_id,
    "breed_category":output2,
    "pet_category":output1
})
sub_new.to_csv("/content/result.csv",index=False)