### Hackerearth ML Challenge 2020 - Adopt a buddy

#### Problem type: Multitarget Multiclass Classification

This is an ongoing ML competition on Hackerearth (Jul 30, 2020 - Aug 23, 2020). We are required to build an model to determine type and breed of the animal based on its physical attributes and other factors. The evaluation metric being used is (the average of both f1_scores * 100).

For this competition the key to get on the top of leaderboard is **data Analysis and generating new features** which I have covered in this notebook. After end of the competition on Aug 23rd, I have secured Rank 9th with the public leaderboard score of 91.32278 

**Kindly upvote if you find it interesting/helpful and comment your suggestions or any queries.**

In [None]:
#Import libraries
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [None]:
#reading data
train= pd.read_csv('/kaggle/input/hackerearth-ml-challenge-pet-adoption/train.csv')
test= pd.read_csv('/kaggle/input/hackerearth-ml-challenge-pet-adoption/test.csv')

print("Train Shape: ",train.shape)
print("Test Shape: ", test.shape)

In [None]:
# Check for columns
print(train.columns)
print(test.columns)

* We have two target lables to predict: breed_category and pet_category.

In [None]:
# Checking the data
train.head()

In [None]:
test.head()

In [None]:
#check for datatypes
print(train.dtypes)
print('*'*30)
print(test.dtypes)

#### Target Variable Analysis

In [None]:
print('Var1: Breed Category')
print(train['breed_category'].value_counts())
print()
print('Var2: Pet Category')
print(train['pet_category'].value_counts())

**The first thing we can notice is the imbalaced classes in target variables. Due to imbalanced class distribution, we need to be very careful while choosing any validation strategy. StratifiedKFold validation will be good. We can take a note of few things here:**
* There are 3 classes in breed category -> 0, 1, 2
* there are 4 classes in pet category   -> 0, 1, 2, 4 
* No class labelled 3 in pet category.


#### Missing Values

In [None]:
# train
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

In [None]:
# test
total_test = test.isnull().sum().sort_values(ascending=False)
percent_test = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data_test = pd.concat([total_test, percent_test], axis=1, keys=['Total', 'Percent'])
missing_data_test

* We have only one column condition with missing values both in train and test.

#### Other variables analysis and relation with targets

In [None]:
# Col1: pet_id
print(train.shape)
print(train.pet_id.nunique())

print()

print(test.shape)
print(test.pet_id.nunique())

In [None]:
train.sort_values(by=['pet_id']).head()

In [None]:
test.sort_values(by=['pet_id']).head()

In [None]:
train.sort_values(by=['issue_date']).head()

* Tried sorting values by id and issue_date in order to understand how data split was made in order to choose validation split. It didn't work.
* var pet_id is unique for each rows in train and test.
* Possibility of new feature generation from alphanumeric col pet_id.

In [None]:
# feature engg
# getting substring from pet_id for new feature
train['nf1_pet_id'] = train['pet_id'].str[:6]
train['nf2_pet_id'] = train['pet_id'].str[:7]

In [None]:
# check for new feature-1
print(train.nf1_pet_id.nunique())
print(train.nf1_pet_id.value_counts())

In [None]:
# check for new feature-2
print(train.nf2_pet_id.nunique())
print(train.nf2_pet_id.value_counts())

In [None]:
train.groupby(['nf1_pet_id', 'pet_category']).size()

In [None]:
test['pet_id'].str[:6].value_counts()

In [None]:
# Col2-3: issue_data and listing_date 

#anomoly detection datetime- train
train['issue_date']= pd.to_datetime(train['issue_date'])
train['listing_date']= pd.to_datetime(train['listing_date'])

train['duration_days'] = (train['listing_date'] - train['issue_date']).dt.days
train.loc[train['listing_date'] < train['issue_date']]

In [None]:
#anomoly detection datetime- test
test['issue_date']= pd.to_datetime(test['issue_date'])
test['listing_date']= pd.to_datetime(test['listing_date'])
test.loc[test['listing_date'] < test['issue_date']]

*TODO: Modelling*

1. Generate multiple datetime features from issue_date and listing_date.
2. Correct 2 detected anomolies in train.

In [None]:
# Col4: condition
train = train.fillna(-99)
test = test.fillna(-99)
print(train['condition'].value_counts())
print()
print(test['condition'].value_counts())

In [None]:
train.groupby(['condition','pet_category']).size()

In [None]:
train.columns

In [None]:
train.groupby(['condition','X1','X2','breed_category']).size()

**Yaaa!!! Looks like we hit a jackpot here.**

*TODO: Modelling*

Generate 3 binary features for -99, 0.0, 1.0 condition types.

In [None]:
# Col5: color_type
print(train['color_type'].value_counts())
print('*'*40)
print(test['color_type'].value_counts())

In [None]:
train.groupby(['color_type', 'pet_category']).size()

In [None]:
train.groupby(['color_type','breed_category']).size()

**Another cool feature found**

*TODO: Modelling*

* Generate new features based on grouped color_type variables. Particulary useful for predicting pet_category.

In [None]:
print(train['color_type'].nunique())
print(test['color_type'].nunique())

In [None]:
#to find which two color types not present in test
set(train.color_type) - set(test.color_type)

In [None]:
set(test.color_type) - set(train.color_type)

In [None]:
# Col6-7: length(m) and height(cm)
sns.distplot(train['length(m)'])

In [None]:
df=train[['length(m)','height(cm)']]
df['length(cm)'] = df['length(m)']*100
df[['length(cm)','height(cm)']].boxplot()

**Many pets have length zero**

In [None]:
train.describe()

In [None]:
print(len(train[train['length(m)'] == 0]))
print(len(test[test['length(m)']==0]))

**93 rows in train and 44 column in test have length column zero**

In [None]:
#convert length(m) to length(cm)
train['length(cm)'] = train['length(m)'].apply(lambda x: x*100)
test['length(cm)'] = test['length(m)'].apply(lambda x: x*100)

In [None]:
train.drop('length(m)', axis=1, inplace=True)
test.drop('length(m)', axis=1, inplace=True)

In [None]:
train[train['length(cm)']==0].groupby(['length(cm)','pet_category']).size()

In [None]:
test['length(cm)'].mean()

In [None]:
# replace all 0 length with mean of lengths
val = train['length(cm)'].mean()
train['length(cm)'] = train['length(cm)'].replace(to_replace=0, value=val)
test['length(cm)'] = test['length(cm)'].replace(to_replace=0, value=val)

In [None]:
# check again for 0 length
print(len(train[train['length(cm)'] == 0]))
print(len(test[test['length(cm)']==0]))

In [None]:
train[['length(cm)','height(cm)']].describe()

In [None]:
#new feature
train['ratio_len_height'] = train['length(cm)']/train['height(cm)']

In [None]:
#relation between ratio and pet_category
sns.catplot(x='pet_category',y='ratio_len_height',data=train)

In [None]:
sns.catplot(x='breed_category',y='ratio_len_height',data=train)

* Ratio of length and height is somewhat distinctive feature. Useful.

* I also found duration_days to be very useful.

In [None]:
sns.catplot(x='pet_category',y='duration_days',data=train)

In [None]:
sns.boxplot(x='breed_category',y='height(cm)',data=train)

*TODO: Modelling*
1. generate new ratio feature based on length and height, research pets length height correlation and possibility of more features.
2. check for anomoly and correct

In [None]:
# Col8-9: X1, X2 
#X1
print(train['X1'].value_counts())
print('*'*30)
print(test['X1'].value_counts())

In [None]:
#X2
print(train['X2'].value_counts())
print('*'*30)
print(test['X2'].value_counts())

*TODO: Modelling*
1. Research and try predicting the anonomized features X1 and X2 with their distribution.

In [None]:
#correlation matrix
plt.subplots(figsize=(10,8))
sns.heatmap(train.corr(), annot= True)

* After creating lots of features, correlation matrix can help us in effective feature selection. We can remove those features which has strong correlation between them (corr > 0.9)
* Mild correlation between X1 and X2

#### Additional Notes

1. For feature selection use any of univariate feature selection mechanism f_classif, chi2, mutual_info_classif. 

**I used ANOVA-F value f_classif in my case.**

2. I have not used any kind of model stacking or blending approach yet. XGB seems to be giving good results as always. 
3. For categorical variables encoding go for One Hot Encoding, I also tried mean target encoding technique with StratifiedKFold approach and regularization parameter. It gave similar results to One Hot Encoding.


#### With better feature engineering and a careful validation strategy, It easy to score better. 
#### I hope this notebook was helpful. Will keep updating and publish my Modelling notebook soon. 

### Thank you, kindly Upvote and Happy learning :)