#### **FIND THE UPDATED VERSION (WITH CELL OUTPUT), HERE - [GITHUB](https://gist.github.com/abhilash97/11945d1cdfe5658432d59932f1baeb88)**

# Predict the speed of Pet Adoption


The goal of the project is to build a product that would predict the speed at which a pet would be adopted, given the description of the pet. The most interesting thing about the project is the dataset - it contains both, structured and unstructred dataset, i.e. tabular as well image data. The dataset is intuitive with respect to the problem and at the same time quite challenging.[](http://) 

# An overview of the dataset

The dataset used is an open-source dataset of kaggle. The following gives an overview of the data fields representing a pet 

PetID - Unique hash ID of pet profile <br>
AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict <br> 
**Type - Type of animal (1 = Dog, 2 = Cat)** <br>
Name - Name of pet (Empty if not named) <br>
Age - Age of pet when listed, in months <br>
Breed1 - Primary breed of pet (Refer to BreedLabels dictionary) <br>
Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary) <br>
Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets) <br>
Color1 - Color 1 of pet (Refer to ColorLabels dictionary) <br>
Color2 - Color 2 of pet (Refer to ColorLabels dictionary) <br>
Color3 - Color 3 of pet (Refer to ColorLabels dictionary) <br>
MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified) <br>
FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified) <br>
Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure) <br>
Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure) <br>
Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure) <br>
Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified) <br>
Quantity - Number of pets represented in profile <br>
Fee - Adoption fee (0 = Free) <br>
State - State location in Malaysia (Refer to StateLabels dictionary) <br>
RescuerID - Unique hash ID of rescuer <br>
VideoAmt - Total uploaded videos for this pet <br>
PhotoAmt - Total uploaded photos for this pet <br>

In [None]:
import numpy as np
import pandas as pd
from matplotlib import image
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from scipy import stats
import cv2
from keras.utils import Sequence

# 1. Load Data

The ETL process is performed locally. The data is extracted from the csv and jpg files. The transformed tabular and image data are stored in a dataframe, numpy files respectively, which are then used for modelling

### 1.1 Training Data 
Tabular/text data for the training set

In [None]:
#load and display the 1st 10 rows of the train data
df = pd.read_csv("../input/training/input.csv", index_col=0)
df.head(5)

In [None]:
# dataset dimension
n_rows = df.shape[0]
n_cols = df.shape[1]

print("\nNumber of rows - ", n_rows, "\nNumber of columns - ", n_cols)

# dataset datatypes - #int cols, #string cols
# would be useful during one-hot encoding
n_stringTypes = 0 

for i in df.iloc[0]:
    if type(i)==str:
        n_stringTypes+=1
print("Number of string type columns - ", n_stringTypes) 

### 1.2 Breed Labels [Part of training data]
Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.

In [None]:
filepath = '../input/petfinder-adoption-prediction/PetFinder-BreedLabels.csv'
df_breedLabels = pd.read_csv(filepath)
df_breedLabels.head(5)

In [None]:
# dataset dim
n_df_breedRows = df_breedLabels.shape[0]
n_df_breedCols = df_breedLabels.shape[1]
print("Dataset shape - ", df_breedLabels.shape)
# different number of breeds for dog [type=1] and cat [type=2]

grp1, grp2 = df_breedLabels.groupby('Type').apply(lambda ser: ser['BreedName'].unique())
print("\nDog Breeds - ", grp1[:10], "....","\nNumber of Dog Breeds - ", len(grp1))
print("\nCat Breeds - ", grp2[:10], "....","\nNumber of Cat Breeds - ", len(grp2))

### 1.3 Color Labels [Part of training data] 
 Contains ColorName for each ColorID

In [None]:
filepath = '../input/petfinder-adoption-prediction/PetFinder-ColorLabels.csv'
df_ColorLabels = pd.read_csv(filepath)
print(df_ColorLabels)
print("Shape - ", df_ColorLabels.shape)
print("Number of unique color for pet dataset - ", len(df_ColorLabels))

### 1.4 State Labels [Part of Training Data]
Contains StateName for each StateID

In [None]:
filepath = '../input/petfinder-adoption-prediction/PetFinder-StateLabels.csv'
df_StateLabels = pd.read_csv(filepath)
print(df_StateLabels)
print("Shape - ", df_StateLabels.shape)
print("Number of unique states for pet dataset - ", len(df_StateLabels))

### 1.5 Loading Pet Image Data

The tabular data (training) is loaded in a dataframe above. Now, let's load some image data corresponding to the tabular data. Each row in the dataframe has a PetID value. The images are stored with PetID as the filename - format: <PetID>-<Integer> . The integer represent which #image is of that PetID. Default image of the pet is the one labelled 1. 

In [None]:
# Loading pet images corresponding to top 3 rows of the dataframe. This is just for visualization
# Not loading all images at once, as it would take up a lot of memory
def plot_image(filename):
    data = image.imread(filename)
    print("\nData type - ",data.dtype, "\nData shape - ", data.shape)
    # display image
    plt.imshow(data)
    plt.show()
    
img_path = "../input/petfinder-adoption-prediction/train_images/"
dff = df.head(2)
for i in range(0,len(dff)):
    print("\nName - ", dff.loc[i,'Name'], "\nPetID - ",dff.loc[i,'PetID'])
    plot_image(img_path+dff.loc[i,'PetID']+'-'+'1.jpg')
    

# 2. Data Quality Assessment - Exploring the Data
Let's conduct some quality assessment on the data obtained. Based on the assessments the data would be accordingly cleaned and transformed (if required)

#### CHECK FOR MISSING VALUES AND DUPLICATES
For duplicates, we don't need to search every column. Need to check only those columns which shouldn't semantically contain duplicate value, i.e. PetID

In [None]:
df.loc[:,'PetID'].duplicated().any() #if False, no duplicate values present

In [None]:
def checkIfNull():
    for i in df.columns:
        if df[i].isnull().any():
            print('Column','"',i,'"',' has missing values')
        else:
            continue
checkIfNull()

In [None]:
#Let's print the column Names
names = df.loc[:,'Name']
names[:28]
# The NaN values are visible

Filling the missing values - The best way to fill the missing names is to name the pet - 'No Name' (or unknown). Some pets are assigned 'No Name yet' and some 'No Name'. It would be easier to one-hot encode the names if all these no name pets are given a single name (eg. unknown)

In [None]:
df.loc[:,'Name'] = df.loc[:,'Name'].fillna('Unknown')
for i,name in enumerate(names):
    if re.search("Name", name):
        df.loc[:,'Name'][i] = "Unknown"
print(df.loc[:,'Name'][:15])
print(df.loc[:,'Name'].isnull().any())

#### DATA TYPE CHECK
Check if data types of col match their content

In [None]:
dtypes = {}
for i,k in enumerate(df.iloc[0]):
    dtypes[df.columns[i]] = type(k)
print(dtypes)
# from the output it can be stated that the data types of each col match their content

#### SET AND FOREIGN KEY MEMBERSHIP
Set Membership -> Check if only allowed values are chosen for categorical fields.<br>
FK Membership -> Check if only allowed values (with respect to the reference table values) are present in a field

In [None]:
#Set Membership 

# Gender
print('Gender - ', df.loc[:,'Gender'].isin([1,2,3]).all())
#Fur length
print('Fur length - ', df.loc[:,'FurLength'].isin([0,1,2,3]).all())
# MaturitySize
print('Maturity - ', df.loc[:,'MaturitySize'].isin([0,1,2,4,3]).all())
# Vaccinated
print('Vaccinated - ', df.loc[:,'Vaccinated'].isin([1,2,3]).all())
# Dewormed
print('Dewormed - ', df.loc[:,'Dewormed'].isin([1,2,3]).all())
# Sterilized
print('Sterilized - ', df.loc[:,'Sterilized'].isin([1,2,3]).all())
# Health
print('Health - ', df.loc[:,'Health'].isin([0,1,2,3]).all())

In [None]:
# Foreign Key membership - Breed, color, state

#breed
print('Breed 1 - ', df.loc[:,'Breed1'].isin(df_breedLabels.loc[:,'BreedID']).all())
print('Breed 2 - ', df.loc[:,'Breed2'].isin(df_breedLabels.loc[:,'BreedID']).all())

#color
print('Color 1 - ', df.loc[:,'Color1'].isin(df_ColorLabels.loc[:,'ColorID']).all())
print('Color 2 - ', df.loc[:,'Color2'].isin(df_ColorLabels.loc[:,'ColorID']).all())
print('Color 3 - ', df.loc[:,'Color3'].isin(df_ColorLabels.loc[:,'ColorID']).all())

#State
print('State - ', df.loc[:,'State'].isin(df_StateLabels.loc[:,'StateID']).all())

In [None]:
#Cols - Breed1, Breed2, Color2, Color3 needs to be checked
def getIndices(col,colname):
    indices = col[col==False].index[:]
    ls = [df.loc[i,colname] for i in indices]
    return ls, indices
        
breed1 = df.loc[:,'Breed1'].isin(df_breedLabels.loc[:,'BreedID'])
breed2 = df.loc[:,'Breed2'].isin(df_breedLabels.loc[:,'BreedID'])
color2 = df.loc[:,'Color2'].isin(df_ColorLabels.loc[:,'ColorID'])
color3 = df.loc[:,'Color3'].isin(df_ColorLabels.loc[:,'ColorID'])

b1,ind = getIndices(breed1, 'Breed1')
b2,ind2 = getIndices(breed2, 'Breed2')
c2,ind3 = getIndices(color2, 'Color2')
c3,ind4 = getIndices(color3, 'Color3')
print('\nb1 values (anomaly) -', b1,' at indices', ind,'\nb2 no of anomalous values - ',len(ind2))
print('c2 no of anomalous values - ', len(ind3), '\nc3 no of anomalous values - ', len(ind4))

Since the no of samples having values different from the allowed values (wrt reference table) is quite large, removing these samples/rows just based on these 3 attributes would reduce a significant chunk of data. And filling these col values would only make these samples spurious. So,let's keep it as it is - and treat these values as breed/color unknown.
Moreover, as the number of values having other than allowed values in b2 and c3 is near to the total no of samples, these columns won't contribute much to training. Hence, can be dropped

In [None]:
fig, axes = plt.subplots(1,2, figsize=(14,5))
axes[0].hist(df.loc[:,'Breed2'], color='blue')
axes[0].set_xlabel('Breed2 values')
axes[0].set_ylabel('Frequency')
axes[0].set_title('No of pets per Breed2')

axes[1].hist(df.loc[:,'Color3'], bins = df_ColorLabels.shape[0],histtype='barstacked')
axes[1].set_xlabel('Color3 values')
axes[1].set_ylabel('Frequency')
axes[1].set_title('No of pets per Color3')
plt.show()

# 3. Feature Engineering
Feature selection, feature extraction, Normalization, ...

From above plots, it is clear that most of the values in Breed2 and Color3 are 0, i.e. unknown, and as such do not provide any significant information in the prediction of the adoption speed. 
Thus, manually these features can be removed

In [None]:
# Dropping the Breed2 and Color3 columns
df = df.drop(['Breed2', 'Color3'], axis=1)
df.head(5)

Let's check how data is distributed in other categorical columns, especially those having more than 3 unique values (assumed - data would be approx uniformly distributed when no of bins are less)

In [None]:
fig, axes = plt.subplots(1,3, figsize=(17,4))

axes[0].hist(df.loc[:,'FurLength'], bins = range(0,6,1))
axes[0].set_xlabel('FurLength values')
axes[0].set_ylabel('Frequency')
axes[0].set_title('# of Pets per Fur length')

axes[1].hist(df.loc[:,'MaturitySize'], bins = range(0,6,1),color='g')
axes[1].set_xlabel('MaturitySize values')
axes[1].set_ylabel('Frequency')
axes[1].set_title('# of Pets vs MaturitySize')

axes[2].hist(df.loc[:,'Health'], bins = range(0,5,1),color='y')
axes[2].set_xlabel('Health values')
axes[2].set_ylabel('Frequency')
axes[2].set_title('# of Pets vs Health Values')

plt.show()

#### DATA DISTRIBUTION IN COLUMNS - MIN-MAX VALUES, OVERALL STATS
How are the values distributed in each data columns? Let's perform some statistical analysis

In [None]:
# non-categorical attributes - Age, Quantity,Fee,VideoAmt,PhotoAmt. Calculating range of values

age_min = df.loc[:,'Age'].min() #age is in months
age_max = df.loc[:,'Age'].max()
print("Pet ages range - ", age_max-age_min, "(max -", age_max,",min - ", age_min,")")
print("Average age of pets - ", df.loc[:,'Age'].mean())

quantity_min = df.loc[:,'Quantity'].min()
quantity_max = df.loc[:,'Quantity'].max()
print("\nMin and Max no. pets in a profile - ", quantity_min,",",quantity_max)

fee_min = df.loc[:,'Fee'].min()
fee_max = df.loc[:,'Fee'].max()
print("\nMin and Max fee - ", fee_min,",",fee_max)
print("Average adoption fee of pets - ", df.loc[:,'Fee'].mean())

video_avg = df.loc[:,'VideoAmt'].mean()
print("\nMean no of videos uploaded for each pet - ", video_avg)

photo_avg = df.loc[:,'PhotoAmt'].mean()
print("Mean no of photos uploaded for each pet - ", photo_avg)

From the above results, the average age (in months) of pets is very near to the min age and much farther than the max age of pets. 
Similarly, the average adoption fee is nearer to the min adoption fee. 

There is a possibility of outliers being present in both the cases. A few visualizations should help clear the air.

Mean no. of videos is ~ 0. As such, this value won't seem to contribute much to training. And hence can be dropped. 

In [None]:
#AdoptionSpeed vs Distribution of VideoAmt, PhotoAmt
def adoptionSpeedDistribution():
    adoption0 = np.where(df.loc[:,'AdoptionSpeed']==0)
    adoption1 = np.where(df.loc[:,'AdoptionSpeed']==1)
    adoption2 = np.where(df.loc[:,'AdoptionSpeed']==2)
    adoption3 = np.where(df.loc[:,'AdoptionSpeed']==3)
    adoption4 = np.where(df.loc[:,'AdoptionSpeed']==4)
    adoption_ = [adoption0, adoption1, adoption2, adoption3, adoption4]
    return adoption_  

In [None]:
adoption_ = adoptionSpeedDistribution()
n_pets = []
video_amt = []
for i in range(5):
    n_pets += [i]* len(adoption_[i][0])
    video_amt += [i]* df.loc[adoption_[i][0],'VideoAmt'].sum()

fig, axes = plt.subplots(1,2, figsize=(14,5))

#Mean is not plotted for videoAmt, instead the sum is plotted, since the mean is 0 at every adoptionSpeed 
#target. Plotting the sum (which too is less as can be seen in the graph) clearly shows that VideoAmt has 
# 0 or very less impact on the target

axes[0].hist([n_pets,video_amt],bins = range(0,6,1), color = ['r','y'], label=['pets', 'videos'])
axes[1].plot(range(0,101),df.loc[:100, 'VideoAmt'], label=['VideoAmt'])
axes[1].plot(range(0,101),df.loc[:100, 'AdoptionSpeed'], label=['AdoptionSpeed'])

axes[0].set_xlabel('Adoption Speed - 0,1,2,3,4 days')
axes[0].set_ylabel('Number of pets, videos')

axes[1].set_xlabel('Pet# (Only till 100)')
axes[1].set_ylabel('Number of videos')
axes[1].legend()
axes[0].legend()
axes[0].set_title('AdoptionSpeed Distribution wrt VideoAmt_sum')
axes[1].set_title('Distribution VideoAmt')
plt.show()

In [None]:
adoption_p = adoptionSpeedDistribution()
n_pets_ = []
photo_amt = []
for i in range(5):
    n_pets_ += [i]* int(len(adoption_p[i][0])/100)
    photo_amt += [i]* int(df.loc[adoption_p[i][0],'PhotoAmt'].astype('int64').mean())

fig, axes = plt.subplots(1,2, figsize=(14,5))

axes[0].hist([n_pets_,photo_amt],bins = range(0,6,1), color = ['r','y'], label=['pets', 'photos'])
axes[1].plot(range(0,101),df.loc[:100, 'PhotoAmt'], label=['PhotoAmt'])
axes[1].plot(range(0,101),df.loc[:100, 'AdoptionSpeed'], label=['AdoptionSpeed'])

axes[0].set_xlabel('Adoption Speed - 0,1,2,3,4 days')
axes[0].set_ylabel('Number of pets (1/100), photos')

axes[1].set_xlabel('Pet# (Only till 100)')
axes[1].set_ylabel('Number of photos')
axes[1].legend()
axes[0].legend()
axes[0].set_title('AdoptionSpeed Distribution wrt PhotoAmt_mean')
axes[1].set_title('Distribution PhotoAmt')
plt.show()

From the plots, the video amount (sum) at every target is very less to cause any impact on the adoption speed. The video amount is plotted as a sum at every bin. The 2nd graph of VideoAmt shows an almost flat curve wrt AdoptionSpeed. However, it is different for PhotoAmt. Hence, VideoAmt can conclusively be dropped off

In [None]:
df = df.drop(['VideoAmt'], axis=1)
df.shape

In [None]:
# box-plot for Age and adoption price - to check for outliers
fig, axes = plt.subplots(2,2,figsize=(14,6), sharex=True)

sns.set(style="whitegrid")
sns.boxplot(x=df.loc[:,'Age'], ax = axes[0,0])
axes[0,0].set_title('Age Box-plot')

sns.boxplot(x=df.loc[:,'Fee'], ax = axes[0,1])
axes[0,1].set_title('Adoption Fee Box-plot')

axes[1,0].scatter(range(0,14993), df.loc[:,'Age'])
axes[1,0].set_xlabel('Sample#')
axes[1,0].set_ylabel('Age')
axes[1,0].set_title('Pet Ages vs sample#')

axes[1,1].scatter(range(0,14993), df.loc[:,'Fee'])
axes[1,1].set_xlabel('Sample#')
axes[1,1].set_ylabel('Adoption Fee')
axes[1,1].set_title('Adoption fee vs sample#')
plt.setp(axes, yticks=[])
plt.tight_layout()


As can be seen from the plots, both 'Age' and 'Fee' have certain outliers.Let's keep the outliers for now (as no upper bound/limits mentioned for these 2 categories)

In [None]:
# Checking adoption speed of cats vs dogs

adoption_ = adoptionSpeedDistribution()
cats = []
dogs = []
for i in range(len(adoption_)):
    cats += [i]*(np.where(df.loc[adoption_[i][0],'Type']==2)[0].shape[0])
    dogs += [i]*(np.where(df.loc[adoption_[i][0],'Type']==1)[0].shape[0])

fig, axes = plt.subplots()
plt.hist([cats,dogs],bins = range(0,6,1), color = ['b','g'], label=['cats', 'dogs'])
plt.xlabel('Adoption Speed - 0,1,2,3,4 days')
plt.ylabel('Number of pets')
plt.legend()
plt.title('Cats and Dogs Adoption Speed')
plt.show()

The adoption speed for both cats and dogs are pretty much comparable. 

#### LABEL ENCODING CATEGORICAL (STRING TYPE) ATTRIBUTES, SCALING ATTRIBUTE VALUES

In [None]:
# Label Encoding 'Name', 'RescuerID', 'State'
def labelEncode(attr):
    enc = LabelEncoder()
    attr_ = list(attr)
    enc.fit(attr_)
    return enc.transform(attr_)

#Not encoding PetID, since it represents the image, which would be seperated out as a different dataset
df.loc[:,'Name'] = labelEncode(df.loc[:,'Name'])
df.loc[:,'RescuerID'] = labelEncode(df.loc[:,'RescuerID'])
df.loc[:,'State'] = labelEncode(df.loc[:,'State'].astype(str))
df.head(5)

In [None]:
df.loc[:,'State'].max() #number of unique state values

The values corresponding to Name, RescuerID, Breed1 attributes are quite large. Feeding these values would result in an uneven model training. Need to check out their distribution, if some values can be removed. 

However, before scaling the values of these columns, let's first check whether these attributes are really important for training the model. 

#### FEATURE SELECTION - IMPACT OF AN ATTRIBUTE IN DETERMINING THE TARGET

In [None]:
# Correlation matrix of the dataset

corr = df.corr(method='pearson')
adoptionSpeed = corr['AdoptionSpeed'][:-1]
#correlation of attributes with adoptionspeed
adoptionSpeed

Instead of having any threshold for correlation values (as most of the values are in similar range) to eliminate certain attribute, let's simply eliminate the least correlated attribute (1 +ve corr, -1 -ve corr, 0 no corr)

In [None]:
x = np.where(adoptionSpeed<0)
col = adoptionSpeed[x[0]].idxmax() #column to delete
col

In [None]:
df = df.drop(['RescuerID'],axis=1)
df.shape

In [None]:
#correlation matrix
corr

From the correlation matrix above, it can be seen that 3 features - vaccinated, dewormed, sterilized are highly correlated (Through manual observation). Thus, instead of keeping all the 3 features, can simply keep 1 feature, that is a representative of all the 3. Although PCA is ideal for feature extraction tasks, here it is done manually, since number of features are very less (PCA would be more apt with higher dimensional data - say 500) 

In [None]:
def combineFeatures(dff, col1, col2, col3):
    x,y,z = df.loc[:,col1], df.loc[:,col2], df.loc[:,col3] 
    a = np.hstack((np.array(x).reshape(-1,1), np.array(y).reshape(-1,1)))
    a = np.hstack((np.array(a), np.array(z).reshape(-1,1)))
    col_new = [int(stats.mode(a[i])[0]) for i in range(a.shape[0])]
    print('Original dataframe - ', dff.shape)
    dff = df.drop([col1, col2, col3],axis=1)
    # Let the new column name be Vaccinated
    dff['Vaccinated'] = col_new
    print('New Dataframe - ', dff.shape)
    return dff

df = combineFeatures(df, 'Vaccinated', 'Dewormed', 'Sterilized')

Scaling the values of 'Fee' to range [0-10]. But before that, let's visualize the distribution of 'Fee' values.

In [None]:
plt.hist(df.loc[:,'Fee'])

As can be seen from the plot above, >14k samples have a value of 0. As such this attribute won't contribute much to training, instead might make it worse. It is similar to having a sparse vector as training data. 
Given, the distribution there is no point in scaling the values and keeping it for training (scaling to a smaller range, say 1-10, would only make most of the non-zero values close to 0). Hence, it is better to remove this attribute although semantically it might seem to be an important attribute

In [None]:
df = df.drop(['Fee'], axis=1)
print('Current Dataframe - ', df.shape)

#### FEATURE ENGINEERING IMAGE DATA

In [None]:
from skimage.color import gray2rgb
from skimage.transform import resize

mean = [0.485,0.456,0.406] # standard values, based on ImageNet data
std = [0.229,0.224,0.225] # standard values, based on ImageNet data

def read_image(path):
    """
    resizing image into size 224x224x3 to feed into ResNet50
    """
    default_path = '../input/petfinder-adoption-prediction/train_images/86e1089a3-1.jpg'
    try:
        img = image.imread(path)
        img = img/255.0
        
    except FileNotFoundError:
        img = image.imread(default_path) #read this default img (randomly selected to fill missing data)
        img = img/255.0
    
    return gray2rgb(resize(img, (160,160)))

def normalize_image(img):
    
    img[:,:,0] -= mean[0]
    img[:,:,0] /= std[0]
        
    img[:,:,1] -= mean[1]
    img[:,:,1] /= std[1]
        
    img[:,:,2] -= mean[2]
    img[:,:,2] /= std[2]
        
    return img

# 4. Data Preparation

Preparing the data in the right format to feed into the neural network model. Utility functions to help during training, plotting,..

In [None]:
from sklearn.model_selection import train_test_split

# seperate the target and features
target = df.loc[:, 'AdoptionSpeed'].to_numpy()
features = df.drop(['AdoptionSpeed'], axis=1).to_numpy()

train_x, val_x, train_y, val_y = train_test_split(features, target, test_size=0.10, random_state=42)

In [None]:
# remove the image col from the numpy arrays (index 12 in a row)
train_images = train_x[:, 12]
val_images = val_x[:,12]
train_x = np.delete(train_x, 12, 1)
val_x = np.delete(val_x, 12, 1)

In [None]:
# One Hot Encoding the labels
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
y_train = enc.fit_transform(train_y.reshape(-1,1))
y_val = enc.fit_transform(val_y.reshape(-1,1))

In [None]:
print('Train X - ', train_x.shape, '\nVal X - ', val_x.shape, '\nTrain Y - ', y_train.shape, '\nVal Y - '
     , y_val.shape, '\nTrain images - ', train_images.shape, '\nVal images - ', val_images.shape)

In [None]:
# get the output of a neural network layer. Would be useful for data generator
def get_layer_output(model, data, layer = 'dense_4'):
    
    network_output = model.get_layer(layer).output
    feature_extraction_model = Model(model.input, network_output)
    prediction = feature_extraction_model.predict(np.asarray(data).astype(np.float32))
    #print(type(prediction))
    #print(prediction.shape)
    return np.asarray(prediction).astype(np.float32)

In [None]:
# Generator to read and preprocess data in batches for model training
class DataGenerator(Sequence) :
     
    def __init__(self, trainX, train_imgs, y_train, model, densenet, batch_size) :
        self.trainx = trainX
        self.imgs = train_imgs
        self.labels = y_train
        self.model_1 = model
        self.densenet = densenet
        self.batch_size = batch_size
    
    def __len__(self) :
        return (np.ceil(len(self.imgs) / float(self.batch_size))).astype(np.int)
  
    def __getitem__(self, idx) :
        batch_x_imgs = self.imgs[idx * self.batch_size : (idx+1) * self.batch_size]
        batch_x = self.trainx[idx * self.batch_size : (idx+1) * self.batch_size]
        batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size]
        
        NN_1_data = get_layer_output(self.model_1, batch_x)
        images = []
        for file in batch_x_imgs:
            images.append(normalize_image(read_image(img_path+file+'-1.jpg')))
        
        DenseNet_data = get_layer_output(self.densenet, np.array(images), layer='avg_pool')
        
        return np.concatenate((NN_1_data, DenseNet_data), axis=1), np.array(batch_y)

In [None]:
def plot_history(history):
    # summarize history for accuracy
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'val'], loc='upper left')
    plt.show()
    # summarize history for loss
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'val'], loc='upper left')
    plt.show()

# 5. Model Training

The proposed model is similar to the architecture shown in the figure below. The ordered data ('categorical data A' as per the figure) would be fed into the 1st neural network (NN-1) which would be trained on the actual targets. After NN-1 is trained, the output (Out-1) of a layer in NN-1 would be taken to prepare the input for 2nd neural network (NN-2). 
The image data would be fed into a pretrained convolutional neural network (say, ResNet50), and the output of last hidden layer would be extracted (Out-2). This output combined with NN-1 output (Out-1 + Out-2) is then fed into NN-2 as input. NN-2 then outputs the class probabilities


Fig credits - [[StackExchange](https://datascience.stackexchange.com/questions/29634/how-to-combine-categorical-and-continuous-input-features-for-neural-network-trai)]

![model](https://i.stack.imgur.com/QgQFq.png)

In [None]:
from keras.models import Model, Sequential
from keras.layers import Flatten, Dense, Activation, Input, Dropout, Activation, BatchNormalization
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.optimizers import Adam, SGD
from keras.applications import DenseNet169
from keras.models import load_model

#### NEURAL NETWORK MODEL 1 (PET ATTRIBUTES DATA)

In [None]:
# NN-1 model
inputs = Input(shape=(14,))
x = Dense(32, activation='sigmoid')(inputs)
#x = Dropout(0.2)(x)
x = Dense(64, activation='sigmoid')(x)
#x = Dense(128)(x) ## Deeper networks degrade the model. Doesn't fit the data well
#x = Dense(256, activation='tanh')(x)
#x = Dropout(0.25)(x)
#x = Dense(512, activation='sigmoid')(x)
#x = Dense(20, activation='tanh')(x)
#x = Dropout(0.25)(x)
out = Dense(5, activation='softmax')(x)

model = Model(inputs=inputs, outputs=out)
model.summary()

In [None]:
# NN-1 model
lr = 0.001
checkpt = ModelCheckpoint(filepath='../input/output/models/best_model.h5',monitor='val_acc',save_best_only=True)
earlystop = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=10, verbose=1, mode='min')
model.compile(optimizer=Adam(lr=lr),loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
# EXPERIMENTAL
# scaling down the name column of training data
def scaleData(X):
    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    return X_std * (10)
train_x[:,3] = scaleData(train_x[:, 3])
train_x[:,1] = scaleData(train_x[:, 1])

# After scaling, all values remain same, except training accuracy increases by 8%

In [None]:
# NN-1 model
history = model.fit(np.asarray(train_x).astype(np.float32),np.asarray(y_train),epochs=200,batch_size=32,shuffle=True,
          validation_data=(np.asarray(val_x).astype(np.float32),np.asarray(y_val)),callbacks=[checkpt])

In [None]:
plot_history(history) #an averaged out curve might look less noisy

In [None]:
model.evaluate(np.array(val_x).astype(np.float32), np.array(y_val), batch_size=32)

In [None]:
model.save('best_model_latest.h5')
#model = load_model('../input/models-files/best_model_latest.h5')

#### NEURAL NETWORK MODEL 2 (COMBINES ATTRIBUTE DATA FEATURES AND IMAGE FEATURES)

In [None]:
# NN-2 model
model_densenet = DenseNet169(include_top=True, weights="imagenet")
model_densenet.summary()

In [None]:
# NN-2 model
batch_size = 64

train_gen = DataGenerator(train_x, train_images, y_train, model, model_densenet, batch_size)
val_gen = DataGenerator(val_x, val_images, y_val, model, model_densenet,batch_size)

In [None]:
# NN-2 model
# Model architecture - input : (1728,)
inputs = Input(shape=(1728,))

x = Dense(1024, activation='sigmoid')(inputs)
x = Dropout(0.2)(x)

x = Dense(512, activation='sigmoid')(x)

x = Dense(256, activation='tanh')(x)
x = Dropout(0.25)(x)

x = Dense(256, activation='sigmoid')(x)

x = Dense(64, activation='tanh')(x)
x = Dropout(0.25)(x)

out = Dense(5, activation='softmax')(x)

model_2 = Model(inputs=inputs, outputs=out)
model_2.summary()

In [None]:
# NN-2 model
lr = 0.001
checkpt = ModelCheckpoint(filepath='../input/output/models/best_model_Part2.h5',monitor='val_acc',save_best_only=True)
earlystop = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=10, verbose=1, mode='min')
model_2.compile(optimizer=Adam(lr=lr),loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
# NN-2 model
history = model_2.fit_generator(generator=train_gen,steps_per_epoch = int(13493 // batch_size),epochs = 20,
                   verbose = 1,validation_data = val_gen,validation_steps = int(1500 // batch_size))

In [None]:
model_2.save('final_model_latest.h5')

#### Note: Due to certain resource constraints (Memory error , Long training times + connectivity issues) the 1st model training was halted. The model was saved and trained seperately for several days for fewer number of epochs (8 or 10) each time, on Kaggle. With no provision for saving the cell output on a kaggle editable notebook, the model training output could not be present as part of this main notebook

# 6. Model Evaluation
 
The model performance is measured using classification metrics like classification report, confusion matrix

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import multilabel_confusion_matrix

In [None]:
# load the trained model
model_2 = load_model('final_model_latest.h5')

In [None]:
# evaluate model
model_2.evaluate_generator(val_gen, 1500)

In [None]:
y_pred = model_2.predict_generator(val_gen, 1500)
y_pred.shape

In [None]:
# Converting predictions into suitable format to feed into sklearn libraries
def process_predictions():
    prediction = []
    for i in range(len(y_pred)):
        t = np.zeros(5)
        t[np.argmax(y_pred[i])] = 1
        prediction.append(t)
    return np.array(prediction)

prediction = process_predictions()

In [None]:
# Classification Report
print(classification_report(y_val, prediction))

#### As per the Classification Report, the trained model shows : <br>

1. Accuracy : 67% <br>
2. Recall : 67% <br>
3. F1-Score : 67% <br>

Considering only the sample averages. Class wise, the model has the highest accuracy, recall and F1-score for class 'Adoption Speed 4' as can be seen from the report. 

Given the complexity of the data, the performance of the model seems to be decent enough (especially with limited resources to experiment). The performance/scores can be increased even more, with better architecture, preprocessing and tuning

In [None]:
# Confusion Matrix
print(multilabel_confusion_matrix(y_val,prediction,labels=[0,1,2,3,4]))

#### TRYING OUT A DIFFERENT TECHNIQUE

Training different ML classifiers like Adaboost, GB classifiers and combining the predictions with the predictions of NN model. Only the ordered data has been fed into these classifier during training

However performances are nearly same, in fact, combining worsens the performance. This is just an experimental technique, and isn't fully correct, as combining 2 different predictions (from 2 different models) just by averaging isn't semantically correct. A better and more robust model/architecture would be to train only a ML classifier(like DT) on the ordered data, and a neural network on the image data. The labels in both the cases would be same. Although, combining the predictions,in this case, still seems to be an issue, but a suitable classifier, (i.e. according to the type of training data) would be trained for ordered and the unordered data

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
clf1 = AdaBoostClassifier(n_estimators=100, random_state=0)
clf2 = GradientBoostingClassifier(min_samples_split = 8, random_state=0)

In [None]:
# transform One hot encoded data to labelled data (as in labelEncoded data)
def decode(y):
    ys = []
    for i in range(len(y)):
        ys.append(np.argmax(y[i]))
    return np.array(ys)
y_train = decode(y_train)
y_val = decode(y_val)
y_pred = decode(prediction) # NN model predictions

In [None]:
clf1.fit(train_x, y_train) # fit the classifier

In [None]:
clf2.fit(train_x, y_train) # fit the classifier

In [None]:
pred1 = clf1.predict(val_x) # AdaBoost predictions

In [None]:
pred2 = clf2.predict(val_x) # GB predictions

In [None]:
#combining both models' predictions
def combinePreds(pred, y_pred):
    pred_combined = []
    for i in range(len(pred)):
        if pred[i]==y_pred[i]:
            pred_combined.append(pred[i])
        else:
            pred_combined.append(int((pred[i]+y_pred[i])//2))
    
    return pred_combined

pred_combined1 = combinePreds(pred1, y_pred)
pred_combined2 = combinePreds(pred2, y_pred)

In [None]:
# GradientBoosting Classifier classification report 
print(classification_report(y_val, pred_combined2))