## <center>Feature Extractor : VGG16 + ML Algorithm [Training Notebook]</center>

### <center>Welcome Curious Reader!</center>

- You are now going to explore and understand this Training Notebook created for the competition : [PetFinder.my - PawPularity](https://www.kaggle.com/c/petfinder-pawpularity-score)  

- This competition opens the door for facing the challenge of using Image Data & Categorical Data to predict a Continuous Value. 

- **Aim** : To understand the approach of using Feature Extractor in combination with a Machine Learning Algorithm to predict the output feature only using Image Data for training the models.

## <center>Import The Necessary Libraries & Define Data Access Variables </center>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import cv2
import os
from tqdm import tqdm
import tensorflow as tf
import pickle
import time

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

train = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
train_images_path = '../input/petfinder-pawpularity-score/train'
train_images_list = os.listdir(train_images_path)
print('Total Number of Training Images : ',len(train_images_list))

## <center>Create Train Image Batches & Output Feature Variable</center>

In [None]:
train_images = []
for i in tqdm(range(len(train_images_list))):
    path = os.path.join(train_images_path,train_images_list[i])
    image = cv2.imread(path)
    image = image / 255
    image = cv2.resize(image,(128,128))
    train_images.append(image)
train_images = np.array(train_images)  

train_label = train['Pawpularity'] / 100

- Normalizing the output feature variable by dividing it by 100 to bring their values in the range of [0 - 1]. 

## <center>Feature Extractor : VGG16</center>

- We are using VGG16 as Image Feature Extractor which requires a batch of images as input.
    
- To define VGG16, we use [Keras Applications](https://keras.io/api/applications/) to call VGG16 and set its parameters.

In [None]:
from tensorflow.keras.applications.vgg16 import VGG16;
model = VGG16(include_top = False,input_shape = (128,128,3),weights = '../input/vgg16-no-top-weights/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5')
for layer in model.layers:
    layer.trainable = False
    
train_feature_extractor = model.predict(train_images)
train_features = train_feature_extractor.reshape(train_feature_extractor.shape[0],-1)

In [None]:
print('Input to Feature Extractor Shape : ',train_images.shape)
print('Output of Feature Extractor Shape : ',train_feature_extractor.shape)
print('Input to Machine Learning Algorithm Shape',train_features.shape)

- VGG16 model is assigned the weights of the VGG16 model which does not have the dense layers for classification purpose manually.
For more information on this part, checkout [Keras Applications](https://keras.io/api/applications/)

- Source for downloading weights can be found [here!](https://github.com/fchollet/deep-learning-models/releases/tag/v0.1)

- To make sure that we don't lose the learnings of the weights, we use a for loop and assign **layer.trainable = False** which makes sure that we don't overwrite the weights set for the individual layers.

## <center>Regression Algorithms

- **Machine Learning Algorithms** : Linear Regression, Support Vector Regressor, Random Forest Regressor, Xgboost Regressor, LGBM Regressor, Stack of Regressors.

- KFold is used to divide the dataset into sections which assists to make the model robust by training and testing on all the images.

- **train_test_split** from sklearn was used for hyperparameter tuning & cross validation purposes with a high test size of 75 %.

- This was done to get a random 25% of dataset.[Just an approach! You can try different methods aswell!]</center>

### <font color = 'red'>Note</font> : Below cells were executed at different times, these cells were commented & executed once again before publishing. Thus, you won't be able to see the outputs of print statements and model definitions. 

### Common Steps used for all the Regression Algorithms :

1. Define the model.

2. Using hyperparamter tuning & cross validation in combination with train_test_split, find the best parameters for the model.

3. Fit the model on complete dataset using the combinations generated by the KFold.

4. Pickle file of the model is generated and it is then referenced in the inference notebook for prediction purposes. 

In [None]:
from sklearn.model_selection import KFold
kf = KFold(n_splits = 4)

x_train_cv,x_test_cv,y_train_cv,y_test_cv = train_test_split(train_features,train_label,test_size = 0.75)

## <center>Linear Regression</center>

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

In [None]:
for train_index,test_index in tqdm(kf.split((train_images))):
    
    lr.fit(train_features[train_index],train_label[train_index])
    pred = lr.predict(train_features[test_index]);y_true = train_label[test_index]
    print('RMSE {:.2f}'.format(mean_squared_error(pred,y_true,squared = False)))

- Used default parameters,could not judge which paramters to tune. 
- RMSE [Average] : 1.56

### <center>Pickle File</center>

In [None]:
with open('lr_pickle.pkl','wb') as f:
    pickle.dump(lr,f)

## <center>Support Vector Regressor</center>

In [None]:
from sklearn.svm import SVR
svr = SVR()

#### <center>Hyperparamter Tuning + Cross Validation</center>

In [None]:
param_tuning = {
        'kernel': ['linear','poly'],
        'C': [1,0.1,0.01] }

gsearch = GridSearchCV(estimator = svr,
                       param_grid = param_tuning,                        
                       cv = 5,
                       n_jobs = -1,
                       verbose = 1)
gsearch.fit(x_train_cv,y_train_cv)
gsearch.best_params_

In [None]:
svr = SVR(C=0.01,kernel='poly')

for train_index,test_index in tqdm(kf.split((train_images))):
    
    svr.fit(train_features[train_index],train_label[train_index])
    pred = svr.predict(train_features[test_index]);y_true = train_label[test_index]
    print('RMSE {:.2f}'.format(mean_squared_error(pred,y_true,squared = False)))

- Redefining the model with tuned hyperparameters & training the model.
- Average RMSE : 0.205

### <center>Pickle File</center>

In [None]:
with open('svr_pickle.pkl','wb') as f:
    pickle.dump(svr,f)

## <center>Xgboost Regressor</center>

In [None]:
import xgboost 
xgb = xgboost.XGBRegressor()

### <center>Hyperparamter Tuning + Cross Validation</center>

In [None]:
param_tuning = {
        'learning_rate': [0.001,0.01, 0.1],
        'max_depth': [4,5,6],
        'n_estimators' : [100,200]}

gsearch = GridSearchCV(estimator = xgb,
                       param_grid = param_tuning,                        
                       cv = 5,
                       n_jobs = -1,
                       verbose = 1)
gsearch.fit(x_train_cv,y_train_cv)
gsearch.best_params_

In [None]:
xgb = xgboost.XGBRegressor(learning_rate= 0.01,max_depth= 3,
                             n_estimators = 100)

for train_index,test_index in tqdm(kf.split((train_images))):
    
    xgb.fit(train_features[train_index],train_label[train_index])
    pred = xgb.predict(train_features[test_index]);y_true = train_label[test_index]
    print('RMSE {:.2f}'.format(mean_squared_error(pred,y_true,squared = False)))

- Redefining the model with tuned hyperparameters & training the model.
- Average RMSE : 0.2125

### <center>Pickle File</center>

In [None]:
with open('xgb_pickle.pkl','wb') as f:
    pickle.dump(xgb,f)

## <center>LGBM Regressor</center>

In [None]:
from lightgbm import LGBMRegressor
lgbm = LGBMRegressor() 

#### <center>Hyperparamter Tuning + Cross Validation</center>

In [None]:
param_tuning = {
        'learning_rate': [0.001,0.01, 0.1],
        'max_depth': [4,5,6],
        'n_estimators' : [100,200]}

gsearch = GridSearchCV(estimator = lgbm,
                       param_grid = param_tuning,                        
                       cv = 5,
                       n_jobs = -1,
                       verbose = 1)
gsearch.fit(x_train_cv,y_train_cv)
gsearch.best_params_

In [None]:
lgbm = LGBMRegressor(learning_rate = 0.01,max_depth = 3,n_estimators = 100)

for train_index,test_index in tqdm(kf.split((train_images))):

    lgbm.fit(train_features[train_index],train_label[train_index])
    pred = lgbm.predict(train_features[test_index]);y_true = train_label[test_index]
    print('RMSE {:.2f}'.format(mean_squared_error(pred,y_true,squared = False)))

- Redefining the model with tuned hyperparameters & training the model.
- Average RMSE : 0.205

### <center>Pickle File</center>

In [None]:
with open('lgbm_pickle.pkl','wb') as f:
    pickle.dump(lgbm,f)

## <center>Stack : Linear Regression, Support Vector Regressor, Xgboost Regressor, LGBM Regressor</center>

In [None]:
from mlxtend.regressor import StackingCVRegressor
stack = StackingCVRegressor(regressors = (lr,svr,xgb,lgbm),meta_regressor = lr)

In [None]:
for train_index,test_index in tqdm(kf.split((train_images))):

    stack.fit(train_features[train_index],train_label[train_index])
    pred = stack.predict(train_features[test_index]);y_true = train_label[test_index]
    print('RMSE {:.2f}'.format(mean_squared_error(pred,y_true,squared = False)))

- Did not carry out hyperparameter tuning!
- Average RMSE : 0.205

### <center>Pickle File</center>

In [None]:
with open('stack_pickle.pkl','wb') as f:
    pickle.dump(stack,f)

## <center>Conclusion</center>

1. Huge amount of features generated by VGG16.
2. Model fitting & hyperparameter tuning were time consuming tasks.
3. Model Training Performances :

| Model  |  RMSE |
| :--- | :---: |
|   LR   | 1.56  |
|  SVR   | 0.205 |
|  XGB   | 0.2125|
|  LGBM  | 0.205 |
|  STACK | 0.205 |



Links: 

1. [Inference Notebook](https://www.kaggle.com/tanmay111999/starter-feature-extractor-ml-algo-infer)
2. [Discussion Post](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/279212)

## <center>If you like the content of the notebook, please do upvote!</center>
### <center>Feedback is appreciated!</center>
### <center>Stay Safe!</center>