### Problem description: 

PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. 

While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved. The participants needs to build an AI model using provided data to help make the tool better.  

**Task** 

The task is to predict engagement with a pet's profile( **Pawpularity** ) based on the photograph for that profile. 

**Data** 

The dataset for this competition comprises both images and tabular data(hand-labelled metadata for each photo). 

The train set contains 9912 pet photos 

The test set contains 8 pet photos
> NOTE: The actual test data comprises about **6800** pet photos similar to the training set photos. 


####  **Previous Notebook**: [*Understanding the problem & EDA*](!https://www.kaggle.com/vivmankar/understanding-the-problem-eda) 

 
#### **In this notebook and upcoming notebooks we will analyze four different approaches to the problem.**  
 
1. With provided tabular data only ( Score : 20.47458 ) 
2. With provided image data only 
3. Image + Tabular data as inputs to an end to end model. 
4. image + Tabular data models ensembled 


## Imports

In [None]:
import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt 
import seaborn as sns 

import os 
import cv2
import random 

from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestRegressor # used for prediction 
from sklearn.model_selection import RandomizedSearchCV # hyperparameter tuning
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings("ignore")

## Data

In [None]:
data = pd.read_csv('../input/petfinder-pawpularity-score/train.csv')
test = pd.read_csv('../input/petfinder-pawpularity-score/test.csv')
ss = pd.read_csv('../input/petfinder-pawpularity-score/sample_submission.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
X= data[data.columns[1:-1]] # other features 
y= data["Pawpularity"] # Pawpularity

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

In [None]:
X_train.shape , X_test.shape

## RandomForest Regressor

In [None]:
# create the base model to tune
rf = RandomForestRegressor()


### Hyperparameter Tuning

Hyperparaymeter tuening using Randomized Search CV

Use the random grid to search for best hyperparameters

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1500, num = 15)]


# Number of features to consider at every split
max_features = ['auto', 'sqrt']


# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
 
# Minimum number of samples required to split a node
min_samples_split = [5, 10 , 15, 20 , 25]


# Minimum number of samples required at each leaf node
min_samples_leaf = [5, 10, 15]

In [None]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

In [None]:
# Random search of parameters, using 5 fold cross validation, 
 
rf_random = RandomizedSearchCV(estimator = rf, 
                               param_distributions = random_grid, # Dictionary with parameters names (str) as keys and distributions or lists of parameters to try
                               scoring='neg_mean_squared_error', #  to evaluate the performance of the cross-validated model on the test set.
                               n_iter = 10, 
                               cv = 4, 
                               refit = True, # Refit an estimator using the best found parameters on the whole dataset.
                               verbose=2, 
                               random_state=42, 
                               n_jobs = -1 # Number of jobs to run in parallel. -1 means using all processors 
                              )


In [None]:
rf_random.fit(X_train,y_train)

In [None]:
# Best parameters choosen 

rf_random.best_params_

In [None]:
# Get best score ( neg_mean_squared_error )

rf_random.best_score_ 

In [None]:
predictions_X_test = rf_random.predict(X_test)

In [None]:
RMSE_model1_RfR = np.sqrt(mean_squared_error(y_test, predictions_X_test))

print(RMSE_model1_RfR)

### Retrain the final model on whole data 

In [None]:
final_model = RandomForestRegressor(n_estimators = 100,
                                     min_samples_split = 15,
                                     min_samples_leaf = 10,
                                     max_features = 'sqrt',
                                     max_depth = 5,
                                     n_jobs = -1 )

In [None]:
final_model.fit(X,y)

In [None]:
predictions_final = final_model.predict(test[test.columns[1:]])

In [None]:
predictions_final

In [None]:
ss.head()

In [None]:
f_ss = pd.DataFrame()

f_ss["Id"] = test["Id"]
f_ss["Pawpularity"] = predictions_final

In [None]:
f_ss.head()

In [None]:
f_ss.to_csv('submission.csv', index=False)