# EDA, Machine Learning and Hyperparameter Tuning for Pawpularity of Petfinder:
Here first we explored the data carefully by finding correlation between the features, and also the top 10 important features using ExtraTreeRegressor Model. Then applied the XGBRegressor model to the data and at last did some hyperparameter tuning to improve the prediction.

# Exploratory Data Analysis (EDA)

Importing essential moduls and exploring training data using read_csv method of Pandas

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

pet_data = pd.read_csv("../input/petfinder-pawpularity-score/train.csv")
pet_data.head()

In [None]:
pet_data.columns

Taking out the Target i.e. Pawpularity from the Data

In [None]:
features =['Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory',
       'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur']
X = pet_data[features]
y = pet_data.Pawpularity

# The Top 3 Most Correlated Features

In [None]:
plt.figure(figsize = (20, 8))
sns.heatmap(pet_data.corr(), annot = True, cmap = "YlGnBu")
plt.show()

From above it is clear that 

1. Occlusion and Human
2. Face and Eyes
3. Collage and Info

are the top 3 correlated features

# The Top 10 Most Important Features

Using ExtraTreesRegressor Module to find the ranking of the features based on its importance

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
Selection = ExtraTreesRegressor()
Selection.fit(X,y)

In [None]:
print(Selection.feature_importances_)

In [None]:
# Plotting the feature importances for better understanding

plt.figure(figsize= (12,8))
feat_importances = pd.Series(Selection.feature_importances_, index = X.columns)
feat_importances.nlargest(10).plot(kind = 'barh', )
plt.show()

From above figure it is clear that the **Near** feature is the most important while the **subject** focus is the least of 10 features

# Fitting model using XGBRegressor
1. Import model
2. Split dataset into train and test set in order to predict w.r.t test_X
3. Fit the data using XGBRegressor Model
4. Predict w.r.t test_X
5. In regression check RSME Score
6. Plot graph

In [None]:
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn import metrics

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

model = XGBRegressor(n_estimators=6)

model.fit(train_X, train_y, 
             eval_set=[(test_X, test_y)],
             verbose=False)
print("Training Score: ", model.score(train_X, train_y))

preds_y = model.predict(test_X)
print('RMSE:', np.sqrt(metrics.mean_squared_error(test_y, preds_y)))
print("Test Score: ", model.score(test_X, test_y))

In [None]:
sns.distplot(test_y-preds_y)
plt.show()

In [None]:
plt.scatter(test_y, preds_y, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_predictions")
plt.show()

# Hyper Parameter Tuning
Choose following method for hyperparameter tuning
*  RandomizedSearchCV --> Fast
    1. GridSearchCV
    2. Assign hyperparameters in form of dictionery
* Fit the model
* Check best paramters and best score

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import time

In [None]:
# A parameter grid for XGBoost
params = {
    'n_estimators':[1,5,6,10,20,50,100,200,500,1000], 
    'objective': ['reg:squarederror', 'reg:tweedie'],
    'booster': ['gbtree', 'gblinear'],
    'importance_type': ['gain','weight', 'cover'],
    'eval_metric': ['rmse'],
    'n_jobs': [i for i in range(1,100)],
    'nthread': [i for i in range(-10,10)],
    'eta': [i/10.0 for i in range(3,6)],
}

reg = XGBRegressor(random_state = 11)

# run randomized search
n_iter_search = 100
random_search = RandomizedSearchCV(reg, param_distributions=params,
                                   n_iter=n_iter_search, cv=5, iid=False, scoring='neg_mean_squared_error')

start = time.time()
random_search.fit(train_X, train_y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time.time() - start), n_iter_search))

In [None]:
best_regressor = random_search.best_estimator_
best_regressor

In [None]:
val_predicts = best_regressor.predict(test_X)
print('RMSE:', np.sqrt(metrics.mean_squared_error(test_y, val_predicts)))

Which shows an improvement

In [None]:
x_ax = range(len(test_y))
plt.figure(figsize = (20, 5))
plt.plot(x_ax, test_y, label="original")
plt.plot(x_ax, val_predicts, label="predicted")
plt.title("Pawpularity test and predicted data")
plt.legend()
plt.show()

In [None]:
pet_test = pd.read_csv('../input/petfinder-pawpularity-score/test.csv')
pet_test.head()

In [None]:
X_test= pet_test[features]
predictions = best_regressor.predict(X_test)

In [None]:
output = pd.DataFrame({"Id": pet_test.Id, "pawpularity": predictions})
output.to_csv('submission.csv', index = False)

**Note:** Suggestions are Highly appreciated for improvement.