# GridSearchCV:

  GridSearchCV is a library function that is a member of sklearn's model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.
  
![](https://imgur.com/HSh9mej.png)

# Gradient Boosting Classifier:
  
  Gradient boosting is a machine learning technique for regression and classification problems that produce a prediction model in the form of an ensemble of weak prediction models
  
![](https://imgur.com/aPnHLAE.png)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
print(df)

In [None]:
df.isnull().sum()

In [None]:
df = df.fillna(0)
df

In [None]:
df.isnull().sum()


In [None]:
x = df.drop('status', axis=1)
x.head(10)

In [None]:
y = df['status']
y.head(10)

# Label Encoding

In [None]:

from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()  
x= x.apply(label_encoder.fit_transform)
print(x)

In [None]:
y= label_encoder.fit_transform(y)
print(y)

# Train and Test Split

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state= 109)

#  GradientBoostingClassifier and GridSearchCV

In [None]:
#Build Model with GradientBoostingClassifier and GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [None]:
#separating numerical and categorical col
numerical_col = ['ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p']
categorical_col = ['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation']

In [None]:
#Creating Pipeline to Missing Data 

#inpute numerical missing data with median
numerical_transformer = make_pipeline(SimpleImputer(strategy='median'),
                                      StandardScaler())

#inpute categorical data with the most frequent value of the feature and make one hot encoding
categorical_transformer = make_pipeline(SimpleImputer(strategy='most_frequent'),
                                        OneHotEncoder(handle_unknown='ignore'))

preprocessor = ColumnTransformer(transformers=[('num', numerical_transformer, numerical_col),
                                               ('cat', categorical_transformer, categorical_col)])

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score


In [None]:
clf = Pipeline([
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier())])

In [None]:
#Using GradientBoostingClassifier with GridSearchCV to get better parameters

param_grid = {'model__learning_rate':[0.001, 0.01, 0.1], 
              'model__n_estimators':[100, 150, 200, 300, 350, 400]}

#param_grid = {'model__learning_rate':[0.1], 
#              'model__n_estimators':[150]}

#use recall score
grid = GridSearchCV(clf, param_grid, cv=10, scoring='accuracy', n_jobs=-1)

In [None]:
grid.fit(x_train, y_train)

In [None]:
grid.best_params_

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
predictions = grid.predict(x_test)

In [None]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

# Plotting

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.countplot(x="degree_t", data=df, hue='specialisation')
plt.title("Candidate degree vs Placement")
plt.xlabel("Courses in degree")
plt.ylabel("Number of candidate")
plt.show()

In [None]:
df.plot.scatter(x='salary', y='mba_p',title='Candidate Performance')

In [None]:
df['salary'].plot.hist()

In [None]:
df['status'].value_counts().sort_index().plot.bar()

In [None]:
df.drop(['sl_no','ssc_p','hsc_p','etest_p'], axis=1).plot.line(title='Candidate Performance')

***Pros:***

* Exhaustive search, will find the absolute best way to tune the hyperparameters based on the training set.
* Easy to find the optimal hyperparameters of a model which results in the most 'accurate' predictions. 
* More “efficient” use of data as every observation is used for both training and testing.

***Cons:***

* Time-consuming and danger of overfitting.
* That when it comes to dimensionality, it suffers when evaluating the number of hyperparameters grows exponentially.

# References:

1. [https://medium.com/better-programming/comparing-grid-and-randomized-search-methods-in-python-cd9fe9c3572d](http://)
2. https://medium.com/@kesarimohan87/model-selection-using-cross-validation-and-gridsearchcv-8756aac1e9d7
3. https://medium.com/datadriveninvestor/an-introduction-to-grid-search-ff57adcc0998
