#Defining the problem


**What is the input?**

based on the profiles of two people, the input data representing the results of a speed dating session.


*  consisting of 191 features, one for each date from a speed dating event
*  Although the dataset is clean, there are numerous missing values that require preprocessing.






**What is the output?** 

In order to implement a recommendation system to better match people at speed dating events, our model must forecast the result of a specific speed dating session based on the profiles of two people. We are going to estimate the likelihood that a successful match will result from the dating session.




**What data mining function is required?** 

using grid search, random search, pipelines, and Bayesian search for binary classification.

**What could be the challenges?** 

Our algorithm must be effective and scalable to extract information from the big data, and we must have sufficient knowledge and experience to use them if we needed to improve our algorithms. Developing a successful solution to our problem requires complex data, datasets can include complex data elements, and another thing is that we have to make sure that our algorithm is appropriate for the problem at hand.

**What is the impact?** 

Our model's predictions will lead to the implementation of a recommendation system to more effectively match participants in speed dating events, which will facilitate dating.

**What is an ideal solution?** 

The best option is to clean and prepare the data before using it.

The following are a few potential answers:


*  providing the appropriate value for the missing data in accordance with the results of the hyperparameter tuning
*  Remove unnecessary Columns (or Features) to make the selection of Features less dimensional




**What is the experimental protocol used and how was it carried out?** 

The experimental protocol involves setting a value for k-fold cross validation cv while using GridSearchCV, RandomizedSearchCV, or BayesSearchCV after loading, cleaning, and preprocessing the data.


and use (roc_auc) to assess performance.

**What preprocessing steps are used?** 



*  view the data and understand it

*  using df.info() to get more insight about the data

*   check the missing data using df.isna().sum()

*   convert all object columns to categorical column

*   extracting numeric features and categorical features

*   define a pipe line for numeric feature preprocessing with applying StandardScaler on it
*   define a pipe line for categorical feature preprocessing with applying OneHotEncoder on it

*   define the preprocessor and specify what are the categorical and numeric pipeline on it


*   using hyperparameter tuning, try to find:the approprait strategy to fill the missing data






#Questions

**Why a simple linear regression model (without any activation function) is not good for classification task, compared to Perceptron/Logistic regression?** 

While classification predicts and handles discrete values, linear regression predicts and handles continuous values. The second problem is that the threshold value changes as more data points are added.

In contrast, a categorical value—such as 0 or 1, Yes or No—is the outcome of a perceptron or logistic regression.

**What's a decision tree and how it is different to a logistic regression model?** 

An instrument that forecasts probable outcomes, resource costs, utility costs, and potential implications is called a decision tree. In order to provide conditional control assertions in algorithms, decision trees are a mechanism. They have branches that represent potential course of action for making good decisions.

While Logistic Regression fits a single line to precisely divide the space into two, Decision Trees divide the space into progressively smaller sections. For higher-dimensional data, these lines would naturally generalize to planes and hyperplanes.

**What's the difference between grid search and random search?** 

*Grid search*


*   Try out every combination of the parameters:

*  Computationally expensive
*  Global optimal (within the given range)


*   Sklearn: model_selection.GridSearchCV

*Random search*


*   Try out a random subset

*   good enough
*   Local optimal (within the given range)


*   Efficient (less trials)


*   Sklearn: model_selection.RandomizedSearchCV

The exhaustive enumeration of all combinations used in grid search is replaced by random selection in random search.


Random Search limits the number of hyperparameter combinations tested while Grid Search uses all possible combinations.

The cost of grid search (global optimal) increases with the size of the search space. As an alternative, random search CV produces local optimal results that may be adequate and even more generalizable.

the random search is more faster than the grid search









**What's the difference between bayesian search and random search?** 


Bayesian optimization methods are effective because they choose hyperparameters intelligently. Because Bayesian approaches prioritize hyperparameters that seem more promising from prior findings, they can find the best hyperparameters faster (with fewer iterations) than grid search and random search.


#importing libraires 
These libraries will make our tasks a lot easier, as they have readily available functions and models that can be used instead of doing that ourselves.

In [None]:
#import libiraries
import pandas as pd #which is used for data cleaning and analysis
import numpy as np # to perform a wide variety of mathematical operations on arrays.
from sklearn.metrics import confusion_matrix, classification_report # let you assess the quality of your predictions
from sklearn.compose import ColumnTransformer #This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer
from sklearn.datasets import fetch_openml #Datasets are uniquely identified by either an integer ID or by a combination of name and version so give either name or data_id In case a name is given, a version can also be provided.
from sklearn.pipeline import Pipeline # is to assemble several steps that can be cross-validated together while setting different parameters.
from sklearn.impute import SimpleImputer #for completing missing values with simple strategies
from sklearn.preprocessing import StandardScaler, OneHotEncoder #StandardScaler removes the mean and scales each feature/variable to unit variance., process by which categorical data (such as nominal data) are converted into numerical features of a dataset.
from sklearn.model_selection import train_test_split, GridSearchCV #train_test_split: helps us create our training data and test data. #GridSearchCV: helps to loop through predefined hyperparameters and fit your estimator (model) on your training set.
from sklearn.model_selection import train_test_split, RandomizedSearchCV #randomly passes the set of hyperparameters and calculate the score and gives the best set of hyperparameters which gives the best score as an output.
from sklearn.model_selection import PredefinedSplit #Provides train/test indices to split data into train/test sets using a predefined scheme specified by the user with the test_fold parameter.
from xgboost.sklearn import XGBClassifier #often produces state-of-the-art predictions 
from sklearn import feature_selection #to improve estimators' accuracy scores or to boost their performance on very high-dimensional datasets.
from sklearn.feature_selection import SelectKBest, chi2, SelectFromModel #SelectKBest: It helps us to eliminate less important part of the data and reduce a training time., chi2 :Compute chi-squared stats between each non-negative feature and class.,SelectFromModel: a meta-estimator that determines the weight importance by comparing to the given threshold value.
from sklearn.feature_selection import f_classif #to determine whether there is any statistically significant difference between the means of two or more groups.
from sklearn import linear_model, preprocessing #linear_model: for performing machine learning with linear models,preprocessing: provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators
from sklearn import decomposition # is to break the problem into discrete “chunks” of work that can be distributed to multiple tasks so the can work on on the problem

#load the data
To both read from and write to a file

In [None]:
# reading the training dataset 
df_train = pd.read_csv('/content/train.csv')  
# reading the testing dataset 
df_test = pd.read_csv('/content/test.csv') 

In [None]:
#show heading of columns in trainig data
df_train.head()

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,8.0,8.0,7.0,8.0,,,,,,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,9.0,8.0,8.0,6.0,,,,,,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828


In [None]:
#show heading of columns in testing data
df_test.head()

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,7.0,8.0,6.0,8.0,,,,,,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,8.0,7.0,7.0,8.0,6.0,7.0,6.0,5.0,5.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052


In [None]:
# show the training data shape
df_train.shape

(5909, 192)

In [None]:
# show the testing data shape
df_test.shape

(2469, 191)

In [None]:
# #import train and test data and merge them together for perprocessing step..
# df_train = pd.read_csv('/content/train.csv', index_col='id')
# df_test = pd.read_csv('/content/test.csv', index_col='id')
# data =pd.concat([df_train,df_test])

#check the null values 
Checking for null values and initializing them is a necessary step in the implementation of a ruleset, because input objects might be null or contain null values

In [None]:
# check the null values in training data
df_train.isnull().sum().sort_values(ascending=True)

gender         0
samerace       0
match          0
partner        0
order          0
            ... 
sinc7_2     4519
amb7_2      4519
expnum      4627
numdat_3    4849
num_in_3    5449
Length: 192, dtype: int64

In [None]:
# show the sum of null values in rating 
df_train.isna().sum().sum()

304971

In [None]:
#display training data information
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 192 entries, gender to id
dtypes: float64(173), int64(11), object(8)
memory usage: 8.7+ MB


In [None]:
#extract numeric features and categorical features names
#for later use
# numeric features can be selected by: (based on the df_train.info() output )
features_numeric = list(df_train.select_dtypes(include=['float64', 'int64']))

# categorical features can be selected by: (based on the df_train.info() output )
features_categorical = list(df_train.select_dtypes(include=['category']))

print('numeric features:', features_numeric)
print('categorical features:', features_categorical)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'match', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1

In [None]:
# check the null values in testing data
df_test.isnull().sum().sort_values(ascending=True)

gender         0
samerace       0
partner        0
order          0
position       0
            ... 
amb7_2      1904
sinc7_2     1904
expnum      1951
numdat_3    2033
num_in_3    2261
Length: 191, dtype: int64

In [None]:
# show the sum of null values in rating 
df_test.isna().sum().sum()

127044

In [None]:
#display testing data information
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2469 entries, 0 to 2468
Columns: 191 entries, gender to id
dtypes: float64(173), int64(10), object(8)
memory usage: 3.6+ MB


In [None]:
#extract numeric features and categorical features names
#for later use
# numeric features can be selected by: (based on the df_test.info() output )
features_numeric = list(df_train.select_dtypes(include=['float64', 'int64']))

# categorical features can be selected by: (based on the df_test.info() output )
features_categorical = list(df_train.select_dtypes(include=['category']))

print('numeric features:', features_numeric)
print('categorical features:', features_categorical)

numeric features: ['gender', 'idg', 'condtn', 'wave', 'round', 'position', 'positin1', 'order', 'partner', 'pid', 'match', 'int_corr', 'samerace', 'age_o', 'race_o', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o', 'like_o', 'prob_o', 'met_o', 'age', 'field_cd', 'race', 'imprace', 'imprelig', 'goal', 'date', 'go_out', 'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums', 'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater', 'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy', 'expnum', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'attr4_1', 'sinc4_1', 'intel4_1', 'fun4_1', 'amb4_1', 'shar4_1', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1', 'attr3_1', 'sinc3_1', 'fun3_1', 'intel3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'intel5_1', 'fun5_1', 'amb5_1', 'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob', 'met', 'match_es', 'attr1

In [None]:
# Below code gives percentage of null in every column
null_percentage = df_train.isnull().sum()/df_train.shape[0]*100
null_percentage


gender       0.000000
idg          0.000000
condtn       0.000000
wave         0.000000
round        0.000000
              ...    
sinc5_3     76.087324
intel5_3    76.087324
fun5_3      76.087324
amb5_3      76.087324
id           0.000000
Length: 192, dtype: float64

In [None]:
# Below code gives list of columns having more than 60% null
col_to_drop = null_percentage[null_percentage>60].keys()
col_to_drop

Index(['mn_sat', 'expnum', 'attr7_2', 'sinc7_2', 'intel7_2', 'fun7_2',
       'amb7_2', 'shar7_2', 'numdat_3', 'num_in_3', 'attr7_3', 'sinc7_3',
       'intel7_3', 'fun7_3', 'amb7_3', 'shar7_3', 'attr4_3', 'sinc4_3',
       'intel4_3', 'fun4_3', 'amb4_3', 'shar4_3', 'attr2_3', 'sinc2_3',
       'intel2_3', 'fun2_3', 'amb2_3', 'shar2_3', 'attr5_3', 'sinc5_3',
       'intel5_3', 'fun5_3', 'amb5_3'],
      dtype='object')

In [None]:
#training data after removing the nulls that more than 60 percent in each column
df_train = df_train.drop(col_to_drop, axis=1)
df_train

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,intel1_3,fun1_3,amb1_3,shar1_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,id
0,0,3,2,14,18,2,2.0,14,12,372.0,...,,,,,,,,,,2583
1,1,14,1,3,10,2,,8,8,63.0,...,15.00,20.00,10.00,15.00,6.0,8.0,8.0,7.0,8.0,6830
2,1,14,1,13,10,8,8.0,10,10,331.0,...,,,,,,,,,,4840
3,1,38,2,9,20,18,13.0,6,7,200.0,...,18.37,18.37,14.29,14.29,8.0,9.0,8.0,8.0,6.0,5508
4,1,24,2,14,20,6,6.0,20,17,357.0,...,,,,,,,,,,4828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5904,0,1,2,9,20,2,2.0,18,1,214.0,...,18.87,15.09,16.98,13.21,12.0,12.0,12.0,9.0,12.0,3390
5905,1,24,2,9,20,19,15.0,5,6,199.0,...,,,,,,,,,,4130
5906,0,13,2,11,21,5,5.0,3,18,290.0,...,,,,,,,,,,1178
5907,1,10,2,7,16,6,14.0,9,10,151.0,...,,,,,,,,,,5016


In [None]:
# show the sum of null values in rating 
df_train.isna().sum().sum()

163165

In [None]:
# Below code gives percentage of null in every column
null_percentage = df_test.isnull().sum()/df_test.shape[0]*100
null_percentage

gender       0.000000
idg          0.000000
condtn       0.000000
wave         0.000000
round        0.000000
              ...    
sinc3_3     52.612394
intel3_3    52.612394
fun3_3      52.612394
amb3_3      52.612394
id           0.000000
Length: 158, dtype: float64

In [None]:
# Below code gives list of columns having more than 60% null
col_to_drop = null_percentage[null_percentage>60].keys()
col_to_drop

Index([], dtype='object')

In [None]:
#testing data after removing the nulls that more than 60 percent in each column
df_test = df_test.drop(col_to_drop, axis=1)
df_test

Unnamed: 0,gender,idg,condtn,wave,round,position,positin1,order,partner,pid,...,intel1_3,fun1_3,amb1_3,shar1_3,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,id
0,0,5,2,2,16,3,,13,13,52.0,...,30.00,15.00,10.00,10.00,5.0,7.0,8.0,6.0,8.0,934
1,0,33,2,14,18,6,6.0,4,8,368.0,...,30.00,15.00,20.00,5.00,6.0,8.0,7.0,7.0,8.0,6539
2,1,6,2,9,20,10,16.0,15,19,212.0,...,,,,,,,,,,6757
3,1,26,2,2,19,15,,8,10,30.0,...,,,,,,,,,,2275
4,0,29,2,7,16,7,7.0,10,5,162.0,...,,,,,,,,,,1052
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2464,0,23,2,15,19,18,18.0,14,11,407.0,...,,,,,,,,,,7982
2465,0,5,1,13,9,4,4.0,4,8,339.0,...,,,,,,,,,,7299
2466,1,26,2,2,19,3,,15,3,23.0,...,,,,,,,,,,1818
2467,0,19,2,9,20,11,11.0,9,2,215.0,...,20.41,20.41,18.37,6.12,9.0,7.0,12.0,12.0,9.0,937


In [None]:
# show the sum of null values in rating 
df_test.isna().sum().sum()

68050

In [None]:
#drop useless columns      
x=df_train.drop(['match','pid','idg','partner'],axis=1)
y=df_train['match']

In [None]:
#drop useless columns      
test_data=df_test.drop(['pid','idg','partner'],axis=1)

In [None]:
df_test.shape

(2469, 158)

In [None]:
x.shape

(5909, 155)

In [None]:
features_categorical = list(x.select_dtypes(include=['category']))
print('categorical features:', features_categorical)

categorical features: []


In [None]:
#split features to numeric and categorical
features_numeric = list(x.select_dtypes(include=['float64', 'int64']))
features_categorical = list(x.select_dtypes(include=['category']))
features_categorical

[]

In [None]:
np.random.seed(0)

# define a pipe line for numeric feature preprocessing
# give them a name so can set their hyperparameters
#SimpleImputer: Univariate imputer for completing missing values with simple strategies.
#StandardScaler: removes the mean and scales each feature/variable to unit variance.
transformer_numeric = Pipeline(
    steps=[
        ('imputer', SimpleImputer()),
        ('scaler', StandardScaler())]
)

# define a pipe line for categorical feature preprocessing
# give them a name so can set their hyperparameters
transformer_categorical = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)
# define the preprocessor 
# gave them a name so can set their hyperparameters
# also specify what are the categorical 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', transformer_numeric, features_numeric),
        ('cat', transformer_categorical, features_categorical)
    ]
)


In [None]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5909 entries, 0 to 5908
Columns: 155 entries, gender to id
dtypes: float64(140), int64(8), object(7)
memory usage: 7.0+ MB


#First Model: Decision Tree

In [None]:
from sklearn import tree # to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
#set list of values in each hyperparameter
criterion = ['gini', 'entropy'] #gini: split the population into two equal halves , entropy: a measure of disorder or impurity in a node.
max_depth = [6,7,8,10,12,14] #max_depth: This determines the maximum depth of the tree
dec_tree = tree.DecisionTreeClassifier() #the classifier 
parameters = dict(dec_tree__criterion=criterion,
                  dec_tree__max_depth=max_depth)
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       #use feature selection function to select best features
                       ('feature_selection', SelectKBest(score_func=f_classif,k=140)),
                       ('dec_tree', dec_tree)])
pipe.fit(x,y)

Trial 1: Decision Tree with Random search


In [None]:
#use random search to find optimal hyperparameters
random_search_1 = RandomizedSearchCV(
    pipe, parameters, cv=5, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')

random_search_1.fit(x, y)

print('best score {}'.format(random_search_1.best_score_))
print('best score {}'.format(random_search_1.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
best score 0.8012456731407477
best score {'dec_tree__max_depth': 6, 'dec_tree__criterion': 'entropy'}


Trail 2: Decision Tree with BayesianSearch

In [None]:
!pip install scikit-optimize
import skopt
from skopt import BayesSearchCV #is an open-source Python library for performing optimization tasks , #BayesSearchCV implements a “fit” and a “score” method.
from skopt.space import Real, Categorical, Integer #skopt.space: Initialize a search space from given specifications.,Real: Returns true if all dimensions are Real,Categorical: Space contains exclusively categorical dimensions,Integer: Search space dimension that can take on integer values.
bayes_search_1 = BayesSearchCV(
    pipe, parameters, cv=5, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=15,
    scoring='roc_auc')

bayes_search_1.fit(x, y)

print('best score {}'.format(bayes_search_1.best_score_))
print('best score {}'.format(bayes_search_1.best_params_))

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits




Fitting 5 folds for each of 1 candidates, totalling 5 fits




Fitting 5 folds for each of 1 candidates, totalling 5 fits
best score 0.7994273553209752
best score OrderedDict([('dec_tree__criterion', 'entropy'), ('dec_tree__max_depth', 6)])


Trial 3: Decision Tree with GridSearch


In [None]:
# cv=6 means two-fold cross-validation
# n_jobs means the cucurrent number of jobs
# (on colab since we only have two cpu cores, we set it to 2)
grid_search_1 = GridSearchCV(
    pipe, parameters, cv=6, verbose=1, n_jobs=2, 
    scoring='roc_auc')

grid_search_1.fit(x, y)
print('best score {}'.format(grid_search_1.best_score_))
print('best score {}'.format(grid_search_1.best_params_))

Fitting 6 folds for each of 12 candidates, totalling 72 fits
best score 0.8061428710743823
best score {'dec_tree__criterion': 'entropy', 'dec_tree__max_depth': 6}


after observations I think that the Decision tree with gridsearch was the better one .

#Second Model: XGBoost *Classifier*

In [None]:
# combine the preprocessor with the model as a full tunable pipeline
# we gave them a name so we can set their hyperparameters
full_pipline_2 = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('feature_selection', SelectKBest(score_func=f_classif,k=140)),
                                 ('my_classifier', XGBClassifier())])
full_pipline_2.fit(x,y)
param_grid = {'preprocessor__num__imputer__strategy': ['mean'],
             # preprocessor__num__imputer__strategy points to preprocessor->num (a Pipeline)-> imputer -> strategy
             'my_classifier__n_estimators': [10,20,25, 30, 40,50,60,70,80],  
             # my_classifier__n_estimators points to my_classifier->n_estimators 
             'my_classifier__max_depth':[5,10, 20, 30,40,50] }

Trial 4: XGboost with Random Search

In [None]:
random_search_2 = RandomizedSearchCV(
    full_pipline_2, param_grid, cv=5, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=10,
    scoring='roc_auc')

random_search_2.fit(x, y)

print('best score {}'.format(random_search_2.best_score_))
print('best score {}'.format(random_search_2.best_params_))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
best score 0.8759327372364328
best score {'preprocessor__num__imputer__strategy': 'mean', 'my_classifier__n_estimators': 80, 'my_classifier__max_depth': 20}


Trial 5: XGBoost with Bayesian Search

In [None]:
bayes_search_2 = BayesSearchCV(
    full_pipline_2, param_grid, cv=5, verbose=1, n_jobs=2, 
    # number of random trials
    n_iter=20,
    scoring='roc_auc')

bayes_search_2.fit(x, y)

print('best score {}'.format(bayes_search_2.best_score_))
print('best score {}'.format(bayes_search_2.best_params_))

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits




Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
best score 0.8781963015034879
best score OrderedDict([('my_classifier__max_depth', 10), ('my_classifier__n_estimators', 80), ('preprocessor__num__imputer__strategy', 'mean')])


In [None]:
submission = pd.DataFrame()

submission['id'] = df_test.index

submission['match'] = bayes_search_2.predict_proba(df_test)[:,1]

submission.to_csv('Bayesian_search_XG.csv', index=False)

Trial 6: XGBoost clasifier with Grid Search

In [None]:
# cv=3 means two-fold cross-validation
# n_jobs means the cucurrent number of jobs
# (on colab since we only have two cpu cores, we set it to 2)
grid_search_2 = GridSearchCV(
    full_pipline_2, param_grid, cv=3, verbose=1, n_jobs=2, 
    scoring='roc_auc')

grid_search_2.fit(x, y)
print('best score {}'.format(grid_search_2.best_score_))
print('best score {}'.format(grid_search_2.best_params_))

Fitting 3 folds for each of 54 candidates, totalling 162 fits
best score 0.8754023626858495
best score {'my_classifier__max_depth': 10, 'my_classifier__n_estimators': 70, 'preprocessor__num__imputer__strategy': 'mean'}


after observations I think that the XGboost with bayesian search was the better one .

#I think after observations using the decision tree and XGboost 
XGboost is better classifier.