This notebook contains some basic concepts to apply the model selection techniques in machine learning.
I will try to give some over view here on K-Fold cross validation techniques and Grid search which are the part pf sklearn library. 
We will use K cross validation technoques to get the score of different classifier implemented on one data set to know how they perform and will then use Grid Search technique to get the best parameters for the best model that we choose based on the cross validation scores. 
The data set choosen here is Social Networking Ads dataset, which I found on the Kaggle dataset repositories. 

In [None]:
# libraries import 
import numpy as np  
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt


In [None]:
# reading the data set
data_set = pd.read_csv('../input/social-network-ad/Social_Network_Ads.csv')

In [None]:
# Analysis of the dataset 
data_set.head()

As we can see that the dataset contains total of 5 columns and the columns namely USER_ID, Gender, Age and Estimated Salary are the independent attribtes here and the Purchased is the dependent attribute which we need to classify.

Now if we look at the attributes, 'Gender' is one of categorical type and if we want to use it in out machine learning algorithm as input than we have to process it further. The other attributes are of continous type and can be used as it is. 

In [None]:
# Plot
import plotly.express as px
fig = px.scatter_3d(data_set, x='Age', y= 'EstimatedSalary',z = 'Gender',
              color='Purchased', symbol='Purchased', opacity=0.7)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()

If we see the above figure, we can say that the Gender is not having that much of information as it seems to be distributed equally for both the choices, so we will try to visualise the relationship between purchased choice and the Ahe and income of the users.

As we can see in the figure below, the choices are seperable in relation to the age and income. So for our analysis we will try to use the Age and Estimated SAlary as the only input attributes. 

In [None]:
fig = px.scatter(data_set, x='Age', y= 'EstimatedSalary',
              color='Purchased', symbol='Purchased', opacity=0.7)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()

In [None]:
X = data_set.iloc[:,[2,3]].values
Y = data_set.iloc[:,4].values
# Splitting the dataset into train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

In [None]:
# Preprocessing the dataset
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# K-Fold Cross Validation

As we are trying to learn, the model selection here in this section so, we will try to choose here two different classifiers

1) Logistic regression

2) Support Vector machine

and then using the cross validation score we will decide which algorithm to choose. 
Note: This can be esily deduced by analysing the above figure that logistic regression classifier will not work good on our dataset as the data doesnt seems to be linearly seperable but as we are trying to learn the method here to use cross validation score to decide the better models. The logistic regression model is just taken as one of the example here. 

In [None]:
# Applying the logistic regression classifier 
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=0)
lr.fit(X_train, Y_train)

In [None]:
# Applying the SVM 
from sklearn.svm import SVC
sv = SVC(kernel='rbf', random_state = 0)
sv.fit(X_train, Y_train)

**Now we will apply the K-Fold Cross validation techniques to check which model is bettter for our dataset **

In [None]:
from sklearn.model_selection import cross_val_score
score_lr = cross_val_score(estimator= lr, X= X_train,y=Y_train,cv= 10, n_jobs= -1)
score_sv = cross_val_score(estimator= sv, X= X_train,y=Y_train,cv= 10, n_jobs= -1)


print('SVM : Mean of the accuracies is %2f percent' % (score_sv.mean()*100))
print('SVM : the standard deviation of the accuracies is %2f percent' % (score_sv.std()*100))

print('log_Regression: Mean of the accuracies is %2f percent' % (score_lr.mean()*100))
print('log_Regression: the standard deviation of the accuracies is %2f percent' % (score_lr.std()*100))

plt.plot(range(len(score_sv)), score_sv, label='SVM')
plt.plot(range(len(score_lr)), score_lr, label = 'logistic regression')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend()

The cross_validation scores shows that SVM is much better as compared to the Logistic regression model for our dataset which we already knew also. 

# Grid Search
Now since we know that the better model among the Logistic regression and the SVM is SVM, we will now use the grid search method to do the parameter tuning.
As in SVM we have different parameters, for example 'C' which is basically the penalty parameter and is used to avoid the overfiting of our model, what is the best value for C for our dataset is determined by Grid Search, apart from this for SVC we have different kernels, now which kernel will work best for our dataset is also determined using grid search. apart from this all the parameters of a particular algorithm can be optimised using the Grid Search. 

Lets apply this. 


In [None]:
from sklearn.model_selection import GridSearchCV
# dictionary of parameters used as an input to gridsearch 
parameters = [
    {'C':[1,10,100,1000], 'kernel': ['linear']},
    {'C':[1,10,100,1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]}
]
# applying grid search
gscv = GridSearchCV(estimator = sv, 
                   param_grid= parameters,
                   scoring = 'accuracy', cv= 10, n_jobs= -1)
gscv = gscv.fit(X_train, Y_train)

best_acc = gscv.best_score_
print('best accuracy is %2f percent ' %(best_acc*100))
best_parameters = gscv.best_params_
print('best parameters are : ', best_parameters)

Now we have got the best parameters for our model we can then train with this model here and can see the results

In [None]:
predictions_gridsearch = gscv.predict(X_test)
from sklearn.metrics import confusion_matrix
cm_gridsearch = confusion_matrix(Y_test, predictions_gridsearch)
print(cm_gridsearch)

In [None]:
# Visualisation
# Training results 
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, Y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, gscv.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('green','yellow')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('green', 'yellow'))(i), label = j)
plt.title('Kernel SVM (Grid Search) -Training set')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

In [None]:

X_set, y_set = X_test, Y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, gscv.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('green','yellow')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('green','yellow'))(i), label = j)
plt.title('Kernel SVM (Grid Search)-Test set')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

**Thank you for reading this kernel. If you found this kernel useful, I would really appreciate if you upvote it or leave a short comment below.**