# Is there a cat in your data?
Because this is such a common task and important skill to master, we've put together a dataset that contains only categorical features, and includes:

binary features
low- and high-cardinality nominal features
low- and high-cardinality ordinal features
(potentially) cyclical features

https://www.kaggle.com/alexisbcook/categorical-variables <br>
https://towardsdatascience.com/understanding-feature-engineering-part-2-categorical-data-f54324193e63

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Import Libraries

In [None]:
# basic function of python
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sb
%matplotlib inline
import numpy as np
from pandas import ExcelWriter
from pandas import ExcelFile
import xlrd
from scipy import stats
from datetime import datetime

# feature hashing
from sklearn.feature_extraction import FeatureHasher

# target encoder
import category_encoders as ce

# feature selection
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import feature_selection

# oversampling
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

# building the models
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import tensorflow
# from tensorflow.contrib.keras import models, layers
# from tensorflow.contrib.keras import activations, optimizers, losses

# standardize the vaiable
from sklearn.preprocessing import StandardScaler

# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split

# validation
from sklearn.metrics import confusion_matrix,classification_report

# Get the Data

In [None]:
train = pd.read_csv('../input/cat-in-the-dat/train.csv')
test = pd.read_csv('../input/cat-in-the-dat/test.csv')
submission = pd.read_csv('../input/cat-in-the-dat/sample_submission.csv')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
submission.head()

In [None]:
train.drop(['id'],axis=1,inplace=True)

In [None]:
test.drop(['id'],axis=1,inplace=True)

In [None]:
train['target'].value_counts()

In [None]:
train.dtypes

# Exploratory Data Analysis
In this section, I deal with the data cleaning and check out some missing data or imbalance data.

## Let's check the dimension of dataset

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.isnull() # Checking missing values

In [None]:
train.isnull().sum() # check the missing values

In [None]:
sb.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

<h4>Evaluating for Missing Data</h4>

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

In [None]:
missing_data = train.isnull()
missing_data.head(5)

<h4>Count missing values in each column</h4>
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  In the body of the for loop the method  ".value_counts()"  counts the number of "True" values. 
</p>

In [None]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

In [None]:
test.isnull() # Checking missing values

In [None]:
test.isnull().sum() # check the missing values

In [None]:
sb.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
missing_data = test.isnull()
missing_data.head(5)

<h4>Count missing values in each column</h4>
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  In the body of the for loop the method  ".value_counts()"  counts the number of "True" values. 
</p>

In [None]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

In conclusion, there is no missing value in this dataset. We don't need to handle the missing value:)

# Converting Categorical Features
We'll need to convert categorical features to numerical features. Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
train.head()

In [None]:
train.dtypes

## Feature hashing

In [None]:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=8, input_type='string')
sp = fh.fit_transform(train['ord_5'])
df = pd.DataFrame(sp.toarray(), columns=['fh1', 'fh2', 'fh3', 'fh4', 'fh5', 'fh6', 'fh7', 'fh8'])
pd.concat([train, df], axis=1)
train.drop('ord_5',axis=1,inplace=True)
train

In [None]:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=8, input_type='string')
sp = fh.fit_transform(test['ord_5'])
df = pd.DataFrame(sp.toarray(), columns=['fh1', 'fh2', 'fh3', 'fh4', 'fh5', 'fh6', 'fh7', 'fh8'])
pd.concat([test, df], axis=1)
test.drop('ord_5',axis=1,inplace=True)
test

## One-hot coding for nomial features

In [None]:
train = pd.get_dummies(train, columns=['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4','ord_3', 'ord_4'],drop_first=True, sparse=True)

In [None]:
train.shape

In [None]:
test = pd.get_dummies(test, columns=['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4','ord_3', 'ord_4'],drop_first=True, sparse=True)

In [None]:
test.shape

## Target encoder
Target-based encoding is numerization of categorical variables via target. In this method, we replace the categorical variable with just one new numerical variable and replace each category of the categorical variable with its corresponding probability of the target (if categorical) or average of the target (if numerical). The main drawbacks of this method are its dependency to the distribution of the target, and its lower predictability power compare to the binary encoding method.

In [None]:
cols_ = ['nom_5','nom_6','nom_7','nom_8','nom_9']
ce_target_encoder = ce.TargetEncoder(cols = cols_, smoothing=0.50)
ce_target_encoder.fit(train[cols_], train['target'])
train_nom = ce_target_encoder.transform(train[cols_])
train.drop(['nom_5','nom_6','nom_7','nom_8','nom_9'],axis=1,inplace=True)
train = pd.concat([train, train_nom], axis=1)
train

In [None]:
ce_target_encoder = ce.TargetEncoder(cols = ['nom_5','nom_6','nom_7','nom_8','nom_9'], smoothing=0.50)
cols = ['nom_5','nom_6','nom_7','nom_8','nom_9']
ce_target_encoder.fit(train[cols], train['target'])
#train = oof.sort_index() 
test_nom = ce_target_encoder.transform(test[cols])
test_nom

In [None]:
test.drop(['nom_5','nom_6','nom_7','nom_8','nom_9'],axis=1,inplace=True)
test = pd.concat([test, test_nom], axis=1)
test

## Label-encoder

In [None]:
train.head()

In [None]:
# Category variables -> Numerical variables
list_feat=['bin_3','bin_4','ord_1','ord_2']

In [None]:
for feature in list_feat:
    labels = train[feature].astype('category').cat.categories.tolist()
    replace_map_comp = {feature : {k: v for k,v in zip(labels,list(range(0,len(labels)+1)))}}

    train.replace(replace_map_comp, inplace=True)

In [None]:
list_feat=['bin_3','bin_4','ord_1','ord_2']

In [None]:
for feature in list_feat:
    labels = test[feature].astype('category').cat.categories.tolist()
    replace_map_comp = {feature : {k: v for k,v in zip(labels,list(range(0,len(labels)+1)))}}

    test.replace(replace_map_comp, inplace=True)

## Handling cyclical features: Day, month

In [None]:
# Day
train['day_sin'] = np.sin(2 * np.pi * train['day']/7)
train['day_cos'] = np.cos(2 * np.pi * train['day']/7)
# Month
train['month_sin'] = np.sin(2 * np.pi * train['month']/12)
train['month_cos'] = np.cos(2 * np.pi * train['month']/12)

In [None]:
# Day
test['day_sin'] = np.sin(2 * np.pi * test['day']/7)
test['day_cos'] = np.cos(2 * np.pi * test['day']/7)
# Month
test['month_sin'] = np.sin(2 * np.pi * test['month']/12)
test['month_cos'] = np.cos(2 * np.pi * test['month']/12)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.drop(['day','month'],axis=1,inplace=True)

In [None]:
test.drop(['day','month'],axis=1,inplace=True)

In [None]:
train.head()

In [None]:
# train_target = train['target']
# train_target
# train.drop('target',axis=1,inplace=True)
# train = pd.concat([train, train_target], axis=1)
# train

In [None]:
test.head()

In [None]:
# checking the imbalance
sb.countplot(x='target',data=train,palette='RdBu_r') # Barplot for the dependent variable

In [None]:
train['target'].value_counts()

# Oversampling
A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling).
Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free lunch). The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting. In under-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.

In [None]:
# # Separate the majority of data and the minority of data
df_majority = train[train['target']==0]
df_minority = train[train['target']==1]

In [None]:
# oversampling minority data
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # replace the original data
                                 n_samples=208236,    # the number of data to match with majority
                                 random_state=123) # reproducible results

In [None]:
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [None]:
sb.countplot(x='target',data=df_upsampled,palette='RdBu_r')

In [None]:
# Display new class counts
df_upsampled['target'].value_counts()

As you can see, the new DataFrame has more observations than the original, and the ratio of the two classes is now 1:1.

In [None]:
# dataset=df_upsampled._get_values

In [None]:
# Separate input features (X) and target variable (y)
y = df_upsampled.target
X = df_upsampled.drop('target', axis=1)

# Feature Importance

We can get the feature importance of each feature of our dataset by using the feature importance property of the model. Feature importance gives you a score for each feature of our data, the higher the score more important or relevant is the feature towards our output variable. Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.<br>

There are 4 different feature selection techniques: univariate selection, recursive feature elimination, principle component analysis, and feature importance. So, we need to select the important features: Extra Trees Classifier and XGBClassifier

In [None]:
train.shape

In [None]:
# # Build a forest and compute the feature importances
# model1 = ExtraTreesClassifier(n_estimators=250,
#                               random_state=0)

# model1.fit(dataset_train,dataset_label)
# importances = model1.feature_importances_
# std = np.std([tree.feature_importances_ for tree in model1.estimators_],
#              axis=0)
# indices = np.argsort(importances)[::-1]

In [None]:
# Unbalanced dataset
# X = train.iloc[:,np.r_[:,0:8,9:76]]  #independent columns
# y = train.iloc[:,np.r_[:,8]]    #target column
# Balanced dataset
y = df_upsampled.target
X = df_upsampled.drop('target', axis=1)
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model1 = ExtraTreesClassifier()
model1.fit(X,y)
print(model1.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model1.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

# Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable. Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable) Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

In [None]:
# X = dataset[:,np.r_[:,0:8,9:76]]   #independent columns
# y = dataset[:,np.r_[:,8]]    #target column
y = df_upsampled.target
X = df_upsampled.drop('target', axis=1)
#get correlations of each features in dataset
corrmat = train.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sb.heatmap(train[top_corr_features].corr(),annot=True,cmap="RdYlGn")

In [None]:
# from xgboost import XGBClassifier
# from xgboost import plot_importance

# # X = dataset[:,np.r_[:,0:8,9:76]]   #independent columns
# # y = dataset[:,np.r_[:,8]]    #target column
# y = df_upsampled.target
# X = df_upsampled.drop('target', axis=1)
# # fit model no training data
# model2 = XGBClassifier()
# model2.fit(X,y)
# # feature importance
# print(model2.feature_importances_)
# # plot feature importance

# plt.figure(figsize=(3,6))
# plot_importance(model2,max_num_features=20)
# plt.show()

In [None]:
# from numpy import sort
# from xgboost import XGBClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score
# from sklearn.feature_selection import SelectFromModel

# # X = train.iloc[:,np.r_[:,0:8,9:76]]  #independent columns
# # Y = train.iloc[:,np.r_[:,8]]    #target column
# y = df_upsampled.target
# X = df_upsampled.drop('target', axis=1)

# # split data into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# # fit model on all training data
# model = XGBClassifier()
# model.fit(X_train, y_train)
# # make predictions for test data and evaluate
# y_pred = model.predict(X_test)
# predictions = [round(value) for value in y_pred]
# accuracy = accuracy_score(y_test, predictions)
# print("Accuracy: %.2f%%" % (accuracy * 100.0))
# # Fit model using each importance as a threshold
# thresholds = sort(model.feature_importances_)
# for thresh in thresholds:
#     # select features using threshold
#     selection = SelectFromModel(model, threshold=thresh, prefit=True)
#     select_X_train = selection.transform(X_train)
#     # train model
#     selection_model = XGBClassifier()
#     selection_model.fit(select_X_train, y_train)
#     # eval model
#     select_X_test = selection.transform(X_test)
#     y_pred = selection_model.predict(select_X_test)
#     predictions = [round(value) for value in y_pred]
#     accuracy = accuracy_score(y_test, predictions)
#     print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

In [None]:
## Import the random forest model.
from sklearn.ensemble import RandomForestClassifier 
## This line instantiates the model. 
model3 = RandomForestClassifier() 
## Fit the model on your training data.
model3.fit(X, y)

In [None]:
feature_importances = pd.DataFrame(model3.feature_importances_,
                                   index = X.columns,
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances

In [None]:
(pd.Series(model3.feature_importances_, index=X.columns).nlargest(20).plot(kind='barh'))

In [None]:
# display the relative importance of each attribute
output1=model1.feature_importances_

In [None]:
# output2=model2.feature_importances_

In [None]:
output3=model3.feature_importances_

In [None]:
output = output1 + output3 #  + output2

In [None]:
n=18
important_features=np.argsort(output)[::-1][:n]

In [None]:
important_features

In [None]:
training_data = X.iloc[:,important_features]
training_label = y

In [None]:
testing_data=test.iloc[:,important_features]

# Methology

In [None]:
# train.shape

In [None]:
# training_data = train.iloc[:,np.r_[:,0:8,9:76]]  #independent columns
# training_label = train.iloc[:,np.r_[:,8]]   #target column
training_label = df_upsampled.target
training_data = df_upsampled.drop('target', axis=1)

# 1. Building a Logistic Regression model
Let's build our model using LogisticRegression from Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ solvers. You can find extensive information about the pros and cons of these optimizers if you search it in internet.
The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models. C parameter indicates inverse of regularization strength which must be a positive float. Smaller values specify stronger regularization. Now lets fit our model with train set:

## Train Test Split
Let's start by splitting our data into a training set and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(training_data,training_label,test_size=0.33,random_state=101)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

## Training and Predicting

In [None]:
logmodel = LogisticRegression(C=0.01, solver='liblinear')
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

## Evaluation
### confusion matrix
Another way of looking at accuracy of classifier is to look at __confusion matrix__.

In [None]:
print("Accuracy is", accuracy_score(y_test,predictions)*100)

In [None]:
cm1 = confusion_matrix(y_test,predictions)

In [None]:
print(cm1)

In [None]:
print(classification_report(y_test,predictions))

Based on the count of each section, we can calculate precision and recall of each label:


- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

- __Recall__ is true positive rate. It is defined as: Recall =  TP / (TP + FN)

    
So, we can calculate precision and recall of each class.

__F1 score:__
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. 

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.


And finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case.

In [None]:
plt.clf()
plt.imshow(cm1, interpolation='nearest', cmap=plt.cm.Wistia)
classNames = ['Negative','Positive']
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(cm1[i][j]))
plt.show()

## 2. Building a K Nearest Neighbors model
It works very slow so I didn't use it.

### Standardize the Variables

In [None]:
# scaler = StandardScaler()

In [None]:
# scaler.fit(training_data)

In [None]:
# scaled_features = scaler.transform(training_data)

In [None]:
# scaled_features

### Train Test Split

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(scaled_features,training_label,test_size=0.30)
# print ('Train set:', X_train.shape,  y_train.shape)
# print ('Test set:', X_test.shape,  y_test.shape)

### Choosing a K Value
Use the elbow method.

In [None]:
# error_rate = []

# # Will take some time
# for i in range(1,20):
    
#     knn = KNeighborsClassifier(n_neighbors=i)
#     knn.fit(X_train,y_train)
#     pred_i = knn.predict(X_test)
#     error_rate.append(np.mean(pred_i != y_test))

### Using KNN

In [None]:
# knn = KNeighborsClassifier(n_neighbors=1) # n_neighbors = k

In [None]:
# knn.fit(X_train,y_train)

In [None]:
# predictions = knn.predict(X_test)

### Predictions and Evaluations

In [None]:
# from sklearn import metrics
# print("Train set Accuracy: ", metrics.accuracy_score(y_train, knn.predict(X_train)))
# print("Test set Accuracy: ", metrics.accuracy_score(y_test, predictions))

In [None]:
# print("Accuracy is", accuracy_score(y_test,predictions)*100)

In [None]:
# cm2 = confusion_matrix(y_test,predictions)

In [None]:
# print(cm2)

In [None]:
# print(classification_report(y_test,predictions))

In [None]:
# plt.clf()
# plt.imshow(cm2, interpolation='nearest', cmap=plt.cm.Wistia)
# classNames = ['Negative','Positive']
# plt.title('Confusion Matrix')
# plt.ylabel('True label')
# plt.xlabel('Predicted label')
# tick_marks = np.arange(len(classNames))
# plt.xticks(tick_marks, classNames, rotation=45)
# plt.yticks(tick_marks, classNames)
# s = [['TN','FP'], ['FN', 'TP']]
# for i in range(2):
#     for j in range(2):
#         plt.text(j,i, str(s[i][j])+" = "+str(cm2[i][j]))
# plt.show()

## 3. Building the Decision Tree
We'll start just by training a single decision tree.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(training_data,training_label,test_size=0.3,random_state=101)

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy')

In [None]:
dtree.fit(X_train,y_train)

In [None]:
predictions = dtree.predict(X_test)

## Prediction and Evaluation
Let's evaluate our decision tree.

In [None]:
print("Accuracy is", accuracy_score(y_test,predictions)*100)

In [None]:
print(classification_report(y_test,predictions))

In [None]:
cm3 = confusion_matrix(y_test,predictions)

In [None]:
print(cm3)

In [None]:
plt.clf()
plt.imshow(cm3, interpolation='nearest', cmap=plt.cm.Wistia)
classNames = ['Negative','Positive']
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(cm3[i][j]))
plt.show()

## 4. Building the Random Forests model

In [None]:
rfc = RandomForestClassifier(n_estimators=170)
rfc.fit(X_train,y_train)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print("Accuracy is", accuracy_score(y_test,rfc_pred)*100)

In [None]:
print(classification_report(y_test,rfc_pred))

In [None]:
cm4 = confusion_matrix(y_test,rfc_pred)

In [None]:
print(cm4)

In [None]:
plt.clf()
plt.imshow(cm4, interpolation='nearest', cmap=plt.cm.Wistia)
classNames = ['Negative','Positive']
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(cm4[i][j]))
plt.show()

## 5. Building the Support Vector Machines model
It works very slow so I didn't use it.

### Train Test Split

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(training_data, training_label, test_size=0.30, random_state=101)
# print ('Train set:', X_train.shape,  y_train.shape)
# print ('Test set:', X_test.shape,  y_test.shape)

In [None]:
# model = SVC()

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

    1.Linear
    2.Polynomial
    3.Radial basis function (RBF)
    4.Sigmoid
Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.

In [None]:
# model.fit(X_train,y_train) # If C is 0, we can have no margin kernel ='Radial Basis Functions'(Big cone located in all points of data set)

### Predictions and Evaluations
Now let's predict using the trained model.

In [None]:
# predictions = model.predict(X_test)

In [None]:
# cm5 = confusion_matrix(y_test,predictions)

In [None]:
# print("Accuracy is", accuracy_score(y_test,predictions)*100)

In [None]:
# print(cm5)

In [None]:
# print(classification_report(y_test,predictions))

In [None]:
# plt.clf()
# plt.imshow(cm5, interpolation='nearest', cmap=plt.cm.Wistia)
# classNames = ['Negative','Positive']
# plt.title('Confusion Matrix')
# plt.ylabel('True label')
# plt.xlabel('Predicted label')
# tick_marks = np.arange(len(classNames))
# plt.xticks(tick_marks, classNames, rotation=45)
# plt.yticks(tick_marks, classNames)
# s = [['TN','FP'], ['FN', 'TP']]
# for i in range(2):
#     for j in range(2):
#         plt.text(j,i, str(s[i][j])+" = "+str(cm5[i][j]))
# plt.show()

# Submission 

In [None]:
submission = pd.read_csv('../input/cat-in-the-dat/sample_submission.csv')

In [None]:
final_prediction=rfc.predict(test)

In [None]:
submission["target"] = rfc.predict_proba(test)[:, 1]

In [None]:
# submission["target"] =final_prediction

In [None]:
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)