## Don't Overfit!EDA+SVM+KNN+XGB

### We used EDA and 3 algorithms Classifier 

 #### Exploratory Data Analysis(EDA)


Exploratory Data Analysis - does this for Machine Learning enthusiast. It is a way of visualizing, summarizing and interpreting the information that is hidden in rows and column format.

<img src="https://csml.princeton.edu/sites/csml/files/events/data-analysis-blog.jpg" width="800px">




#### 1- Support Vector Machines

Support Vector Machines (SVMs), also known as support vector networks, are a family of extremely powerful models which use method based learning and can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. In other words, SVM is a discriminative classifier formally defined by a separating hyperplane

<img src="https://d2o2utebsixu4k.cloudfront.net/media/images/f64026c1-4f3d-42f7-98b9-0ee5fe46ef92.jpg" width="800px">

#### 2- XGboost

XGboost is the most widely used algorithm in machine learning, whether the problem is a classification or a regression problem. It is known for its good performance as compared to all other machine learning algorithms. 

<img src="https://www.analyticssteps.com/backend/media/thumbnail/3327098/5525447_1593423035_XG.jpg" width="800px">

 #### 3- K-Nearest Neighbor

K-Nearest Neighbor classifier is one of the introductory supervised classifiers, which every data science learner should be aware of. This algorithm was first used for a pattern classification task which was first used by Fix & Hodges in 1951. To be similar the name was given as KNN classifier. KNN aims for pattern recognition tasks.

<img src="https://www.analyticssteps.com/backend/media/thumbnail/8694049/8459793_1587615148_KNN.jpg" width="800px">




### What am I predicting?
You are predicting the binary target associated with each row, without overfitting to the minimal set of training examples provided.


<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/1200px-Overfitting.svg.png" width="500px">


#### Files
* train.csv - the training set. 250 rows.
* test.csv - the test set. 19,750 rows.
* sample_submission.csv - a sample submission file in the correct format


#### Columns
* id- sample id
* target- a binary target of mysterious origin.
* 0-299- continuous variables.

#### Dataset Link

[Here](https://www.kaggle.com/c/dont-overfit-ii/data)

In [None]:
!pip install dataprep by

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from dataprep.eda import *
from dataprep.eda import plot
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
import matplotlib.pyplot as plt 
import seaborn as sns; sns.set()
import warnings
warnings.filterwarnings("ignore")

In [None]:
train = pd.read_csv('../input/dont-overfit-ii/train.csv')
test = pd.read_csv('../input/dont-overfit-ii/test.csv')
submission = pd.read_csv('../input/dont-overfit-ii/sample_submission.csv')

In [None]:
print(train.head())
print('_'*80)
print(test.head())
print('_'*80)
print(submission.head())

In [None]:
print(train.info())
print('_'*50)
print(test.info())
print('_'*50)
print(submission.info())

In [None]:
print(train.describe().T)
print('_'*50)
print(test.describe().T)
print('_'*50)
print(submission.describe().T)

In [None]:
print(train.isna())
print('_'*50)
print(test.isna())
print('_'*50)
print(submission.isna())

In [None]:
#sns.pairplot(train, height=3.5, aspect=1.3)

In [None]:
# plots the distribution of each column and calculates dataset statistics
plot(train)

In [None]:
plot(test)

In [None]:
sns.heatmap(train.corr())

In [None]:
sns.heatmap(test.corr())

In [None]:
#import pandas_profiling as pp
#pp.ProfileReport(train)

In [None]:
#import pandas_profiling as pp
#pp.ProfileReport(test)

In [None]:
X_train = train.drop(['id', 'target'], axis=1).values
y_train = train['target'].values

X_test = test.drop(['id'], 1).values

In [None]:
from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
X_train.shape, y_train.shape, X_test.shape

In [None]:
from sklearn.svm import SVC
svm = SVC(C=100, kernel='linear', max_iter=100, gamma='auto', probability=True, random_state=0)
svm.fit(X_train, y_train)

In [None]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(svm, X_train, y_train, cv=20, scoring='roc_auc')
print(score)
print('-' * 60)
print(score.max())

In [None]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
xgb_model = xgb.XGBClassifier().fit(X_train, y_train)

#brute force scan for all parameters, here are the tricks
#usually max_depth is 6,7,8
#learning rate is around 0.05, but small changes may make big diff
#tuning min_child_weight subsample colsample_bytree can have 
#much fun of fighting against overfit 
#n_estimators is how many round of boosting
#finally, ensemble xgboost with multiple seeds may reduce variance
parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['binary:logistic'],
              'learning_rate': [0.05], #so called `eta` value
              'max_depth': [6],
              'min_child_weight': [11],
              'silent': [1],
              'subsample': [0.8],
              'colsample_bytree': [0.7],
              'n_estimators': [5], #number of trees, change it to 1000 for better results
              'missing':[-999],
              'seed': [1337]}


grid_search = GridSearchCV(xgb_model, param_grid=parameters, n_jobs=5)
grid_search.fit(X_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(X_train,y_train)
y_pred_KNN =knn.predict(X_test)
score = cross_val_score(knn, X_train, y_train, cv=20, scoring='roc_auc')
print(score)
print('-' * 60)
print(score.max())

In [None]:
svm_pred = svm.predict_proba(X_test)[:, 1]
xgb_pred = xgb_model.predict_proba(X_test)[:, 1]
knn_pred = knn.predict_proba(X_test)[:, 1]
av_pred = (svm_pred + xgb_pred+knn_pred) / 3
av_pred

In [None]:
submission['target'] = svm_pred
submission.to_csv('svm_submission.csv', index=False)

In [None]:
submission['target'] = xgb_pred
submission.to_csv('xgb_submission.csv', index=False)

In [None]:
submission['target'] = knn_pred
submission.to_csv('knn_submission.csv', index=False)

In [None]:
submission['target'] = av_pred
submission.to_csv('submission.csv', index=False)

In [None]:
submission.head()