## SPAM detection task
The data contains 100 features extracted from a corpus of emails. Some of the emails are spam and some are normal. The task is to make a spam detector. 
train.csv - contains 600 emails x 100 features for use training model(s)
train_labels.csv - contains labels for the 600 training emails (1 = spam, 0 = normal)
test.csv - contains 4000 emails x 100 features. Need to detect the spam on them.

Predictions can be continuous numbers or 0/1 labels. No header is necessary. Submissions are judged on area under the ROC curve. 

In [None]:
# Will import libraries
import numpy as np
import pandas as pd
import scipy.optimize as sp
import xgboost as xgb

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn import linear_model, model_selection, metrics, tree, ensemble 

In [None]:
#Reading data
data = pd.read_csv('../input/just-the-basics-the-after-party/train.csv')
dataT = pd.read_csv('../input/just-the-basics-the-after-party/test.csv')
y = pd.read_csv('../input/just-the-basics-the-after-party/train_labels.csv')
data.head()

In [None]:
#Since the dataset has no headers, let's name the columns for further incrimination. 
colums = list((range(0,100)))
data.columns = [colums]
dataT.columns = [colums]
data.info()

In [None]:
#And let's fill in the missing values with the median
for i in colums:
    data[i,].fillna(data[i,].median(), inplace = True)

for i in colums:
    dataT[i,].fillna(dataT[i,].median(), inplace = True)
data.info()

In [None]:
#Let's bring y to the required shape

y_train = np.ravel(y)
print(y.shape,type(y), y_train.shape, type(y_train))

#Data is full, no need delete outliers (NEED MORE Explanations)
X_train = data
X_test = dataT

## Modeling
### Will tune hyperparameters using GridSearchCV. For scoring will use area under the ROC curve: 'roc_auc'.

### LogisticRegression

In [None]:
#For penalty will use Lasso 'l1'. Tune 'C' parameter
param_grid = {'C': [0.01, 0.05, 0.1, 0.5, 1, 5, 10]}

estimator = linear_model.LogisticRegression(solver='liblinear', penalty = 'l1', random_state = 1)
optimizerL = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizerL.fit(X_train, y_train)

print('score_train_opt', optimizerL.best_score_)
print('param_opt', optimizerL.best_params_)

### RidgeClassifier

In [None]:
param_grid = {'alpha': [0.01, 0.05, 0.1, 0.5, 1, 2, 5]}

estimator = linear_model.RidgeClassifier( random_state = 1)
optimizerR = GridSearchCV(estimator, param_grid,  scoring = 'roc_auc',cv = 3)                    
optimizerR.fit(X_train, y_train)

print('score_train_opt', optimizerR.best_score_)
print('param_opt', optimizerR.best_params_)

### RandomForestClassifier
We should have a loose stopping criterion and then use pruning to remove branches that contribute to overfitting. But pruning is a tradeoff between accuracy and generalizability, so our train scores might lower but the difference between train and test scores will also get lower.  This is what we need.  (details - https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680)

In [None]:
rf_class = ensemble.RandomForestClassifier(random_state = 1)
train_scores, test_scores = model_selection.validation_curve(rf_class, X_train, y_train, 'max_depth', list(range(1, 11)), cv=3, scoring='roc_auc')
print('max_depth=', list(range(1, 10)))
print(train_scores.mean(axis = 1))
print(test_scores.mean(axis = 1))

We get the same difference between train and test scores on by  max_depth=4-9
And we have the bigger score ROC AUC by max_depth=4

In [None]:
param_grid = {'n_estimators': list(range(20, 100, 5)), 'min_weight_fraction_leaf': [0.001,  0.005, 0.01, 0.05, 0.1, 0.5] } 

estimator = ensemble.RandomForestClassifier(max_depth=4, random_state = 1)
optimizerRF = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizerRF.fit(X_train, y_train)

print('score_train_opt', optimizerRF.best_score_)
print('param_opt', optimizerRF.best_params_)

### Extreme Gradient Boosting

In [None]:
param_grid = {'max_depth': list(range(1, 7)), 'learning_rate': [0.01, 0.05, 0.1, 0.5, 1, 1.5], 'n_estimators': list(range(10, 100, 5)) }
estimator = xgb.XGBClassifier( random_state = 1, min_child_weight=3)
optimizer = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizer.fit(X_train, y_train)

print('score_train_opt', optimizer.best_score_)
print('param_opt', optimizer.best_params_)

In [None]:
param_grid = {'n_estimators': list(range(10, 100, 5)), 'min_child_weight': list(range(1, 10)) }
estimator = xgb.XGBClassifier( max_depth = 3, random_state = 1, learning_rate=0.1)
optimizer = GridSearchCV(estimator, param_grid, scoring = 'roc_auc',cv = 3)                    
optimizer.fit(X_train, y_train)

print('score_train_opt', optimizer.best_score_)
print('param_opt', optimizer.best_params_) 

Will use the highest value ROC AUC model - RandomForestClassifier


In [None]:
#Writting answers

ans=optimizerRF.predict(X_test)

f=open("/kaggle/working/answers.csv", "w")
f.write(str(ans))
f.close()
