After two submissions, more advanced techniques should be used to further improve the result.

In [1]:
import numpy as np
import pandas as pd

train_dataset = pd.read_csv("train.csv")
test_dataset = pd.read_csv("test.csv")

In [2]:
train_dataset = train_dataset.drop(["id"], axis = 1)
target = train_dataset["target"]

Same as last submission, handle skewness.

In [3]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(train_dataset.drop('target',axis=1), target, test_size = 0.25, random_state = 42, shuffle=True, stratify=target)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(187, 300) (63, 300) (187,) (63,)


### Features selection

In the last submission, I have used mutual_info_classif from sklearn package to select the features. In this submission, RandomForestClassifier and feature_selection will be used for features selection. Expected number of features is 200.

In [4]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=500, class_weight='balanced', max_depth=5, random_state=42)
selector = RFE(rfc, n_features_to_select=200)
selector.fit(x_train, y_train)

In [5]:
selected_features = selector.get_support()

In [6]:
print('number of selected columns',selected_features.sum())
print('number of selected columns',x_train.columns[selected_features])

number of selected columns 200
number of selected columns Index(['0', '1', '2', '4', '5', '6', '7', '9', '13', '15',
       ...
       '284', '286', '287', '288', '289', '290', '292', '295', '297', '298'],
      dtype='object', length=200)


In [7]:
dropped_features = x_train.columns[~selected_features]
x_train.drop(dropped_features,axis= 1,inplace= True)
x_test.drop(dropped_features,axis= 1 ,inplace= True)
train_dataset.drop(dropped_features,axis= 1 ,inplace= True)
test_dataset.drop(dropped_features,axis= 1 ,inplace= True)

The first submission used standard scaler to scale the features, but StandardScaler follows Standard Normal Distribution. RobustScaler uses the interquartile range so that it is robust to outliers. It could be a more common and accurate method.

In [8]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [9]:
x_train

array([[-1.4864572 ,  1.13866878, -0.22279793, ...,  1.11662904,
        -0.86406813,  0.91941247],
       [-0.8335139 ,  0.56576862, -0.77202073, ..., -0.09706546,
        -0.75597773, -0.19055181],
       [ 0.60743951,  0.63232964,  0.06822107, ...,  1.01429646,
         0.43105142,  0.37157602],
       ...,
       [ 0.36619718,  0.25277338, -0.56303972, ..., -0.32656132,
        -0.60661644,  1.5045653 ],
       [-0.10906464, -0.32884311,  0.44300518, ..., -0.32355154,
         0.75139207, -0.17467249],
       [ 1.10653666, -0.80744849, -2.25734024, ...,  0.67268623,
         0.61513266, -0.01905518]])

### Model training

In [10]:
import warnings
warnings.filterwarnings("ignore")

I reuse rbf kernel svc model in the last submission. I just set the value of gamma to be auto, and enable probability estimates because I want to use predict_proba method to get the result in this submission.

In [11]:
from sklearn.svm import SVC

svm = SVC(C=100, kernel='rbf', class_weight={0: 1.8, 1: 1.0}, max_iter=100, gamma='auto', probability=True, random_state=42)
svm.fit(x_train, y_train)

In [12]:
from sklearn.model_selection import cross_val_score

score = cross_val_score(svm, x_train, y_train, cv=20, scoring='roc_auc')
print(score)

[1.         0.91666667 0.875      0.875      0.95833333 1.
 1.         0.77777778 1.         1.         0.66666667 0.94444444
 0.83333333 0.44444444 0.77777778 1.         0.83333333 1.
 0.38888889 0.88888889]


In this submission, I want to use another method, logistic regression, to predict the result, and at last taking the average of both results as the target.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression(solver='liblinear', max_iter=1000).fit(x_train, y_train)

parameter_grid = {'class_weight' : [{0: 1.8, 1: 1.0}],
                  'penalty' : ['l2'],
                  'C' : [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
                  'solver': ['newton-cg', 'sag', 'lbfgs']
                 }

grid_search = GridSearchCV(lr, param_grid=parameter_grid, cv=20, scoring='roc_auc')
grid_search.fit(x_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

Best score: 0.8569444444444445
Best parameters: {'C': 0.01, 'class_weight': {0: 1.8, 1: 1.0}, 'penalty': 'l2', 'solver': 'newton-cg'}


### Export result

In [14]:
test_data_id = test_dataset["id"]
test_dataset = test_dataset.drop(["id"], axis = 1)
test_data = scaler.transform(test_dataset)

Take the average of the two predicted results as the target column of the result.

In [15]:
svm_y_pred = svm.predict_proba(test_data)[:, 1]
lr_y_pred = grid_search.predict_proba(test_data)[:, 1]
avg_pred = (svm_y_pred + lr_y_pred) / 2

In [16]:
submission= pd.DataFrame({'id':np.asarray(test_data_id), 'target':avg_pred})
submission.to_csv("submission.csv", index=False)