Import the data file from the compressed folder

In [1]:
import gzip
import pandas  as pd
with gzip.open('DScasestudy (1) (50).txt.gz') as file:

    data = pd.read_csv(file, sep='\t')

#check the structure of the data
    data.head()

In [2]:
#get general information on the data(# of columns and rows, what data types)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530 entries, 0 to 529
Columns: 16563 entries, response to V16562
dtypes: int64(16563)
memory usage: 67.0 MB


In [3]:
#remove any data points that is empty since it will affect the model performance in this case
data.dropna(axis=0, how='any', inplace=True)

In [4]:
#check how many unique values in the response column to check whether its a binary or multiclass classification
data['response'].nunique()

2

In [5]:
#get more statistical information on the response column. The results show that there are 530 values with mean 0.234 
#and standard deviation of 0.422. The 1st, 2nd and third quartile are 0 and the max 1. So a class imbalance does exists in response.
data['response'].describe()

count    530.000000
mean       0.232075
std        0.422556
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        1.000000
Name: response, dtype: float64

In [6]:
#analyse the response feature further to understand the class imbalance. There are 123 responses with 1 and 407 with 0 response, in the ratio 1:3 
sum(data['response']== 1), sum(data['response']== 0), ((123/407)*100), int(407/123)

(123, 407, 30.22113022113022, 3)

In [7]:
#Remove some features to improve model performance: I tried to remove the columns that are not relevant for the response ie., 
#they do not change irrespective of positive or negative response. This process removed more than 6000 features that could speed up the training process as well improve the model performance.
for col in data.columns:
    if data[col].nunique() == 1:
        data.drop(col,inplace=True,axis=1)

Handling class imbalance: one of the ways to handle class imbalance is either downsample the minority class or upsample the majority class. We could also try data augmentation to synthesize more minority class samples with some variations. I have tried the downsampling of the majority class here and the upsampling of the minority class here. So that there will be enough number of samples for both classes to train the model.

In [8]:
#split the data based on response
res_0 = data[data['response']== 0] #negative response only
res_1 = data[data['response']== 1] #positive response only
from sklearn.utils import resample
#downsampling the majority class
neg_downsampled = resample(res_0, replace=True, n_samples=123, random_state=50)
# Combine the minority class with downsampled majority class
downsample = pd.concat([neg_downsampled, res_1])
#the new class counts
downsample.response.value_counts()

1    123
0    123
Name: response, dtype: int64

In [9]:
#upsampling the data
pos_upsampled = resample(res_1, replace=True, n_samples=81, random_state=50)
# Combine upsampled minority class with downsampled data
upsample = pd.concat([pos_upsampled, downsample])
#the new class counts
upsample.response.value_counts()

#here we have slightly more positive samples (81)

1    204
0    123
Name: response, dtype: int64

Prepare the data for model

In [10]:
#separating the data into X (independent) and y (dependent) features
X = upsample.loc[:,upsample.columns != 'response'].values
y = upsample.iloc[:,1].values

In [11]:
#handling sparse matrix, since there are a lot of zeros in the independent variables and it helps to 
#convert it to sparse matrix to improve model performance
from scipy.sparse import csr_matrix
X_csr = csr_matrix(X)

Model selection and parameter tuning: I have chosen the xgboost classifier algorithm for this classification problem since it is a fast and flexible algorithm with great performance. It is a gradient boosting decision tree algorithm. The boosting reduces high bias and high variance. The gridsearchcv algorithm is used to tune the parameters of the model for training.

In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier

# A parameter grid for XGBoost
params = {
        'min_child_weight': [1, 3, 5],
        'reg_lambda': [0.2,0.4,0.6],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.1, 0.3, 0.5],
        'max_depth': [3, 4, 5],
        'scale_pos_weight': [1,3,5]
        }

cls = XGBClassifier(learning_rate=0.02, n_estimators=100, objective='binary:logistic',  silent=True, nthread = -1)

skf = StratifiedKFold(n_splits=3, shuffle = True, random_state = 5)

# if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used
gridsearch = GridSearchCV(estimator=cls, param_grid=params, scoring='roc_auc', n_jobs=-1, cv=skf.split(X,y), return_train_score=True, verbose=3 )
gridsearch.fit(X, y)
cvresults = pd.DataFrame(gridsearch.cv_results_)

Fitting 3 folds for each of 729 candidates, totalling 2187 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   20.1s
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  9.4min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed: 14.8min
[Parallel(n_jobs=-1)]: Done 1560 tasks      | elapsed: 21.8min
[Parallel(n_jobs=-1)]: Done 2040 tasks      | elapsed: 32.0min
[Parallel(n_jobs=-1)]: Done 2187 out of 2187 | elapsed: 35.0min finished


In [20]:
#splitting the X and y data into train and test set
from sklearn.model_selection import train_test_split
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X_csr,y, test_size=0.30, random_state=5)

In [21]:
#use the best parameters from the gridsearch to train the model
from xgboost import XGBClassifier

model = XGBClassifier(
n_estimators=1400, objective = 'binary:logistic', learning_rate = 0.02, reg_lambda = 0.2, scale_pos_weight = 3, max_depth =3, colsample_bytree = 0.1, #try changing colsample6
            subsample =0.6, min_child_weight =1, tree_method = "auto", nthread = -1, random_state = 50)

eval_set = [(X_test, y_test)]

model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, early_stopping_rounds = 10, verbose=True)
y_pred = model.predict(X_test)

[0]	validation_0-error:0.010101
Will train until validation_0-error hasn't improved in 10 rounds.
[1]	validation_0-error:0
[2]	validation_0-error:0
[3]	validation_0-error:0
[4]	validation_0-error:0
[5]	validation_0-error:0
[6]	validation_0-error:0
[7]	validation_0-error:0
[8]	validation_0-error:0
[9]	validation_0-error:0
[10]	validation_0-error:0
[11]	validation_0-error:0
Stopping. Best iteration:
[1]	validation_0-error:0



Performance metrices for the model: accuracy, confusion matrix and precision/recall

In [22]:
#performance
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 100.00%


In [23]:
conf_matrix = pd.DataFrame(metrics.confusion_matrix(y_test, y_pred, labels=[1, 0]), index=['true:1', 'true:0'], columns=['pred:1', 'pred:0'])
conf_matrix

Unnamed: 0,pred:1,pred:0
true:1,1,0
true:0,0,98


In [24]:
#precision = TP / (TP+FP)
precision = conf_matrix.iloc[0,0] / (conf_matrix.iloc[0,0]+conf_matrix.iloc[1,0])
#recall = TP / (TP+FN)
recall = conf_matrix.iloc[0,0] / (conf_matrix.iloc[0,0]+conf_matrix.iloc[0,1])
precision, recall

(1.0, 1.0)

In [26]:
#only probabilities can be used for roc_auc
# Predict class probabilities
prob_y = model.predict_proba(X_test)
# Keep only the positive class
prob_y = [p[1] for p in prob_y]
metrics.roc_auc_score(y_test, prob_y)

1.0

Eventhough the prediction accuracy is 100% for this particular sample(small) with more data we may have a better way to adjudge the performance of the model.