This kernel is an effort to show that the so-called "useless" columns that has smaller variances have actually predictive power. I tried to show it once in [this kernel](https://www.kaggle.com/mhviraf/there-is-predictive-power-in-the-useless-columns). However, Chris pointed out that the results I showed there are actually due to the curse of dimensionality which he described in detail [here](https://www.kaggle.com/c/instant-gratification/discussion/93379#latest-537928). So I digged deeper to assess whether or not there is predictive power in "useless" columns. Here is what I did:
1. filtered dataset by `wheezy-copper-turtle-magic=0`
2. trained a logistic regression classifier on every column (512 10fold models) and calculated the ROC AUC
3. results show that 14 of the top 30 features with the highest predictive power are among the "useless" columns.
4. to make sure it's not an accident, I repeated 1 to 3 for `wheezy-copper-turtle-magic=111` and this time 16 out of top 30 most predictive columns were among "useless" columns.
5. update: I also added the plot that shows hyperplane of logistic regression that is trained on only the "useless" columns which resulted in ROC AUC > 0.5 in the table below (condition: ROC AUC > 0.5 & in useful cols == False). The plot shows they can give us separable classes. Hence, they have predictive power.
6. update2: I am making predictions on the test set based on "useless" columns only, if the ROC AUC is greater than 0.5 then we know for sure that they have predictive power. 

In [12]:
# IMPORT LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.animation import ArtistAnimation
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import roc_auc_score
from tqdm import tqdm_notebook as tqdm
from sklearn.model_selection import StratifiedKFold

In [13]:
test = pd.read_csv('../input/test.csv')
train = pd.read_csv('../input/train.csv')

In [None]:
train2 = train.loc[train['wheezy-copper-turtle-magic'] == 0, :].reset_index(drop=True)
train2.drop('id', axis=1, inplace=True)
target2 = train2.target
train2.drop('target', axis=1, inplace=True)
train2.drop('wheezy-copper-turtle-magic', axis=1, inplace=True)
cols = np.array([c for c in train2.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']])
sel = VarianceThreshold(threshold=1.5).fit(train2[cols])
useful_cols = cols[np.where(sel.variances_ > 1.5)[0]]
train2 = train2[cols]

In [None]:
results = []
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=2019)

for col in tqdm(train2.columns):
    lre_oof_preds = np.zeros(len(train2))
    for fold_, (trn_, val_) in enumerate(folds.split(train2, target2)):
        trn_x, trn_y = train2.loc[trn_, col], target2[trn_]
        val_x, val_y = train2.loc[val_, col], target2[val_]
        clf = LogisticRegression(solver='liblinear',penalty='l2',C=0.1,class_weight='balanced')
        clf.fit(train2[col].values.reshape(-1, 1), target2.values)
        lre_oof_preds[val_] = clf.predict(val_x.values.reshape(-1,1))
    results.append([col, roc_auc_score(target2, lre_oof_preds), col in useful_cols, np.std(train2[col].values)])

results_pd = pd.DataFrame(results, columns=['variable', 'roc auc', 'in useful cols?', 'std']).sort_values(by='roc auc', ascending=False).reset_index(drop=True)

In [None]:
results_pd.head(30)

In [None]:
results_pd.to_csv('resulting_dude.csv', index=None)

In [None]:
# IMPORT LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.animation import ArtistAnimation
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import roc_auc_score

only_new_cols = results_pd.loc[(results_pd['roc auc'] > 0.5) & (results_pd['in useful cols?'] == False), 'variable'].values
train2 = train.loc[train['wheezy-copper-turtle-magic'] == 0, :].reset_index(drop=True)
target2 = train2.target
train2.drop('target', axis=1, inplace=True)
train2 = train2[only_new_cols]
cols = [c for c in train2.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']]
train2 = train2[cols].values
# FIND NORMAL TO HYPERPLANE
clf = LogisticRegression(solver='liblinear',penalty='l2',C=0.1,class_weight='balanced')
clf.fit(train2, target2)
u1 = clf.coef_[0]
u1 = u1/np.sqrt(u1.dot(u1))
# CREATE RANDOM DIRECTION PERPENDICULAR TO U1
u2 = np.random.normal(0,1,len(u1))
u2 = u2 - u1.dot(u2)*u1
u2 = u2/np.sqrt(u2.dot(u2))

# CREATE RANDOM DIRECTION PERPENDICULAR TO U1 AND U2
u3 = np.random.normal(0,1,len(u1))
u3 = u3 - u1.dot(u3)*u1 - u2.dot(u3)*u2
u3 = u3/np.sqrt(u3.dot(u3))
idx0 = np.where(target2==0)
idx1 = np.where(target2==1)
# CREATE AN ANIMATION
images = []
steps = 60
fig = plt.figure(figsize=(8,8))
for k in range(steps):
    # CALCULATE NEW ANGLE OF ROTATION
    angR = k*(2*np.pi/steps)
    angD = round(k*(360/steps),0)
    u4 = np.cos(angR)*u1 + np.sin(angR)*u2
    u = np.concatenate([u4,u3]).reshape((2,len(u1)))  
    # PROJECT TRAIN AND TEST ONTO U3,U4 PLANE
    p = u.dot(train2.transpose())
    # PLOT TRAIN DATA (KEEP CORRECT COLOR IN FRONT)
    if angD<180:
        plt.title('wheezy-copper-turtle-magic=1; all features')
        img2 = plt.scatter(p[0,idx1],p[1,idx1],c='yellow')
        img3 = plt.scatter(p[0,idx0],p[1,idx0],c='blue')
    else:
        plt.title('wheezy-copper-turtle-magic=1; all features')
        img2 = plt.scatter(p[0,idx0],p[1,idx0],c='blue')
        img3 = plt.scatter(p[0,idx1],p[1,idx1],c='yellow')
    images.append([img2, img3])
# SAVE MOVIE TO FILE
ani = ArtistAnimation(fig, images)
ani.save('all_features.gif', writer='imagemagick', fps=15)
plt.close()
# PROJECT TRAIN ONTO U2, U3 PLANE
u = np.concatenate([u1,u2]).reshape((2,len(u1)))
p = u.dot(train2.transpose())
plt.figure(figsize=(6,6))
plt.title('wheezy-copper-turtle-magic=1; bunch of useless columns')
plt.scatter(p[0,idx1],p[1,idx1],c='yellow', alpha=0.9)
plt.scatter(p[0,idx0],p[1,idx0],c='blue', alpha=0.9)
print(f'ROC: {roc_auc_score(target2, clf.predict(train2))}')

In [21]:
# check if it applies to the test set
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=2019)
for wctm in tqdm(range(train['wheezy-copper-turtle-magic'].nunique())):
    train2 = train.loc[train['wheezy-copper-turtle-magic'] == wctm, :]
    train2 = train2.reset_index(drop=True)
    test2 = test.loc[test['wheezy-copper-turtle-magic'] == wctm, :]
    test2_ids = test2.index
    test2 = test2.reset_index(drop=True)
    train2.drop('id', axis=1, inplace=True)
    target2 = train2.target
    train2.drop('target', axis=1, inplace=True)
    train2.drop('wheezy-copper-turtle-magic', axis=1, inplace=True)
    cols = np.array([c for c in train2.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']])
    sel = VarianceThreshold(threshold=1.5).fit(train2[cols])
    useful_cols = cols[np.where(sel.variances_ > 1.5)[0]]
    train2 = train2[cols]

    results = []
    for col in train2.columns:
        lre_oof_preds = np.zeros(len(train2))
        for fold_, (trn_, val_) in enumerate(folds.split(train2, target2)):
            trn_x, trn_y = train2.loc[trn_, col], target2[trn_]
            val_x, val_y = train2.loc[val_, col], target2[val_]
            clf = LogisticRegression(solver='liblinear',penalty='l2',C=0.1,class_weight='balanced')
            clf.fit(train2[col].values.reshape(-1, 1), target2.values)
            lre_oof_preds[val_] = clf.predict(val_x.values.reshape(-1,1))
        results.append([col, roc_auc_score(target2, lre_oof_preds), col in useful_cols, np.std(train2[col].values)])

    results_pd = pd.DataFrame(results, columns=['variable', 'roc auc', 'in useful cols?', 'std'])
    only_new_cols = results_pd.loc[(results_pd['roc auc'] > 0.5) & (results_pd['in useful cols?'] == False), 'variable'].values
    clf.fit(train2[only_new_cols], target2)
    
    test.loc[test2_ids, 'target'] = clf.predict_proba(test2[only_new_cols])[:, 1]

HBox(children=(IntProgress(value=0, max=512), HTML(value='')))

KeyboardInterrupt: 

In [23]:
# submit test predictions
sub = pd.read_csv('../input/sample_submission.csv')
sub['target'] = test.target

sub.to_csv('submission_based_on_useless_cols.csv', index=None)

Unnamed: 0,id,target
0,1c13f2701648e0b0d46d8a2a5a131a53,0.5
1,ba88c155ba898fc8b5099893036ef205,0.5
2,7cbab5cea99169139e7e6d8ff74ebb77,0.5
3,ca820ad57809f62eb7b4d13f5d4371a0,0.5
4,7baaf361537fbd8a1aaa2c97a6d4ccc7,0.5
5,8b3116e5e3e92e971dac305d1a093bf6,0.5
6,35cfd7cab9bfa29bc963d1b8c94dd280,0.5
7,83cf532df8ff4642a3e33a70fffdac37,0.5
8,2e1350fe94ec9f2220bec5245e5e9265,0.5
9,e62020afa72eb54a15725473e3a8475b,0.5


In [None]:
train2 = train.loc[train['wheezy-copper-turtle-magic'] == 111, :].reset_index(drop=True)
train2.drop('id', axis=1, inplace=True)
target2 = train2.target
train2.drop('target', axis=1, inplace=True)
train2.drop('wheezy-copper-turtle-magic', axis=1, inplace=True)
cols = np.array([c for c in train2.columns if c not in ['id', 'target', 'wheezy-copper-turtle-magic']])
sel = VarianceThreshold(threshold=1.5).fit(train2[cols])
useful_cols = cols[np.where(sel.variances_ > 1.5)[0]]
train2 = train2[cols]

In [None]:
results = []
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=2019)

for col in tqdm(train2.columns):
    lre_oof_preds = np.zeros(len(train2))
    for fold_, (trn_, val_) in enumerate(folds.split(train2, target2)):
        trn_x, trn_y = train2.loc[trn_, col], target2[trn_]
        val_x, val_y = train2.loc[val_, col], target2[val_]
        clf = LogisticRegression(solver='liblinear',penalty='l2',C=0.1,class_weight='balanced')
        clf.fit(train2[col].values.reshape(-1, 1), target2.values)
        lre_oof_preds[val_] = clf.predict(val_x.values.reshape(-1,1))
    results.append([col, roc_auc_score(target2, lre_oof_preds), col in useful_cols, np.std(train2[col].values)])

results_pd = pd.DataFrame(results, columns=['variable', 'roc auc', 'in useful cols?', 'std']).sort_values(by='roc auc', ascending=False).reset_index(drop=True)

In [None]:
results_pd.head(30)

In [None]:
results_pd.to_csv('resulting_dude_of_111.csv', index=None)