### Feature Selection

Feature selection is not being implemented since the number of observations (39,644 overall, 31,715 in training) is much larger than the number of features (58 overall, 60). 

For completeness, the F score (linear dependency) and the mutual information score (non-linear dependency) for each feature is being calculated for reference. This is not being included in the report. 

In [1]:
from project_paths import *

from sklearn.feature_selection import f_classif, mutual_info_classif
import pandas as pd

In [4]:
# Load preprocessed data
X_train = load_preprocessed_data(preprocessed_data_path_train)
y_train = load_preprocessed_data(preprocessed_data_y_train)
all_features_pp = load_list_from_pkl('processed_features_list.pkl')

In [8]:
# Finding linear f_classif and printing f test values
f_test, p_values = f_classif(X_train, y_train)
print("Linear correlations between features and target:")
for f, l in zip(f_test, all_features_pp):
    print(l, " : ", f)

Linear correlations between features and target:
x0_Business  :  1.9821534137585748
x0_Entertainment  :  398.51892432385523
x0_Lifestyle  :  48.495560897342095
x0_No_data_channel  :  304.47369808265097
x0_Social Media  :  418.58225419804654
x0_Tech  :  284.1082390363232
x0_World  :  740.7373607818726
x1_0.0  :  621.8362968564277
x1_1.0  :  621.8362968564348
x2_Friday  :  1.728267237215165
x2_Monday  :  17.307239993355783
x2_Saturday  :  379.49894897341835
x2_Sunday  :  209.78133614081128
x2_Thursday  :  15.918246676221484
x2_Tuesday  :  51.94248425523417
x2_Wednesday  :  54.5103875367296
n_tokens_title  :  63.68545407362967
average_token_length  :  25.09076316466873
num_keywords  :  147.45206335044114
LDA_00  :  126.2322862321775
LDA_01  :  169.04683306053383
LDA_02  :  793.2809924385907
LDA_03  :  114.81125577474629
LDA_04  :  254.94411261830112
title_subjectivity  :  49.693606302960475
title_sentiment_polarity  :  103.53079984369735
abs_title_subjectivity  :  0.0008676827307222182
ab

In [26]:
linear_scores = [(x, y) for x, y in zip(f_test, all_features_pp)]
linear_scores.sort(key=lambda k: -k[0])
linear_scores[:10]

[(895.3877958211976, 'kw_avg_avg'),
 (793.2809924385907, 'LDA_02'),
 (740.7373607818726, 'x0_World'),
 (621.8362968564348, 'x1_1.0'),
 (621.8362968564277, 'x1_0.0'),
 (418.58225419804654, 'x0_Social Media'),
 (398.51892432385523, 'x0_Entertainment'),
 (379.49894897341835, 'x2_Saturday'),
 (304.47369808265097, 'x0_No_data_channel'),
 (284.1082390363232, 'x0_Tech')]

In [9]:
# Then non-linear correlations
mi = mutual_info_classif(X_train, y_train)
print("Non-linear correlations between features and target:")
for m, l in zip(mi, all_features_pp):
    print(l, " : ", m)

Non-linear correlations between features and target:
x0_Business  :  0.0
x0_Entertainment  :  0.008304822443002902
x0_Lifestyle  :  0.00147227583761822
x0_No_data_channel  :  0.008678933505772202
x0_Social Media  :  0.008497602545643135
x0_Tech  :  0.0070146489985516425
x0_World  :  0.01063662409742916
x1_0.0  :  0.012537952390572427
x1_1.0  :  0.006653541870363933
x2_Friday  :  0.00023753172877749584
x2_Monday  :  0.0038459111496298437
x2_Saturday  :  0.007189345410125281
x2_Sunday  :  0.005160273458537201
x2_Thursday  :  0.004815685942957781
x2_Tuesday  :  0.003083994380005395
x2_Wednesday  :  0.0
n_tokens_title  :  0.0
average_token_length  :  0.001499676718353271
num_keywords  :  1.7115050843674595e-05
LDA_00  :  0.0196412425372976
LDA_01  :  0.02482106611361634
LDA_02  :  0.03544867103738181
LDA_03  :  0.02225815582039181
LDA_04  :  0.023051871657786105
title_subjectivity  :  0.0065299170555983554
title_sentiment_polarity  :  0.004213060217826525
abs_title_subjectivity  :  0.00407

In [27]:
nonlinear_scores = [(x, y) for x, y in zip(mi, all_features_pp)]
nonlinear_scores.sort(key=lambda k: -k[0])
nonlinear_scores[:10]

[(0.03544867103738181, 'LDA_02'),
 (0.03440798202419848, 'kw_max_avg'),
 (0.026289865089594855, 'self_reference_avg_sharess'),
 (0.02513913091568476, 'kw_avg_avg'),
 (0.02482106611361634, 'LDA_01'),
 (0.023051871657786105, 'LDA_04'),
 (0.02274586757727004, 'kw_max_min'),
 (0.02225815582039181, 'LDA_03'),
 (0.0219413628389169, 'self_reference_max_shares'),
 (0.02114846288203398, 'self_reference_min_shares')]