#### https://www.kaggle.com/code/alexpengxiao/preprocessing-model-averaging-by-xgb-lgb-1-39/notebook

preprocessing, model averaging by xgb + lgb

In this notebook, we preprocessed the data and feed the data to gradient boosting tree models, and got 1.39 on public leaderboard.

the workflow is as follows:

##### 1. Data preprocessing. The purpose of data preprocessing is to achieve higher time/space efficiency. What we did includes round, constant features removal, duplicate features removal, insignificant features removal, etc. The key here is to ensure the preprocessing shall not hurt the accuracy.
##### 2. Feature transform. The purpose of feature transform is to help the models to better grasp the information in the data, and fight overfitting. What we did includes dropping features which "live" on different distributions on training/testing set, adding statistical features, adding low-dimensional representation as features.
##### 3. Modeling. We used 2 models: xgboost and lightgbm. We averaged the 2 models for the final prediction.


##### step 1: load train & test data, drop duplicate columns, round the features to NUM_OF_DECIMALS decimals. here NUM_OF_DECIMALS is a experience value which can be tuned.

In [8]:
#Processing
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, TruncatedSVD, FastICA
from sklearn.model_selection import KFold
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection
from sklearn.preprocessing import normalize
from sklearn.manifold import TSNE
import gc
#Plotting
import seaborn as sns

##### Load The Files and Get a Brief Overview

In [9]:
warnings.filterwarnings("ignore")
train_df = pd.read_csv('/Users/sangth/Desktop/USF_Springboard/Capstone_2/Dataset/santander-value-prediction-challenge/train.csv')
test_df = pd.read_csv('/Users/sangth/Desktop/USF_Springboard/Capstone_2/Dataset/santander-value-prediction-challenge/test.csv')


In [10]:
train_df.head()

Unnamed: 0,ID,target,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000d6aaf2,38000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
1,000fbd867,600000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
2,0027d6b71,10000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
3,0028cbf45,2000000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0
4,002a68644,14400000.0,0.0,0,0.0,0,0,0,0,0,...,0.0,0.0,0.0,0,0,0,0,0,0,0


In [11]:
test_df.head()


Unnamed: 0,ID,48df886f9,0deb4b6a8,34b15f335,a8cb14b00,2f0771a37,30347e683,d08d1fbe3,6ee66e115,20aa07010,...,3ecc09859,9281abeea,8675bec0b,3a13ed79a,f677d4d13,71b203550,137efaa80,fb36b89d9,7e293fbaf,9fc776466
0,000137c73,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,00021489f,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0004d7953,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,00056a333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,00056d8eb,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


##### Set Up Train & Test X,Y

In [12]:
X_train = train_df.drop(["ID", "target"], axis=1)
y_train = np.log1p(train_df["target"].values)

X_test = test_df.drop(["ID"], axis=1)

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Columns: 4993 entries, ID to 9fc776466
dtypes: float64(1845), int64(3147), object(1)
memory usage: 169.9+ MB


In [14]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49342 entries, 0 to 49341
Columns: 4992 entries, ID to 9fc776466
dtypes: float64(4991), object(1)
memory usage: 1.8+ GB


#### Processing

##### Prepare Data

Checking for NaN values and removing constant features in the training data

In [15]:
print("Total Train Features with NaN Values = " + str(train_df.columns[train_df.isnull().sum() != 0].size))
if (train_df.columns[train_df.isnull().sum() != 0].size):
    print("Features with NaN => {}".format(list(train_df.columns[train_df.isnull().sum() != 0])))
    train_df[train_df.columns[train_df.isnull().sum() != 0]].isnull().sum().sort_values(ascending = False)

Total Train Features with NaN Values = 0


In [16]:
zero_count = []
for col in X_train.columns[2:]:
    zero_count.append([i[1] for i in list(X_train[col].value_counts().items()) if i[0] == 0][0])
    
print('{0} features of 4491 have zeroes in 99% or more samples.'.format(len([i for i in zero_count if i >= 4459 * 0.99])))
print('{0} features of 4491 have zeroes in 98% or more samples.'.format(len([i for i in zero_count if i >= 4459 * 0.98])))
print('{0} features of 4491 have zeroes in 97% or more samples.'.format(len([i for i in zero_count if i >= 4459 * 0.97])))
print('{0} features of 4491 have zeroes in 96% or more samples.'.format(len([i for i in zero_count if i >= 4459 * 0.96])))
print('{0} features of 4491 have zeroes in 95% or more samples.'.format(len([i for i in zero_count if i >= 4459 * 0.95])))

cols_to_drop = [col for col in X_train.columns[2:] if [i[1] for i in list(X_train[col].value_counts().items()) if i[0] == 0][0] >= 4459 * 0.98]

X_train.drop(cols_to_drop, axis=1, inplace=True)
X_test.drop(cols_to_drop, axis=1, inplace=True)

print('\nTrain shape: {}\nTest shape: {}'.format(X_train.shape, X_test.shape))

2362 features of 4491 have zeroes in 99% or more samples.
2868 features of 4491 have zeroes in 98% or more samples.
3315 features of 4491 have zeroes in 97% or more samples.
3794 features of 4491 have zeroes in 96% or more samples.
3992 features of 4491 have zeroes in 95% or more samples.

Train shape: (4459, 2123)
Test shape: (49342, 2123)


In [17]:
colsToRemove = []
for col in X_train.columns:
    if X_train[col].std() == 0: 
        colsToRemove.append(col)
        
# remove constant columns in the training set
train_df.drop(colsToRemove, axis=1, inplace=True)

# remove constant columns in the test set
test_df.drop(colsToRemove, axis=1, inplace=True) 

print("Removed `{}` Constant Columns\n".format(len(colsToRemove)))
print(colsToRemove)
print('\nTrain shape: {}\nTest shape: {}'.format(X_train.shape, X_test.shape))


Removed `0` Constant Columns

[]

Train shape: (4459, 2123)
Test shape: (49342, 2123)


##### Removing Duplicated Columns

In [18]:
colsToRemove = []
colsScaned = []
dupList = {}

columns = X_train.columns

for i in range(len(columns)-1):
    v = X_train[columns[i]].values
    dupCols = []
    for j in range(i+1,len(columns)):
        if np.array_equal(v, X_train[columns[j]].values):
            colsToRemove.append(columns[j])
            if columns[j] not in colsScaned:
                dupCols.append(columns[j]) 
                colsScaned.append(columns[j])
                dupList[columns[i]] = dupCols
                
# remove duplicate columns in the training set
X_train.drop(colsToRemove, axis=1, inplace=True) 

# remove duplicate columns in the testing set
X_test.drop(colsToRemove, axis=1, inplace=True)

print("Removed `{}` Duplicate Columns\n".format(len(dupList)))
print(dupList)

print('\nTrain shape: {}\nTest shape: {}'.format(X_train.shape, X_test.shape))


Removed `0` Duplicate Columns

{}

Train shape: (4459, 2123)
Test shape: (49342, 2123)


##### Drop Sparse Data

In [19]:
def drop_sparse(train, test):
    flist = [x for x in train.columns if not x in ['ID','target']]
    for f in flist:
        if len(np.unique(train[f]))<2:
            train.drop(f, axis=1, inplace=True)
            test.drop(f, axis=1, inplace=True)
    return train, test

In [20]:
X_train, X_test = drop_sparse(X_train, X_test)

print('\nTrain shape: {}\nTest shape: {}'.format(X_train.shape, X_test.shape))


Train shape: (4459, 2123)
Test shape: (49342, 2123)


##### Add Features

Sumzeros and Sumvalues

In [21]:
def add_SumZeros(train, test, features):
    flist = [x for x in train.columns if not x in ['ID','target']]
    if 'SumZeros' in features:
        train.insert(1, 'SumZeros', (train[flist] == 0).astype(int).sum(axis=1))
        test.insert(1, 'SumZeros', (test[flist] == 0).astype(int).sum(axis=1))
    flist = [x for x in train.columns if not x in ['ID','target']]

    return train, test

In [22]:
# X_train, X_test = add_SumZeros(X_train, X_test, ['SumZeros'])

In [23]:
def add_SumValues(train, test, features):
    flist = [x for x in train.columns if not x in ['ID','target']]
    if 'SumValues' in features:
        train.insert(1, 'SumValues', (train[flist] != 0).astype(int).sum(axis=1))
        test.insert(1, 'SumValues', (test[flist] != 0).astype(int).sum(axis=1))
    flist = [x for x in train.columns if not x in ['ID','target']]

    return train, test

In [24]:
# X_train, X_test = add_SumValues(X_train, X_test, ['SumValues'])

Other Aggregates

In [25]:
def add_OtherAgg(train, test, features):
    flist = [x for x in train.columns if not x in ['ID','target','SumZeros','SumValues']]
    if 'OtherAgg' in features:
        train['Mean'] = train.mean(axis=1)
        train['Median'] = train.median(axis=1)
        train['Mode'] = train.mode(axis=1)
        train['Max'] = train.max(axis=1)
        train['Var'] = train.var(axis=1)
        train['Std'] = train.std(axis=1)
        
        test['Mean'] = test.mean(axis=1)
        test['Median'] = test.median(axis=1)
        test['Mode'] = test.mode(axis=1)
        test['Max'] = test.max(axis=1)
        test['Var'] = test.var(axis=1)
        test['Std'] = test.std(axis=1)
    flist = [x for x in train.columns if not x in ['ID','target','SumZeros','SumValues']]

    return train, test

##### K-Means

In [26]:
def kmeans(X_Tr,Xte):
    flist = [x for x in X_Tr.columns if not x in ['ID','target']]
    flist_kmeans = []
    for ncl in range(2,11):
        cls = KMeans(n_clusters=ncl)
        cls.fit_predict(X_train[flist].values)
        X_Tr['kmeans_cluster_'+str(ncl)] = cls.predict(X_Tr[flist].values)
        Xte['kmeans_cluster_'+str(ncl)] = cls.predict(Xte[flist].values)
        flist_kmeans.append('kmeans_cluster_'+str(ncl))
    print(flist_kmeans)
    
    return X_Tr,Xte

##### PCA

In [27]:
def pca(X_Tr,Xte):
    flist = [x for x in X_Tr.columns if not x in ['ID','target']]
    n_components = 20
    flist_pca = []
    pca = PCA(n_components=n_components)
    x_train_projected = pca.fit_transform(StandardScaler(X_Tr[flist], axis=0))
    x_test_projected = pca.transform(StandardScaler(X_test[flist], axis=0))
    for npca in range(0, n_components):
        X_Tr.insert(1, 'PCA_'+str(npca+1), x_train_projected[:, npca])
        Xte.insert(1, 'PCA_'+str(npca+1), x_test_projected[:, npca])
        flist_pca.append('PCA_'+str(npca+1))
    print(flist_pca)

In [29]:
print('\nTrain shape: {}\nTest shape: {}'.format(X_train.shape, X_test.shape))


Train shape: (4459, 2123)
Test shape: (49342, 2123)


In [32]:
PERC_TRESHOLD = 0.98   ### Percentage of zeros in each feature ###
N_COMP = 97            ### Number of decomposition components ###

print("\nStart decomposition process...")
print("PCA")
pca = PCA(n_components=N_COMP, random_state=17)
pca_results_train = pca.fit_transform(X_train)
pca_results_test = pca.transform(X_test)
print(pca.explained_variance_ratio_)

print("tSVD")
tsvd = TruncatedSVD(n_components=N_COMP, random_state=17)
tsvd_results_train = tsvd.fit_transform(X_train)
tsvd_results_test = tsvd.transform(X_test)

print("ICA")
ica = FastICA(n_components=N_COMP, random_state=17)
ica_results_train = ica.fit_transform(X_train)
ica_results_test = ica.transform(X_test)

print("GRP")
grp = GaussianRandomProjection(n_components=N_COMP, eps=0.1, random_state=17)
grp_results_train = grp.fit_transform(X_train)
grp_results_test = grp.transform(X_test)

print("SRP")
srp = SparseRandomProjection(n_components=N_COMP, dense_output=True, random_state=17)
srp_results_train = srp.fit_transform(X_train)
srp_results_test = srp.transform(X_test)

print("Append decomposition components to datasets...")
for i in range(1, N_COMP + 1):
    X_train['pca_' + str(i)] = pca_results_train[:, i - 1]
    X_test['pca_' + str(i)] = pca_results_test[:, i - 1]
    
    X_train['ica_' + str(i)] = ica_results_train[:, i - 1]
    X_test['ica_' + str(i)] = ica_results_test[:, i - 1]

    X_train['tsvd_' + str(i)] = tsvd_results_train[:, i - 1]
    X_test['tsvd_' + str(i)] = tsvd_results_test[:, i - 1]

    X_train['grp_' + str(i)] = grp_results_train[:, i - 1]
    X_test['grp_' + str(i)] = grp_results_test[:, i - 1]

    X_train['srp_' + str(i)] = srp_results_train[:, i - 1]
    X_test['srp_' + str(i)] = srp_results_test[:, i - 1]
print('\nTrain shape: {}\nTest shape: {}'.format(X_train.shape, X_test.shape))



Start decomposition process...
PCA
[0.05845484 0.0383209  0.03829271 0.03828013 0.03825617 0.03824973
 0.03823006 0.03821477 0.0381978  0.03816147 0.02786177 0.02214392
 0.01657821 0.0130679  0.01158438 0.01150749 0.01133923 0.01121784
 0.01113108 0.01036719 0.00946003 0.00854243 0.00795691 0.00719521
 0.00695735 0.00646891 0.00634419 0.006164   0.00599398 0.0059037
 0.00575507 0.00565423 0.00541038 0.0053179  0.00497676 0.00483699
 0.00471673 0.0045214  0.00443197 0.00438417 0.00427236 0.00419191
 0.00409204 0.0040596  0.00392973 0.00381137 0.00380971 0.00369276
 0.00360305 0.00348218 0.00341211 0.00322577 0.00320448 0.00308076
 0.00296015 0.00288679 0.0027445  0.00263927 0.0026055  0.00257357
 0.00252259 0.00248433 0.0024462  0.00232029 0.00231383 0.00225798
 0.00222129 0.00218594 0.00212406 0.0020847  0.0020516  0.00199384
 0.00191684 0.00184879 0.00181508 0.00177521 0.00175046 0.00172354
 0.00170877 0.00168471 0.00165735 0.00160635 0.00159153 0.00154126
 0.00153326 0.00150486 0.00

In [33]:
print(tsvd.explained_variance_ratio_)
sum1 = 0
for i in range(len(tsvd.explained_variance_ratio_)):
    sum1 = sum1 + tsvd.explained_variance_ratio_[i]

print(sum1)

[0.05794651 0.03830152 0.03826978 0.03829257 0.03824949 0.03824868
 0.03823126 0.03822519 0.03820725 0.03819589 0.02793447 0.02233509
 0.01658305 0.01311583 0.01158509 0.01150749 0.01133952 0.01121794
 0.0111313  0.01036726 0.00946694 0.00854255 0.00795859 0.00719521
 0.00696026 0.00646913 0.00634428 0.00616397 0.00599341 0.00590364
 0.00575137 0.00566054 0.00541042 0.0053211  0.00497676 0.00483614
 0.00471498 0.0045274  0.00443353 0.00438627 0.00427261 0.00419301
 0.00409213 0.00405966 0.00392977 0.00381137 0.00380971 0.00369268
 0.00360112 0.00347747 0.00341188 0.00320911 0.00320336 0.00309682
 0.00296604 0.00288679 0.0027413  0.00264533 0.00260542 0.00258658
 0.00253314 0.00248415 0.00246391 0.00233773 0.00231945 0.00225922
 0.00222218 0.00218558 0.00212328 0.00208423 0.00205147 0.00199296
 0.00191693 0.00184655 0.00181976 0.0017785  0.00174848 0.00172015
 0.00170724 0.00167963 0.00165302 0.00160432 0.00158966 0.0015377
 0.00152632 0.00149971 0.00144064 0.00141479 0.00139959 0.00139

In [34]:
sum = 0
for i in range(len(pca.explained_variance_ratio_)):
    sum = sum + pca.explained_variance_ratio_[i]

print(sum)

0.8041603444003791
