# LSA, LGBM Baseline + Experiments

**In this notebook, I attempt to use TFIDF followed by SVD (AKA latent semantic analysis or LSA).** It doesn't seem to work well, probably because the problem requires a deeper analysis of the relationship between words and LSA doesn't keep track of their relative positions - we can't even compute trigrams as they don't fit in memory.

However I experimented with a few interesting things. Things that worked:

- **Measuring system usage** while the kernel is running. Head here to see how it works: https://www.kaggle.com/masterscrat/monitoring-system-usage

- **Using `tqdm` works well for more things than I expected. **For example, it gives useful output when ingesting data in `TfidfVectorizer` or `TruncatedSVD`

- **Logging output to standard output and *commit logs* at the same time.** This means you can keep track of the progression of your kernel while it's commiting. Quite useful in practice.

- **Simple t-SNE viz**

Things that didn't work out:

- **Parallelizing TF-IDF.** In theory, you can use all available cores for the `TfidfVectorizer` `fit` operation. It does speed things up, but it causes out of memory errors when using the full dataset.

- **Loading LGBM `Dataset`s from files.** Reportedly, this would prevent the massive memory use that occurs when LGBM starts up. In practice, it seems to use more memory, and fails with an out of memory error when using the full dataset (while things load properly using the regular way on that same dataset).

- **v36** Added system monitoring, more LGB leaves, removed cruft

- **v35** Made public

In [31]:
# PARAMETERS
TOY_MODE = False
tfidf_ngram_range = (1, 2) # trigrams don't fit in memory
svd_n_components = 64 # 96 doesn't fit in memory
lgb_num_leaves = 65

PARALLELIZE_TF_IDF = False # doesn't work
LOAD_LGBM_DATA_FROM_FILES = False # doesn't work

competition_files_path = '../input/jigsaw-unintended-bias-in-toxicity-classification/'

In [32]:
### MONITORING ###
import psutil
import numpy as np
import matplotlib.pyplot as plt
import os
import time
import multiprocessing
from IPython.display import clear_output
from collections import deque

class SystemMonitorProcess:
    def __init__(self, start_timestamp, update_interval=0.1):
        self.update_interval = update_interval
        self.cpu_nums = psutil.cpu_count()
        self.max_mem = psutil.virtual_memory().total
        self.sysCpuLogs = deque()
        self.sysMemLogs = deque()
        self.timeLogs = deque()
        self.start_time = start_timestamp

    def get_system_info(self):
        cpu_percent = psutil.cpu_percent(interval=0.0, percpu=False)
        mem_percent = float(psutil.virtual_memory().used) / self.max_mem * 100
        return cpu_percent, mem_percent
        
    def monitor(self):
        while True:
            time.sleep(self.update_interval)
            sCpu, sMem = self.get_system_info()  
            self.sysCpuLogs.append(sCpu)
            self.sysMemLogs.append(sMem)
            self.timeLogs.append(time.time() - self.start_time)
            logs.update({
                'sysCpuLogs': self.sysCpuLogs,
                'sysMemLogs': self.sysMemLogs,
                'time': self.timeLogs
            })
            
class SystemMonitor:
    def __init__(self, update_interval=0.1):
        self.graph = None
        self.update_interval = update_interval
        self.start_timestamp = time.time()
        self.msgs = []
    
    def monitor(self):
        self.graph = SystemMonitorProcess(self.start_timestamp, self.update_interval)
        self.graph.monitor()
        
    def annotate(self, msg):
        self.msgs.append([time.time() - self.start_timestamp, msg])
        
    def plot(self):
        if not 'sysCpuLogs' in logs:
            print('No data yet.')
            return

        fig = plt.figure(figsize=(20,3))
        plt.ylabel('usage (%)')
        
        # FIXME display running time on primary X axis!
        ax = plt.axes()
        #plt.xlabel('running time (s)')
        
        ax2 = ax.twiny()
        ax2.plot(list(logs['time']), logs['sysCpuLogs'], label="cpu")
        ax2.plot(list(logs['time']), logs['sysMemLogs'], label="mem")
        ax2.set_xticks([msg[0] for msg in self.msgs])
        ax2.set_xticklabels([msg[1] for msg in self.msgs], rotation=90)

        ax2.legend(loc='best')
        plt.show()
        
sm = SystemMonitor(1.0) # polling frequency
logs = multiprocessing.Manager().dict()
smp = multiprocessing.Process(target=sm.monitor)
smp.start()

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
pd.set_option('max_colwidth',400)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from tqdm import tqdm_notebook as tqdm
import os
import shlex
import gc

from numpy.random import seed
from tensorflow import set_random_seed
seed(42)
set_random_seed(42)

In [34]:
# print message both to standard output and to Kaggle commit logs
# inspired from https://www.kaggle.com/alexktn/logs-in-commits/log
# shlex.quote exists specifically to shell-escape strings!
def debug(msg):
    os.system('echo ' + shlex.quote(msg))
    sm.annotate(msg)
    print(msg)

In [35]:
import time
debug('Started {}'.format(time.strftime('%d %b %Y at %H:%M:%S', time.localtime())))

Started 06 Apr 2019 at 22:23:07


In [36]:
train = pd.read_csv(competition_files_path + 'train.csv')
test = pd.read_csv(competition_files_path + 'test.csv')

print(train.shape)
print(test.shape)

(1804874, 45)
(97320, 2)


In [37]:
if TOY_MODE:
    train = train[:int(len(train)/32)]
    test = test[:int(len(test)/32)]

In [38]:
train = train[['comment_text', 'target']]
test = test[['comment_text']]

train['comment_text'] = train['comment_text'].astype(str)
test['comment_text'] = test['comment_text'].astype(str)

## TFIDF

In [39]:
debug('Starting TFIDF, ngram range {}...'.format(tfidf_ngram_range))

tfv = TfidfVectorizer(min_df=3,
                      max_features=None, 
                      strip_accents='unicode', 
                      analyzer='word', 
                      token_pattern=r'(?u)\b\w+\b',  
                      ngram_range=tfidf_ngram_range,
                      use_idf=1, 
                      smooth_idf=1, 
                      sublinear_tf=1
                     )

Starting TFIDF, ngram range (1, 2)...


In [40]:
full_text = list(train['comment_text'].values) + list(test['comment_text'].values)

In [41]:
%%time
if PARALLELIZE_TF_IDF:
    debug('Fit...')
    tfv.fit(tqdm(full_text));

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 9.78 µs


In [42]:
%%time
# Tried out parallel transform, but fails with out of memory, to investigate.
# https://github.com/rafaelvalero/ParallelTextProcessing/blob/master/parallelizing_text_processing.ipynb

if PARALLELIZE_TF_IDF:
    import multiprocessing
    from multiprocessing import Pool
    import scipy.sparse as sp

    num_cores = multiprocessing.cpu_count()

    def chunks(l, n):
        for i in range(0, len(l), n):
            yield l[i:i + n]

    def parallelize_dataframe(df, func):
        pool = Pool(num_cores)
        df = sp.vstack(pool.map(func, chunks(df, 512)), format='csr')
        pool.close()
        pool.join()
        return df

    def test_func(data):
        tfidf_matrix = tfv.transform(data)
        return tfidf_matrix

    tfidf_col = parallelize_dataframe(full_text, test_func)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11 µs


In [43]:
if not PARALLELIZE_TF_IDF:
    tfidf_col = tfv.fit_transform(tqdm(full_text))

HBox(children=(IntProgress(value=0, max=59443), HTML(value='')))




In [44]:
del full_text, tfv

In [45]:
print(tfidf_col.shape)
tfidf_col

(59443, 176887)


<59443x176887 sparse matrix of type '<class 'numpy.float64'>'
	with 4739092 stored elements in Compressed Sparse Row format>

## SVD

In [46]:
debug('Starting SVD, reducing feature dimension from {} to {}...'.format(tfidf_col.shape[1], svd_n_components))

Starting SVD, reducing feature dimension from 176887 to 64...


In [None]:
from sklearn.decomposition import TruncatedSVD

svd_ = TruncatedSVD(n_components=svd_n_components, random_state=1337)

svd_col = svd_.fit_transform(tfidf_col)
svd_col = pd.DataFrame(svd_col)
svd_col = svd_col.add_prefix('TFIDF_')

del svd_, tfidf_col

In [None]:
pd.DataFrame(svd_col).head()

In [None]:
X_train = svd_col[0:len(train)]
X_test = svd_col[len(train):]

In [None]:
from sklearn.manifold import TSNE

X_embedded = TSNE(n_components=2).fit_transform(X_train[0:1500])
dftsne = pd.DataFrame(X_embedded, columns=['x','y'])
dftsne['class'] = train['target'] > 0.5

# I'm no t-SNE expert but this looks pretty unpromising
ax = sns.lmplot('x', 'y', dftsne, hue='class', fit_reg=False, height=8, scatter_kws={'alpha':0.7,'s':60})

# LGBM

In [None]:
debug('Starting LGBM...');

In [None]:
from math import sqrt
from sklearn.metrics import cohen_kappa_score, mean_squared_error
from sklearn.metrics import roc_auc_score

def rmse(actual, predicted):
    return sqrt(mean_squared_error(actual, predicted))

In [None]:
# https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst

lgb_params = {'application': 'regression', 
              'boosting': 'gbdt',
              'metric': 'rmse',
              'num_leaves': lgb_num_leaves,
              'max_depth': -1,
              'learning_rate': 0.01,
              'bagging_fraction': 0.85,
              'feature_fraction': 0.8,
              'min_split_gain': 0.02,
              'min_child_samples': 150,
              'min_child_weight': 0.02,
              'lambda_l2': 0.0475,
              'verbosity': -1,
              'data_random_seed': 17,
              'early_stop': 600,
              'max_bin': 255,
              'verbose_eval': 500,
              'num_rounds': 10000,
              
              # only used when loading from files
              'has_header': True,
              'label_column': 'name:label',
              'use_two_round_loading': True
             }

In [None]:
# CV code from https://www.kaggle.com/skooch/petfinder-simple-lgbm-baseline

from sklearn.model_selection import KFold

N_SPLITS = 5

def run_cv_model(train, test, target, model_fn, params={}, eval_fn=None, label='model'):
    kf = KFold(n_splits=N_SPLITS, random_state=42, shuffle=True)
    fold_splits = kf.split(train, target)
    cv_scores = []
    roc_scores = []
    pred_full_test = 0
    pred_train = np.zeros((train.shape[0], N_SPLITS))
    feature_importance_df = pd.DataFrame()
    
    i = 1
    for dev_index, val_index in fold_splits:
        print('{} fold {}/{}'.format(label, i, N_SPLITS))
        
        if isinstance(train, pd.DataFrame):
            dev_X, val_X = train.iloc[dev_index], train.iloc[val_index]
            dev_y, val_y = target[dev_index], target[val_index]
        else:
            dev_X, val_X = train[dev_index], train[val_index]
            dev_y, val_y = target[dev_index], target[val_index]
        params2 = params.copy()
        
        ###
        pred_val_y, pred_test_y, importances, roc = model_fn(dev_X, dev_y, val_X, val_y, test, params2)
        ###
        
        pred_full_test = pred_full_test + pred_test_y
        pred_train[val_index] = pred_val_y
        if eval_fn is not None:
            cv_score = eval_fn(val_y, pred_val_y)
            cv_scores.append(cv_score)
            roc_scores.append(roc)
            debug(label + ' CV score {}/{}: ROC {}'.format(i, N_SPLITS, roc))
            
        fold_importance_df = pd.DataFrame()
        fold_importance_df['feature'] = train.columns.values
        fold_importance_df['importance'] = importances
        fold_importance_df['fold'] = i
        
        feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)        
        i += 1
        
    print('{} CV RMSE scores : {}'.format(label, cv_scores))
    print('{} CV mean RMSE score : {}'.format(label, np.mean(cv_scores)))
    print('{} CV std RMSE score : {}'.format(label, np.std(cv_scores)))
    print('{} CV ROC scores : {}'.format(label,  roc_scores))
    print('{} CV mean ROC score : {}'.format(label, np.mean(roc_scores)))
    print('{} CV std ROC score : {}'.format(label, np.std(roc_scores)))
    
    pred_full_test = pred_full_test / float(N_SPLITS)
    
    results = {'label': label,
               'train': pred_train, 'test': pred_full_test,
                'cv': cv_scores, 'roc': roc_scores,
               'importance': feature_importance_df
              }
    return results

In [None]:
def plot_importance(feature_importance, columns):
    feature_imp = pd.DataFrame(sorted(zip(feature_importance, columns)), columns=['Value','Feature'])

    plt.figure(figsize=(20, 20))
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.show()
    plt.savefig('lgbm_importances-01.png')

In [None]:
import lightgbm as lgb

def run_lgb(train_X, train_y, test_X, test_y, test_X2, params):
    
    if LOAD_LGBM_DATA_FROM_FILES:
        # Tried dumping data to CSV before loading to LGBM, as recommended from:
        # https://github.com/Microsoft/LightGBM/issues/1032
        # BUT didn't work, made problem worse
        
        # TODO try to *append* to CSV instead of doing concat in memory
        # https://stackoverflow.com/questions/17530542/how-to-add-pandas-data-to-an-existing-csv-file

        train_X = pd.concat([train_y, train_X], axis=1)
        train_X = train_X.rename(columns={ train_X.columns[0]: "label" })
        train_X.to_csv('train_X.csv')
        del train_X

        test_X = pd.concat([test_y, test_X], axis=1)
        test_X = test_X.rename(columns={ test_X.columns[0]: "label" })
        test_X.to_csv('test_X.csv')
        del test_X

        gc.collect()

        d_train = lgb.Dataset('train_X.csv', free_raw_data=True)
        d_train.raw_data = None

        d_valid = lgb.Dataset('test_X.csv', free_raw_data=True)
        d_valid.raw_data = None

        gc.collect()
    
    else:
        d_train = lgb.Dataset(train_X, label=train_y, free_raw_data=True)
        d_valid = lgb.Dataset(test_X, label=test_y, free_raw_data=True)
        del train_X, train_y
    
    watchlist = [d_train, d_valid]
    num_rounds = params.pop('num_rounds')
    verbose_eval = params.pop('verbose_eval')
    early_stop = None
    if params.get('early_stop'):
        early_stop = params.pop('early_stop')
        
    model = lgb.train(params,
                      train_set=d_train,
                      num_boost_round=num_rounds,
                      valid_sets=watchlist,
                      verbose_eval=verbose_eval,
                      early_stopping_rounds=early_stop,
                      )
    
    print('Computing score...')
    pred_test_y = model.predict(test_X, num_iteration=model.best_iteration)
    roc = roc_auc_score(test_y > 0.5, pred_test_y)
    
    print('Predicting on test set...')
    pred_test_y2 = model.predict(test_X2, num_iteration=model.best_iteration)
    
    return pred_test_y.reshape(-1, 1), pred_test_y2.reshape(-1, 1), model.feature_importance(), roc

In [None]:
results = run_cv_model(X_train, X_test, train['target']  > 0.5, run_lgb, lgb_params, rmse, 'LGB')

In [None]:
# feature importance
plot_importance(results['importance']['importance'], results['importance']['feature'])

## Save results

In [None]:
test_predictions = [r[0] for r in results['test']]
sub = pd.read_csv(competition_files_path + 'sample_submission.csv')

if not TOY_MODE:
    assert sub.shape[0] == len(test_predictions)
    debug('Saving...')
    sub['prediction'] = test_predictions
    sub.to_csv('submission.csv', index=False)
    sub.head()
    
else:
    print("Toy mode, won't save.")

In [None]:
sm.plot()
smp.terminate()