<h3> In this notebook I write a simple Pandas loop to shard the very large numeric and categorical training data into smaller files. 
    
    
<h5> This is not super scalable to really large datasets or must just be run overnight. There may be faster ways using Spark. Upsample the training data, so the imbalanced class and balanced classes are matched. Then I can run a regular DNNClassifier without writing a custom model. 
    
<h5> I write code to shard the total numeric and test data separately, and concat the two into a single file. 

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.utils import resample, shuffle

data_path = os.getcwd() + '/FInal_Numeric_Datsets/shuffled_train_numeric_train.csv'

def get_sharded_datasets(full_data_path):
    ''' Takes data from your data path and shards the numerical data into files after upsampling the defective class
    to 25000 for training. Only use this for training'''
    chunksize = 50000
    for i, chunk in enumerate(pd.read_csv(full_data_path, chunksize=chunksize)):
            print("Chunk Number: {}".format(i))
            chunk = chunk.drop(columns = ['Id'])
            temp_df = chunk[chunk['Response'] == 1]
            print(len(temp_df))
            pos_df = resample(temp_df, n_samples=25000,random_state=42,replace=True)
            full_df = shuffle(pd.concat([chunk, pos_df]), random_state = 42)
            print(len(full_df))
            full_df.to_csv(os.path.dirname(full_data_path) + "/numeric_data_50k_{}.csv".format(i),
                        index=False, header = False) # make sure additional index doens't get added and drop Id column
    return print("Done")  

data_path_cat = os.getcwd() + '/FInal_Numeric_Datsets/shuffled_train_categorical_train.csv'
def get_sharded_categorical_datasets(full_data_path):
    ''' Takes the path to your categorical data and shards it into files for training'''
    chunksize = 50000
    for i, chunk in enumerate(pd.read_csv(full_data_path, chunksize=chunksize)):
        print("Chunk Number: {}".format(i))
        chunk = chunk.drop(columns = ['Id'])
        chunk.to_csv(os.path.dirname(full_data_path) + "/categorical_data_50k_{}.csv".format(i)
                         ,index = False, header = None)
    return print("Done") 

data_path_test = os.getcwd() + '/FInal_Numeric_Datsets/shuffled_train_numeric_test.csv'
def get_sharded_test_sets(full_data_path):
    ''' Generates a sharded test set from the numerical data only'''
    chunksize = 50000
    for i, chunk in enumerate(pd.read_csv(full_data_path, chunksize=chunksize)):
            print("Chunk Number: {}".format(i))
            chunk = chunk.drop(columns = ['Id'])
            chunk.to_csv(os.path.dirname(full_data_path) + "/test_numeric_data_50k_{}.csv".format(i),
                        index=False, header = False) # make sure additional index doens't get added and drop Id column
    return print("Done")  

In [None]:
get_sharded_datasets(data_path_test)

In [None]:
get_sharded_categorical_datasets(data_path_cat)

<h3> Full Categorical and Numerical Dataset.
    
<h4> Only a tiny subset of categorical columns actually have ANY data. I use a threshold of 0.001, which leaves me with only 31 categorical columns out of a few thousand. This is how sparse the data is. 
    
<h4> In order to run ML at scale, I combine the numerical and categorical columns after loading them in chunks,
     upsample the data to enhance the positive class (for training only) and subsequently saves the data into sharded
    csvs that I can read into Tensorflow.

In [None]:
test_cat_col = pd.read_csv(os.getcwd() + '/FInal_Numeric_Datsets/shuffled_train_categorical_test.csv'
                           , dtype = str, chunksize = 10000).get_chunk()

In [None]:
def get_dense_columns(data, threshold):
    ''' extracts the dense(r) columns from the dataset'''
    relevant_cols = []
    for column in data.columns:
        if len(data[column].value_counts())/data.shape[1] < threshold:
            pass
        else:
            relevant_cols.append(column)
    return relevant_cols

In [None]:
relevant_cols = get_dense_columns(test_cat_col, 0.001)
len(relevant_cols)

In [None]:
import json
with open('categorical_cols.txt', 'w') as f:
    json.dump(relevant_cols, f)

In [None]:
# Only keep these 31 columns.
for col in relevant_cols:
    print(test_cat_col[col].dtype)

In [1]:
def get_sharded_full_datasets(full_data_path, full_cat_data_path, relevant_cols):
    ''' Training dataset combining all numerical and 31 categorical columns. Upsampled'''
    chunksize = 50000
    for chunk1, chunk2 in zip(pd.read_csv(full_data_path, chunksize=chunksize)
                              , pd.read_csv(full_cat_data_path, chunksize=chunksize, usecols = relevant_cols, dtype = str)):
        print("Chunk Number: {}".format(i))
        chunk1 = chunk1.drop(columns = ['Id'])
        chunk2 = chunk2.drop(columns = ['Id'])
        full_chunk  = pd.concat([chunk1, chunk2], axis = 1)
        temp_df = full_chunk[full_chunk['Response'] == 1]
        print(len(temp_df))
        pos_df = resample(temp_df, n_samples=25000,random_state=42,replace=True)
        full_df = shuffle(pd.concat([full_chunk, pos_df]), random_state = 42)
        print(len(full_df))
        full_df.to_csv(os.path.dirname(full_data_path) + "/entire_data_50k_{}.csv".format(i),
                    index=False, header = False) # make sure additional index doens't get added and drop Id column
    return print("Done")  


In [None]:
get_sharded_full_datasets(data_path, data_path_cat, relevant_cols)

In [2]:
def get_sharded_full_test_datasets(full_data_path, full_cat_data_path, relevant_cols):
    ''' Test set combining numerical and 31 categorical columns. Not upsampled'''
    chunksize = 50000
    for chunk1, chunk2 in zip(pd.read_csv(full_data_path, chunksize=chunksize)
                              , pd.read_csv(full_cat_data_path, chunksize=chunksize, usecols = relevant_cols, dtype = str)):
        print("Chunk Number: {}".format(i))
        chunk1 = chunk1.drop(columns = ['Id'])
        chunk2 = chunk2.drop(columns = ['Id'])
        full_chunk  = pd.concat([chunk1, chunk2], axis = 1)
        full_chunk.to_csv(os.path.dirname(full_data_path) + "/test_entire_data_50k_{}.csv".format(i),
                        index=False, header = False) # make sure additional index doens't get added and drop Id column
    return print("Done")  

In [None]:
get_sharded_full_test_datasets(data_path_test, data_path_cat, relevant_cols)