## Sampling Large Datasets

__NOTE:__ updated with koalas so now you can use spark dataframes.  FYI - Spark context is not set up in this notebook.


### Down Sampling Large Datasets

Smaller data means faster training times.  Moreover, DataRobot has seen many instances of training models on fractions of a full dataset that perform about the same as models trained on full datasets.

Additionally - this is necessary when dealing with larger datasets given file size requirements of DataRobot.  

### The Problem

Sampling large datasets one must consider the type of supervised ML problem.  

* is it regression?  could it be considered zero inflated? 
* is it classification?  is there significant class imbalance? 

### Why this is an issue

Depending on the problem type, you may be throwing out a lot of information based on your approach to downsampling.

### An example

A (very) large dataset being used for binary classification.  1% of the records has target = 1 and 99% of the records has target = 0.  

_An Aside: there may be instances were it may make more sense to stratify your data by a given feature instead of down sampling.  For example, we could subset our dataset based on a given feature, and assuming that each subset is less than 10GB, we could build one project for each subset.  We will not cover this example._  


#### Random Sampling 

If the dataset is 100GB and we downsample the data to maybe a random 10%, we can get the file into DR and have a representative distribution of the target, but consider the information you lost in the sampling.  You just dropped a significant number of instanced with target = 1.  

#### Majority Class Down Sampling 

Retain all records with target = 1 and down the rest of the dataset.  The will fundamentally change the distribution of the target variable.  Suppose that our Majority class down sampling resulted in a data set that had a target which is 1 out of 20 times, any model we predict we require us to calibrate the predictions to reflect the distribution of the original dataset.  

#### Majority Class Down Sampling and weighting

This is actually the mechanism DataRobot would pursue for the purpose of training models fast when faced with a binary target or a zero inflated target.  

The addition of the weight will be used to "automatically" calibrate the predictions.  

In What follows we will walk through an example of downsampling a dataset out side of DataRobot, and how to generate a weight for a dataset that will be used in DataRobot for training purposes.  

#### A note on spark dataframes

Check out [koalas](https://github.com/databricks/koalas).  This will put a pandas api on spark dataframes and the following code should work.  I have not tested this out.  

# SET UP YOUR SPARK CONTEXT 

In [None]:
sc.getConf().getAll()

In [2]:
## conda install koalas -c conda-forge
import databricks.koalas as ks

In [3]:
import pandas as pd
import numpy as np
%matplotlib inline 

In [4]:
## toy data.  Target will be highly imbalanced. 

# np.random.seed(123)
# y = np.random.binomial(1, 0.02, size=[1000000,1])
# x = np.random.normal(0, 1, size=[1000000,5])
# df = pd.DataFrame(np.concatenate((y,x), axis=1))
# columns = ["target"]
# columns.extend(["f{}".format(i) for i in range(5)])
# df.columns = columns

In [5]:
# df = pd.read_csv("10K_Lending_Club_Loans.csv", encoding = "ISO-8859-1")
target = "is_bad"

In [8]:
sdf = sqlContext.read.option("header", "true").option("inferSchema", "true").csv("10K_Lending_Club_Loans.csv")

In [9]:
df = ks.DataFrame(sdf)

In [10]:
df["is_bad"].describe()

count    10000.000000
mean         0.129500
std          0.335769
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: is_bad, dtype: float64

## Smart Downsampling Outside of DataRobot

For Binary classification (suppose that $Y$ is binary, i.e., 1 i and 0 is negative class)

1. Split the data into two datasets
    a. minority class
    b. majority class
3. Retain all records in minority class and randomly sample from the majority class - smartly downsampled dataset.  
4. Create a weighting mechanism such that the new data set weighted sample average = original data set sample average.  

Weighting for minority class records is

$$weight_{min} = 1$$ 

and weight for majority class is 

$$weight_{maj} = \frac{ N - \sum y}{n - \sum y}$$

With $N$ is the number of records fro mthe original data set, $n$ number of records in smart downsampled dataset. 

TL;DR verison

the above is equivalent to 

$weight_{min}$ = old proportion of 1's / new proportion of 1's

$weight_{maj}$ = old proportion of 0's / new proportion of 0's

In [11]:
minority_class = 1 # positive class
majority_class = 0
majority_class_sample_rate = 0.15

In [12]:
min_df = df[df[target] == minority_class]
maj_df = df[df[target] == majority_class]

In [13]:
## downsample the majority class dataframe
maj_df_downsample = maj_df.sample(frac=majority_class_sample_rate)

In [14]:
## concate the minority class dataframe with downsampled majority class dataframe
df_downsample = ks.concat([min_df, maj_df_downsample], axis=0)

In [15]:
df.shape

(10000, 34)

In [16]:
df_downsample.shape

(2600, 34)

In [17]:
## create weight
sum_y = min_df[target].shape[0]
N = df.shape[0]
n = df_downsample.shape[0]
# weight_for_minority_class = n / N
weight_for_minority_class = 1
weight_for_majority_class = weight_for_minority_class * (N - sum_y)/(n - sum_y)
weight = df_downsample[target].map(lambda x: float(weight_for_minority_class) if x > 0 else float(weight_for_majority_class))
# df["weight"] = weight
weight.name = "weight"
df_down_sample = df_downsample.join(weight)                             
                             
print("minority class weight: {}, majority class weight: {}".format(weight_for_minority_class, weight_for_majority_class))

minority class weight: 1, majority class weight: 6.670498084291188


In [18]:
weight.unique()

0    1.000000
1    6.670498
Name: weight, dtype: float64

In [19]:
df_down_sample.groupby(target)["weight"].mean()

is_bad
1    1.000000
0    6.670498
Name: weight, dtype: float64

In [20]:
##check
y = df_downsample[target]
print("downsampled weighted average: {:.3f}, original average:{:.3f}".format((y * weight).sum() / weight.sum(), df[target].mean()))

downsampled weighted average: 0.129, original average:0.130


In [21]:
df_downsample.to_csv("downsampled_data_with_weight.csv")

In [None]:
df_downsample[target].describe()

## Start a Project with a Weight Feature

In [23]:
import datarobot as dr
import os

token = os.environ["DATAROBOT_API_TOKEN"]
endpoint = os.environ["DATAROBOT_ENDPOINT"]

dr.Client(token = token, endpoint = endpoint)

<datarobot.rest.RESTClientObject at 0x1192158d0>

In [24]:
project = dr.Project.create(df_downsample.to_pandas(), project_name = "Using No Weight Feature", dataset_filename="lending club down sampled")

In [None]:
advanced_options = dr.AdvancedOptions(weights = "weight")

In [None]:
project.set_target(target = target, mode = "auto", advanced_options = advanced_options)
project.set_worker_count(-1)