# AutoML with Tabular data - Data Processing

The complete AutoML pipeline that translates raw data into accurate predictions involves many stages abstracted by AutoGluon's one-line `fit()`, such as:
- Data splitting
- Data preprocessing
- Training of individual models
- Hyperparameter-tuning (optional)
- Model ensembling (optional)
- Feature-engineering/selection (optional)

Here we describe some basic principles to improve various stages of the pipeline, starting with how to process the given dataset. This tutorial focuses on subtle yet practically-important issues, assuming you're already familiar with overall forms of standard data processing required in most ML projects. If not, please first look at these resources: ([Rencberoglu, 2019](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114); [D'yakonov & Semenov, 2019](https://www.coursera.org/lecture/competitive-data-science/overview-1Nh5Q)).


## Data Splitting

**How much data to hold-out**:  Predictive performance on validation data crucially guides automated decisions about which model (or combination of models) is best, how many training iterations (i.e. epochs/boosting-rounds) to apply for iteratively-optimized models, what hyperparameter-values to use, etc. It's thus critical to use a representative validation set that facilitates accurate estimation of generalization performance. However, we do not want to hold-out too much data for validation, since then less data is available for actually training the models. Once the validation-set reaches a certain size that we can accurately estimate predictive performance, additional validation-set increases will only marginally improve our estimates. While the size of the validation data determines the *variance* of our performance-estimates, how *biased* these estimates are is determined by the number of modeling decisions we base on validation-performance. 
By default, AutoGluon selects the fraction of data to hold-out for validation as follows:

```
if num_train_rows < 5000:
    holdout_frac = max(0.1, min(0.2, 500.0 / num_train_rows))
else:
    holdout_frac = max(0.01, min(0.1, 2500.0 / num_train_rows))

if hyperparameter_tune:
    holdout_frac = min(0.2, holdout_frac * 2)
```

Between 5,000-25,000 examples, we hold-out 10% of the data, as we want to grow validation set to a stable 2500 examples, but for larger sample-sizes, we only hold-out 1% of the data which suffices to accurately estimate validation performance. If hyperparameter-tuning is performed, we double the size of the validation set to mitigate the bias introduced from choosing many hyperparameter-values based on the same validation data.

For smaller datasets, it can be desirable to utilize multiple train/validation splits, i.e. via [k-fold cross-validation](https://www.datavedas.com/k-fold-cross-validation/), which we discuss in the next Notebook. 

**Class-stratification:** In classification problems, we make sure to stratify labels between the training and validation data. Stratification simply means the proportions of each class are matched between training and validation data. This prevents a shift in the class-label-distribution that might otherwise arise simply due to random chance.  Extremely rare classes remain an issue, and AutoGluon by default simply discards all data from classes that occur <10 times (can be adjusted via `label_count_threshold` [argument of `fit()`](https://autogluon.mxnet.io/api/autogluon.task.html#autogluon.task.TabularPrediction.fit)). AutoGluon-trained models thus never predict these discarded classes (about which models could anyway learn very little), but AutoGluon automatically remembers to take them into account when evaluating predictive performance.



**Manually specify validation data:** Whether a random training/validation split is appropriate for a particular application is hard for an AutoML system to determine. Thus AutoGluon places the burden on the user to decide this. If you have reason to believe your future test data will stem from a different distribution than the training data, you should aim to provide the validation set you believe is most representative of the test distribution. For example: data are often collected at different times, where the underlying predictive relationships vary temporally, and the goal is predict on data received at future times. Here a reasonable strategy may be to reserve your most recent data as the validation-set, to mimic the time-difference from the training data that will be encountered during inference. With AutoGluon, we can easily specify a validation-set via the `tuning_data` argument as follows:

In [1]:
import numpy as np
from collections import Counter
import pprint, psutil
from autogluon import TabularPrediction as task

subsample_size = 600 # experiment with larger values to try AutoGluon with larger datasets 

train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/diabetes/train.csv')
val_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/diabetes/validation.csv')
train_data = train_data.head(subsample_size) # subsample data for faster demo
val_data = val_data.head(subsample_size) # subsample data for faster demo
display(train_data)
display(val_data)

label_column = 'readmitted'
predictor = task.fit(train_data=train_data, tuning_data=val_data, label=label_column, time_limits=30)
val_perf = predictor.leaderboard(silent=True)
print("Performance on your provided val_data:")
display(val_perf)

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/diabetes/train.csv | Columns = 47 / 47 | Rows = 61059 -> 61059
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/diabetes/validation.csv | Columns = 47 / 47 | Rows = 20353 -> 20353


Unnamed: 0,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,Female,[0-10),?,"""6""","""25""","""1""",1.5,?,Pediatrics-Endocrinology,41,...,No,No,No,No,No,No,No,No,No,NO
1,Female,[10-20),?,"""1""","""1""","""7""",3.0,?,?,59,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,Female,[20-30),?,"""1""","""1""","""7""",2.3,?,?,11,...,No,No,No,No,No,No,No,No,Yes,NO
3,Male,[30-40),?,"""1""","""1""","""7""",2.3,?,?,44,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,Male,[40-50),?,"""1""","""1""","""7""",1.7,?,?,51,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,Male,[40-50),?,"""6""","""25""","""1""",3.0,?,InternalMedicine,41,...,No,No,No,No,No,No,No,No,No,NO
596,Male,[70-80),?,"""2""","""1""","""20""",3.2,?,?,11,...,No,No,No,No,No,No,No,No,No,>30
597,Female,[30-40),?,"""6""","""25""","""7""",4.3,?,Psychiatry,30,...,No,Up,No,No,No,No,No,Ch,Yes,NO
598,Female,[90-100),?,"""1""","""6""","""5""",7.7,?,InternalMedicine,34,...,No,No,No,No,No,No,No,No,No,NO


Unnamed: 0,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,Male,[60-70),?,"""2""","""1""","""4""",12.0,MD,Radiologist,68,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
1,Female,[60-70),?,"""2""","""1""","""7""",2.2,MC,Emergency/Trauma,36,...,No,Steady,No,No,No,No,No,Ch,Yes,>30
2,Female,[70-80),?,"""1""","""1""","""7""",3.2,MC,?,59,...,No,No,No,No,No,No,No,No,No,NO
3,Male,[70-80),?,"""2""","""2""","""7""",5.6,SP,Emergency/Trauma,61,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
4,Female,[80-90),?,"""1""","""3""","""7""",12.9,?,InternalMedicine,73,...,No,Up,No,No,No,No,No,Ch,Yes,<30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,Female,[60-70),?,"""1""","""3""","""1""",11.6,MC,?,81,...,No,Up,No,No,No,No,No,Ch,Yes,>30
596,Female,[80-90),?,"""1""","""3""","""7""",6.6,MC,Emergency/Trauma,65,...,No,No,No,No,No,No,No,No,Yes,>30
597,Male,[80-90),?,"""2""","""5""","""7""",6.3,MD,Emergency/Trauma,45,...,No,Down,No,No,No,No,No,Ch,Yes,NO
598,Female,[70-80),?,"""1""","""1""","""7""",5.3,?,InternalMedicine,53,...,No,No,No,No,No,No,No,No,Yes,NO


No output_directory specified. Models will be saved in: AutogluonModels/ag-20200801_195938/
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to AutogluonModels/ag-20200801_195938/
AutoGluon Version:  0.0.13b20200731
Train Data Rows:    600
Train Data Columns: 47
Tuning Data Rows:    600
Tuning Data Columns: 47
Preprocessing data ...
Here are the 3 unique label values in your data:  ['NO', '>30', '<30']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Train Data Class Count: 3
Feature Generator processed 1200 data points with 37 features
Original Features (raw dtypes):
	object features: 29
	float64 features: 1
	int64 features: 7
Original Features (inferred dtypes):
	object features: 29
	float features: 1
	int features: 7
Generated Features (special dtypes)

Performance on your provided val_data:


Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer
0,weighted_ensemble_k0_l1,0.56,0.177646,18.944229,0.001663,0.519197,1,True
1,NeuralNetClassifier,0.558333,0.147606,11.522898,0.147606,11.522898,0,True
2,CatboostClassifier,0.543333,0.028377,6.902134,0.028377,6.902134,0,True
3,LightGBMClassifierCustom,0.533333,0.033193,1.3857,0.033193,1.3857,0,True
4,LightGBMClassifier,0.533333,0.035097,1.593748,0.035097,1.593748,0,True
5,RandomForestClassifierGini,0.486667,0.127345,0.7627,0.127345,0.7627,0,True
6,ExtraTreesClassifierGini,0.476667,0.151468,0.835468,0.151468,0.835468,0,True
7,ExtraTreesClassifierEntr,0.475,0.135921,0.586289,0.135921,0.586289,0,True
8,RandomForestClassifierEntr,0.47,0.126998,0.766957,0.126998,0.766957,0,True
9,KNeighborsClassifierDist,0.445,0.111001,0.004122,0.111001,0.004122,0,True


**Refit on full dataset:** Regardless of how the validation dataset is selected, one way to often boost performance for models that train stably is to simply refit them to the entire dataset (training + validation) after their optimal hyperparameters (and training-iterations) have been determined based on the validation data.  However, the only way to confirm the resulting accuracy boost is through access to labeled test data, as there no longer remain any held-out examples in the dataset that can be utilized for unbiased accuracy estimation. With AutoGluon, one can do this as follows:

In [2]:
refit_models = predictor.refit_full()
test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/diabetes/test.csv')
test_data = test_data.head(subsample_size) # subsample data for faster demo
all_models_testperf = predictor.leaderboard(test_data, silent=True)
display(all_models_testperf)

Fitting model: RandomForestClassifierGini_FULL ...
	0.86s	 = Training runtime
Fitting model: RandomForestClassifierEntr_FULL ...
	0.85s	 = Training runtime
Fitting model: ExtraTreesClassifierGini_FULL ...
	0.76s	 = Training runtime
Fitting model: ExtraTreesClassifierEntr_FULL ...
	0.75s	 = Training runtime
Fitting model: KNeighborsClassifierUnif_FULL ...
	0.0s	 = Training runtime
Fitting model: KNeighborsClassifierDist_FULL ...
	0.0s	 = Training runtime
Fitting model: LightGBMClassifier_FULL ...
	0.13s	 = Training runtime
Fitting model: CatboostClassifier_FULL ...
	0.54s	 = Training runtime
Fitting model: NeuralNetClassifier_FULL ...
	3.53s	 = Training runtime
Fitting model: LightGBMClassifierCustom_FULL ...
	0.15s	 = Training runtime
Fitting model: weighted_ensemble_FULL_k0_l1 ...
	0.56	 = Validation accuracy score
	0.03s	 = Training runtime
	0.0s	 = Validation runtime
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/diabetes/test.csv | Columns = 47 / 47 | Rows = 20354 ->

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer
0,CatboostClassifier_FULL,0.601667,,0.027697,,0.543249,0.027697,,0.543249,0,True
1,RandomForestClassifierGini_FULL,0.575,,0.158222,,0.861494,0.158222,,0.861494,0,True
2,LightGBMClassifier_FULL,0.573333,,0.031525,,0.12556,0.031525,,0.12556,0,True
3,LightGBMClassifierCustom,0.573333,0.533333,0.031607,0.033193,1.3857,0.031607,0.033193,1.3857,0,True
4,LightGBMClassifierCustom_FULL,0.573333,,0.032832,,0.149797,0.032832,,0.149797,0,True
5,LightGBMClassifier,0.573333,0.533333,0.0335,0.035097,1.593748,0.0335,0.035097,1.593748,0,True
6,RandomForestClassifierEntr_FULL,0.571667,,0.153388,,0.848461,0.153388,,0.848461,0,True
7,NeuralNetClassifier_FULL,0.57,,0.149691,,3.525632,0.149691,,3.525632,0,True
8,weighted_ensemble_FULL_k0_l1,0.57,,0.186244,,4.098808,0.008856,,0.029927,1,True
9,ExtraTreesClassifierGini_FULL,0.556667,,0.195693,,0.7594,0.195693,,0.7594,0,True


Above we list the test accuracy of all models/ensembles (including those originally fit to just `train_data` and those refit to the merged `train_data + val_data` indicated by suffix **_FULL**). The **_FULL** models lack validation scores as their performance cannot be reliably estimated without additional test-data.

## Data Preprocessing

Properly processing raw data into a format suitable for ML is crucial for a successful end-to-end AutoML system.  
AutoGluon relies on two sequential stages of data processing: 
- *model-agnostic* preprocessing that transforms the inputs to all models
- *model-specific* preprocessing that is only applied to a copy of the data used to train a particular model. 

Model-agnostic preprocessing classifies each feature as numeric, categorical, text, or date/time, relying partly on the [**dtype** of each column in the DataFrame](https://pbpython.com/pandas_dtypes.html). Uncategorized columns are discarded from the data, comprised of non-numeric, non-repeating fields with presumably little
predictive value (e.g. UserIDs). To deal with missing categorical variables, we create an
additional **Unknown** category rather than imputing them.
This strategy is also used by AutoGluon to handle previously unseen
categories at inference-time. Note that often observations are not
missing at random and we want to preserve the evidence of
absence (rather than the absence of evidence). 

**Word of Caution about One-hot Encoding of Categorical Features**: 
Most ML textbooks/tutorials claim that categorical features should be converted to numerical values through [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) (OHE).  However, we do *not* recommend this as a generic model-agnostic preprocessing strategy; you should only apply OHE when passing data to a model for which this technique is particularly well-suited. OHE comes with the major downside that it explodes the dimsensionality of your data, and numerous alternative categorical-processing techniques exist ([Grover, 2019](https://towardsdatascience.com/getting-deeper-into-categorical-encodings-for-machine-learning-2312acd347c8)). 
Certain models such as [LightGBM](https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support)/[Catboost](https://catboost.ai/docs/features/categorical-features.html) provide special handling of categoricals, as does AutoGluon's neural network, which utilizes [learned embeddings](https://www.fast.ai/2018/04/29/categorical-embeddings/) to represent categorical data ([Erickson, 2020](https://arxiv.org/abs/2003.06505)).  

If you do utilize one-hot encoding, make sure to consider how your ML system will handle categorical features with: missing values, a huge number of possible categories, or previously-unseen categories encountered in future test data. One way to handle all of these issues is to provision separate OHE-dimensions to only the top $K$ most commonly-occurring categories and bin all less common (and previously-unseen/missing) categories into a single extra category.

**Special Features**:
We identify text features as columns of mostly unique strings, which on average contain more than 3 non-adjacent whitespace characters. For models that solely operate on numerical/categorical data, the values of each text column are encoded via numeric vectors of [n-gram features](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) (only retaining those n-grams
with high overall occurrence, 30+ times by default in the text columns to reduce
memory footprint). In addition to n-grams, AutoGluon-Tabular also generates additional text features including: the number of whitespaces and the average word length in each text field.


<img src="files/images/ngram.png" width="900" height="400">

Date/time features are also transformed
into ordered numeric values via the following simple [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) transformation: `pd.to_numeric(pd.to_datetime(raw_feature))`. Richer date/time feature-engineering is provided in the [fast.ai library](https://docs.fast.ai/tabular.transform.html#add_datepart), but our naive approach works reasonably well in practice. 
After encoding text and date-times, a copy of the resulting set
of numeric and categorical features is subsequently passed
to model-specific methods for further tailored preprocessing specific to each model. Below we add a dummy 'text' column to our dataset to show how AutoGluon handles it.

In [3]:
text_feature = [''.join(np.random.choice(['a','b','c','aaa',' '],p=[0.2,0.3,0.2,0.1,0.2],size=30)) for i in range(len(train_data))]
train_data_wtext = train_data.copy()
train_data_wtext['dummytext'] = text_feature
display(train_data_wtext['dummytext'])

predictor = task.fit(train_data=train_data_wtext, label=label_column, time_limits=30)

0      caaaccbcb  baaacc aaa aaa  aaabaaaaca cb
1              baaabcaccc cbbcaccbabbcb abacbbb
2              aacaab a a b caaaabababbaccbcac 
3            bcaaaabaca acbaaa bcccb b cb b c c
4          aaac cbcabcbcbabcccccb bb  aaaa aaa 
                         ...                   
595      b caaabaaaabaaaaaab b a ba b  ccb  acb
596          ababb bc  bcaaabaaabccbcab b abcb 
597           baccbbcaaaaabca c  cba aaaa  baa 
598              c  b bab bccbcbb    ccaa aaaa 
599            cbc aa baccccacbbc babaaabcb cb 
Name: dummytext, Length: 600, dtype: object

No output_directory specified. Models will be saved in: AutogluonModels/ag-20200801_200022/
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to AutogluonModels/ag-20200801_200022/
AutoGluon Version:  0.0.13b20200731
Train Data Rows:    600
Train Data Columns: 48
Preprocessing data ...
Here are the 3 unique label values in your data:  ['NO', '>30', '<30']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Train Data Class Count: 3
Feature Generator processed 600 data points with 50 features
Original Features (raw dtypes):
	object features: 26
	float64 features: 1
	int64 features: 7
Original Features (inferred dtypes):
	object features: 25
	float features: 1
	int features: 7
	text features: 1
Generated Features (special dtypes):
	text_as_category features: 1


In [4]:
print("AutoGluon assigned the raw features to the following types:")
pprint.pprint(dict(predictor.feature_types.feature_types_raw))

print("\n AutoGluon generated the following 'special' features:")
pprint.pprint(dict(predictor.feature_types.feature_types_special))

print("\n After model-agnostic processing, the data passed to individual models looks like this:")
processed_features, processed_labels = predictor.load_data_internal()
display(processed_features)

AutoGluon assigned the raw features to the following types:
{'category': ['gender',
              'age',
              'admission_type_id',
              'discharge_disposition_id',
              'admission_source_id',
              'medical_specialty',
              'diag_1',
              'diag_2',
              'diag_3',
              'max_glu_serum',
              'A1Cresult',
              'metformin',
              'repaglinide',
              'glimepiride',
              'glipizide',
              'glyburide',
              'tolbutamide',
              'pioglitazone',
              'rosiglitazone',
              'acarbose',
              'troglitazone',
              'tolazamide',
              'insulin',
              'change',
              'diabetesMed',
              'dummytext'],
 'float': ['time_in_hospital'],
 'int': ['num_lab_procedures',
         'num_procedures',
         'num_medications',
         'number_outpatient',
         'number_emergency',
         'number_inp

Unnamed: 0,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,number_diagnoses,gender,age,...,__nlp__.aaaa,__nlp__.ab,__nlp__.ac,__nlp__.ba,__nlp__.bb,__nlp__.bc,__nlp__.ca,__nlp__.cb,__nlp__.cc,__nlp__._total_
216,3.9,34,0,4,0,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,2
534,3.4,73,5,14,0,0,0,9,1,6,...,1,0,0,0,0,0,0,0,0,1
462,1.0,48,0,6,0,0,2,7,1,1,...,0,0,0,0,0,0,0,0,0,0
49,11.0,67,2,25,0,0,0,9,1,6,...,0,0,0,0,0,0,0,0,0,0
401,2.4,31,1,17,0,0,0,4,1,8,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581,8.7,46,3,20,0,0,0,8,1,8,...,0,0,0,0,0,0,0,0,0,1
40,1.9,28,0,15,0,0,0,4,0,7,...,0,0,0,0,0,0,0,0,0,0
536,4.3,37,1,5,0,0,0,7,0,6,...,0,0,0,0,0,0,0,0,0,1
457,5.6,47,1,17,0,0,0,7,1,7,...,0,0,0,0,0,0,0,0,0,0


**Memory usage**: A key concern across AutoML is memory usage, as this often ends up being the bottleneck that causes AutoML systems to fail on certain larger datasets. 
Machines with sizeable RAM are now easily accessible in the cloud (eg. [AWS m5.24xlarge instance with 384 GB memory](https://aws.amazon.com/blogs/aws/m5-the-next-generation-of-general-purpose-ec2-instances/)), and thus robust AutoML systems ought to run on sizeable datasets even without any chunking/distributed-processing of the data. 

However, things break down without careful consideration of memory. For instance, many AutoML systems train multiple models in parallel which may lead to forking processes that duplicate datasets in memory. Similarly, when passing the generically-preprocessed data to a model, it is commmon to copy the dataset so that model-specific preprocessing won't affect the data other models receive. Without caution, additional copies of the dataset are often generated during this process or the model-training. AutoGluon's n-gram feature-generation is one particular area where memory becomes dangerous, as it often adds thousands of additional columns to the table. Here is how it is implemented:

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=30, ngram_range=(1, 3), max_features=10000, dtype=np.uint8)

We use `uint8` to represent n-gram counts in the feature vector, which reduces memory. Before AutoGluon actually featurizes the text fields, we estimate how much memory the n-grams will require and downsample the number of n-grams iteratively until this estimate falls safely below the current available memory: 

In [6]:
def get_ngram_freq(vectorizer, transform_matrix):
    names = vectorizer.get_feature_names()
    frequencies = transform_matrix.sum(axis=0).tolist()[0]
    ngram_freq = {ngram: freq for ngram, freq in zip(names, frequencies)}
    return ngram_freq

def downscale_vectorizer(vectorizer, ngram_freq, vocab_size):
    counter = Counter(ngram_freq)
    top_n = counter.most_common(vocab_size)
    top_n_names = sorted([name for name, _ in top_n])
    new_vocab = {name: i for i, name in enumerate(top_n_names)}
    vectorizer.vocabulary_ = new_vocab

text_data = train_data_wtext['dummytext'].values
vectorizer_fit = vectorizer.fit(text_data)
transform_matrix = vectorizer_fit.transform(text_data)
downsample_ratio = None
predicted_ngrams_memory_usage_bytes = transform_matrix.shape[0] * 8 * (transform_matrix.shape[1] + 1) + 80
mem_avail = psutil.virtual_memory().available
mem_rss = psutil.Process().memory_info().rss
max_memory_percentage = 0.15 # max fraction of available memory the n-grams are allowed to occupy
predicted_rss = mem_rss + predicted_ngrams_memory_usage_bytes
predicted_percentage = predicted_rss / mem_avail
print(f"Predicted fraction of available memory used by n-grams: {predicted_percentage}")
if downsample_ratio is None:
    if predicted_percentage > max_memory_percentage:
        downsample_ratio = max_memory_percentage / predicted_percentage

if downsample_ratio is not None:
    vocab_size = len(vectorizer_fit.vocabulary_)
    downsampled_vocab_size = int(np.floor(vocab_size * downsample_ratio))
    ngram_freq = get_ngram_freq(vectorizer=vectorizer_fit, transform_matrix=transform_matrix)
    downscale_vectorizer(vectorizer=vectorizer_fit, ngram_freq=ngram_freq, vocab_size=downsampled_vocab_size)
    transform_matrix = vectorizer_fit.transform(text_data)


Predicted fraction of available memory used by n-grams: 0.07228432683881864


**Transductive Preprocessing**: While *supervised learning* typically assumes predictions will need to be made on future data that is currently unavailable, *transductive learning* [(Vapnik, 06)](http://axon.cs.byu.edu/~martinez/classes/778/Papers/transductive.pdf) instead only asks for predictions for one particular test dataset that is available (just without labels). For instance, many [prediction competitions](https://www.kaggle.com/competitions) follow this format, where the (unlabeled) test data is provided to contestants at the outset. While sophisticated learning algorithms have been developed specifically for the transductive setting, some accuracy-gains may be reaped simply by performing all data preprocessing on the combined training and (unlabeled) test datasets. While inappropriate for inductive supervised learning (where test data are merely supposed  provide unbiased predictive performance estimates for future data), such joint preprocessing can help in transductive settings (where in the test data: some numerical values may extend beyond their range in the training data and some categorical variables may take values previously unseen in the training data).
This can be done in AutoGluon by passing the *unlabeled* test data as `tuning_data` (rather than *labeled* validation data):

In [7]:
test_data_nolab = test_data.drop(labels=[label_column],axis=1) # delete label column to demonstrate case without labels
predictor = task.fit(train_data=train_data, tuning_data=test_data_nolab, label=label_column, time_limits=30)
predictor.evaluate(test_data)

No output_directory specified. Models will be saved in: AutogluonModels/ag-20200801_200056/
Beginning AutoGluon training ... Time limit = 30s
AutoGluon will save models to AutogluonModels/ag-20200801_200056/
AutoGluon Version:  0.0.13b20200731
Train Data Rows:    600
Train Data Columns: 47
Tuning Data Rows:    600
Tuning Data Columns: 46
Preprocessing data ...
Here are the 3 unique label values in your data:  ['NO', '>30', '<30']
AutoGluon infers your prediction problem is: multiclass  (because dtype of label-column == object).
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Train Data Class Count: 3
Feature Generator processed 1200 data points with 37 features
Original Features (raw dtypes):
	object features: 29
	float64 features: 1
	int64 features: 7
Original Features (inferred dtypes):
	object features: 29
	float features: 1
	int features: 7
Generated Features (special dtypes)

Predictive performance on given dataset: accuracy = 0.53


0.53

## References

[**AutoGluon Documentation** (autogluon.mxnet.io)](https://autogluon.mxnet.io/api/autogluon.task.html)

Rencberoglu, E. [**Fundamental Techniques of Feature Engineering for Machine Learning**](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114). *Towards Data Science*, 2019.

D'yakonov A, Semenov, S. [**Feature Preprocessing and Generation with Respect to Models**](https://www.coursera.org/lecture/competitive-data-science/overview-1Nh5Q). *From Coursera Course: [How to Win a Data Science Competition](https://sites.google.com/view/raybellwaves/courses/how-to-win-a-data-science-competition)*, 2019.

Grover, P. [**Getting Deeper into Categorical Encodings for Machine Learning**](https://towardsdatascience.com/getting-deeper-into-categorical-encodings-for-machine-learning-2312acd347c8). *Towards Data Science*, 2019.

Thomas, R. [**An Introduction to Deep Learning for Tabular Data**](https://www.fast.ai/2018/04/29/categorical-embeddings/). *fast.ai*, 2018.

Erickson et al. [**AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data**](https://arxiv.org/abs/2003.06505). *Arxiv*, 2020.

Vapnik V. [**Transductive Inference and Semi-Supervised Learning**](http://axon.cs.byu.edu/~martinez/classes/778/Papers/transductive.pdf). *Book Chapter in "Semi-Supervised Learning"*, 2006 