# Example: Memory considerations
--------------------------------

This example shows how to use the `memory` parameter to make efficient use of the available memory.

The data used is a variation on the [Australian weather dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) from Kaggle. You can download it from [here](https://github.com/tvdboom/ATOM/blob/master/examples/datasets/weatherAUS.csv). The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target `RainTomorrow`.

## Load the data

In [1]:
# Import packages
import os
import tempfile
import pandas as pd
from atom import ATOMClassifier

In [2]:
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,MelbourneAirport,18.0,26.9,21.4,7.0,8.9,SSE,41.0,W,SSE,...,95.0,54.0,1019.5,1017.0,8.0,5.0,18.5,26.0,Yes,0
1,Adelaide,17.2,23.4,0.0,,,S,41.0,S,WSW,...,59.0,36.0,1015.7,1015.7,,,17.7,21.9,No,0
2,Cairns,18.6,24.6,7.4,3.0,6.1,SSE,54.0,SSE,SE,...,78.0,57.0,1018.7,1016.6,3.0,3.0,20.8,24.1,Yes,0
3,Portland,13.6,16.8,4.2,1.2,0.0,ESE,39.0,ESE,ESE,...,76.0,74.0,1021.4,1020.5,7.0,8.0,15.6,16.0,Yes,1
4,Walpole,16.4,19.9,0.0,,,SE,44.0,SE,SE,...,78.0,70.0,1019.4,1018.9,,,17.4,18.1,No,0


In [3]:
# Define a temp directory to store the files in this example
tempdir = tempfile.gettempdir()

In [4]:
def get_size(filepath):
    """Return the size of the object in MB."""
    return f"{os.path.getsize(filepath + '.pkl') / 1e6:.2f}MB"

## Run the pipeline

In [5]:
atom = ATOMClassifier(X, y="RainTomorrow", verbose=2)


Algorithm task: binary classification.

Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)



Note that the datset takes ~25MB. We can reduce the size of the dataset using 
the shrink method, which reduces the dtypes to their smallest possible value.

In [6]:
atom.dtypes

Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow       int64
dtype: object

In [7]:
atom.shrink(str2cat=True)

The column dtypes are successfully converted.


In [8]:
atom.dtypes

Location         category
MinTemp           Float32
MaxTemp           Float32
Rainfall          Float32
Evaporation       Float32
Sunshine          Float32
WindGustDir      category
WindGustSpeed       Int16
WindDir9am       category
WindDir3pm       category
WindSpeed9am        Int16
WindSpeed3pm         Int8
Humidity9am          Int8
Humidity3pm          Int8
Pressure9am       Float32
Pressure3pm       Float32
Cloud9am             Int8
Cloud3pm             Int8
Temp9am           Float32
Temp3pm           Float32
RainToday        category
RainTomorrow         Int8
dtype: object

In [9]:
# Let's check the memory usage again...
# Notice the huge drop!
atom.stats()

Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 9.67 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)


In [10]:
# Now, we create some new branches to train models with different trasnformers
atom.impute()
atom.encode()
atom.run("LDA")

atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")

atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")

Fitting Imputer...
Imputing missing values...
 --> Dropping 637 samples due to missing values in feature MinTemp.
 --> Dropping 322 samples due to missing values in feature MaxTemp.
 --> Dropping 1406 samples due to missing values in feature Rainfall.
 --> Dropping 60843 samples due to missing values in feature Evaporation.
 --> Dropping 67816 samples due to missing values in feature Sunshine.
 --> Dropping 9330 samples due to missing values in feature WindGustDir.
 --> Dropping 9270 samples due to missing values in feature WindGustSpeed.
 --> Dropping 10013 samples due to missing values in feature WindDir9am.
 --> Dropping 3778 samples due to missing values in feature WindDir3pm.
 --> Dropping 1348 samples due to missing values in feature WindSpeed9am.
 --> Dropping 2630 samples due to missing values in feature WindSpeed3pm.
 --> Dropping 1774 samples due to missing values in feature Humidity9am.
 --> Dropping 3610 samples due to missing values in feature Humidity3pm.
 --> Dropping 14

In [11]:
# If we save atom now, notice the size
# This is because atom keeps a copy of every branch in memory
filename = tempdir + "atom1"
atom.save(filename)
get_size(filename)

ATOMClassifier successfully saved.


'38.99MB'

To avoid large memory usages, set the `memory` parameter.

In [12]:
atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)
atom.impute()
atom.encode()
atom.run("LDA")

atom.branch = "b2"
atom.scale()
atom.run("LDA_scaled")

atom.branch = "b3_from_main"
atom.normalize()
atom.run("LDA_norm")


Algorithm task: binary classification.
Cache storage: C:\Users\Mavs\AppData\Local\Temp\joblib

Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

The column dtypes are successfully converted.
Retrieving cached Imputer...
Imputing missing values...
Retrieving cached Encoder...
Encoding categorical columns...

Models: LDA
Metric: f1


Results for LinearDiscriminantAnalysis:
Fit ---------------------------------------------
Train evaluation --> f1: 0.6233
Test evaluation --> f1: 0.6248
Time elapsed: 0.221s
-------------------------------------------------
Total time: 0.221s


Total time: 0.221s
-------------------------------------
LinearDiscriminantAnalysis --> f1: 0.6248
Successfully created new branch: b2.
Retrieving cached Scaler...
Scaling features...

Models: LDA_scaled
Metric: f1


Results for LinearDiscriminantAnal

In [13]:
# And now, it only takes a fraction of the previous size
# This is because the data of inactive branches is now stored locally
filename = tempdir + "atom2"
atom.save(filename)
get_size(filename)

ATOMClassifier successfully saved.


'13.70MB'

Additionnaly, repeated calls to the same transformers with the same data will use the cached results.  
Don't forget to specify the `random_state` parameter to ensure the data remains the exact same.

In [14]:
atom = ATOMClassifier(X, y="RainTomorrow", memory=tempdir, verbose=1, random_state=1)
atom.shrink(str2cat=True)


Algorithm task: binary classification.
Cache storage: C:\Users\Mavs\AppData\Local\Temp\joblib

Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 25.03 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)

The column dtypes are successfully converted.


In [15]:
# Note the transformers are no longer fitted,
# instead the results are immediately read from cache
atom.impute()
atom.encode()

Retrieving cached Imputer...
Imputing missing values...
Retrieving cached Encoder...
Encoding categorical columns...


In [16]:
atom.dataset

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,0.075703,13.0,30.5,0.0,6.8,10.0,0.271668,59,0.312069,0.273733,...,19,8,1013.599976,1008.0,0,2,19.6,29.9,0.0,0
1,0.245394,15.3,22.4,16.0,4.2,3.3,0.204934,39,0.236475,0.199626,...,83,63,1025.5,1023.599976,6,6,16.9,21.1,1.0,1
2,0.262397,27.9,34.5,0.0,9.0,7.9,0.1737,72,0.236475,0.306935,...,72,63,1009.0,1005.5,7,7,31.0,33.099998,0.0,1
3,0.239174,12.9,27.9,0.0,5.4,8.6,0.269421,39,0.256213,0.286159,...,69,56,1023.400024,1019.799988,7,7,14.7,23.4,0.0,0
4,0.253089,7.4,14.3,0.8,2.8,4.0,0.210095,31,0.269333,0.167808,...,84,62,1023.599976,1023.200012,4,7,9.0,13.6,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56415,0.295559,23.9,28.1,0.0,2.6,7.7,0.241448,44,0.279553,0.259391,...,86,79,1015.900024,1013.900024,7,7,25.799999,27.5,0.0,0
56416,0.217037,13.6,24.6,0.0,4.4,7.8,0.1737,39,0.193908,0.197102,...,87,61,1023.200012,1022.599976,7,3,17.299999,21.4,0.0,0
56417,0.112176,16.299999,38.700001,0.0,10.2,13.4,0.1737,24,0.149795,0.168702,...,29,8,1013.5,1010.299988,5,2,26.4,36.900002,0.0,0
56418,0.295559,11.5,19.200001,0.8,2.0,7.0,0.147458,22,0.13795,0.195807,...,73,52,1021.299988,1018.799988,3,4,17.1,18.4,0.0,0
