# Example: Memory considerations
--------------------------------

This example shows how to use the `memory` parameter to make efficient use of the available memory.

The data used is a variation on the [Australian weather dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) from Kaggle. You can download it from [here](https://github.com/tvdboom/ATOM/blob/master/examples/datasets/weatherAUS.csv). The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target `RainTomorrow`.

## Load the data

In [1]:
# Import packages
import sys
import pandas as pd
from atom import ATOMClassifier

In [2]:
# Load data
X = pd.read_csv("./datasets/weatherAUS.csv")

# Let's have a look
X.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,MelbourneAirport,18.0,26.9,21.4,7.0,8.9,SSE,41.0,W,SSE,...,95.0,54.0,1019.5,1017.0,8.0,5.0,18.5,26.0,Yes,0
1,Adelaide,17.2,23.4,0.0,,,S,41.0,S,WSW,...,59.0,36.0,1015.7,1015.7,,,17.7,21.9,No,0
2,Cairns,18.6,24.6,7.4,3.0,6.1,SSE,54.0,SSE,SE,...,78.0,57.0,1018.7,1016.6,3.0,3.0,20.8,24.1,Yes,0
3,Portland,13.6,16.8,4.2,1.2,0.0,ESE,39.0,ESE,ESE,...,76.0,74.0,1021.4,1020.5,7.0,8.0,15.6,16.0,Yes,1
4,Walpole,16.4,19.9,0.0,,,SE,44.0,SE,SE,...,78.0,70.0,1019.4,1018.9,,,17.4,18.1,No,0


## Run the pipeline

In [35]:
# Note the dataset takes >60MB of memory
atom = ATOMClassifier(X, y="RainTomorrow", memory=False, verbose=2, random_state=1)


Algorithm task: binary classification.

Shape: (142193, 22)
Train set size: 113755
Test set size: 28438
-------------------------------------
Memory: 61.69 MB
Scaled: False
Missing values: 316559 (10.1%)
Categorical features: 5 (23.8%)
Duplicates: 45 (0.0%)



In [32]:
atom.branch._data.data = None

In [36]:
print(f"Size of atom is: {round(get_size(atom) / 1e6, 2)}MB")

Size of atom is: 142.77MB


In [37]:
atom.save("2")

ATOMClassifier successfully saved.


In [17]:
atom.dataset.memory_usage(deep=False).sum() / 1e6

25.026096

In [18]:
atom.save_data("2")

Data set successfully saved.


In [19]:
atom.shrink()

The column dtypes are successfully converted.


In [20]:
atom.dataset.memory_usage(deep=False).sum() / 1e6

14.646007

In [21]:
atom.save_data("3")

Data set successfully saved.


In [10]:
# Now, we create a new branch
atom.branch = "b2"

Successfully created new branch: b2.


In [11]:
print(f"Size of atom is: {round(get_size(atom) / 1e6, 2)}MB")

Size of atom is: 372.5MB


In [12]:
atom.scale()

Fitting Scaler...
Scaling features...


In [13]:
atom.run("LR")


Models: LR
Metric: f1


Results for LogisticRegression:
Fit ---------------------------------------------

Exception encountered while running the LR model.
ValueError: could not convert string to float: 'Woomera'


RuntimeError: All models failed to run. Use the logger to investigate the exceptions.

In [None]:
# See that the change is small
print(f"Size of atom is: {round(get_size(atom) / 1e6, 2)}MB")

## Analyze the results

In [None]:
# Let's compare the differences between the models
print(atom.lr.scaler)
print(atom.bag.scaler)
print(atom.lr2.scaler)

In [None]:
# And the data they use is different
print(atom.lr.X.iloc[:5, :3])
print("-----------------------------")
print(atom.bag.X.iloc[:5, :3])
print("-----------------------------")
print(atom.lr2.X_train.equals(atom.lr.X_train))

In [None]:
# Note that the scaler is included in the model's pipeline
print(atom.lr.pipeline)
print("-----------------------------")
print(atom.bag.pipeline)
print("-----------------------------")
print(atom.lr2.pipeline)

In [None]:
atom.plot_pipeline()