# Example: Getting started
--------------------------

This example shows how to get started with the atom-ml library.

The data used is a variation on the [Australian weather dataset](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) from Kaggle. You can download it from [here](https://github.com/tvdboom/ATOM/blob/master/examples/datasets/weatherAUS.csv). The goal of this dataset is to predict whether or not it will rain tomorrow training a binary classifier on target `RainTomorrow`.

In [1]:
import pandas as pd
from atom import ATOMClassifier

# Load the Australian Weather dataset
X = pd.read_csv("https://raw.githubusercontent.com/tvdboom/ATOM/master/examples/datasets/weatherAUS.csv")

In [2]:
atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1000, verbose=2)

Algorithm task: binary classification.

Shape: (1000, 22)
Train set size: 800
Test set size: 200
-------------------------------------
Memory: 433.89 kB
Scaled: False
Missing values: 2232 (10.1%)
Categorical features: 5 (23.8%)
Outlier values: 3.0 (0.0%)



In [3]:
atom.impute(strat_num="median", strat_cat="most_frequent")  
atom.encode(strategy="LeaveOneOut", max_onehot=8)

Fitting Imputer...
Imputing missing values...
 --> Imputing 1 missing values with median (12.0) in feature MinTemp.
 --> Imputing 10 missing values with median (0.0) in feature Rainfall.
 --> Imputing 424 missing values with median (4.8) in feature Evaporation.
 --> Imputing 470 missing values with median (8.4) in feature Sunshine.
 --> Imputing 72 missing values with most_frequent (SE) in feature WindGustDir.
 --> Imputing 72 missing values with median (37.0) in feature WindGustSpeed.
 --> Imputing 68 missing values with most_frequent (N) in feature WindDir9am.
 --> Imputing 29 missing values with most_frequent (W) in feature WindDir3pm.
 --> Imputing 8 missing values with median (13.0) in feature WindSpeed9am.
 --> Imputing 20 missing values with median (19.0) in feature WindSpeed3pm.
 --> Imputing 11 missing values with median (70.0) in feature Humidity9am.
 --> Imputing 27 missing values with median (52.0) in feature Humidity3pm.
 --> Imputing 112 missing values with median (1017.4

In [4]:
atom.run(models=["LDA", "AdaB"], metric="auc", n_trials=10)


Models: LDA, AdaB
Metric: roc_auc


Running hyperparameter tuning for LinearDiscriminantAnalysis...
| trial |  solver | shrinkage | roc_auc | best_roc_auc | time_trial | time_ht |    state |
| ----- | ------- | --------- | ------- | ------------ | ---------- | ------- | -------- |
| 0     |   eigen |      None |  0.8322 |       0.8322 |     0.152s |  0.152s | COMPLETE |
| 1     |    lsqr |       1.0 |  0.7342 |       0.8322 |     0.146s |  0.298s | COMPLETE |
| 2     |    lsqr |       0.7 |  0.8683 |       0.8683 |     0.143s |  0.441s | COMPLETE |
| 3     |   eigen |       0.9 |  0.9182 |       0.9182 |     0.144s |  0.586s | COMPLETE |
| 4     |   eigen |      None |  0.8322 |       0.9182 |     0.003s |  0.589s | COMPLETE |
| 5     |     svd |       --- |  0.7723 |       0.9182 |     0.146s |  0.735s | COMPLETE |
| 6     |     svd |       --- |  0.7723 |       0.9182 |     0.002s |  0.737s | COMPLETE |
| 7     |     svd |       --- |  0.7723 |       0.9182 |     0.002s |  0.739s | 

In [5]:
atom.evaluate()

Unnamed: 0,accuracy,average_precision,balanced_accuracy,f1,jaccard,matthews_corrcoef,precision,recall,roc_auc
LDA,0.775,0.6305,0.7334,0.5631,0.3919,0.424,0.4915,0.6591,0.8364
AdaB,0.815,0.5364,0.6122,0.3729,0.2292,0.3529,0.7333,0.25,0.81
