## Try this Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truefoundry/mlfoundry-examples/blob/main/examples/sklearn/wine_classification.ipynb)

## Install dependencies

In [4]:
! pip install --quiet "numpy>=1.0.0,<2.0.0" "pandas>=1.0.0,<2.0.0" scikit-learn shap==0.40.0
! pip install -U mlfoundry

You should consider upgrading via the '/Users/chiragjn/Library/Caches/pypoetry/virtualenvs/mlfoundry-jYktQAfc-py3.9/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/Users/chiragjn/Library/Caches/pypoetry/virtualenvs/mlfoundry-jYktQAfc-py3.9/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Initialize MLFoundry Client

In [5]:
import os
import getpass
import urllib.parse
import mlfoundry as mlf

In [6]:
TFY_URL = os.environ.get('TFY_URL', 'https://app.truefoundry.com/')
TFY_API_KEY = os.environ.get('TFY_API_KEY')
if not TFY_API_KEY:
    print(f'Paste your TrueFoundry API key\nYou can find it over at {urllib.parse.urljoin(TFY_URL, "settings")}')
    TFY_API_KEY = getpass.getpass()

Paste your TrueFoundry API key
You can find it over at https://app.truefoundry.com/settings


 ····························································


In [7]:
client = mlf.get_client(api_key=TFY_API_KEY)

---

## Wine recognition as a Classification problem

In [8]:
import shap
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

import mlfoundry as mlf

### Loading data and preprocessing

In [9]:
data = datasets.load_wine()
print(data.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])


In [10]:
print(data.DESCR) 

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [11]:
# Read the DataFrame, first using the feature data
df = pd.DataFrame(data.data, columns=data.feature_names)
# Add a target column, and fill it with the target data
df['target'] = data.target
# Show the first five rows
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


### Start MLFoundry Run(s)

In [12]:
run = client.create_run(project_name='wine-recognition-project')
print('RUN 1 ID:', run.run_id)
print(f'You can track your runs live at {urllib.parse.urljoin(TFY_URL, "mlfoundry")}')

[mlfoundry] 2022-05-13T17:24:48+0530 INFO project wine-recognition-project does not exist. Creating wine-recognition-project.
[mlfoundry] 2022-05-13T17:24:54+0530 INFO Run is created with id '2391dc76418447c5b5a8deeef44f3355' and name 'meet-better-game'
RUN 1 ID: 2391dc76418447c5b5a8deeef44f3355
You can track your runs live at https://app.truefoundry.com/mlfoundry


### Log the dataset

In [13]:
# Store the feature data
X = pd.DataFrame(data.data, columns=data.feature_names)
# store the target data
y = data.target

run.log_dataset(
    dataset_name='wine_recognition_dataset',
    features=X,
    actuals=y
)

### Split Dataset into Training and Validation

In [14]:
# split the data using scikit-learn's train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, stratify=y, random_state=42)
print('Train samples:', len(X_train))
print('Test samples:', len(X_test))

Train samples: 142
Test samples: 36


### Setting tags and Logging parameters

In [15]:
clf = RandomForestClassifier(n_estimators=100, max_depth=15)
run.set_tags({'framework': 'sklearn', 'task': 'classification', 'model': 'RandomForestClassifier'})
run.log_params({'n_estimators': 100, 'max_depth': 15})

[mlfoundry] 2022-05-13T17:25:13+0530 INFO Parameters logged successfully


### Training model and logging model

In [16]:
clf.fit(X_train, y_train)
run.log_model(clf, framework=mlf.ModelFramework.SKLEARN)

[mlfoundry] 2022-05-13T17:25:33+0530 INFO Model logged successfully


## Computing predictions

In [17]:
# logging predictions
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

## Logging metrics

In [20]:
metrics = {
    'train/accuracy_score': accuracy_score(y_train, y_pred_train),
    'train/f1': f1_score(y_train, y_pred_train, average='weighted'),
    'test/accuracy_score': accuracy_score(y_test, y_pred_test),
    'test/f1': f1_score(y_test, y_pred_test, average='weighted'),
}
print('Random Forest metrics:', metrics)
run.log_metrics(metrics)

Random Forest metrics: {'train/accuracy_score': 1.0, 'train/f1': 1.0, 'test/accuracy_score': 1.0, 'test/f1': 1.0}
[mlfoundry] 2022-05-13T17:26:44+0530 INFO Metrics logged successfully


## Log Test dataset stats

In [21]:
X_test_df = X_test.copy()
X_test_df['targets'] = list(y_test)
X_test_df['predictions'] = list(y_pred_test)
X_test_df['prediction_probabilities'] = list(clf.predict_proba(X_test))

# shap value computation model 1 test set
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(X_test)

run.log_dataset_stats(
    X_test_df, 
    data_slice='test',
    data_schema=mlf.Schema(
        feature_column_names=list(data.feature_names),
        prediction_column_name='predictions',
        actual_column_name='targets',
        prediction_probability_column_name='prediction_probabilities'
    ),
    shap_values=shap_values,
    model_type='multiclass_classification',
)

WARN: Missing config
[mlfoundry] 2022-05-13T17:27:09+0530 INFO Metrics logged successfully
[mlfoundry] 2022-05-13T17:27:16+0530 INFO Dataset stats have been successfully computed and logged


In [22]:
run.end()

[mlfoundry] 2022-05-13T17:27:50+0530 INFO Shutting down background jobs and syncing data for run with id '2391dc76418447c5b5a8deeef44f3355', please don't kill this process...


## Training a KNN model

In [24]:
run = client.create_run(project_name='wine-recognition-project')
print('RUN 2 ID:', run.run_id)

# log dataset
run.log_dataset(
    dataset_name='breast_cancer_dataset',
    features=X,
    actuals=y
)

run = client.create_run(project_name='wine-recognition-project')
clf = KNeighborsClassifier(n_neighbors=8)
run.set_tags({'framework': 'sklearn', 'task': 'classification', 'model': 'KNeighborsClassifier'})
run.log_params({'n_neighbors': 8})

clf.fit(X_train, y_train)
run.log_model(clf, framework=mlf.ModelFramework.SKLEARN)

y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)

metrics = {
    'train/accuracy_score': accuracy_score(y_train, y_pred_train),
    'train/f1': f1_score(y_train, y_pred_train, average='weighted'),
    'test/accuracy_score': accuracy_score(y_test, y_pred_test),
    'test/f1': f1_score(y_test, y_pred_test, average='weighted'),
}
print('Tree 2 metrics:', metrics)
run.log_metrics(metrics)


X_test_df = X_test.copy()
X_test_df['targets'] = list(y_test)
X_test_df['predictions'] = list(y_pred_test)
X_test_df['prediction_probabilities'] = list(clf.predict_proba(X_test))

run.log_dataset_stats(
    X_test_df, 
    data_slice='test',
    data_schema=mlf.Schema(
        feature_column_names=list(data.feature_names),
        prediction_column_name='predictions',
        actual_column_name='targets',
        prediction_probability_column_name='prediction_probabilities'
    ),
    model_type='multiclass_classification',
)

run.end()

[mlfoundry] 2022-05-13T17:35:57+0530 INFO Run is created with id 'aa4f55d7e42b43abb6a37376f3b5ccde' and name 'send-medical-father'
[mlfoundry] 2022-05-13T17:35:58+0530 INFO Shutting down background jobs and syncing data for run with id '6da4b742c735439c852b0ebb4249208b', please don't kill this process...
RUN 2 ID: aa4f55d7e42b43abb6a37376f3b5ccde
[mlfoundry] 2022-05-13T17:36:22+0530 INFO Run is created with id '70eaa040c34c4adf9511f968bbfb1927' and name 'ask-huge-end'
[mlfoundry] 2022-05-13T17:36:23+0530 INFO Shutting down background jobs and syncing data for run with id 'aa4f55d7e42b43abb6a37376f3b5ccde', please don't kill this process...
[mlfoundry] 2022-05-13T17:36:27+0530 INFO Parameters logged successfully
[mlfoundry] 2022-05-13T17:36:43+0530 INFO Model logged successfully
Tree 2 metrics: {'train/accuracy_score': 0.7887323943661971, 'train/f1': 0.784706687620442, 'test/accuracy_score': 0.75, 'test/f1': 0.7419569817230636}
[mlfoundry] 2022-05-13T17:36:44+0530 INFO Metrics logged su