# Part 3: Finer-Grained Annotating and Versioning

Now that we've learned how to use the expressive syntax of Flor, we will annotate and version the experiment we have been developing in as much detail as possible.

## Prepare your environment before starting the activities.

We're going to start by importing Flor and letting it know the name of our notebook.

In [None]:
# Import Flor
import flor

# If the notebook name has not already been set, you are able to set the name in code. 
flor.setNotebookName('tutorial_3.ipynb')

## We'll declare all our code in advance

The first thing we'll do is split the data. For now, pay attention to how Flor reads and writes data. We'll later tell Flow what an input is and what an output is when we're specifying the Flor Plan. Also, notice how in the past, the resulting training/testing split (and the parameters) were lost, but we're able to track them here

In [None]:
@flor.func
def split(intermediate_X, intermediate_y, test_size, random_state, 
          X_train, X_test, y_train, y_test, **kwargs):
    import json
    from sklearn.model_selection import train_test_split
    
    # Read the inputs
    with open(intermediate_X) as json_data:
        X = json.load(json_data)
        json_data.close()
        
    with open(intermediate_y) as json_data:
        y = json.load(json_data)
        json_data.close()
        
    # Split the data
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Write the outputs
    with open(X_train, 'w') as f:
        json.dump(X_tr, f)
    with open(X_test, 'w') as f:
        json.dump(X_te, f)
    with open(y_train, 'w') as f:
        json.dump(y_tr, f)
    with open(y_test, 'w') as f:
        json.dump(y_te, f)

Here, we can see how the output of `split`: `X_train` and `y_train` might be passed in as input to `train`. We're also going to serialize the model and vectorizer so we can retrieve them in the future rather than have to re-run the experiment.

In [None]:
@flor.func
def train(X_train, y_train, n_estimators, max_depth, model, vectorizer, **kwargs):
    import pandas as pd
    import json
    import cloudpickle

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.ensemble import RandomForestClassifier
    
    # Read the inputs
    with open(X_train, 'r') as f:
        X_tr = json.load(f)
    with open(y_train, 'r') as f:
        y_tr = json.load(f)
    
    # Fit the vectorizer
    vec = TfidfVectorizer()
    vec.fit(X_tr)
    
    # Transform the training data
    X_tr = vec.transform(X_tr)
    
    # Train the model
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth).fit(X_tr, y_tr)
    
    # Write the output
    with open(model, 'wb') as f:
        cloudpickle.dump(clf, f)
    with open(vectorizer, 'wb') as f:
        cloudpickle.dump(vec, f)

Finally, we'll take the test data, model, and vectorizer to evaluate the model we just trained. We're also going to return the score to facilitate interpretation of the experiment.

In [None]:
@flor.func
def eval(X_test, y_test, model, vectorizer, **kwargs):
    import pandas as pd
    import json
    import cloudpickle
    
    # Read the inputs
    with open(X_test, 'r') as f:
        X_te = json.load(f)
    with open(y_test, 'r') as f:
        y_te = json.load(f)
    with open(model, 'rb') as f:
        clf = cloudpickle.load(f)
    with open(vectorizer, 'rb') as f:
        vec = cloudpickle.load(f)
    
    # Test the model
    X_te = vec.transform(X_te)
    score = clf.score(X_te, y_te)
    
    print(score)
    
    # Return the score
    return {'score': score}

## We'll build a detailed Flor Plan, exposing all hyper-parameters and intermediary artifacts

Now that we've declared the code, and know which resources to read (Bob's latest preprocessed data), we can specify the Flor Plan. Pay close attention to how we indicate the outputs of an action, and how we pass the output of some action to another action.

In [None]:
with flor.Experiment('bob_preproc') as bob, flor.Experiment('risecamp_demo') as ex:
    # This is how we tell Flor we will be using Bob's derived artifacts
    data_x = bob.artifact('data_clean_X.json', 'intermediate_X', label="second_preproc")
    data_y = bob.artifact('data_clean_y.json', 'intermediate_y', label="second_preproc")
    
    # Here we declare all the static literals
    random_state = ex.literal(92, 'random_state')
    test_size = ex.literal(0.20, 'test_size')
    n_estimators = ex.literal(7, 'n_estimators') 
    max_depth = ex.literal(100, 'max_depth')
    
    # Now we connect the Flor Plan
    do_split = ex.action(func=split, in_artifacts=[data_x, data_y, test_size, random_state])
    X_train = ex.artifact(loc='X_train.json', name='X_train', parent=do_split)
    X_test = ex.artifact(loc='X_test.json', name='X_test', parent=do_split)
    y_train = ex.artifact(loc='y_train.json', name='y_train', parent=do_split)
    y_test = ex.artifact(loc='y_test.json', name='y_test', parent=do_split)
    
    do_train = ex.action(func=train, in_artifacts=[X_train, y_train, n_estimators, max_depth])
    model = ex.artifact(loc='model.pkl', name='model', parent=do_train)
    vectorizer = ex.artifact(loc='vectorizer.pkl', name='vectorizer', parent=do_train)
    
    do_eval = ex.action(func=eval, in_artifacts=[X_test, y_test, model, vectorizer])
    score = ex.literal(name='score', parent=do_eval)

So far we've only seen Flor Plans with a single action, this is what a detailed Flor Plan for our experiment looks like. The rectangles correspond to artifacts, which the user is responsible for reading/writing, but Flor automatically tracks and versions. The underlines correspond to literals, which Flor manages completely and manages them by value. As a rule of thumb, if it can be serialized as a string without losing information, then it can be a literal. Otherwise, it's best to represent it as an artifact.

In [None]:
score.plot()

Now we pull the score to execute the experiment we just defined.

In [None]:
score.pull('seventh_pull')

We see the score is still hovering around 75%.

In [None]:
flor.Experiment('risecamp_demo').summarize()

Notice the benefits of the finer-grained annotations. Compared to the very first experiment we ran, this experiment is easier to interpret without having to read code.

## Because we now track intermediary artifacts, such as the model, it is no longer necessary to re-run an experiment and train the model, instead we can just retrieve it.

## Checkout best model and serve it
Here, we're going to retrieve the model and vectorizer we just fitted with Flor.

In [None]:
with flor.Experiment('risecamp_demo') as ex:
    model = ex.artifact('model.pkl', 'model', label='seventh_pull')
    vectorizer = ex.artifact('vectorizer.pkl', 'vectorizer', label='seventh_pull')
model = model.peek()
vec = vectorizer.peek()

Now we deploy the model into a toy application that tries to predict the sentiment from your phrases. Enter `nothing` to exit.

In [None]:
PROMPT = "What's on your mind? "

phrase = input(PROMPT)
while phrase[0:len('nothing')].lower() != 'nothing':
    phrase = vec.transform([phrase,])
    positive = model.predict(phrase)
    if positive:
        print('Happy to hear that!\n'.format(phrase))
    else:
        print("Sorry about that...\n")
    phrase = input(PROMPT)
print('you said nothing.')

To summarize, we've been reminded of the value of annotating and versioning experiments, and witnessed how Flor can automate these responsibilities with as few as 3 lines of code. Moreover, when finer-grained tracking is desired, Flor is able to capture more meta-data and store derived artifacts for future retrieval. Flor enables experiment interpretation and re-using artifacts across experiments.

## Thank you for your time, we hope you have a good evening.