# Ground RISE Camp Tutorial

For background, please see the slides from this morning's talk on Ground. You can find them [here](). (***TODO: Add link to slides.***) This Jupyter notebook is running in a Docker container that already has a Ground instance as well as a Postgres server up and running. There isn't any more set up for us to do, so let's jump right in!

In this tutorial, we will

## Basic Exercises

To get started with Ground, we will use some of the "Aboveground" services that we have already developed. Aboveground services are tools that users use to interface with Ground at a higher semantic level than the simple node-and-edge-based API.

We will begin by using a tool that autopopulates Github repositories into Ground. 

In [None]:
import ground_git_client

REPO_NAME = ""
ground_git_client.add_repo(REPO_NAME)

Now that we have some code that Ground is aware of, we are going to want to do something with code. The particular repository that we populated has some simple Python scripts that are "Ground-aware"\* as well a small amount of data for us to analyze in the form of a CSV file. 

We're going to download that repository locally using the `download_repo` command below. You can find the repo online [here](). We will run a simple script that's going to take our CSV data and split up our currently single-column data into the following fields:

* 1
* 2
* 3

However, before doing that, we need to make sure that Ground knows about the base dataset that we are transforming. Using another Aboveground tool that we have already developed, you can automatically let Ground know about this new dataset. This tool will populate Ground with some useful information about the file including the file type, the size of the file, and the path to the file.

\*When we say that these scripts are Ground-aware, we mean that we have instrumented them to know how to interact with Ground and automatically publish useful data context into Ground in the due course of their execution.

In [None]:
import ground_file_client

FILE_PATH = ""
ground_file_client.add_file(FILE_PATH)

Now that Ground knows about our base dataset, we can go about transforming it. Since the scripts that we are using are Ground-aware, they are going to generate lineage information in Ground as a part of transforming the data. It will tell Ground that it's created a new dataset based on the old input dataset, and it will associate this lineage information with the latest version of the source code that was used for the transformation.

In [None]:
# execute the Python script in the repository in the repository cloned above

Now that we've spent a bunch of time populating information into Ground, it's time to see everything we've done. Using the Ground API client, for which you can find complete documentation [here](), determine the following pieces of information:

In [None]:
import ground
gc = GroundClient()

# the id of the node version for the base dataset (hint: you can use the latest API)

# the id of the lineage edge version that connects the base dataset to the derived dataset

# all of the tags  of the derived dataset

## Ground & ML Models

One use case we have been exploring as a part of the Ground agenda recently is managing the lifecycles of machine learning models using Ground. In particular, we are interested in tracking the code and the data combined to output a particular model. As we've already learned, Ground treats versioning as a first-class citizen. As a result, it is easy to imagine a scenario in which Ground would help users track which particular version of data was used to train a model for reproducibility purposes. 

As an aside, it is an interesting and open research question how we track and version datasets. 

In this particular example, we will be toying with a model that predicts a tweet's location based on the content of the tweet. Below is a rough description of the pipeline that we have put together:

1. Tweets are crawled from Twitter to generate a training set and a test set.
2. Those tweets are cleaned and normalized.
3. The model is trained on the cleaned training set. 
4. The model trained in step 3 is validated on the cleaned testing data.

![Model training pipeline](ml/target_test_simple.png)

As a part of these exercises, we have pre-built a number of helper functions that you might find useful as you go through the steps below. Make sure you read these function defintions before continuing!

* `setup`: prepares and configures the system and data for this tutorial
* `show_me_data`: displays a dataframe containing the data we will use throughout this tutorial
* `get_ground_metadata`: queries ground and displays all relevant metadata for this tutorial
* `test_model`: executes the machine learning pipeline to train and test a model, reports prediction accuracy

In [None]:
from ml import tutorial

In [None]:
tutorial.setup()

In [None]:
tutorial.show_me_data()

In [None]:
output = tutorial.test_model()
print output

Okay, so we have a baseline model, and it does pretty well! The default case would be to guess the United States, asbout 35% of tweets come from the US. We're clearly doing a good bit better than that. However, we're not satisfied with this quite yet; we'd like to improve this, and we have a guess that improving the cleaning process will help improve our model accuracy. We've set up the skeleton of a `clean` function below for you to fill in. You're welcome to try anything you'd like to improve the cleaning process!

For those who might be less familiar with data cleaning and ETL, here's a simple suggestion and code snippet that you can try out: 

***THIS NEEDS TO BE FILLED IN.***

In [None]:
%%writefile ml/my_cleaner.py
#!/usr/bin/env python
import pandas as pd
import numpy as np

def clean(df):
    pass


Now that we've defined this function, let's test the model again. It's okay if you have to run through these steps a few times while testing your cleaning code.

In [None]:
output = tutorial.test_model()
print output

Surprisingly enough, no matter what we put into the `clean` method above, we see that the accuracy of the model that we're training has plummetted, likely at no fault of our own given we haven't changed much here.

The question we have to answer next is what changed that caused our pipeline to break. We can come up with a long list of things that might have broken. If you're stuck, we've written a description below that will help walk you through the investigative steps. 

**HINT**: You're probably going to find the tutorial APIs above very helpful.

### THE REST OF THE CODE IS A SOLUTION AND SHOULD BE HIDDEN FROM THE PEOPLE TAKING THE TUTORIAL.

We can probably recommend some options, like getting the metadata using get_ground_metadata.

In [None]:
md = tutorial.get_ground_metadata()

In [None]:
tutorial.show_me_data()

In [None]:
%%writefile ml/my_cleaner.py
#!/usr/bin/env python
import pandas as pd
import numpy as np

def clean(df):
    df["code"] = df["country"]


In [None]:
output = tutorial.test_model()
print output

## Extending Ground

In this section, we will walk you through how you might go about extending Ground to populate your own data context. Before we go any further, let's first reset our Ground instance. If you mistakenly add data to Ground as you do this exercise, you can run the following cell to wipe Ground and start over anew:

In [None]:
from ground_setup import reset_ground

reset_ground()

Before we start writing our own Aboveground tool, let's first dig into the the Ground file populator component works a little more. Let's begin by opening the `ground_file_client.py` file in another tab. After walking through the comments there, return here to continue with the exercises.