# {{ cookiecutter.project_slug }} analysis

Welcome to the collaboration!

Some of the steps you follow in this analysis are taken from the [Ballet Contributor Guide](https://hdi-project.github.io/ballet/contributor_guide.html), make sure to consult it for more information.

In [None]:
# some preliminaries...
from ballet.util.log import enable as enable_logger
enable_logger('ballet')
enable_logger('{{ cookiecutter.package_slug }}')
import pandas as pd
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [None]:
from ballet import b

## Explore the data

In [None]:
X_df, y_df = b.api.load_data()

In [None]:
X_df.head()

In [None]:
y_df.head()

## Explore existing features

In [None]:
result = b.api.engineer_features(X_df, y_df)
X_train, y_train = result.X, result.y

In [None]:
print('Number of existing features: ', len(result.features))
print('Number of columns in feature matrix: ', X_train.shape[1])

## Write a new feature

Now it's time to write your own feature!

🚧 The content of the cell must be a standalone Python module, as it will be placed in an empty Python source file when it is submitted. This means that any imports or helper functions must be defined (or re-defined) within this cell, otherwise your submitted feature will fail to validate due to missing imports/helpers. 🚧 For example, if you use some functionality from numpy, then you must (re)import numpy within the code cell, even if you imported it elsewhere in the notebook.

If you have questions about feature engineering, see the [Feature Engineering Guide](https://hdi-project.github.io/ballet/feature_engineering_guide.html).

In [None]:
from ballet import Feature

input = None  # TODO - str or list of str
transformer = None  # TODO - function, transformer-like, or list thereof
name = None  # TODO - str
feature = Feature(input, transformer, name)

Let's play around with the feature we have created... what values does it extract?

In [None]:
feature_values = feature.as_feature_engineering_pipeline().fit_transform(X_df, y_df)
feature_values

## Test your feature

You probably want to make sure that your feature does not have bugs before you submit it to the upstream project.

This first command will check that your feature conforms to the feature API. (Assumes you have assigned your feature to a variable `feature`.) Even if your feature passes the tests here in the notebook, it may still fail to be validated, especially if you have not included all your imports/helpers in the code cell that you submit.

In [None]:
b.validate_feature_api(feature)

This second command will evaluate the ML performance of your feature. (Assumes you have assigned your feature to a variable feature.) Usually, this consists of computing the mutual information of your feature values with the target, conditional on the feature values extracted by the existing feature engineering pipeline. If this is above a certain threshold, your feature is determined to be contributing positively to the ML performance of the project. In this Demo, since we want to be nice to new collaborators, each and every feature will be accepted with respect to ML performance!

In [None]:
b.validate_feature_acceptance(feature)

## Submit your feature

When you are ready to submit your feature, look for the "Submit" button in the right of your notebook toolbar. First, select the code cell that contains the feature you have written (you will have to scroll up). Then press the submit button, confirming that the feature code shown is what you want to submit. After submission, you will be shown a URL that takes you to the corresponding Pull Request that has been created.

When you submit your feature, the source code you have written is extracted from the selected code cell. It is then sent to a app which does the job of structuring your feature as a pull request to the {{ cookiecutter.github_owner }}/{{ cookiecutter.project_slug }}, which it opens on your behalf. The pull request is validated by Travis CI, a continuous integration service, using the test suite defined by the ballet framework, testing the feature's API and running a streaming logical feature selection (SLFS) algorithm. If the feature is accepted, the GitHub app ballet-bot responds by automatically merging the pull request into the project. Otherwise, it automatically closes it. Similarly, after an accepted feature is merged, the CI service runs another ballet-defined script to prune redundant features using the same SLFS algorithm.

## Thanks for contributing

If you're reading this, you have already done some great data science work and contributed one (or more!) features to the upstream {{ cookiecutter.project_slug }} project.