## Sample project based on IRIS data.

### Visualization of the data

In this notebook, we will create a sample project, setting up a pipeline to train a model and test this model on new data.
First we import dvb.datascience to start our project.	

In [None]:
import dvb.datascience as ds

We create a pipeline, and read in the standard IRIS dataset included in this project.

In [None]:
p = ds.Pipeline()
p.addPipe("read", ds.data.SampleData(dataset_name="iris"))

We can plot the data to take a look at what we have.
To do that, we add a pipe to split our data, with a specified percentage to be stored as test data, and another pipe to perform the actual scatter plots. We have to define where the previous pipe originates from, hence the [("read", "df", "df")] when adding the split pipe. After adding the necessary pipes, we fit and transform the train data, and transform the test data to display our scatter plots.

In [None]:
p.addPipe('split', ds.transform.RandomTrainTestSplit(test_size=0.3), [("read", "df", "df")]) 
p.addPipe('scatter', ds.eda.ScatterPlots(), [("split", "df", "df")]) 
p.fit_transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TRAIN}})
p.transform(transform_params={'split': {'split': ds.transform.split.TrainTestSplitBase.TEST}}, name='test', close_plt=True)
p.get_pipe_output("read") # we can use the function get_pipe_output(name_pipe) to read the output.

### Adding custom pipes

This pipeline has now been used, so we cannot insert the same names for the pipes. We will set up a new pipeline so we can display all the possibilities and to train the data using a classifier.

In [None]:
p = ds.Pipeline()
p.addPipe("read", ds.data.SampleData(dataset_name="iris"))

If we wanted to filter out certain observations, drop whole columns or add new labels, we can add the following pipes.

In the add_total_petal_size pipe, we add a new column to every row.

In the target_0 pipe we only include the rows which satisfy a condition, in this case row["target"] == 0.

In the drop_petal_length pipe, we drop the petal_length column.

In [None]:
def get_total_size(row):
    total_size = float(row["sepal length (cm)"]) * float(row["sepal width (cm)"])
    return total_size

p.addPipe(
    "add_total_sepal_size",
    ds.transform.ComputeFeature("total_size",lambda row: get_total_size(row)),
    [("read", "df", "df")],
)

p.addPipe(
    "target_0", # Note: this pipe includes the observations which result in True
    ds.transform.FilterObservations(lambda row: row["target"] == 0 ),
    [("add_total_sepal_size", "df", "df")],
)

p.addPipe(
        "drop_petal_length",
        ds.transform.DropFeatures(
            ["petal length (cm)"]
        ),
        [("target_0", "df", "df")],
    )
p.fit_transform()
p.get_pipe_output("drop_petal_length") 

## Generating a model

Now to actually try and predict some data. This is a full pipeline from reading to generating a model. 

To preface some things: Multi-Class predictions are not yet implemented, as well as regression models. Only classifiers and binary predictions are available as of now.

In [None]:
%matplotlib inline

import dvb.datascience as ds # Import our code and sklearn classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

p = ds.Pipeline() # Generate a new pipeline
p.addPipe("read", ds.data.SampleData(dataset_name="iris")) # Read in your data, in this case SampleData Iris.

# If you want to read from a datafile, you can use the following pipe:
# p.addPipe("read", ds.data.CSVDataImportPipe(filename, index_col=index_column_to_be_used))

# We add a metadata pipeline, so we can have a prediction target.
p.addPipe("metadata", ds.data.DataPipe("df_metadata", {"y_true_label": "target"}))

# Split our data in training and test size, in this case 70% training, 30% test data.
p.addPipe(
    "split",
    ds.transform.RandomTrainTestSplit(test_size=0.3),
    inputs=[("read", "df", "df")],
)

# As mentioned, we only support binary labels as of now, so we filter out one of the three targets.
p.addPipe(
    "target_0_1", # Note: this pipe includes the rows which result in True and discards rows which result in False
    ds.transform.FilterObservations(lambda row: row["target"] == 0 or row["target"] == 1 ),
    [("split", "df", "df")],
)

# Now lets say that we want to normalize our values. We don't want to normalize the values for our targets,
# so we remove these temporarily from our data, only to merge them back later.
p.addPipe(
    "remove_label", ds.transform.DropFeatures(["target"]), [("target_0_1", "df", "df")]
)
p.addPipe(
    "keep_label", ds.transform.FilterFeatures(["target"]), [("target_0_1", "df", "df")]
)
p.addPipe(
    "scaler",
    ds.transform.SKLearnWrapper(StandardScaler),
    [("remove_label", "df", "df")],
)
p.addPipe(
    "merge",
    ds.transform.Union(2, "inner"),
    [("keep_label", "df", "df0"), ("scaler", "df", "df1")],
)

# We insert a correlation matrix, to give us a better overview of our data
p.addPipe("corrmatrix", ds.eda.CorrMatrixPlot(), inputs=[("merge", "df", "df")])

# We use a LogisticRegression classifier in this model
p.addPipe(
    "LogisticRegression",
    ds.predictor.SklearnClassifier(
        LogisticRegression,
    ),
    [
        ("merge", "df", "df"),
        ("metadata", "df_metadata", "df_metadata"),
    ],
)

# Display the score so we can measure it
p.addPipe(
    "LogisticRegression_score",
    ds.score.ClassificationScore(),
    [("LogisticRegression", "predict", "predict"), ("LogisticRegression", "predict_metadata", "predict_metadata")],
)

# We fit our data on our model, using the TRAIN data set which we split earlier
p.fit_transform(
    name="train",
    transform_params={
        "split": {"split": ds.transform.split.TrainTestSplitBase.TRAIN},
    },
)

# Here we test our model, using our TEST data set
p.transform(
    name="validate",
    transform_params={
        "split": {"split": ds.transform.split.TrainTestSplitBase.TEST},
    },
)

And thats a basic model. You can add a lot more different pipes, split your data on more than TRAIN/TEST, use different classifiers and visualize these using the plotting pipes.