# Extract train and test data set from 20newsgroups

This pipeline step is based on the 01_Extract_dataset.ipynb Jupiter Notebook. It loads
 the input data (downloaded during the setup steps) and it splits it into 2 datasets: one to train and one to test the model.

### Parameters

Thanks to mlvtool a Python script and a DVC command will be generated from this notebook. This way it will be easier to version
and execute the classifier pipeline. The work/research can still be done in the notebook and then automatically replicated into the script.

The following cell (the first code cell) contains a Docstring to describe the parameters of this pipeline step and the tracked input and output.

Python script:

**param** is used to declare the parameters of the corresponding Python 3 script and command. 
`#:param [type]? [param_name]: [description]?`

DVC command:
```
[:dvc-[in|out][\s{related_param}]?:[\s{file_path}]?]*
[:dvc-extra: {python_other_param}]?

OR (for special cases it is possible to override the default behavior providing the full dvc command line)
[:dvc-cmd: {full command line}]

```
**dvc-cmd** allows to provide the whole DVC command when it is not generic (this is an override of the default method). Here it is used because this pipeline step will run the same script twice with different parameters. describe the whole DVC command when it is not generic 
When it is used In this case, it is possible (and strongly recommended) to use the variable **$MLV_PY_CMD_PATH** to designate the Python command line path.


To have a better understanding of those parameters, see the MLV-Tools [documentation](https://github.com/peopledoc/ml-versioning-tools) and have a look at the corresponding generated DVC command line.

In [None]:
# Parameters:
# The Docstring below will be parsed by mlvtool to build the python script corresponding to this notebook with the proper parameters.
"""
:param str subset: Subset of data to load
:param str data_home: Path to parent directory to cache file
:param str output_path: Path to output file
:dvc-cmd: dvc run -f $MLV_DVC_META_FILENAME -o ./poc/data/data_train.csv -o ./poc/data/data_test.csv
                -d ./poc/data/20news-bydate_py3.pkz
       "$MLV_PY_CMD_PATH --subset train
            --data-home ./poc/data --output-path ./poc/data/data_train.csv &&
        $MLV_PY_CMD_PATH --subset test
            --data-home ./poc/data --output-path ./poc/data/data_test.csv"
"""
# Value of parameters for this Jupyter Notebook only
# the notebook is places in ./poc/pipeline/notebooks
subset = 'test'
data_home = '../../data/'
# Output:
output_path = '../../data/data_train.csv'

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

In [None]:
newsgroups_train = fetch_20newsgroups(subset=subset,
                                      data_home=data_home,
                                      download_if_missing=True,
                                      remove=('headers', 'footers', 'quotes'))

In [None]:
# No effect
newsgroups_train.keys()

In [None]:
# No effect
len(newsgroups_train.data)

In [None]:
# No effect
newsgroups_train.target

In [None]:
df_train = pd.DataFrame(newsgroups_train.data, columns=['data'])

In [None]:
df_train['target'] = newsgroups_train.target

In [None]:
df_train['targetnames'] = df_train['target'].apply(lambda n: newsgroups_train.target_names[n])

In [None]:
# No effect
df_train.head()

In [None]:
df_train.to_csv(output_path, index=None)