# 01 - Prepare data

Imagine you're working for a new credit card company that wants to take a data-driven approach to providing credit card loans to customers.
As part of your data-driven approach you want to use a model that's going to predict whether a customer is going to default
their next credit card payment. Based on this model you want to provide additional services to customers, like providing help
with their financial situation.

In this first part of the tutorial we're going to prepare a dataset for the machine learning model.
We'll cover the following topics:

* [Loading data using pandas](#loading-data-using-pandas)
* [Creating a training and testing dataset](#creating-a-and-testing-dataset)
* [Storing the dataset on disk](#storing-the-dataset-on-disk)

Let's get started by loading the raw dataset from disk.

## Loading data using pandas
The model that we're about to train uses a a dataset that's stored in `../data/raw/UCI_Credit_card.csv`. 
Let's load it up and see what it looks like:

* First, import the pandas package (already done for you)
* Next, use the function [pd.read_csv(...)]() to load the dataset and assign it to `df_creditcard`. 
* Finally, call `df_creditcard.info()` to get some insights into what is in the dataset.


In [2]:
import pandas as pd

In [3]:
df_creditcard = pd.read_csv('../data/raw/UCI_Credit_card.csv')

You should now have a dataset in memory that we can use to train.
From the info method call you'll have gathered that the total dataset contains 30K samples, which is quite enough to train a model.

Now that we have loaded the dataset, let's create a test and training dataset.

## Creating a training and a testing dataset
In the previous section we've loaded the dataset from disk. We could use this dataset for training without problems.
If we were to use the whole dataset for training we would be unable to validate that the model is actually working, because we have no independent dataset to test the model.

In this section we're going to split the dataset. Perform the following steps:

* First, import the function [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from scikit-learn (already done for you).
* Next, split the dataset using the train_test_split function. Use `test_size=0.1` to specify the percentage of data to include in the test dataset. Store the result in `df_train` and `df_test`. (Note: You can assign the output of a function to multiple variables)

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
df_train, df_test = train_test_split(df_creditcard, test_size=0.1)

Now that we have a training and testing dataset, let's save them for later.

## Storing the dataset on disk
In the previous sections we've spend some time getting our data ready. In this section we're going to store the data on disk for using during the next part of the tutorial.
Follow these steps to store your datasets on disk:

* First, use the [to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) method on `df_train` to store it on disk. 
  specify the filename (`../data/processed/train.csv`) 
  and include `index=None` as an additional parameter.
* Next, repeat the previous step for `df_test`, but store it in `../data/processed/test.csv`.


In [9]:
df_train.to_csv('../data/processed/train.csv', index=None)
df_test.to_csv('../data/processed/test.csv', index=None)

## Summary

Congratulations! In this first part of the tutorial you've made a dataset suitable for training a machine learning model. You've also made sure that you can test your machine learning once you've trained it. 

In the [next part](./02-train-model.ipynb), we're going to take a look at training a machine learning model.