## Importing the Data
The data for this project was downloaded from the competition page [here](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/data/). Because of the submission format requirements for the competition, it is vital that we retain the index column throughout modeling so that we are able to produce predictions that can be validated using the competition's validation data. Data is stored in the `data` folder in the project [github repository](https://github.com/sethchart/Pump-it-Up-Data-Mining-the-Water-Table).

We will load the data from the downloaded CSV files, perform a 90%-10% train-test split, and pickle the the resulting dataframes for use in other notebooks.

### Importing Modules

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import pickle

### Loading Data from CSV Files

In [None]:
features = pd.read_csv('../data/training_features.csv', index_col='id')
targets = pd.read_csv('../data/training_labels.csv', index_col='id')
df = features.join(targets, how='left')
X = df.drop('status_group', axis=1)
y = df['status_group']

### Set Random State

In [None]:
random_state = 42

### Test Train Split
For the purposes of model tuning we hold 10% of the data out for local testing. We keep the test set small since we can use the competition validation data for model validation.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state)

### Pickling Dataframes 

In [None]:
with open('../data/train_test_split.pkl', mode='ab') as f:
    pickle.dump({'X_train': X_train, 'X_test': X_test, 'y_train': y_train, 'y_test': y_test}, f)