# Code to write cleaned datasets

### Creates the following csvs

1. `train_df.csv` = initial train csv with a new column that maps integer class labels to a character description
2. `dev_data.csv` = random 10% subset of initial train feature data. This excludes the Y value so only includes features
3. `dev_labels.csv` = random 10% subset of initial train feature data. This only includes the Y value
4. `train_data.csv` = random 90% subset of initial train feature data. This excludes the Y value so only includes features
5. `train_labels.csv` = random 90% subset of initial train feature data. This only includes the Y value

Goal of this is to standardize data sets each group member uses.

## Load packages

In [1]:
import pandas as pd 
import numpy as np

## Load data and build modeling datasets

In [2]:
# ^^^^^^^^^^^^^^^^^^^^^^^^^
# load data

train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

# ^^^^^^^^^^^^^^^^^^^^^^^^^
# set up modeling datasets

# X = all but last column 
X_train = np.array(train.iloc[:,:-1])
X_test = np.array(test)

# Y = last column only
Y_train = np.array(train.iloc[:,-1].tolist())

# build dev set based on random subset (10% of train data)
shuffle = np.random.permutation(np.arange(X_train.shape[0]))
X_train, Y_train = X_train[shuffle], Y_train[shuffle]

dev_size = round(X_train.shape[0] * 0.1)
dev_data, dev_labels = X_train[:dev_size], Y_train[:dev_size]
train_data, train_labels = X_train[dev_size:], Y_train[dev_size:]
test_data = X_test

print('Train Data shape: ', train_data.shape)
print('Train Labels shape: ', train_labels.shape)
print()
print('Dev Data shape: ', dev_data.shape)
print('Dev Labels shape: ', dev_labels.shape)
print()
print('Test Data shape: ', test_data.shape)
print()

# check dev split works
print(f'Dev split check status: {dev_data.shape[0] + train_data.shape[0] == X_train.shape[0]}')

Train Data shape:  (13608, 55)
Train Labels shape:  (13608,)

Dev Data shape:  (1512, 55)
Dev Labels shape:  (1512,)

Test Data shape:  (565892, 55)

Dev split check status: True


## Label categories columns

The cover types (Y) include;
1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

Merge this with the train df for EDA.

In [3]:
label_categories = ['Spruce/Fir', 
                    'Lodgepole Pine', 
                    'Ponderosa Pine', 
                    'Cottonwood/Willow', 
                    'Aspen',
                    'Douglas-fir',
                    'Krummholz']
label_categories = pd.DataFrame(data = zip(label_categories,list(range(1,8,1))), columns = ['Cover_Type_Name','Cover_Type'])
train = train.merge(label_categories, left_on='Cover_Type', right_on='Cover_Type')

## Write new datasets

In [4]:
train.to_csv('data/train_df.csv')
pd.DataFrame(dev_data).to_csv('data/dev_data.csv')
pd.DataFrame(dev_labels).to_csv('data/dev_labels.csv')
pd.DataFrame(train_data).to_csv('data/train_data.csv')
pd.DataFrame(train_labels).to_csv('data/train_labels.csv')