---
# Data Loading
---

### Notebook Summary:

In this notebook, I will perform data pre-processing on a dataset of sign-language images sourced from OpenML.

#### Key Steps:
- **Data Extraction**: Load the sign-language dataset from OpenML.
- **Data Cleaning**: Deal with discontinuity in the target variable.
- **Separate out X (features) and y (target)**: Store the pixel data (features) in varaible X and labels in variable y.
- **Data Storage**: Save X and y into pickle files for easy use in other notebooks.


In [1]:
# List of imports
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
import joblib


### Importing data from OpenML

In [2]:
# Accessing mnist american sign language dataset
#TODO: update to data folder
data_path = "data/mnist"
mnist = fetch_openml('SignMNIST', data_home=data_path, as_frame=False)

  warn(


In [3]:
# Setting class (target) variable -> indicies of alphabet, casting these datatype from a float to int 
my_class = mnist.target.astype(np.int32)

# Accessing image data -> reshape is necessary to 3d array (num_images, pixels_height, pixels_width)
# We know each image is 28 by 28 pixels, passing -1 to allow numpy to auto calculate num of images in the dataset
my_image= mnist.data.reshape(-1,28,28)

# Sanity checking on shapes
print('my_image shape:', my_image.shape)
print('my_class shape:',my_class.shape)

my_image shape: (34627, 28, 28)
my_class shape: (34627,)


### Mapping of target variable to remove discontinuity

In [4]:
np.unique(my_class)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24], dtype=int32)

##### Comment: The elements in the array are not continuous, we are missing 9 from the list. When we use neueal networks later in the project this will be problematic as we will get an extra node in our output layer for '9'.

In [5]:
# mapping to avoid the 'skip' in y
mapping = {0:0,
           1:1,
           2:2,
           3:3,
           4:4,
           5:5,
           6:6,
           7:7,
           8:8,
           10:9,
           11:10,
           12:11,
           13:12,
           14:13,
           15:14,
           16:15,
           17:16,
           18:17,
           19:18,
           20:19,
           21:20,
           22:21,
           23:22,
           24:23
           }

In [6]:
my_target_df = pd.DataFrame(my_class, columns=['mnist_target'])

In [7]:
# creating df and using .map to fix y
my_target_df['new_target'] = my_target_df['mnist_target'].map(mapping)

In [8]:
# Checking new target var is unique
np.unique(my_target_df['new_target'])

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [9]:
# Flattening each image back to 784 pixels for the Logistic Regression model
flattened_dataset = my_image.reshape(my_image.shape[0],28 * 28)

In [10]:
X = flattened_dataset
y = np.array(my_target_df['new_target'])

### Pikling X and y to be used for all models

In [12]:
# storing data in my_files
#TODO: save to data folder as pkl file
joblib.dump(X, '../../model/my_files/X.pkl' )
joblib.dump(y, '../../model/my_files/y.pkl' )

['../../model/my_files/y.pkl']

### Testing pikling was successful

In [14]:
test  = joblib.load( '../../model/my_files/X.pkl' )             
test_y = joblib.load( '../../model/my_files/y.pkl' )

In [15]:
test_y.shape

(34627,)

In [16]:
y.shape

(34627,)

In [17]:
np.unique(test_y)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])