# Data Cleaning and Normalization

Routines for cleaning and normalizing data for processing.

Objectives:

1. Reading a CSV file and representing it as a pandas data frame
2. Normalizing original data frame and transforming to independent data frame
3. Transforming data frame to forma readable by autoencoder

In [1]:
# Import necessary libraries and path relative to project
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

## Importing Data

In [2]:
# Instantiate pandas DataFrame
data = pd.DataFrame()

# Chunk size for reading data
chunksize = 10000

# The reference to the dataset. Change this to 
dataset_file = '../data/creditcardfraud_raw.csv'

print("Loading dataset '{}'...".format(dataset_file))

# Read each chunk and append to data frame
for i, chunk in enumerate(pd.read_csv(dataset_file, chunksize=chunksize)):
    print("Reading chunk %d" % (i + 1))
    data = data.append(chunk)

print("Done loading dataset...")
    
# Check for proper value of input dimensionality to be used by model
input_dim = len(data.columns) - 1
print("Input Dimensionality: %d" % (input_dim))
print(data)
print("Dropping Time column")
data = data.drop(['Time'], axis=1)
print(data)

Loading dataset '../data/creditcardfraud_raw.csv'...
Reading chunk 1
Reading chunk 2
Reading chunk 3
Reading chunk 4
Reading chunk 5
Reading chunk 6
Reading chunk 7
Reading chunk 8
Reading chunk 9
Reading chunk 10
Reading chunk 11
Reading chunk 12
Reading chunk 13
Reading chunk 14
Reading chunk 15
Reading chunk 16
Reading chunk 17
Reading chunk 18
Reading chunk 19
Reading chunk 20
Reading chunk 21
Reading chunk 22
Reading chunk 23
Reading chunk 24
Reading chunk 25
Reading chunk 26
Reading chunk 27
Reading chunk 28
Reading chunk 29
Done loading dataset...
Input Dimensionality: 30
            Time         V1         V2        V3        V4        V5  \
0            0.0  -1.359807  -0.072781  2.536347  1.378155 -0.338321   
1            0.0   1.191857   0.266151  0.166480  0.448154  0.060018   
2            1.0  -1.358354  -1.340163  1.773209  0.379780 -0.503198   
3            1.0  -0.966272  -0.185226  1.792993 -0.863291 -0.010309   
4            2.0  -1.158233   0.877737  1.548718  0.40

## Normalizing the data with `MinMaxScalar`

In [3]:
# create a scaler object
scaler = MinMaxScaler()

# fit and transform the data
df_norm = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)

print(df_norm)

              V1        V2        V3        V4        V5        V6        V7  \
0       0.935192  0.766490  0.881365  0.313023  0.763439  0.267669  0.266815   
1       0.978542  0.770067  0.840298  0.271796  0.766120  0.262192  0.264875   
2       0.935217  0.753118  0.868141  0.268766  0.762329  0.281122  0.270177   
3       0.941878  0.765304  0.868484  0.213661  0.765647  0.275559  0.266803   
4       0.938617  0.776520  0.864251  0.269796  0.762975  0.263984  0.268968   
...          ...       ...       ...       ...       ...       ...       ...   
284802  0.756448  0.873531  0.666991  0.160317  0.729603  0.236810  0.235393   
284803  0.945845  0.766677  0.872678  0.219189  0.771561  0.273661  0.265504   
284804  0.990905  0.764080  0.781102  0.227202  0.783425  0.293496  0.263547   
284805  0.954209  0.772856  0.849587  0.282508  0.763172  0.269291  0.261175   
284806  0.949232  0.765256  0.849601  0.229488  0.765632  0.256488  0.274963   

              V8        V9       V10  .