# Data Preparation 04

I will prepare the data appropriately for the analysis and modeling (data cleaning, manipulation, feature engineering). I'm going to analyze image data from :
* train_datascan_ii
* train_datascan_iii
* train_datascan_iv

> NOTE: I will skip the data from datascan_i as there is no need for more data and on this dataset the background would need an other threshold setting to remove the wooden background.

#### Datacleaning
There was only a minor data cleaning step on datascan_iii. As the image extraction from the movie may also had some images where a card was only showed partitially, I did a visual check within the Windows explorer. As the frames are just in order of the time, it was a peace of cake to eliminate them. The cleaning step of my exported data was done in this manner.

#### Feature Engineering

The following data are beeing collected:
* dataSet (internal use)
* cardId (will be the target feature)
* x (value of bounding box within original image)
* y (value of bounding box within original image)
* width (width of the image)
* height (height of the image)
* orgWidth (width of the original image)
* orgHeight (width of the original image)
* red channel histogram data from value r0 to r255
* green channel histogram data from value g0 to g255
* blue channel histogram data from value b0 to b255


In [1]:
import JassSummarizer as js
from IPython.core.display import display, HTML
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd
from tqdm import tqdm
import shutil
import os
import cv2                                                                                  # computer vision python library see README.md dependencies
                                                                                            # computer vision python library see README.md dependencies

np.set_printoptions(suppress=True)                                      # do not use scientific notation for number in numpy
pathList= [r".\images\02_data_preparation\train_datascan_ii",r".\images\02_data_preparation\train_datascan_iii",r".\images\02_data_preparation\train_datascan_iv"]
color = ('b','g','r')                                                   # the graph enumerator and color
columnTitle=[]
                                                                                            # computer vision python library see README.md dependencies
columnTitle=columnTitle+"dataSet,cardId,x,y,width,height,orgWidth,orgHeight".split(",")     # title for the first part of columns
columnTitle=columnTitle+"".join([f"r{num}," for num in range(256)]).split(",")[0:-1]        # adding all the red color numbered labels from r0-r255
columnTitle=columnTitle+"".join([f"g{num}," for num in range(256)]).split(",")[0:-1]        # adding all the green color numbered labels from g0-g255
columnTitle=columnTitle+"".join([f"b{num}," for num in range(256)]).split(",")[0:-1]        # adding all the blue color numbered labels from b0-b255
data={}                                                                                     # dictionary to store the collected datas

if 1==1:
 for path in pathList:
    dataSet=path.split("\\")[-1].replace("train_","")
    fileList = [os.path.join(dp, f) for dp, dn, filenames in os.walk(path) for f in filenames if os.path.splitext(f)[1] == '.jpg']
    print(f"Dataset Name: {dataSet:15} includes {len(fileList):6} images on Path: {path}")    
    data[dataSet]=[]
                                                                                            # computer vision python library see README.md dependencies    
    with tqdm(total=len(fileList)) as pbar:                                                 # visualize progress
        for file in fileList:                                                               # iterate trough all images in dataset
            histr = []                                                                      # reset histogram recordset
            img=cv2.imread(file,cv2.IMREAD_COLOR)                                           # read image
            orgHeight,orgWidth = img.shape[0],img.shape[1]                                  # save original image dimensions
            frameOrg,mask,img_rect,res,crop_img,d=js.analyzeScan(img)                       # analyze image data and receive droped image
            x,y,width,height=d                                                              # store dimensions
            for i,colorChannel in enumerate(color):                                         # color enumerator
                histr.append(cv2.calcHist([crop_img],[i],None,[256],[0,256]))               # calc histodata
            r=histr[2]; g=histr[1]; b=histr[0]                                              # prepare r,g,b column data
            cardId=file.split("\\")[-1][0:2]                                                # get cardId from filename
            # concatenate all the collected data 
            data[dataSet].append(np.concatenate(([dataSet,cardId,x,y,width,height,orgWidth,orgHeight],r.flatten().astype("uint32"),g.flatten().astype("uint32"),b.flatten().astype("uint32")),axis=0).flatten())
            pbar.update(1)                                                                  # update visualization progress
        df=pd.DataFrame(np.array(data[dataSet]),columns=columnTitle)                        # add numpy array to dataframe
        df.to_csv(dataSet+".csv",header=True)                                               # save dataframe as csv including header infos


  0%|▎                                                                                | 5/1154 [00:00<00:28, 40.36it/s]

Dataset Name: datascan_ii     includes   1154 images on Path: .\images\02_data_preparation\train_datascan_ii


100%|██████████████████████████████████████████████████████████████████████████████| 1154/1154 [00:26<00:00, 43.29it/s]
  0%|                                                                                | 5/13904 [00:00<05:47, 40.04it/s]

Dataset Name: datascan_iii    includes  13904 images on Path: .\images\02_data_preparation\train_datascan_iii


100%|████████████████████████████████████████████████████████████████████████████| 13904/13904 [05:59<00:00, 38.63it/s]
  0%|                                                                                 | 5/6480 [00:00<02:15, 47.68it/s]

Dataset Name: datascan_iv     includes   6480 images on Path: .\images\02_data_preparation\train_datascan_iv


100%|██████████████████████████████████████████████████████████████████████████████| 6480/6480 [02:28<00:00, 43.49it/s]
