This notebook is designed to be run as the first notebook of four included in this submission.

# Hailey Weinschenk - BrainStation Capstone Notebook 1 - Data Loading and Explatory Data Analysis

In this notebook, the data will be loaded and prepared for later modelling. Then, some basic EDA will be performed such as null-handling, viewing distributions, and correlations. First, import necessary libraries and load the data. If the file exists, only the else clause is activated which causes about a 10 minute load time. Without the file, the if clause takes 4-5 hours. 

In [1]:
import os.path
from preprocess import preprocess
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
if(not os.path.exists('full_preprocessedML.csv')):
    x_dim = 216
    y_dim = 288
    deep = False
    df = preprocess(x_dim,y_dim,deep)
    df.to_csv('full_preprocessedML.csv')
else:
    df_chunk = pd.read_csv('full_preprocessedML.csv',index_col=0,iterator=True,chunksize=250)
    df = pd.concat(df_chunk)

In [None]:
df

Firstly, the shape of the read-in data is 2756 rows by 186631 columns. This can be interpreted as 2756 images in the database, each of which has about 186,000 features (each pixel). There are, of course, the features of 'is_red', 'suit_num' and 'card_number' as well as helper columns to obtain them. Even with these disregarded, our column space is massive.

Now, the distribution of the features will be discussed...

In [None]:
counts = df.groupby('card_string').count().iloc[:,1]
counts.sort_values(ascending = False).head(),counts.sort_values(ascending = True).head()

So all cards have between 51 and 52 images each. Interestingly, the 4 of clubs has 10 extra images. This could be an error when creating the database. Regardless, this shouldn't have a large impact on our problem.

In [None]:
counts = df.groupby('suit').count().iloc[:,1]
plt.figure()
plt.title('Suit Distributions')
plt.xlabel('Suits')
plt.ylabel('Frequency')
plt.bar(counts.keys(),counts.values, color = ['green','blue','red','yellow','black'])
plt.show()

Our suit distributions are to be expected. Each suit other than clubs has $13*51\approx(663)$ images. Clubs has a few extra, due to the extra 4 of clubs images. Jokers simply has one rank so therefore only 51 or 52 images.

In [None]:
counts = df.groupby('is_red').count().iloc[:,1]
plt.figure()
plt.title('Binary Distributions')
plt.xlabel('Color')
plt.ylabel('Frequency')
plt.xticks(ticks = [0,1],labels = ['Black','Red'])
plt.bar(counts.keys(),counts.values,color = ['black','red'])
plt.show()

Finally, we see that this increase in the 4 of clubs effected the black/red distribution as well. Additionally, differences in 51 vs. 52 for more black cards could appreciate the difference shown above.

## Null Checks

In [None]:
df.isna().sum()

In [None]:
(df.isna().sum() == 0).all()

So our each of the entire (186631!) columns is full of zeroes. Therefore, there are no null values present in this dataset. This is a clear advantage of working with a 'toy' problem or image data in general. However, the difficulty comes from managing and manipulating such inconcievably large datasets.

With this in mind, the correlation coefficients of each feature can be obtained. Since 186 thousand features would be impossible to view, we will take a small slice of our data and view the correlations. The goal of this is to show potential multicolinearity that can cause issues when modelling. This occurs when features have relationships with other features as well as on the target. It is necessary to deal with these in some way before moving on. 

In [None]:
part_df = df.iloc[1:50,15000:15025]
sns.heatmap(part_df.corr(),cmap = 'coolwarm')

Despite some high correlation scores throughout the heatmap, there is no clear pattern of high or low between two pixels for *each* picture. One image might have a dark object featured, which would cause a high negative correlation to adjacent pixels for that image. 

With this in mind, we can move forward with modelling.