---
---
# 1) Initial Data Exploration
This notebook is a brief introduction to the contents of the Virus-MNIST image data set. While more detailed analyses are reserved for future notebooks, the data herein serves to impart a general famaliarity to the structure and contents.

---
# 2) Installs & Imports

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import glob as glob
import matplotlib.pyplot as plt
from matplotlib import rcParams
import skimage.io as io
from skimage.io.collection import ImageCollection
%matplotlib inline

# Set global plot values
rcParams['figure.facecolor'] = 'lightgray'
rcParams['figure.figsize'] = (13, 5)

---
# 3) Load & View Data

In [None]:
# Save train.csv as a DataFrame and inspect
train = pd.read_csv('../input/virusmnist/train.csv')
test = pd.read_csv('../input/virusmnist/test.csv')
trainLabels = pd.read_csv('../input/virusmnist/trainLabels.csv')

train.head()

In [None]:
test.head()

In [None]:
trainLabels

In [None]:
trainLabels.groupby('Class').count()

The *trainLabels.csv* file does not contain names for each of the examples. It must be distinguished if they are in the train or test groups. Regardless, it can be used to analyze any trends in hash value that relate to predicting its class.

---
# 4) Discovery
This section seeks to reveal details such as the number of training examples and features.

* Exploratory Data Analysis (EDA) is performed in a subsequent notebook

## Display Values

In [None]:
# Save numbers of samples and features
m, nx = train.shape
m_test, nx_test = test.shape

print(f'\nTrain Examples: {m}    Features: {nx}')
print(f'\nTest Examples: {m_test}    Features: {nx_test}')

num_classes = train.label.nunique()
num_missing = train.isna().sum().sum()

num_classes_test = test.label.nunique()
num_missing_test = test.isna().sum().sum()

print(f'\nTrain Classes: {num_classes}    Missing: {num_missing}')
print(f'\nTest Classes: {num_classes_test}    Missing: {num_missing_test}')

total_examples = m + m_test

print(f'\nTotal Examples: {total_examples}')

## View Images
A volunteer from each class is randomly selected, reshaped and displayed. 

In [None]:
# Load sample images as arrays
images = []

plt.figure(figsize = (10, 20), tight_layout = True)
for i in range(num_classes):
    images.append(ImageCollection('../input/virusmnist/train/' + 
                                 str(i) + '/*1.jpg'))
    for j in range(1):
        plt.subplot(num_classes, 5, i + 1)
        _ = io.imshow(images[i][0])
        plt.axis('off')
        plt.title(i)
plt.suptitle('Virus Image Examples\n')
plt.show()

### Remarks
The images are gray scale thumbnails of shape (32, 32, 1) which resemeble random noise, or static seen on a television. Certain features are evident in some of the pictures, such as darker or lighter regions, and stripes of solid white or black.


## Information

In [None]:
# Get basic details
train.info()

In [None]:
test.info()

## Basic Descriptive Statistics
* Mean, Median, Standard Deviation

In [None]:
# Calculate feature max, min, var, etc.
description = train.describe()
description

In [None]:
description_test = test.describe()
description_test

---
# 5) Conclusion
Revealed we find $51880$ virus, or malware, image examples each with $1026$ features. They are pixel values "flattened" into a 1D shape from the standard 2D arrays of antiquity. Their quantity is a perfect square whose roots unmask the picture's original shape of (32, 32, 1).




## Next Steps
* Exploratory Data Analysis
* Model Comparisons
---
---