---
---
# 1) Overhead-MNIST Initial Data Exploration
* Purpose:
> This notebook is intended to serve as a data familiarization starting point. It contains general statistics and insights regarding the overall population that can be used to inform subsequent model construction and optimization. There are no machine learning algorithms utilized; hence, an accelerator is not required.

---
# 2) Installs & Imports

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline

from skimage import io

# Set global plot values for uniformity
rcParams['figure.facecolor'] = 'lightgray'
rcParams['figure.figsize'] = (13, 5)

# Useful functions
def sp(int):
    # Returns a string of blank spaces of length int
    return ' ' * int

---
# 3) Load & Inspect Files

In [None]:
# Store useful paths
path = '../input/overheadmnist/overhead/'
path_tr = path + 'training'
path_ts = path + 'testing'

# Save files as dataframes
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')
labels = pd.read_csv(path + 'labels.csv')
classes = pd.read_csv(path + 'classes.csv')

classes2 = classes.drop(['class', 'label'], axis = 1)
classes2.index = classes['class']

# Store names and frames
df_lst = ['train', 'test', 'classes', 'classes2', 'labels']
df_lst2 = [train, test, classes, classes2, labels]

# View dataframes
print(sp(25) + 'DataFrames\n')
for i, df in enumerate(df_lst2):
    print('{}:\n{}\n\n'
          .format(df_lst[i], df.head().iloc[:, :8]), 
          end = '\n')

---
# 4) Data Exploration

## Basic Information & Descriptions

In [None]:
# View info
print(sp(25) + 'DataFrame Info\n')
for i, df in enumerate(df_lst2):
    print(df_lst[i] + ':')
    print('{}\n\n'
          .format(df.info()), 
          end = '')

In [None]:
# View statistics
print(sp(25) + 'Descriptive Statistics\n')
for i, df in enumerate(df_lst2):
    print('{}:\n{}\n\n'
          .format(df_lst[i], df.describe().iloc[:, :5]), end = '\n')

In [None]:
# Save useful values
tot_tr = len(train)
tot_ts = len(test)
tot_pics = tot_tr + tot_ts
num_classes = len(classes)
clss_lst = classes['class'].values

ts_tr_ratio = round(tot_ts / tot_tr, 3)

# Display pertinent data
print('Total pics: {}    Classes: {}    Training: {}    Test: {}'
     .format(tot_pics, num_classes, tot_tr, tot_ts))
print('\nTest/Training ratio: {}\n\nClasses: \n{}'
     .format(ts_tr_ratio, clss_lst))

### Remarks
There are a total of 9584 images in the entire set. The train and test splits contain 8519 and 1065 labeled examples, respectively. Each row in the dataframe represents a labeled, 28x28, grayscale satellite image with no missing values. For the interim, label fidelity is assumed to be above 95%.

There are 10 classes:\
car, harbor, helicopter, oil_gas_field, parking_lot, plane, runway_mark, ship, stadium, and storage_tank.

## Distributions

In [None]:
# Bar chart
plt.figure(tight_layout = True)
classes2.plot.bar(figsize = (12, 5), ec = 'k', stacked = True, 
                  rot = 90, title = 'Population Comparisons')
plt.xlabel(None)
plt.legend(loc = 'lower right')
plt.show()

# Pie chart
plt.figure(tight_layout = True)
classes2.plot.pie(subplots = True, figsize = (12, 5), legend = False,  
                  title = 'Population Distributions')
plt.show()


### Remarks
The training and test sets share the same population distribution, with each showing significantly fewer helicopter examples. The runway_mark and stadium categories also have slightly less than average representation. While notable, these differences are unlikely to impact initial training or evaluation.

## Class Size Comparisons

In [None]:
# Save and display pertinent variables
desc = classes.describe()

tr_mean = int(round(desc.loc['mean', 'train_count']))
ts_mean = int(round(desc.loc['mean', 'test_count']))

tr_least = classes['train_count'].min()
tr_most = classes['train_count'].max()

ts_least = classes['test_count'].min()
ts_most = classes['test_count'].max()

print('Average samples per class - \n{}Train: {}\n{}Test: {}'
     .format(sp(30), tr_mean, sp(30), ts_mean))

print('\nMost represented classes - \n{}\n\n{}\n{}'
      .format(sp(25) + 'harbor, plane, ship', 
              sp(30) + 'Train: ' + str(tr_most),
              sp(30) + 'Test: ' + str(tr_least - 1)))

print('\nLeast represented class - \n{}\n\n{}\n{}'
      .format(sp(25) + 'helicopter', 
              sp(30) + 'Train: ' + str(tr_least),
              sp(30) + 'Test: ' + str(classes['test_count'].min())))

In [None]:
print('Ratios of pictures in each class compared to the total,' + 
      '\nas well as the least and most populous classes:')

for i in range(num_classes):
    clss = clss_lst[i]
    tr, ts = classes.iloc[i, -2:].values
    tr_frac = round(tr / tot_tr, 3)
    ts_frac = round(ts / tot_ts, 3)
    tr_sm = round(tr / tr_least, 3)
    tr_lg = round(tr / tr_most, 3)
    ts_sm = round(ts / ts_least, 3)
    ts_lg = round(ts / ts_most, 3)
    print('\n{}{} -\n{}ratio to tot -    Train: {}    Test: {}'
          .format(sp(3), clss, sp(10), tr_frac, ts_frac) + 
          '\n{}ratio to min -    Train: {}    Test: {}'
          .format(sp(10), tr_sm, ts_sm) + 
          '\n{}ratio to max -    Train: {}    Test: {}'
          .format(sp(10), tr_lg, ts_lg))

---
# 5) View Sample Images
* Collected and displayed with scikit-image.

In [None]:
# Create multi-pic display
print(sp(40) + 'Training Examples')

# Subplot placement indexer
idx = 1

plt.figure(figsize = (12, 12), tight_layout = True)
for clss in clss_lst:
    ic = io.ImageCollection('../input/overheadmnist/overhead/training/' + clss + '/*.jpg')
    for i in range(6):
        plt.subplot(num_classes, 6, idx)
        _ = io.imshow(ic[i])
        plt.axis('off')
        plt.title(clss)
        idx += 1
plt.show()

In [None]:
# Single pic examples
idx = 1

plt.figure(figsize = (15, 20), tight_layout = False)
for clss in clss_lst:
    ic = io.ImageCollection('../input/overheadmnist/overhead/training/' + clss + '/*.jpg')
    for i in range(1):
        plt.subplot(num_classes, 5, idx)
        _ = io.imshow(ic[i])
        plt.axis('off')
        plt.title(clss)
        idx += 1
plt.show()

## Remarks
Each picture contains object variants from mere edges to single, or multilple, instances of the indicated category. The sample images are accurately labeled, which bodes well for testing very basic models, and lends confidence to the assumption that labels are correct. 

*Note:*\
The first plot omits title and picture in that same position regardless of the input. Manual inspection of the storage_tank pictures shows no problem with displaying the missing image and is likely due to rendering issues in this notebook. If the problem persists it could be an indication of some form of bug. 

---
# 6) Conclusion
The Overhead-MNIST dataset is composed of 9584 grayscale image arrays of shape (1, 28, 28), four .csv files containing either flattened picture arrays or label mapping/summary data, and ubyte files. There are 784 pixels per picture, and the raw arrays are not normalized. The ratio of test to train data is .125, while both share matching internal class distributions. 

The average train class size is 852, while the average test class size is 107. Helicopters bear the least representation with only 655 training and 82 test entries. 
This accounts for approximately 7.7% of the data, while car, harbor, oil_gas_field, parking_lot, plane, ship, and storage_tank compose around 10.4% on average. The remaining two classes, runway_mark and stadium, comprise 9.4% and 9.9% of the training data respectively. 

*Model Creation, Training, & Validation:*\
During baseline establishment avoid hyperparameter adjustments. Raw data augmentation should be strictly limited to array normalization. The presence of dissimilar class sizes necessitates stratification when further splitting the training set into train and validation sets. The given test set will be held unseen and used for final model scoring.

*Image Processing:*\
Detailed image processing is presently a low priority; it will become a focus only during performance optimization. In future explorations,  various techinques will be employed to augment the images. Model performance will then be compared and contrasted until optimum hyper-parameter settings are achieved. Normalization will take place during filtering and model ingestion preparation. Multiple filters can be applied for enhanced edge detection, contrast, and more. Model fine tuning could include determining a combination of size, filter applications, etc., that satisfies storage space and run time requirements at the cost of some yet unkown.


## Next Steps
* Model comparison with Pycaret
* Exploratory Data Analysis (EDA)
---
---