## Organizing Labels with pandas

This shows how the data labels for the Stage 1 training scans can be organized in a pandas dataframe.  By separating the scan IDs and body zones, we can easily get information such as how many threats are in each zone or how many threats are in the scans.

First, let's import some dependencies.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
%matplotlib inline
plt.style.use('ggplot')
pd.options.display.max_rows = 20


Read in the data and create a dataframe that has the body zones for column names and the scan IDs for indices.  Note that the IDs are in the csv file are in alphanumeric order, so the zones are in order of Zone1, Zone10, Zone11, etc.


In [3]:
# Read in labels
base_dir = os.path.join('..', 'input')
unsorted_df = pd.read_csv(os.path.join(base_dir, 'stage1_labels.csv'))

# Get IDs for rows
s = list(range(0,len(unsorted_df),17))
obs = unsorted_df.loc[s,'Id'].str.split('_')
scanID = [x[0] for x in obs]

# Put zones in columns
columns = sorted(['Zone'+str(i) for i in range(1,18)])

df = pd.DataFrame(index=scanID, columns=columns)

# Sort labels by zone
for i in range(17):
    s = list(range(i,len(unsorted_df),17))
    df.iloc[:,i] = unsorted_df.iloc[s,1].values

print('Number of labeled scans:', len(df))
df.head()


It's now easier to look at what data is available to us for training.  For example, how many threats are in each zone:


In [None]:
nobj_zone = df.sum()
print(nobj_zone)

nobj_zone.plot(kind='bar', width=.75, title='Threat Count in Each Zone')
plt.ylabel('Number of Threats')
plt.xlabel('Zone')
plt.show()


We can also see how many threats are being used in the scans.  


In [None]:
nobj_scan = df.sum(1).value_counts().sort_values()
print(nobj_scan)

nobj_scan.plot(kind='bar', width=.75, title='Frequency of Threat Counts')
plt.ylabel('Number of Scans')
plt.xlabel('Threat Count')
plt.show()


Using this dataframe would also make it easier to get scans with threats in desired zones...


In [None]:
df[df['Zone1']==1]


...or to search for any correlations...


In [None]:
df.corr()



... although correlations probably shouldn't be taken too seriously unless there's a reason to think it could be real, but it's just to demonstrate how organizing the labels in a dataframe like this is more useful.
