This notebook is to check images by labels.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import itertools
import os
import json
%matplotlib inline 

In [2]:
#Utility method to read the data pickle
def read_dataset(filename):
    return pd.read_pickle(filename)

We first start by creating the label to image mapping dataframe.
The below method takes the data json and create a flat structure of `[Imageid, LabelId]`

In [3]:
TEST_PICKLE = '../data/pickles/test.pickle'
TRAIN_PICKLE = '../data/pickles/train.pickle'
VALIDATION_PICKLE = '../data/pickles/validation.pickle'
TRAIN_LABEL_PICKLE = '../data/pickles/train_label.pickle'
VALIDATION_LABEL_PICKLE = '../data/pickles/validation_label.pickle'
PATH_TO_IMAGES = 'G:\\Data\\data\\train\\'

In [13]:
dataset = read_dataset(VALIDATION_LABEL_PICKLE)
dataset = pd.DataFrame(dataset, dtype='int32')

Now that we have the required structure we start with some basic analysis.
1. The number of distinct labels in the dataset.
2. The maximum labelId value [This will be used later].
3. The number of distinct images in the dataset.

In [14]:
number_of_labels = dataset['labelId'].nunique() # Number of distinct labels
maximum_label_id = max(dataset['labelId']) # The maximum labelId value
number_of_images = dataset['imageId'].nunique() # Number of distinct images
print('Number of distinct labels in the dataset : ', number_of_labels)
print('Maximum id if labels in the dataset : ', maximum_label_id)
print('Number of distinct images in the dataset : ', number_of_images)

Number of distinct labels in the dataset :  225
Maximum id if labels in the dataset :  228
Number of distinct images in the dataset :  9897


One of the simplest analysis of the dataset can be done through counts. Let us start by creating counts of labels and images. This can easily be done using `pandas.groupby` method provided out of the box.

In [15]:
# Count analysis for images
count_by_image_id = dataset.groupby('imageId')['imageId'].count().reset_index(name="count")
count_by_label_id = dataset.groupby('labelId')['labelId'].count().reset_index(name="count")

In [16]:
# Plot by label counts
print('Labels associated with largest number of images ')
count_by_label_id.nsmallest(12, 'count')

Labels associated with largest number of images 


Unnamed: 0,labelId,count
84,86,1
155,157,1
159,163,1
81,83,2
102,104,2
105,107,2
127,129,2
143,145,2
144,146,2
219,223,2


In [17]:
count_by_label_id['percentage'] =(count_by_label_id['count'] / number_of_images) * 100

In [20]:
count_by_label_id[count_by_label_id['count']>10].nsmallest(12, 'count')

Unnamed: 0,labelId,count,percentage
48,50,11,0.111145
198,202,11,0.111145
21,22,12,0.121249
38,39,12,0.121249
169,173,13,0.131353
193,197,16,0.161665
78,80,17,0.171769
94,96,17,0.171769
110,112,19,0.191977
132,134,19,0.191977
