# Datacheck
authors: Tuguldur Sukhbold

Here we address the issues of our preliminary analysis: 

### (1) Are all the test queries in the Planet set or do we need to be able to process both Sentinel and Planet separately?

In [2]:
# path to the directory where you downloaded all data
datapath = '../d/'

import pandas as pd

In [4]:
fname = 'test.csv'
test = pd.read_csv(f'{datapath}{fname}')
test

Unnamed: 0,ID,Year,PlotSize_acres,Yield
0,id_e7032b10,2016,0.50,0.000000
1,id_ae7cb51e,2018,1.00,
2,id_e59f7730,2018,1.00,
3,id_b9011c86,2018,1.50,
4,id_caaeb9f8,2018,0.25,
...,...,...,...,...
1608,id_36ec7a7b,2018,0.25,
1609,id_70f3d8f7,2018,1.00,
1610,id_2b7f2b96,2017,0.90,0.591875
1611,id_498f3db0,2018,1.75,


So we have 1613 maize fields to test our model. Let's first check each of these fields on whether they exist in all or some Planet data:

In [31]:
import os

# 4 sets of Planet data covering 2 years and 4 epochs
setnames = ['jun17', 'dec17', 'jun18', 'dec18']

# dictionary containing all fields for each set
planet = {}
for name in setnames: planet[name] = os.listdir(f'{datapath}planet-{name}')

# all images in planet-jun17 set:
planet['jun17']

['4f92a25f.png',
 '991b244e.png',
 '311f8c97.png',
 'e820d4c4.png',
 '3901ebc2.png',
 '6594c18d.png',
 '7a57357a.png',
 '13392dc7.png',
 'eb1814b6.png',
 'c33950b2.png',
 '457b8682.png',
 'a6748baf.png',
 '6b4e07b9.png',
 'd82580f8.png',
 '9fb700a2.png',
 'c6d1d84f.png',
 'e4998c42.png',
 'ba8d34c7.png',
 'ad49bd77.png',
 'f316f370.png',
 '59ee12ea.png',
 'e937c1cf.png',
 '92bf7bf6.png',
 '8d9e2c2a.png',
 '40b3e94b.png',
 '4c265ff6.png',
 'f1207e3a.png',
 'fa0e790a.png',
 'fb26360c.png',
 '999f040a.png',
 '95ce8b7c.png',
 '15e61724.png',
 '63af4d8e.png',
 'f8b78441.png',
 '17aa80f9.png',
 '8bbbc7ff.png',
 '07bb7b7d.png',
 '45482134.png',
 '4f0834d1.png',
 '5b423a0b.png',
 '5048467e.png',
 '30359ad7.png',
 'ae7cb51e.png',
 '2905837a.png',
 'a280bd3b.png',
 'ee1a3eb4.png',
 '1b42b574.png',
 '84d2bae1.png',
 '2248aa77.png',
 '74cba48e.png',
 '8b437030.png',
 '98baaf36.png',
 'df0aa6b1.png',
 'e9a69667.png',
 '8e866297.png',
 'd87c2ca2.png',
 'caaeb9f8.png',
 'f945f95f.png',
 '9847f033.png

for each set we have the same 3516 images:

In [8]:
for name in setnames: print(f'{name} = {len(planet[name])} images')

jun17 = 3516 images
dec17 = 3516 images
jun18 = 3516 images
dec18 = 3516 images


Not only these sets have same number of images, they are all for identical set of 3516 fields:

In [32]:
print(set(planet[setnames[0]]) ^ set(planet[setnames[1]]) ^ set(planet[setnames[2]]) ^ set(planet[setnames[3]]))
# this returns empty set if all lists are same, i.e. all have the same set of 3516 fields

set()


This looks good, seems like all of our test suite images are covered by Planet, but let's really make sure:

In [33]:
# since each set has the same set of field images we can just check one of them
for field in test.ID:
    fieldName      = field.split('_')[-1]
    fieldImageName = f'{fieldName}.png'
    
    if fieldImageName not in planet[setnames[0]]: print(' we have a problem ')


Nothing is printed, so we just confirmed that all of our test suite images are fully covered by Planet data.