<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

# A Baseline Model for the Radiant Earth Spot the Crop Challenge [Sentinel-2 version]

This notebook walks you through the steps to load the data and build a baseline model based on Sentinel-2 daya using Random Forests for `Radiant Earth Spot the Crop Challenge`.

## Radiant MLHub API


The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth).

Full documentation for the API is available at [docs.mlhub.earth](docs.mlhub.earth).

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/radiantearth/stac-spec/tree/master/extensions/label) definition.

## Dependencies

All the dependencies for this notebook are included in the `requirements.txt` file included in this folder.


**You must replace the `YOUR_API_KEY_HERE` text with your API key which you can obtain by creating a free account on the [MLHub Dashboard](https://dashboard.mlhub.earth/) within the `API Keys` tab at the top of the page.**

In [1]:
!pip install --quiet -r /content/Plato_Radiant_Data_Preprocessing_requirements.txt

[K     |████████████████████████████████| 19.3 MB 3.3 MB/s 
[?25h

In [2]:
import datetime
from datetime import timedelta
import rasterio
import numpy as np
import pandas as pd

In [None]:
competition_train_df = pd.read_csv('train_df.csv').sort_values(by = 'tile_id').reset_index(drop = True)
competition_test_df = pd.read_csv('test_df.csv').sort_values(by = 'tile_id').reset_index(drop = True)

In [None]:
competition_train_df.head()

Unnamed: 0,tile_id,datetime,satellite_platform,asset,file_path
0,1,2017-06-20,s2,B11,/content/radiant/ref_south_africa_crops_compet...
1,1,2017-05-14,s2,B05,/content/radiant/ref_south_africa_crops_compet...
2,1,2017-05-14,s2,B04,/content/radiant/ref_south_africa_crops_compet...
3,1,2017-05-11,s2,B06,/content/radiant/ref_south_africa_crops_compet...
4,1,2017-05-11,s2,B05,/content/radiant/ref_south_africa_crops_compet...


In [None]:
competition_test_df.head()

Unnamed: 0,tile_id,datetime,satellite_platform,asset,file_path
0,1,,,field_info_test,/content/radiant/ref_south_africa_crops_compet...
1,1,2017-11-12,s2,B01,/content/radiant/ref_south_africa_crops_compet...
2,1,2017-11-12,s2,B02,/content/radiant/ref_south_africa_crops_compet...
3,1,2017-11-12,s2,B03,/content/radiant/ref_south_africa_crops_compet...
4,1,2017-11-17,s2,B01,/content/radiant/ref_south_africa_crops_compet...


In [None]:
# This DataFrame lists all types of assets including documentation of the data. 
# In the following, we will use the Sentinel-2 bands as well as labels. 
competition_train_df['asset'].unique()

array(['B11', 'B05', 'B04', 'B06', 'raster_values', 'labels', 'B09',
       'B12', 'CLM', 'field_info_train', 'field_ids', 'documentation',
       'B07', 'B08', 'B8A', 'B02', 'B03', 'B01', 'VV', 'VH'], dtype=object)

### Date Analysis for S2

In [None]:
# Change to datetime object
competition_train_df.datetime = pd.to_datetime(competition_train_df.datetime)

# Check number of unique dates for each tile train
train_unique_dates, train_nunique_dates = [], []
for tile in competition_train_df.tile_id.unique():
  train_unique_dates.append(competition_train_df[(competition_train_df.tile_id == tile) & (competition_train_df.satellite_platform == 's2')].datetime.unique())
  train_nunique_dates.append(competition_train_df[(competition_train_df.tile_id == tile) & (competition_train_df.satellite_platform == 's2')].datetime.nunique())

# Check number of unique dates for each tile test
test_unique_dates, test_nunique_dates = [], []
for tile in competition_test_df.tile_id.unique():
  test_unique_dates.append(competition_test_df[(competition_test_df.tile_id == tile) & (competition_test_df.satellite_platform == 's2')].datetime.unique())
  test_nunique_dates.append(competition_test_df[(competition_test_df.tile_id == tile) & (competition_test_df.satellite_platform == 's2')].datetime.nunique())

In [None]:
pd.Series(train_nunique_dates + test_nunique_dates).describe()

count    3787.000000
mean       54.447056
std        18.647749
min        37.000000
25%        38.000000
50%        38.000000
75%        76.000000
max        76.000000
dtype: float64

 - In total there are 3787 tiles
 - Average number of dates per tile is 54
 - Minimum dates available in a tile is 37
 - Maximum dates available in a tile is 76

In [None]:
dates_df =pd.DataFrame({
                        'tile': competition_train_df.tile_id.unique().tolist()+competition_test_df.tile_id.unique().tolist(),
                        'n_unique_dates': train_nunique_dates + test_nunique_dates,
                        'unique_dates': train_unique_dates + test_unique_dates
                        })
dates_df.head()

Unnamed: 0,tile,n_unique_dates,unique_dates
0,1,76,"[2017-06-20T00:00:00.000000000, 2017-05-14T00:..."
1,2,76,"[2017-07-25T00:00:00.000000000, 2017-07-23T00:..."
2,3,76,"[2017-07-08T00:00:00.000000000, 2017-07-05T00:..."
3,4,38,"[2017-10-03T00:00:00.000000000, 2017-10-13T00:..."
4,5,38,"[2017-05-11T00:00:00.000000000, 2017-05-21T00:..."


In [None]:
all_dates = [item for sublist in dates_df.unique_dates.tolist() for item in sublist]

# Number of unique dates overall
len(set(all_dates)), set(all_dates)

(152,
 {'2017-04-01',
  numpy.datetime64('2017-04-01T00:00:00.000000000'),
  '2017-04-04',
  numpy.datetime64('2017-04-04T00:00:00.000000000'),
  '2017-04-11',
  numpy.datetime64('2017-04-11T00:00:00.000000000'),
  '2017-04-14',
  numpy.datetime64('2017-04-14T00:00:00.000000000'),
  '2017-04-21',
  numpy.datetime64('2017-04-21T00:00:00.000000000'),
  '2017-04-24',
  numpy.datetime64('2017-04-24T00:00:00.000000000'),
  '2017-05-01',
  numpy.datetime64('2017-05-01T00:00:00.000000000'),
  '2017-05-04',
  numpy.datetime64('2017-05-04T00:00:00.000000000'),
  '2017-05-11',
  numpy.datetime64('2017-05-11T00:00:00.000000000'),
  '2017-05-14',
  numpy.datetime64('2017-05-14T00:00:00.000000000'),
  '2017-05-21',
  numpy.datetime64('2017-05-21T00:00:00.000000000'),
  '2017-05-24',
  numpy.datetime64('2017-05-24T00:00:00.000000000'),
  '2017-05-31',
  numpy.datetime64('2017-05-31T00:00:00.000000000'),
  '2017-06-03',
  numpy.datetime64('2017-06-03T00:00:00.000000000'),
  '2017-06-10',
  numpy.date

In [None]:
# Maximum and minimum dates 
pd.Series(all_dates).min(), pd.Series(all_dates).max()

(Timestamp('2017-04-01 00:00:00'), Timestamp('2017-11-30 00:00:00'))

In [None]:
# Investigate minimum dates
min_dates = dates_df[dates_df.n_unique_dates == 37].unique_dates.tolist()
min_dates

[array(['2017-11-27T00:00:00.000000000', '2017-11-17T00:00:00.000000000',
        '2017-11-07T00:00:00.000000000', '2017-11-12T00:00:00.000000000',
        '2017-11-22T00:00:00.000000000', '2017-05-11T00:00:00.000000000',
        '2017-05-21T00:00:00.000000000', '2017-05-31T00:00:00.000000000',
        '2017-06-10T00:00:00.000000000', '2017-06-20T00:00:00.000000000',
        '2017-05-01T00:00:00.000000000', '2017-04-01T00:00:00.000000000',
        '2017-04-11T00:00:00.000000000', '2017-04-21T00:00:00.000000000',
        '2017-10-28T00:00:00.000000000', '2017-11-02T00:00:00.000000000',
        '2017-07-25T00:00:00.000000000', '2017-07-30T00:00:00.000000000',
        '2017-08-04T00:00:00.000000000', '2017-07-20T00:00:00.000000000',
        '2017-08-09T00:00:00.000000000', '2017-08-19T00:00:00.000000000',
        '2017-07-15T00:00:00.000000000', '2017-06-30T00:00:00.000000000',
        '2017-07-05T00:00:00.000000000', '2017-07-10T00:00:00.000000000',
        '2017-10-23T00:00:00.000000000

In [None]:
# Data provided is for every 5 days
# Confirmation of 5 day difference
for i in range(5):
  print(pd.to_datetime(min_dates[0][i]) + timedelta(days = 5))

2017-12-02 00:00:00
2017-11-22 00:00:00
2017-11-12 00:00:00
2017-11-17 00:00:00
2017-11-27 00:00:00


 - Some dates are not available for 5 day difference

In [None]:
# Confirmation of 10 day difference
for i in range(5):
  print(pd.to_datetime(min_dates[0][i]) + timedelta(days = 10))

2017-12-07 00:00:00
2017-11-27 00:00:00
2017-11-17 00:00:00
2017-11-22 00:00:00
2017-12-02 00:00:00


 - Most days are available for a ten day difference

In [None]:
# Confirm that each tile has a difference of 10 days
dates_df.head()

Unnamed: 0,tile,n_unique_dates,unique_dates
0,1,76,"[2017-06-20T00:00:00.000000000, 2017-05-14T00:..."
1,2,76,"[2017-07-25T00:00:00.000000000, 2017-07-23T00:..."
2,3,76,"[2017-07-08T00:00:00.000000000, 2017-07-05T00:..."
3,4,38,"[2017-10-03T00:00:00.000000000, 2017-10-13T00:..."
4,5,38,"[2017-05-11T00:00:00.000000000, 2017-05-21T00:..."


In [None]:
# Check minimum and maximum dates for each tile
dates_min, dates_max = [], []
for i in range(dates_df.shape[0]):
  dates_min.append(dates_df.unique_dates.loc[i].min())
  dates_max.append(dates_df.unique_dates.loc[i].max())

dates_df['min_date'] = dates_min
dates_df['max_date'] = dates_max
dates_df.head()

Unnamed: 0,tile,n_unique_dates,unique_dates,min_date,max_date
0,1,76,"[2017-06-20T00:00:00.000000000, 2017-05-14T00:...",2017-04-01,2017-11-30
1,2,76,"[2017-07-25T00:00:00.000000000, 2017-07-23T00:...",2017-04-01,2017-11-30
2,3,76,"[2017-07-08T00:00:00.000000000, 2017-07-05T00:...",2017-04-01,2017-11-30
3,4,38,"[2017-10-03T00:00:00.000000000, 2017-10-13T00:...",2017-04-01,2017-11-27
4,5,38,"[2017-05-11T00:00:00.000000000, 2017-05-21T00:...",2017-04-01,2017-11-27


In [None]:
dates_df.min_date.unique(), dates_df.max_date.unique()

(array(['2017-04-01T00:00:00.000000000', '2017-04-04T00:00:00.000000000'],
       dtype='datetime64[ns]'),
 array(['2017-11-30T00:00:00.000000000', '2017-11-27T00:00:00.000000000'],
       dtype='datetime64[ns]'))

 - There are two unique start and end dates for tiles

In [None]:
# Create 10  day difference dates for each tile given the start day
# There are 24 10 day gaps between start and end dates
# if first date is 2017-04-01
start_date = pd.to_datetime('2017-04-01')
dates_1 = []
dates_1.append(start_date)
for i in range(24):
 start_date = start_date + timedelta(days = 10)
 dates_1.append(start_date)
dates_1

[Timestamp('2017-04-01 00:00:00'),
 Timestamp('2017-04-11 00:00:00'),
 Timestamp('2017-04-21 00:00:00'),
 Timestamp('2017-05-01 00:00:00'),
 Timestamp('2017-05-11 00:00:00'),
 Timestamp('2017-05-21 00:00:00'),
 Timestamp('2017-05-31 00:00:00'),
 Timestamp('2017-06-10 00:00:00'),
 Timestamp('2017-06-20 00:00:00'),
 Timestamp('2017-06-30 00:00:00'),
 Timestamp('2017-07-10 00:00:00'),
 Timestamp('2017-07-20 00:00:00'),
 Timestamp('2017-07-30 00:00:00'),
 Timestamp('2017-08-09 00:00:00'),
 Timestamp('2017-08-19 00:00:00'),
 Timestamp('2017-08-29 00:00:00'),
 Timestamp('2017-09-08 00:00:00'),
 Timestamp('2017-09-18 00:00:00'),
 Timestamp('2017-09-28 00:00:00'),
 Timestamp('2017-10-08 00:00:00'),
 Timestamp('2017-10-18 00:00:00'),
 Timestamp('2017-10-28 00:00:00'),
 Timestamp('2017-11-07 00:00:00'),
 Timestamp('2017-11-17 00:00:00'),
 Timestamp('2017-11-27 00:00:00')]

In [None]:
start_date = pd.to_datetime('2017-04-04')
dates_2 = []
dates_2.append(start_date)
for i in range(24):
 start_date = start_date + timedelta(days = 10)
 dates_2.append(start_date)
dates_2

[Timestamp('2017-04-04 00:00:00'),
 Timestamp('2017-04-14 00:00:00'),
 Timestamp('2017-04-24 00:00:00'),
 Timestamp('2017-05-04 00:00:00'),
 Timestamp('2017-05-14 00:00:00'),
 Timestamp('2017-05-24 00:00:00'),
 Timestamp('2017-06-03 00:00:00'),
 Timestamp('2017-06-13 00:00:00'),
 Timestamp('2017-06-23 00:00:00'),
 Timestamp('2017-07-03 00:00:00'),
 Timestamp('2017-07-13 00:00:00'),
 Timestamp('2017-07-23 00:00:00'),
 Timestamp('2017-08-02 00:00:00'),
 Timestamp('2017-08-12 00:00:00'),
 Timestamp('2017-08-22 00:00:00'),
 Timestamp('2017-09-01 00:00:00'),
 Timestamp('2017-09-11 00:00:00'),
 Timestamp('2017-09-21 00:00:00'),
 Timestamp('2017-10-01 00:00:00'),
 Timestamp('2017-10-11 00:00:00'),
 Timestamp('2017-10-21 00:00:00'),
 Timestamp('2017-10-31 00:00:00'),
 Timestamp('2017-11-10 00:00:00'),
 Timestamp('2017-11-20 00:00:00'),
 Timestamp('2017-11-30 00:00:00')]

In [None]:
s_dates = []
for i in dates_df.min_date:
  if i == pd.to_datetime('2017-04-01'):
    s_dates.append(dates_1)
  else:
    s_dates.append(dates_2)

dates_df['expected_dates'] = s_dates
dates_df.head()

Unnamed: 0,tile,n_unique_dates,unique_dates,min_date,max_date,expected_dates
0,1,76,"[2017-06-20T00:00:00.000000000, 2017-05-14T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201..."
1,2,76,"[2017-07-25T00:00:00.000000000, 2017-07-23T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201..."
2,3,76,"[2017-07-08T00:00:00.000000000, 2017-07-05T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201..."
3,4,38,"[2017-10-03T00:00:00.000000000, 2017-10-13T00:...",2017-04-01,2017-11-27,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201..."
4,5,38,"[2017-05-11T00:00:00.000000000, 2017-05-21T00:...",2017-04-01,2017-11-27,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201..."


In [None]:
# Confirm that unique dates are in expected dates
unavailable_dates = []
for i in range(dates_df.shape[0]):
  unavailable_dates.append(set(dates_df.expected_dates.loc[i]) - set([pd.to_datetime(x) for x in dates_df.unique_dates.loc[i]]))

In [None]:
dates_df['not_available'] = unavailable_dates 
dates_df.head()

Unnamed: 0,tile,n_unique_dates,unique_dates,min_date,max_date,expected_dates,not_available
0,1,76,"[2017-06-20T00:00:00.000000000, 2017-05-14T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{}
1,2,76,"[2017-07-25T00:00:00.000000000, 2017-07-23T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{}
2,3,76,"[2017-07-08T00:00:00.000000000, 2017-07-05T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{}
3,4,38,"[2017-10-03T00:00:00.000000000, 2017-10-13T00:...",2017-04-01,2017-11-27,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{}
4,5,38,"[2017-05-11T00:00:00.000000000, 2017-05-21T00:...",2017-04-01,2017-11-27,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{}


In [None]:
# Investigate tiles which dont have all the days
dates_df[dates_df.not_available != set()]

Unnamed: 0,tile,n_unique_dates,unique_dates,min_date,max_date,expected_dates,not_available
38,39,68,"[2017-11-15T00:00:00.000000000, 2017-11-17T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...","{2017-06-30 00:00:00, 2017-11-27 00:00:00}"
185,186,65,"[2017-11-15T00:00:00.000000000, 2017-08-07T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...","{2017-06-20 00:00:00, 2017-06-30 00:00:00, 201..."
600,601,68,"[2017-11-10T00:00:00.000000000, 2017-11-05T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...","{2017-06-30 00:00:00, 2017-11-27 00:00:00}"
1823,1824,74,"[2017-11-20T00:00:00.000000000, 2017-11-17T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{2017-06-30 00:00:00}
2290,2291,62,"[2017-10-31T00:00:00.000000000, 2017-11-05T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...","{2017-06-20 00:00:00, 2017-11-17 00:00:00, 201..."
2515,2516,73,"[2017-08-17T00:00:00.000000000, 2017-09-08T00:...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...",{2017-06-30 00:00:00}
3229,580,62,"[2017-06-23, 2017-06-13, 2017-05-21, 2017-06-0...",2017-04-01,2017-11-30,"[2017-04-01 00:00:00, 2017-04-11 00:00:00, 201...","{2017-06-20 00:00:00, 2017-11-17 00:00:00, 201..."


In [None]:
n_v_tiles = dates_df[dates_df.not_available != set()].tile.unique().tolist()
n_v_tiles 

[39, 186, 601, 1824, 2291, 2516, 580]

In [None]:
tile_ids_train = competition_train_df['tile_id'].nunique()

In [None]:
# Our goal is developing a pixel-based Random Forest model. So we will create an X variable
# that each row is a pixel and each column is one of the observations. 
# The other variables is y which has rows equal to the number of pixels. 
X = np.empty((0, 13 * 25))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))

for tile_id in tile_ids_train:
    tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

    label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
    label_array = label_src.read(1)
    y = np.append(y, label_array.flatten())

    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids = np.append(field_ids, field_id_array.flatten())

    tile_date_times = sorted(tile_df[tile_df['satellite_platform']=='s2']['datetime'].unique()))
    if tile_id in n_v_tiles:
      tile_date_times = [x[0] for x in np.array_split(sorted(tile_df[tile_df['satellite_platform']=='s2']['datetime'].unique()), 25)]
    elif: tile_date_times[0] == np.datetime64('2017-04-01'):
      tile_date_times = dates_1.copy()
    else:
      tile_date_times = dates_2.copy()

    X_tile = np.empty((256 * 256, 0))
    for date_time in tile_date_times:

      b1_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B01')]['file_path'].values[0])
      b1_array = np.expand_dims(b1_src.read(1).flatten(), axis=1)

      b2_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B02')]['file_path'].values[0])
      b2_array = np.expand_dims(b2_src.read(1).flatten(), axis=1)
      
      b3_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B03')]['file_path'].values[0])
      b3_array = np.expand_dims(b3_src.read(1).flatten(), axis=1)

      b4_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B04')]['file_path'].values[0])
      b4_array = np.expand_dims(b4_src.read(1).flatten(), axis=1)

      b5_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B05')]['file_path'].values[0])
      b5_array = np.expand_dims(b5_src.read(1).flatten(), axis=1)

      b6_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B06')]['file_path'].values[0])
      b6_array = np.expand_dims(b6_src.read(1).flatten(), axis=1)

      b7_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B07')]['file_path'].values[0])
      b7_array = np.expand_dims(b7_src.read(1).flatten(), axis=1)

      b8_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B08')]['file_path'].values[0])
      b8_array = np.expand_dims(b8_src.read(1).flatten(), axis=1)

      b8a_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B8A')]['file_path'].values[0])
      b8a_array = np.expand_dims(b8a_src.read(1).flatten(), axis=1)

      b9_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B09')]['file_path'].values[0])
      b9_array = np.expand_dims(b9_src.read(1).flatten(), axis=1)

      b11_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B11')]['file_path'].values[0])
      b11_array = np.expand_dims(b11_src.read(1).flatten(), axis=1)

      b12_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B12')]['file_path'].values[0])
      b12_array = np.expand_dims(b12_src.read(1).flatten(), axis=1)

      clm_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='CLM')]['file_path'].values[0])
      clm_array = np.expand_dims(clm_src.read(1).flatten(), axis=1)

      X_tile = np.append(X_tile, b1_array, axis = 1)
      X_tile = np.append(X_tile, b2_array, axis = 1)
      X_tile = np.append(X_tile, b3_array, axis = 1)
      X_tile = np.append(X_tile, b4_array, axis = 1)
      X_tile = np.append(X_tile, b5_array, axis = 1)
      X_tile = np.append(X_tile, b6_array, axis = 1)
      X_tile = np.append(X_tile, b7_array, axis = 1)
      X_tile = np.append(X_tile, b8_array, axis = 1)
      X_tile = np.append(X_tile, b8a_array, axis = 1)
      X_tile = np.append(X_tile, b9_array, axis = 1)
      X_tile = np.append(X_tile, b11_array, axis = 1)
      X_tile = np.append(X_tile, b12_array, axis = 1)
      X_tile = np.append(X_tile, clm_array, axis = 1)

  X = np.append(X, X_tile, axis=0)

In [None]:
train_data = pd.DataFrame(X)
train_data['crop_type'] = y.astype(int)
train_data['Field ID'] = field_ids
train_data = train_data[train_data.label != 0] #this filters the pixels that don't have a label (or corresponding field ID)
train_grouped = train_data.groupby('Field ID').mean().reset_index()

In [None]:
tile_ids_test = competition_test_df['tile_id'].nunique()

In [None]:
# Our goal is developing a pixel-based Random Forest model. So we will create an X variable
# that each row is a pixel and each column is one of the observations.  
X = np.empty((0, 13 * 25))
field_ids = np.empty((0, 1))

for tile_id in tile_ids_test:
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]

    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids = np.append(field_ids, field_id_array.flatten())

    tile_date_times = sorted(tile_df[tile_df['satellite_platform']=='s2']['datetime'].unique()))
    if tile_id in n_v_tiles:
      tile_date_times = [x[0] for x in np.array_split(sorted(tile_df[tile_df['satellite_platform']=='s2']['datetime'].unique()), 25)]
    elif: tile_date_times[0] == np.datetime64('2017-04-01'):
      tile_date_times = dates_1.copy()
    else:
      tile_date_times = dates_2.copy()

    X_tile = np.empty((256 * 256, 0))
    for date_time in tile_date_times:

      b1_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B01')]['file_path'].values[0])
      b1_array = np.expand_dims(b1_src.read(1).flatten(), axis=1)

      b2_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B02')]['file_path'].values[0])
      b2_array = np.expand_dims(b2_src.read(1).flatten(), axis=1)
      
      b3_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B03')]['file_path'].values[0])
      b3_array = np.expand_dims(b3_src.read(1).flatten(), axis=1)

      b4_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B04')]['file_path'].values[0])
      b4_array = np.expand_dims(b4_src.read(1).flatten(), axis=1)

      b5_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B05')]['file_path'].values[0])
      b5_array = np.expand_dims(b5_src.read(1).flatten(), axis=1)

      b6_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B06')]['file_path'].values[0])
      b6_array = np.expand_dims(b6_src.read(1).flatten(), axis=1)

      b7_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B07')]['file_path'].values[0])
      b7_array = np.expand_dims(b7_src.read(1).flatten(), axis=1)

      b8_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B08')]['file_path'].values[0])
      b8_array = np.expand_dims(b8_src.read(1).flatten(), axis=1)

      b8a_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B8A')]['file_path'].values[0])
      b8a_array = np.expand_dims(b8a_src.read(1).flatten(), axis=1)

      b9_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B09')]['file_path'].values[0])
      b9_array = np.expand_dims(b9_src.read(1).flatten(), axis=1)

      b11_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B11')]['file_path'].values[0])
      b11_array = np.expand_dims(b11_src.read(1).flatten(), axis=1)

      b12_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B12')]['file_path'].values[0])
      b12_array = np.expand_dims(b12_src.read(1).flatten(), axis=1)

      clm_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='CLM')]['file_path'].values[0])
      clm_array = np.expand_dims(clm_src.read(1).flatten(), axis=1)

      X_tile = np.append(X_tile, b1_array, axis = 1)
      X_tile = np.append(X_tile, b2_array, axis = 1)
      X_tile = np.append(X_tile, b3_array, axis = 1)
      X_tile = np.append(X_tile, b4_array, axis = 1)
      X_tile = np.append(X_tile, b5_array, axis = 1)
      X_tile = np.append(X_tile, b6_array, axis = 1)
      X_tile = np.append(X_tile, b7_array, axis = 1)
      X_tile = np.append(X_tile, b8_array, axis = 1)
      X_tile = np.append(X_tile, b8a_array, axis = 1)
      X_tile = np.append(X_tile, b9_array, axis = 1)
      X_tile = np.append(X_tile, b11_array, axis = 1)
      X_tile = np.append(X_tile, b12_array, axis = 1)
      X_tile = np.append(X_tile, clm_array, axis = 1)

  X = np.append(X, X_tile, axis=0)

In [None]:
test_data = pd.DataFrame(X)
test_data['Field ID'] = field_ids
test_data['crop_type'] = 0
test_grouped = test_data.groupby('Field ID').mean().reset_index()

In [None]:
df = pd.concat([train_data, test_data]).reset_index(drop = True)

In [None]:
df.to_csv('radiant_pixels.csv', index = False)