<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

# A Baseline Model for the Radiant Earth Spot the Crop Challenge [Sentinel-2 version]

This notebook walks you through the steps to load the data and build a baseline model based on Sentinel-2 daya using Random Forests for `Radiant Earth Spot the Crop Challenge`.

## Radiant MLHub API


The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth).

Full documentation for the API is available at [https://mlhub.earth/docs](https://mlhub.earth/docs).

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/stac-extensions/label) definition.

## Dependencies

All the dependencies for this notebook are included in the `requirements.txt` file included in this folder.


**You must replace the `YOUR_API_KEY_HERE` text with your API key which you can obtain by creating a free account on the [MLHub Dashboard](https://mlhub.earth/profile/) within the `API Keys` tab at the top of the page.**

In [5]:
# Required libraries
import os
import tarfile
import json
from pathlib import Path
from radiant_mlhub.client import _download as download_file

import datetime
import rasterio
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedShuffleSplit

os.environ['MLHUB_API_KEY'] = 'N/A'

In [103]:
DOWNLOAD_S1 = False # If you set this to true then the Sentinel-1 data will be downloaded which is not needed in this notebook.

# Select which imagery bands you'd like to download here:
DOWNLOAD_BANDS = {
    'B01': False,
    'B02': False,
    'B03': True,
    'B04': True,
    'B05': False,
    'B06': False,
    'B07': False,
    'B08': True,
    'B8A': False,
    'B09': False,
    'B11': False,
    'B12': False,
    'CLM': True
}

# In this model we will only use Green, Red and NIR bands. You can select to download any number of bands. 
# Our choice relies on the fact that vegetation is most sensitive to these bands. 
# We also donwload the CLM or Cloud Mask layer to exclude cloudy data from the training phase. 
# You can also do a feature selection, and try different combination of bands to see which ones will result in a better accuracy. 

## Downloading and Loading the Data

In this part, we will download the data from Radiant MLHub and load the properties of each item in the dataset into a DataFrame


In [3]:
FOLDER_BASE = 'ref_south_africa_crops_competition_v1'

def download_archive(archive_name):
    if os.path.exists(archive_name.replace('.tar.gz', '')):
        return
    
    print(f'Downloading {archive_name} ...')
    download_url = f'https://radiant-mlhub.s3.us-west-2.amazonaws.com/archives/{archive_name}'
    download_file(download_url, '.')
    print(f'Extracting {archive_name} ...')
    with tarfile.open(archive_name) as tfile:
        tfile.extractall()
    os.remove(archive_name)

for split in ['train', 'test']:
    # Download the labels
    labels_archive = f'{FOLDER_BASE}_{split}_labels.tar.gz'
    download_archive(labels_archive)
    
    # Download Sentinel-1 data
    if DOWNLOAD_S1:
        s1_archive = f'{FOLDER_BASE}_{split}_source_s1.tar.gz'
        download_archive(s1_archive)
        

    for band, download in DOWNLOAD_BANDS.items():
        if not download:
            continue
        s2_archive = f'{FOLDER_BASE}_{split}_source_s2_{band}.tar.gz'
        download_archive(s2_archive)
        
def resolve_path(base, path):
    return Path(os.path.join(base, path)).resolve()
        
def load_df(collection_id):
    split = collection_id.split('_')[-2]
    collection = json.load(open(f'{collection_id}/collection.json', 'r'))
    rows = []
    item_links = []
    for link in collection['links']:
        if link['rel'] != 'item':
            continue
        item_links.append(link['href'])
        
    for item_link in item_links:
        item_path = f'{collection_id}/{item_link}'
        current_path = os.path.dirname(item_path)
        item = json.load(open(item_path, 'r'))
        tile_id = item['id'].split('_')[-1]
        for asset_key, asset in item['assets'].items():
            rows.append([
                tile_id,
                None,
                None,
                asset_key,
                str(resolve_path(current_path, asset['href']))
            ])
            
        for link in item['links']:
            if link['rel'] != 'source':
                continue
            source_item_id = link['href'].split('/')[-2]
            
            if source_item_id.find('_s1_') > 0 and not DOWNLOAD_S1:
                continue
            elif source_item_id.find('_s1_') > 0:
                for band in ['VV', 'VH']:
                    asset_path = Path(f'{FOLDER_BASE}_{split}_source_s1/{source_item_id}/{band}.tif').resolve()
                    date = '-'.join(source_item_id.split('_')[10:13])
                    
                    rows.append([
                        tile_id,
                        f'{date}T00:00:00Z',
                        's1',
                        band,
                        asset_path
                    ])
                
            if source_item_id.find('_s2_') > 0:
                for band, download in DOWNLOAD_BANDS.items():
                    if not download:
                        continue
                    
                    asset_path = Path(f'{FOLDER_BASE}_{split}_source_s2_{band}/{source_item_id}_{band}.tif').resolve()
                    date = '-'.join(source_item_id.split('_')[10:13])
                    rows.append([
                        tile_id,
                        f'{date}T00:00:00Z',
                        's2',
                        band,
                        asset_path
                    ])
            
    return pd.DataFrame(rows, columns=['tile_id', 'datetime', 'satellite_platform', 'asset', 'file_path'])

competition_train_df = load_df(f'{FOLDER_BASE}_train_labels')
competition_test_df = load_df(f'{FOLDER_BASE}_test_labels')

Downloading ref_south_africa_crops_competition_v1_train_labels.tar.gz ...


  0%|          | 0/31.4 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_train_labels.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_train_source_s2_B03.tar.gz ...


  0%|          | 0/5775.1 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_train_source_s2_B03.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_train_source_s2_B04.tar.gz ...


  0%|          | 0/6363.4 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_train_source_s2_B04.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_train_source_s2_B08.tar.gz ...


  0%|          | 0/6755.8 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_train_source_s2_B08.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_train_source_s2_CLM.tar.gz ...


  0%|          | 0/24.3 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_train_source_s2_CLM.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_test_labels.tar.gz ...


  0%|          | 0/10.9 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_test_labels.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_test_source_s2_B03.tar.gz ...


  0%|          | 0/2454.4 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_test_source_s2_B03.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_test_source_s2_B04.tar.gz ...


  0%|          | 0/2706.0 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_test_source_s2_B04.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_test_source_s2_B08.tar.gz ...


  0%|          | 0/2877.3 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_test_source_s2_B08.tar.gz ...
Downloading ref_south_africa_crops_competition_v1_test_source_s2_CLM.tar.gz ...


  0%|          | 0/10.4 [00:00<?, ?M/s]

Extracting ref_south_africa_crops_competition_v1_test_source_s2_CLM.tar.gz ...


In [4]:
competition_train_df

Unnamed: 0,tile_id,datetime,satellite_platform,asset,file_path
0,2587,,,documentation,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
1,2587,,,field_ids,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
2,2587,,,field_info_train,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
3,2587,,,labels,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
4,2587,,,raster_values,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
...,...,...,...,...,...
591109,2198,2017-11-27T00:00:00Z,s2,CLM,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
591110,2198,2017-11-30T00:00:00Z,s2,B03,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
591111,2198,2017-11-30T00:00:00Z,s2,B04,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...
591112,2198,2017-11-30T00:00:00Z,s2,B08,/Users/hamed/Dropbox (Radiant)/Radiant/mlhub/m...


In [7]:
# This DataFrame lists all types of assets including documentation of the data. 
# In the following, we will use the Sentinel-2 bands as well as labels. 
competition_train_df['asset'].unique()

array(['documentation', 'field_ids', 'field_info_train', 'labels',
       'raster_values', 'B03', 'B04', 'B08', 'CLM'], dtype=object)

In [8]:
tile_ids_train = competition_train_df['tile_id'].unique()

In [7]:
# For simplicty of this baseline model, we will use only 5 images throughout the growing season
# You can choose to use all of them, select a few of them at specifc intervals, or 
# load as many as you want and interpolate between them to have a regular temporal frequency.

# Another assumption is that we are selecting the first 5 cloud free images. Ideally, you should
# select the images across the different tiles with the same temporal frequency. 
n_obs = 5

In [88]:
# Our goal is developing a pixel-based Random Forest model. So we will create an X variable
# that each row is a pixel and each column is one of the observations. 
# The other variables is y which has rows equal to the number of pixels. 
X = np.empty((0, 3 * n_obs))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))

for tile_id in tile_ids_train:
    tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

    label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
    label_array = label_src.read(1)
    y = np.append(y, label_array.flatten())

    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids = np.append(field_ids, field_id_array.flatten())

    tile_date_times = tile_df[tile_df['satellite_platform']=='s2']['datetime'].unique()

    X_tile = np.empty((256 * 256, 0))
    n_X = 0
    for date_time in tile_date_times:
        # Here we retrieve the cloud band, and check if it's cloud free we will load the other bands
        # Otherwise we will pass on to the next observation
        
        clm_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='CLM')]['file_path'].values[0])
        clm_max = np.max(clm_src.read(1))
        # In the following we select images that the maximum cloud cover probability per pixel is 10% (10% * 255 = 25.5).
        if clm_max < 25:
            n_X+=1
            b3_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B03')]['file_path'].values[0])
            b3_array = np.expand_dims(b3_src.read(1).flatten(), axis=1)

            b4_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B04')]['file_path'].values[0])
            b4_array = np.expand_dims(b4_src.read(1).flatten(), axis=1)

            b8_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B08')]['file_path'].values[0])
            b8_array = np.expand_dims(b8_src.read(1).flatten(), axis=1)


            X_tile = np.append(X_tile, b3_array, axis = 1)
            X_tile = np.append(X_tile, b4_array, axis = 1)
            X_tile = np.append(X_tile, b8_array, axis = 1)
        if n_X == n_obs:
            break
        
    X = np.append(X, X_tile, axis=0)

In [89]:
data = pd.DataFrame(X)
data['label'] = y.astype(int)
data['field_id'] = field_ids
data = data[data.label != 0] #this filters the pixels that don't have a label (or corresponding field ID)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,label,field_id
2048,29.0,40.0,57.0,29.0,41.0,60.0,24.0,33.0,52.0,27.0,...,12.0,86.0,17.0,11.0,94.0,15.0,10.0,90.0,2,3020.0
2304,30.0,41.0,57.0,30.0,39.0,58.0,25.0,34.0,51.0,29.0,...,13.0,86.0,17.0,12.0,95.0,15.0,10.0,91.0,2,3020.0
2560,30.0,43.0,58.0,30.0,43.0,61.0,25.0,36.0,55.0,30.0,...,16.0,85.0,18.0,13.0,92.0,14.0,11.0,85.0,2,3020.0
2561,32.0,43.0,63.0,30.0,44.0,64.0,24.0,33.0,52.0,30.0,...,13.0,90.0,17.0,11.0,97.0,14.0,10.0,94.0,2,3020.0
2816,28.0,42.0,58.0,32.0,42.0,60.0,25.0,34.0,53.0,28.0,...,16.0,82.0,19.0,16.0,89.0,17.0,16.0,93.0,2,3020.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6553595,44.0,65.0,87.0,35.0,55.0,78.0,41.0,62.0,85.0,39.0,...,59.0,78.0,34.0,47.0,66.0,32.0,47.0,67.0,5,9441.0
6553596,41.0,60.0,84.0,32.0,51.0,76.0,38.0,58.0,83.0,35.0,...,54.0,75.0,31.0,43.0,60.0,30.0,44.0,63.0,5,9441.0
6553597,40.0,60.0,83.0,32.0,51.0,76.0,35.0,55.0,81.0,34.0,...,54.0,77.0,30.0,42.0,62.0,30.0,44.0,65.0,5,9441.0
6553598,40.0,59.0,82.0,32.0,50.0,76.0,36.0,56.0,80.0,33.0,...,53.0,75.0,29.0,42.0,60.0,28.0,41.0,63.0,5,9441.0


## Building the Model

In [90]:
# Each field has several pixels in the data. Here our goal is to build a Random Forest (RF) model using the average values
# of the pixels within each field. So, we use `groupby` to take the mean for each field_id
data_grouped = data.groupby('field_id').mean().reset_index()
data_grouped

Unnamed: 0,field_id,0,1,2,3,4,5,6,7,8,...,21,22,23,24,25,26,27,28,29,label
0,0.0,35.250000,49.500000,78.250000,33.500000,48.250000,74.000000,33.250000,46.500000,70.000000,...,28.750000,36.500000,78.250000,29.750000,37.000000,78.750000,28.500000,36.000000,77.250000,6
1,29.0,28.278689,40.229508,68.836066,23.885246,34.213115,56.737705,25.065574,34.868852,55.147541,...,20.393443,24.426230,48.491803,21.295082,25.918033,51.098361,21.180328,26.245902,51.770492,4
2,78.0,16.300000,19.042857,56.971429,16.757143,19.128571,60.671429,14.714286,17.485714,60.614286,...,12.628571,16.914286,41.742857,23.671429,28.642857,60.057143,23.100000,30.071429,58.671429,4
3,92.0,20.504950,27.009901,66.831683,19.237624,24.960396,61.118812,19.306931,24.673267,61.801980,...,16.247525,12.801980,74.405941,15.891089,12.653465,79.237624,17.732673,13.683168,85.950495,1
4,104.0,21.632110,32.996330,65.098165,18.666972,28.722936,52.577064,18.116514,28.635780,52.334862,...,12.444037,10.877982,55.035780,17.665138,17.108257,80.156881,18.035780,16.777982,76.982569,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3393,122419.0,21.269912,28.845133,66.296460,20.601770,27.411504,60.407080,19.376106,26.141593,55.265487,...,17.318584,16.800885,54.862832,17.203540,17.008850,58.915929,16.893805,16.088496,58.225664,4
3394,122436.0,64.006579,85.838816,105.638158,56.578947,77.072368,97.736842,62.203947,83.967105,105.378289,...,54.957237,70.437500,98.907895,42.730263,55.476974,81.796053,48.256579,63.036184,91.516447,5
3395,122615.0,28.555556,39.755556,64.244444,28.874074,41.118519,65.696296,27.000000,36.607407,60.037037,...,18.385185,21.237037,61.503704,19.125926,20.888889,63.229630,19.400000,21.170370,65.688889,2
3396,122704.0,23.972167,32.038767,53.290258,22.744533,29.574553,48.015905,22.895626,29.428429,51.883698,...,15.236581,13.104374,58.981113,19.040755,15.754473,71.596421,18.061630,13.360835,72.826044,5


In [91]:
# Split train and test
# We use field_ids to split the data to train and test. Note that the test portion for training is different than the test 
# portion provided as part of the competition. 
train_per = 0.7

n_fields = len(data_grouped['field_id'])
np.random.seed(10)
train_fields = np.random.choice(data_grouped['field_id'], int(n_fields * train_per), replace=False)
test_fields = data_grouped['field_id'][~np.in1d(data_grouped['field_id'], train_fields)]

In [92]:
X_train, X_test = data_grouped[data_grouped['field_id'].isin(train_fields)], data_grouped[data_grouped['field_id'].isin(test_fields)]
X_train = X_train.drop(columns=['label', 'field_id'])
X_test = X_test.drop(columns=['label', 'field_id'])
y_train, y_test = data_grouped[data_grouped['field_id'].isin(train_fields)]['label'], data_grouped[data_grouped['field_id'].isin(test_fields)]['label']

In [93]:
# We ran a simple hyperparameter tuning for the number of trees, and concluded to use:
n_trees = 50

In [94]:
# Fitting the RF model
rf = RandomForestClassifier(n_estimators = n_trees, random_state = 0, n_jobs = 3)
rf.fit(X_train, y_train.astype(int))

RandomForestClassifier(n_estimators=50, n_jobs=3, random_state=0)

## Competition Test Data

In this part we will load the competition test data (which does not have labels) and predict the crop class for each field

In [95]:
tile_ids_test = competition_test_df['tile_id'].unique()

In [97]:
X_competition_test = np.empty((0, 3 * n_obs))
field_ids_test = np.empty((0, 1))

for tile_id in tile_ids_test:
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s2']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 0))
    n_X = 0
    for date_time in tile_date_times:
        # Here we retrieve the cloud band, and check if it's cloud free we will load the other bands
        # Otherwise we will pass on to the next observation
        
        clm_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='CLM')]['file_path'].values[0])
        clm_max = np.max(clm_src.read(1))
        
        if clm_max < 25:
            n_X+=1
            b3_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B03')]['file_path'].values[0])
            b3_array = np.expand_dims(b3_src.read(1).flatten(), axis=1)

            b4_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B04')]['file_path'].values[0])
            b4_array = np.expand_dims(b4_src.read(1).flatten(), axis=1)

            b8_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='B08')]['file_path'].values[0])
            b8_array = np.expand_dims(b8_src.read(1).flatten(), axis=1)


            X_tile = np.append(X_tile, b3_array, axis = 1)
            X_tile = np.append(X_tile, b4_array, axis = 1)
            X_tile = np.append(X_tile, b8_array, axis = 1)
        if n_X == n_obs:
            break
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

In [98]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]
data_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,field_id
72,44.0,62.0,84.0,45.0,61.0,82.0,47.0,61.0,85.0,47.0,...,43.0,60.0,85.0,24.0,33.0,60.0,27.0,37.0,68.0,102896.0
73,45.0,62.0,84.0,42.0,60.0,80.0,46.0,65.0,88.0,45.0,...,47.0,65.0,91.0,25.0,35.0,59.0,28.0,40.0,64.0,102896.0
74,41.0,60.0,80.0,42.0,58.0,77.0,44.0,62.0,84.0,44.0,...,45.0,63.0,89.0,23.0,35.0,55.0,26.0,40.0,61.0,102896.0
75,43.0,59.0,80.0,43.0,58.0,78.0,44.0,62.0,82.0,43.0,...,47.0,67.0,92.0,22.0,34.0,53.0,25.0,39.0,61.0,102896.0
76,44.0,61.0,80.0,44.0,60.0,79.0,44.0,61.0,84.0,45.0,...,50.0,69.0,96.0,23.0,34.0,52.0,26.0,41.0,61.0,102896.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6553595,39.0,54.0,75.0,40.0,56.0,77.0,36.0,51.0,67.0,31.0,...,16.0,12.0,81.0,19.0,14.0,87.0,16.0,12.0,89.0,88366.0
6553596,38.0,53.0,73.0,38.0,53.0,75.0,35.0,48.0,67.0,28.0,...,16.0,11.0,88.0,16.0,12.0,95.0,15.0,11.0,90.0,88366.0
6553597,36.0,52.0,71.0,38.0,52.0,72.0,33.0,46.0,66.0,26.0,...,16.0,9.0,97.0,15.0,9.0,106.0,13.0,7.0,102.0,88366.0
6553598,34.0,48.0,72.0,35.0,51.0,71.0,32.0,44.0,65.0,27.0,...,15.0,9.0,91.0,14.0,8.0,98.0,13.0,8.0,100.0,88366.0


In [99]:
data_test_grouped = data_test.groupby('field_id').mean().reset_index()
data_test_grouped

Unnamed: 0,field_id,0,1,2,3,4,5,6,7,8,...,20,21,22,23,24,25,26,27,28,29
0,56.0,34.798817,48.591716,72.260355,33.911243,47.988166,71.195266,27.236686,39.810651,60.757396,...,57.319527,18.431953,15.739645,83.988166,17.733728,14.579882,79.106509,19.952663,17.739645,84.769231
1,60.0,42.573991,59.477578,75.878924,45.031390,63.495516,81.659193,30.466368,46.995516,65.271300,...,93.013453,19.829596,14.504484,93.795964,15.134529,7.134529,126.852018,14.878924,7.751121,116.264574
2,97.0,39.850665,56.858058,82.050271,34.743223,51.089207,73.808280,36.003943,52.457368,75.269591,...,59.341055,34.829473,45.451454,80.078857,34.769345,42.734845,85.749138,32.508132,39.586003,84.771809
3,103.0,30.953740,49.539370,70.334646,32.636811,52.448819,74.428150,32.040354,51.937992,73.218504,...,68.678150,22.397638,34.785433,57.898622,21.320866,34.351378,58.513780,24.099409,36.169291,61.146654
4,123.0,44.265934,64.771429,87.507692,44.254945,62.773626,84.879121,37.591209,53.936264,69.639560,...,53.962637,23.472527,23.545055,73.571429,23.727473,22.767033,78.646154,24.773626,24.989011,82.648352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2884,122658.0,43.159352,60.002582,71.037785,43.965501,62.803567,76.265431,38.128843,54.178127,65.432528,...,64.230228,20.162403,21.843933,59.897677,19.902136,16.440272,86.352265,18.163107,13.247125,96.404600
2885,122689.0,30.416058,51.912409,65.518248,31.270073,53.102190,70.868613,34.182482,56.350365,72.299270,...,56.204380,23.781022,41.379562,61.671533,24.233577,44.167883,70.175182,28.744526,47.554745,72.175182
2886,122698.0,33.542553,46.170213,69.861702,32.829787,45.691489,68.563830,31.648936,42.617021,62.276596,...,54.297872,16.808511,20.861702,51.191489,16.882979,14.882979,73.946809,16.946809,12.989362,84.117021
2887,122703.0,39.417191,55.907757,83.333333,41.482180,60.289308,86.528302,41.285115,58.498952,79.402516,...,91.777778,26.677149,25.035639,98.819706,26.127883,24.452830,96.819706,23.062893,19.888889,94.790356


In [100]:
y_competition_prob = rf.predict_proba(data_test_grouped.drop(columns=['field_id']))

In [101]:
# In this part we format the DataFrame to have column names and order similar to the sample submission file. 
pred_df = pd.DataFrame(y_competition_prob)
pred_df = pred_df.rename(columns={
    0:'Crop_ID_1',
    1:'Crop_ID_2', 
    2:'Crop_ID_3',
    3:'Crop_ID_4',
    4:'Crop_ID_5',
    5:'Crop_ID_6',
    6:'Crop_ID_7',
    7:'Crop_ID_8',
    8:'Crop_ID_9'
})
pred_df['field_id']=data_test_grouped['field_id']
pred_df = pred_df[['field_id', 'Crop_ID_1', 'Crop_ID_2', 'Crop_ID_3', 'Crop_ID_4', 'Crop_ID_5', 'Crop_ID_6', 'Crop_ID_7', 'Crop_ID_8', 'Crop_ID_9']]
pred_df

Unnamed: 0,field_id,Crop_ID_1,Crop_ID_2,Crop_ID_3,Crop_ID_4,Crop_ID_5,Crop_ID_6,Crop_ID_7,Crop_ID_8,Crop_ID_9
0,56.0,0.04,0.62,0.00,0.18,0.02,0.02,0.10,0.00,0.02
1,60.0,0.06,0.06,0.04,0.00,0.02,0.32,0.48,0.02,0.00
2,97.0,0.04,0.12,0.08,0.12,0.30,0.06,0.04,0.04,0.20
3,103.0,0.04,0.14,0.16,0.08,0.20,0.12,0.02,0.00,0.24
4,123.0,0.22,0.14,0.08,0.12,0.02,0.18,0.22,0.02,0.00
...,...,...,...,...,...,...,...,...,...,...
2884,122658.0,0.18,0.06,0.02,0.00,0.00,0.10,0.62,0.02,0.00
2885,122689.0,0.08,0.26,0.08,0.02,0.36,0.08,0.00,0.02,0.10
2886,122698.0,0.08,0.22,0.00,0.10,0.04,0.06,0.42,0.04,0.04
2887,122703.0,0.26,0.14,0.02,0.26,0.02,0.16,0.04,0.08,0.02


In [102]:
# Write the predicted probabilites to a csv for submission
pred_df.to_csv('baseline_submission.csv', index=False)