<a href="https://colab.research.google.com/github/sameeraltaf/Data-science-progress/blob/main/Starter_notebook_checkpoint_HPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

# A Baseline Model for the AgrifieldNet India Competition

This notebook walks you through the steps to load the data and build a baseline model using Random Forests for `AgrifieldNet India Competition`.

## Radiant MLHub API


The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth).

Full documentation for the API is available at [docs.mlhub.earth](docs.mlhub.earth).

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/radiantearth/stac-spec/tree/master/extensions/label) definition.

## Dependencies

This notebook utilizes the [`radiant-mlhub` Python client](https://pypi.org/project/radiant-mlhub/) for interacting with the API. This notebook also utilizes the [`pandas` library](https://pandas.pydata.org/). If you are running this notebooks using Binder, then these dependencies have already been installed. If you are running this notebook locally, you will need to install these yourself.

See the official [`radiant-mlhub` docs](https://radiant-mlhub.readthedocs.io/) for more documentation of the full functionality of that library.

In [17]:
#import libraries
!pip install radiant_mlhub
!pip install rasterio
import os
import glob
import json
import getpass
import rasterio
import numpy as np
import pandas as pd
from tqdm import tqdm
from radiant_mlhub import Dataset
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# DOWNLOAD DATA FROM MLHUB

In [18]:
#For simplicity we select 4 out 12 bands for the this baseline model

Full_bands = ['B01', 'B02', 'B03', 'B04','B05', 'B06', 'B07', 'B08','B8A', 'B09', 'B11', 'B12']

selected_bands = Full_bands[0:-1]  + [Full_bands[11]]  #'B02', 'B03', 'B04', 'B08'
selected_bands

['B01',
 'B02',
 'B03',
 'B04',
 'B05',
 'B06',
 'B07',
 'B08',
 'B8A',
 'B09',
 'B11',
 'B12']

In [19]:
#!pip install matplotlib
#%matplotlib inline

In [20]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [21]:
main = '/content/drive/MyDrive/ref_agrifieldnet_competition_v1'

assets = ['field_ids','raster_labels']

source_collection = f'/ref_agrifieldnet_competition_v1_source'
train_label_collection = f'/ref_agrifieldnet_competition_v1_labels_train'
test_label_collection = f'/ref_agrifieldnet_competition_v1_labels_test'

## Prepare Train data

- Load collection.json in labels_train collection's path and retrieve all unique folder ids into a list.
- Use unique folder ids to create a list of field.tif and raster_labels.tif paths for all tiles.
- Create competition_train_data dataframe for folder_ids and field_paths
- Create field_crop_pair dataframe using field_crop_extractor.
- Create train_data dataframe using the feature_extractor with argsss (competition_train_data, source_collection)
- Group processed dataset by fields and find the pixel average of across the entire field 
- Merge train_data dataframe and field_crop_pair dataframe on field_id
- Split train_df dataframe for model training and evaluation 

In [22]:
os.path.exists('/content/drive/MyDrive/ref_agrifieldnet_competition_v1/ref_agrifieldnet_competition_v1_labels_train/collection.json')

True

In [23]:
#load collection json and retrieve all unique folder ids 
#use all unique folder ids to create a list of field and label paths for all tiles

with open (f'{main}/{train_label_collection}/collection.json') as f:
    train_json = json.load(f)
    
train_folder_ids = [i['href'].split('_')[-1].split('.')[0] for i in train_json['links'][4:]]

train_field_paths = [f'{main}/{train_label_collection}/{train_label_collection}_{i}/field_ids.tif' for i in train_folder_ids]
train_label_paths = [f'{main}/{train_label_collection}/{train_label_collection}_{i}/raster_labels.tif' for i in train_folder_ids]

In [24]:
#create dataset for folder_ids and field_paths

competition_train_data = pd.DataFrame(train_folder_ids, columns=['unique_folder_id'])
competition_train_data['field_paths'] = train_field_paths
competition_train_data.head()

Unnamed: 0,unique_folder_id,field_paths
0,28852,/content/drive/MyDrive/ref_agrifieldnet_compet...
1,d987c,/content/drive/MyDrive/ref_agrifieldnet_compet...
2,ca1d4,/content/drive/MyDrive/ref_agrifieldnet_compet...
3,2ec18,/content/drive/MyDrive/ref_agrifieldnet_compet...
4,7575d,/content/drive/MyDrive/ref_agrifieldnet_compet...


In [25]:
# PREPROCESS FIELDS AND CROPS IN TILES FOR TRAININIG

In [26]:
#Extract field_crop Pairs 

def field_crop_extractor(crop_field_files):
    field_crops = {}

    for label_field_file in tqdm(crop_field_files):
        with rasterio.open(f'{main}/{train_label_collection}/{train_label_collection}_{label_field_file}/field_ids.tif') as src:
            field_data = src.read()[0]
        with rasterio.open(f'{main}/{train_label_collection}/{train_label_collection}_{label_field_file}/raster_labels.tif') as src:
            crop_data = src.read()[0]
    
        for x in range(0, crop_data.shape[0]):
            for y in range(0, crop_data.shape[1]):
                field_id = str(field_data[x][y])
                field_crop = crop_data[x][y]

                if field_crops.get(field_id) is None:
                    field_crops[field_id] = []

                if field_crop not in field_crops[field_id]:
                    field_crops[field_id].append(field_crop)
    
    field_crop_map  =[[k, v[0]]  for k, v in field_crops.items() ]
    field_crop = pd.DataFrame(field_crop_map , columns=['field_id','crop_id'])

    return field_crop[field_crop['field_id']!='0']

In [27]:
field_crop_pair = field_crop_extractor(train_folder_ids)
field_crop_pair.head()

100%|██████████| 1165/1165 [09:07<00:00,  2.13it/s]


Unnamed: 0,field_id,crop_id
1,757,6
2,756,6
3,1372,5
4,1374,1
5,1986,4


In [28]:
field_crop_pair.shape

(5551, 2)

In [29]:
# Our goal is developing a pixel-based Random Forest model. So we will create an X variable
# such that, each row is a pixel and each column is one of the band observations mapped to its corresponding field. 


img_sh = 256
n_selected_bands= len(selected_bands)

n_obs = 1  #imagery per chip(no time series)

def feature_extractor(data_ ,   path ):
    '''
        data_: Dataframe with 'field_paths' and 'unique_folder_id' columns
        path: Path to source collections files

        returns: pixel dataframe with corresponding field_ids
        '''
    
    X = np.empty((0, n_selected_bands * n_obs))
    X_tile = np.empty((img_sh * img_sh, 0))
    X_arrays = []
        
    field_ids = np.empty((0, 1))

    for idx, tile_id in tqdm(enumerate(data_['unique_folder_id'])):
        
        field_src =   rasterio.open( data_['field_paths'].values[idx])
        field_array = field_src.read(1)
        field_ids = np.append(field_ids, field_array.flatten())
        
        
        bands_src = [rasterio.open(f'{main}/{path}/{path}_{tile_id}/{band}.tif') for band in selected_bands]
        bands_array = [np.expand_dims(band.read(1).flatten(), axis=1) for band in bands_src]
        
        X_tile = np.hstack(bands_array)

        X_arrays.append(X_tile)
        

    X = np.concatenate(X_arrays)
    
    data = pd.DataFrame(X, columns=selected_bands)

    data['field_id'] = field_ids

    return data[data['field_id']!=0]

In [30]:
train_data = feature_extractor(competition_train_data, source_collection)
train_data.head()

1165it [45:34,  2.35s/it]


Unnamed: 0,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B09,B11,B12,field_id
11031,43,39,38,38,41,54,63,61,64,12,57,37,757.0
11287,43,39,38,38,42,57,67,63,72,12,63,42,757.0
11288,43,39,38,37,41,59,69,65,78,12,68,43,757.0
11289,43,38,37,36,41,59,69,64,78,12,68,43,757.0
11543,43,39,38,38,42,57,67,64,72,12,63,42,757.0


In [31]:
# Each field has several pixels in| the data. Here our goal is to build a Random Forest (RF) model using the average values
# of the pixels within each field. So, we use `groupby` to take the mean for each field_id

train_data_grouped = train_data.groupby(['field_id']).mean().reset_index()
train_data_grouped.field_id = [str(int(i)) for i in train_data_grouped.field_id.values]
train_data_grouped.head()

Unnamed: 0,field_id,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B09,B11,B12
0,1,45.0,42.444444,42.722222,48.0,49.666667,58.0,65.222222,60.277778,71.944444,12.0,80.277778,61.333333
1,2,45.0,42.0,42.166667,47.666667,49.25,59.916667,69.0,63.916667,76.333333,12.833333,79.916667,56.75
2,3,45.0,42.6875,43.5,49.1875,51.4375,62.875,71.625,66.625,79.3125,13.0,82.125,58.0625
3,4,45.866667,42.466667,43.8,47.733333,49.466667,59.733333,68.133333,62.6,73.466667,11.266667,77.6,55.0
4,5,46.0,43.238095,45.238095,49.285714,50.904762,60.904762,68.952381,63.380952,74.547619,11.333333,77.452381,55.809524


In [32]:
# merge pixel dataframe to field_crop_pair dataframe

train_df = pd.merge(train_data_grouped, field_crop_pair , on='field_id' )
train_df.head()

Unnamed: 0,field_id,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B09,B11,B12,crop_id
0,1,45.0,42.444444,42.722222,48.0,49.666667,58.0,65.222222,60.277778,71.944444,12.0,80.277778,61.333333,1
1,2,45.0,42.0,42.166667,47.666667,49.25,59.916667,69.0,63.916667,76.333333,12.833333,79.916667,56.75,1
2,3,45.0,42.6875,43.5,49.1875,51.4375,62.875,71.625,66.625,79.3125,13.0,82.125,58.0625,1
3,4,45.866667,42.466667,43.8,47.733333,49.466667,59.733333,68.133333,62.6,73.466667,11.266667,77.6,55.0,2
4,5,46.0,43.238095,45.238095,49.285714,50.904762,60.904762,68.952381,63.380952,74.547619,11.333333,77.452381,55.809524,2


In [33]:
train_df.tail(8)

Unnamed: 0,field_id,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B09,B11,B12,crop_id
5543,7322,45.0,36.5,32.25,26.75,31.0,54.0,67.875,64.5,75.0,13.0,49.625,26.75,4
5544,7323,45.0,36.0,31.0,24.090909,27.818182,56.363636,75.454545,71.090909,82.181818,13.0,41.909091,20.909091,4
5545,7324,45.0,36.533333,31.666667,25.333333,29.133333,55.333333,70.8,67.4,77.6,13.866667,43.066667,21.466667,4
5546,7326,46.384615,39.0,34.576923,30.653846,33.153846,51.076923,62.538462,59.269231,66.846154,12.461538,50.615385,33.923077,9
5547,7327,46.0,37.851852,32.62963,26.555556,29.296296,51.185185,65.518519,63.925926,70.444444,12.777778,40.740741,21.518519,9
5548,7328,47.0,40.1,35.1,29.65,31.95,51.6,65.05,60.55,70.7,11.0,39.55,20.0,9
5549,7331,46.652174,40.130435,35.130435,30.0,32.608696,52.347826,66.347826,62.608696,71.521739,11.0,44.434783,25.217391,9
5550,7332,46.076923,39.653846,35.230769,29.423077,31.846154,49.615385,58.615385,53.346154,60.923077,10.461538,35.538462,18.615385,36


In [34]:
train_df.shape

(5551, 14)

In [35]:
# split data for model training and evaluation 

X_train, X_test, y_train, y_test =  train_test_split(train_df.drop(['field_id', 'crop_id'], axis=1), train_df['crop_id'] , test_size=0.3, random_state=24)

In [36]:
rf = RandomForestClassifier(random_state = 6)#its the random state where if he score is gicen none, it will output varying results but when given a specific value like 6 here, we get constant outputs 

In [37]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3885, 12), (1666, 12), (3885,), (1666,))

This is the testing on the hyperparameter setting. also additional model tuning can be done like implementing cross validation 

In [38]:
rf.fit(X_train, y_train.astype(int))  #rf.fit(X_train, y_train.astype(int))

RandomForestClassifier(random_state=6)

# MODEL TRAINING

In [39]:
from enum import auto
import pandas as pd
import numpy as np 
from sklearn import ensemble
from sklearn import metrics
from sklearn import model_selection
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.models import load_model

#if __name__=="_main_":

#empty model
classifier = ensemble.RandomForestClassifier(n_jobs = -1)
param_grid = {
    "n_estimators" : np.arange(100,1500,100),
    'criterion' : ['gini', 'entropy'],
    'max_depth': np.arange(2,20),
    'min_samples_leaf': np.arange(1,10,1),
    'min_samples_split' : np.arange(2,10,1),
    'max_features': ['auto', 'log2']
}

In [40]:
from sklearn import model_selection
model = model_selection.RandomizedSearchCV(
    estimator= classifier,
    param_distributions= param_grid,
    scoring="accuracy",
    verbose=10,
    n_jobs = 1,
    cv=5,
)

0.6301158301158302
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 12, 'max_features': 'log2', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 7, 'min_samples_split': 6, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 1300, 'n_jobs': -1, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

In [41]:
model.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5; 1/10] START criterion=entropy, max_depth=19, max_features=auto, min_samples_leaf=3, min_samples_split=4, n_estimators=100
[CV 1/5; 1/10] END criterion=entropy, max_depth=19, max_features=auto, min_samples_leaf=3, min_samples_split=4, n_estimators=100;, score=0.629 total time=   4.2s
[CV 2/5; 1/10] START criterion=entropy, max_depth=19, max_features=auto, min_samples_leaf=3, min_samples_split=4, n_estimators=100
[CV 2/5; 1/10] END criterion=entropy, max_depth=19, max_features=auto, min_samples_leaf=3, min_samples_split=4, n_estimators=100;, score=0.642 total time=   2.3s
[CV 3/5; 1/10] START criterion=entropy, max_depth=19, max_features=auto, min_samples_leaf=3, min_samples_split=4, n_estimators=100
[CV 3/5; 1/10] END criterion=entropy, max_depth=19, max_features=auto, min_samples_leaf=3, min_samples_split=4, n_estimators=100;, score=0.650 total time=   2.3s
[CV 4/5; 1/10] START criterion=entropy, max_depth=19, max_fe

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=-1), n_jobs=1,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19]),
                                        'max_features': ['auto', 'log2'],
                                        'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                                        'min_samples_split': array([2, 3, 4, 5, 6, 7, 8, 9]),
                                        'n_estimators': array([ 100,  200,  300,  400,  500,  600,  700,  800,  900, 1000, 1100,
       1200, 1300, 1400])},
                   scoring='accuracy', verbose=10)

In [None]:
print(model.best_score_)
print(model.best_estimator_.get_params())

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# trained classes

model.classes_

# MODEL EVALUATION

In [None]:
from sklearn.metrics import classification_report
y_pred_crop = model.predict(X_test)

print(classification_report(y_test,y_pred_crop))

## Prepare Test data

- Load collection json and retrieve all unique folder ids 
- Use unique folder ids to create a list of field.tif paths for all tiles
- Create competition_test_data dataframe for folder_ids and field_paths
- Create test_data dataframe using the feature_extractor with argsss (competition_test_data, source_collection)
- Group processed dataset by fields and find the pixel average of across the entire field 

In [None]:
with open (f'{main}/{test_label_collection}/collection.json') as f:
    test_json = json.load(f)
    
test_folder_ids = [i['href'].split('_')[-1].split('.')[0] for i in test_json['links'][4:]]

test_field_paths = [f'{main}/{test_label_collection}/{test_label_collection}_{i}/field_ids.tif' for i in test_folder_ids]

In [None]:
competition_test_data = pd.DataFrame(test_folder_ids , columns=['unique_folder_id'])
competition_test_data['field_paths'] = test_field_paths
competition_test_data.head()

In [None]:
test_data = feature_extractor(competition_test_data,  source_collection)
test_data.head()

In [None]:
# Each field has several pixels in| the data. Here our goal is to build a Random Forest (RF) model using the average values
# of the pixels within each field. So, we use `groupby` to take the mean for each field_id

test_data_grouped = test_data.groupby(['field_id']).mean().reset_index()
test_data_grouped.field_id = [str(int(i)) for i in test_data_grouped.field_id.values]
test_data_grouped

# Submit predictions with field_ids and class probabilites

- run predictions with trained model
- pass to multioutput predictions to csv file with field_id as index
- save output file as submission.csv

In [None]:
# extract crop_id-label dictionary

with open('/content/drive/MyDrive/ref_agrifieldnet_competition_v1/ref_agrifieldnet_competition_v1_labels_train/ref_agrifieldnet_competition_v1_labels_train_001c1/ref_agrifieldnet_competition_v1_labels_train_001c1.json') as ll:
    label_json = json.load(ll)

In [None]:
crop_dict = {asset.get('values')[0]:asset.get('summary') for asset in label_json['assets']['raster_labels']['file:values']}

In [None]:
crop_dict

In [None]:
def labeler(labeled):
    crop_label = np.array([crop_dict.get(f'{int(i)}') for i in labeled])
    return crop_label

In [None]:
predictions = rf.predict_proba(test_data_grouped.drop('field_id', axis=1 ))

crop_columns = [crop_dict.get(i) for i in model.classes_]

test_df  = pd.DataFrame(columns= ['field_id'] + crop_columns)

test_df['field_id'] = test_data_grouped.field_id

test_df[crop_columns]= predictions 
test_df.to_csv('/content/drive/MyDrive/submission.csv', index=False)

In [None]:
test_df.head()