<a href="https://colab.research.google.com/github/yasminemasmoudi/geoai-hack-2022-crop-type-classification-challenge/blob/master/Geo_AI_Starter_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Starter Notebook: GEO AI  Hackathon!


Welcome! This starter notebook is designed to get you started on this challenge, where you will be using time-series Sentinel-2 multi-spectral data to classify crops. We will take a look at the data, create a model and then use that to make our first submission. After that we will briefly look at some ways to improve. Let's get started.

---



---



# IMPORTS

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import log_loss , accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import warnings
warnings.simplefilter('ignore')

# Loading the Data

We're using the pandas library to load the data into dataframes - a tabular data structure that is perfect for this kind of work. Each of the three CSV files from Zindi is loaded into a dataframe and we take a look at the shape of the data (number of rows and columns) as well as a preview of the first 5 rows to get a feel for what we're working with.

---




* For those who are not familiar with pandas , i highly recommend watching this [video](https://youtu.be/fwWCw_cE5aI) ✌

In [None]:
train = pd.read_csv('Train.csv')
print(train.shape)
train.head()

(1004, 1202)


Unnamed: 0,ID,Target,timestep1_B02_mean,timestep1_B03_mean,timestep1_B04_mean,timestep1_B05_mean,timestep1_B06_mean,timestep1_B07_mean,timestep1_B08_mean,timestep1_B8A_mean,...,timestep24_B02_min,timestep24_B03_min,timestep24_B04_min,timestep24_B05_min,timestep24_B06_min,timestep24_B07_min,timestep24_B08_min,timestep24_B8A_min,timestep24_B11_min,timestep24_B12_min
0,6fba03cb,olive+cereals,456.918182,799.361364,825.761364,1420.606818,2478.922727,2882.245455,3084.704545,1989.954545,...,113.0,562.0,206.0,1193.0,1497.0,1588.0,1644.0,2060.0,1157.0,1630.0
1,7ea60a74,plowing_and_sowing,1087.130208,1653.770833,2194.458333,2447.984375,2588.328125,2725.609375,2777.385417,3698.088542,...,328.0,913.0,746.0,1468.0,1523.0,1603.0,1782.0,2308.0,1429.0,1741.0
2,64d18595,olive,917.369369,1383.882883,1742.765766,2012.234234,2388.657658,2592.783784,2644.738739,3111.837838,...,113.0,587.0,312.0,1161.0,2050.0,2182.0,2238.0,2016.0,1134.0,2372.0
3,119c8ec4,arable_soil,1013.77512,1551.863636,2073.009569,2371.222488,2666.064593,2843.196172,2930.330144,3802.370813,...,212.0,592.0,810.0,1289.0,1655.0,1796.0,1824.0,2418.0,1708.0,1965.0
4,d884d98c,olive,801.352113,1243.741784,1557.410798,1827.133803,2248.71831,2433.793427,2471.901408,2846.659624,...,84.0,326.0,480.0,790.0,1132.0,1207.0,1194.0,1595.0,1085.0,1283.0


In train, we have a set of inputs over 24 diffrent Time steps  and our desired output variable, 'Target'. There are 1010 rows - lots of juicy data!

In [None]:
test = pd.read_csv('Test.csv')
print(test.shape)
test.head()

(502, 1201)


Unnamed: 0,ID,timestep1_B02_mean,timestep1_B03_mean,timestep1_B04_mean,timestep1_B05_mean,timestep1_B06_mean,timestep1_B07_mean,timestep1_B08_mean,timestep1_B8A_mean,timestep1_B11_mean,...,timestep24_B02_min,timestep24_B03_min,timestep24_B04_min,timestep24_B05_min,timestep24_B06_min,timestep24_B07_min,timestep24_B08_min,timestep24_B8A_min,timestep24_B11_min,timestep24_B12_min
0,d8da32b5,901.974359,1255.74359,1561.230769,1891.230769,2340.333333,2568.282051,2679.179487,3381.820513,2856.641026,...,345.0,631.0,763.0,1299.0,1803.0,1964.0,2050.0,2384.0,1901.0,2171.0
1,670ad0fb,927.181818,1423.636364,1778.636364,2161.477273,2745.909091,2969.204545,3063.0,3632.977273,2942.5,...,600.0,1048.0,1416.0,2014.0,2367.0,2574.0,2614.0,3160.0,2482.0,2795.0
2,fec40ac9,716.61194,1126.828358,1390.171642,1751.037313,2480.328358,2757.880597,2798.059701,2913.335821,2234.589552,...,228.0,525.0,723.0,1106.0,1514.0,1582.0,1742.0,2141.0,1453.0,1761.0
3,4f6d4495,565.61194,931.238806,999.813433,1558.462687,2627.276119,2965.738806,3134.19403,2584.865672,1740.19403,...,245.0,528.0,723.0,1140.0,1270.0,1420.0,1532.0,2270.0,1387.0,1613.0
4,e56d2db7,943.47343,1371.809179,1709.190821,1926.652174,2180.309179,2325.961353,2378.082126,2846.335749,2323.219807,...,415.0,840.0,1046.0,1417.0,1885.0,2017.0,2080.0,2375.0,1665.0,2157.0


Test looks just like train but without the 'Target' column and with fewer rows.

In [None]:
ss = pd.read_csv('SampleSubmission.csv')
print(ss.shape)
ss.head()

(502, 14)


Unnamed: 0,ID,arable_soil,cereals,forage_crop,greenhouses,mixed_crops,ochards,olive,olive+arbo,olive+cereals,olive+crops,plowing_and_sowing,vegetable_and_flower,wheat
0,d8da32b5,0,0,0,0,0,0,0,0,1,0,0,0,0
1,670ad0fb,0,0,0,0,0,0,0,0,0,0,1,0,0
2,fec40ac9,0,0,0,0,0,0,1,0,0,0,0,0,0
3,4f6d4495,1,0,0,0,0,0,0,0,0,0,0,0,0
4,e56d2db7,0,0,0,0,0,0,0,1,0,0,0,0,0


# Data Processing

---



In [None]:
def process(train,test,ss) :
  
  target_mapper = dict(zip(ss.drop('ID',1).columns.tolist(),
                          [i for i in range(len(ss.drop('ID',1).columns.tolist()))]))  # Used to Encode Train Target 
  train['Target'] = train['Target'].map(target_mapper)
  
  Inversetarget_mapper = dict(zip([i for i in range(len(ss.drop('ID',1).columns.tolist()))],
                                  ss.drop('ID',1).columns.tolist())) # Used to Create submission file 

  in_cols = train.filter(like='timestep1_').columns.tolist() # features used in Training, we will use only time step 1 in this tutorial

  return target_mapper , Inversetarget_mapper , in_cols , train , test , ss

In [None]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
ss = pd.read_csv('SampleSubmission.csv')

target_mapper , Inversetarget_mapper , in_cols , train , test , ss  = process(train,test,ss)

# MODELING

---



In [None]:
X, y = train[in_cols], train['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,stratify = y ,
                                                    random_state=58) # Random state keeps the split consistent
print(X_train.shape, X_test.shape)

(803, 50) (201, 50)


In [None]:
model = RandomForestClassifier(random_state=0) # Create the model
model.fit(X_train, y_train) # Train it (this syntax looks the same for all sklearn models)

RandomForestClassifier(random_state=0)

In [None]:
print('LOCAL log_loss',log_loss(y_test, model.predict_proba(X_test))) # predict_proba return probabilities 
print('LOCAL accuracy',accuracy_score(y_test, model.predict(X_test)))

LOCAL log_loss 2.9747506719277053
LOCAL accuracy 0.4427860696517413


# SUBMISSION

---



In [None]:
test_prediction = model.predict_proba(test[in_cols])

sub = pd.DataFrame(test_prediction)
sub = sub.rename(columns=Inversetarget_mapper)
sub['ID'] =  test['ID'].values
submission = pd.read_csv('SampleSubmission.csv')
submission = pd.merge(sub[['ID']],submission,on='ID',how='left')

for col in submission.columns[1:] :
  submission[col] = sub[col]
  
submission.head()

Unnamed: 0,ID,arable_soil,cereals,forage_crop,greenhouses,mixed_crops,ochards,olive,olive+arbo,olive+cereals,olive+crops,plowing_and_sowing,vegetable_and_flower,wheat
0,d8da32b5,0.04,0.03,0.11,0.0,0.0,0.05,0.65,0.03,0.06,0.0,0.01,0.01,0.01
1,670ad0fb,0.09,0.04,0.08,0.01,0.0,0.11,0.26,0.12,0.13,0.03,0.01,0.0,0.12
2,fec40ac9,0.01,0.04,0.01,0.01,0.03,0.3,0.28,0.11,0.02,0.08,0.01,0.09,0.01
3,4f6d4495,0.02,0.15,0.0,0.0,0.03,0.22,0.14,0.24,0.02,0.12,0.01,0.03,0.02
4,e56d2db7,0.04,0.02,0.04,0.1,0.0,0.07,0.09,0.01,0.09,0.06,0.41,0.0,0.07


In [None]:
submission.to_csv('submission.csv',index=False)

# What Next ?

---



*   Use all time steps
*   Model Fine Tuning
*   use Cross Validation Technique , take a look on this [notebook](https://github.com/ASSAZZIN-01/UmojaHack-Africa-2022/blob/master/Challenge%233%20-%20Faulty%20Air%20Quality%20Sensor/UmojaHack_Challenge_3_Top_3_Notebook.ipynb) 

*  Win the battle with Vegetation Indexes ( NDVI , WDVI , EVI etc .... )
*   Map Each time step to the correspondant date , and create interactions between dates