# EDA: Forestry Commission (RMSC) training data 

Training data is organized by Ghana's basin, including 5 basins:
1. Black Volta
2. Pra
3. Sene
4. Tano
5. White Volta


## Top Questions
- How many training data points in total for the region?
- What are the different crop categories? How many points per category?
- When was the data gathered?

In [162]:
import pandas as pd
import matplotlib.pyplot as plt
import sys
sys.path.append('../src/')
import ptype_prepare_data as pp
import ptype_visualize as viz

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [25]:
folder = '../data/rmsc_train/'

In [176]:
sene = pd.read_csv(folder + 'sene/sene_comb_raw.csv')
pra = pd.read_csv(folder + 'pra/pra_comb_raw.csv')
tano = pd.read_csv(folder + 'tano/tano_comb_raw.csv')
white = pd.read_csv(folder + 'white_volta/white_volta_comb_raw.csv')
black = pd.read_csv(folder + 'black_volta/black_volta_comb_raw.csv')

In [177]:
frames = [sene, pra, tano, white, black]
df = pd.concat(frames)
df.head()

Unnamed: 0,time,lat,lon,land use,dominant,district,remarks
0,2022-07-24T07:47:59Z,6.648703,-0.743282,,Open Forest,kuahu east,
1,2022-07-24T07:47:59Z,6.696985,-0.745465,,Teak Plantation,kuahu east,
2,2022-07-24T07:47:59Z,6.706291,-0.725214,,Annuals,kuahu east,
3,2022-07-24T07:47:59Z,6.722182,-0.731192,,Grassland,kuahu east,
4,2022-07-24T07:47:59Z,6.685867,-0.768746,,Open Forest,kuahu east,


In [178]:
# turn time category to datetime
df['time'] = pd.to_datetime(df['time']).dt.normalize()

# make all categories lower
df['land use'] = df['land use'].str.lower()
df['dominant'] = df['dominant'].str.lower()

# fix specific wording
df['dominant'] = df['dominant'].replace({'teak plantation':'teak', 
                                         'fallow land': 'fallow',
                                         'maiz': 'maize',
                                          'maze': 'maize',
                                         'shrubs':'shrub',
                                         'shaded': 'shaded cocoa',
                                         'urban': 'settlement',
                                         'grass': 'grassland',
                                         'bare surface': 'bare',
                                         'annual': 'annuals',
                                         'soya beans': 'soyabean',
                                         'soyabbean': 'soyabean',
                                         'palm and rice':'rice and palm',
                                         'shea tree': 'shea',
                                         'gcorn': 'guinea corn', 
                                         'cidrella':'cidrela',
                                         'citrus orange':'citrus'})

In [179]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1937 entries, 0 to 242
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype              
---  ------    --------------  -----              
 0   time      374 non-null    datetime64[ns, UTC]
 1   lat       1937 non-null   float64            
 2   lon       1937 non-null   float64            
 3   land use  1530 non-null   object             
 4   dominant  1937 non-null   object             
 5   district  1936 non-null   object             
 6   remarks   1254 non-null   object             
dtypes: datetime64[ns, UTC](1), float64(2), object(4)
memory usage: 121.1+ KB


In [183]:
top_25 = df.dominant.value_counts()[:25]

In [185]:
## useful categories
categories_to_use = ['annuals and cocoa',
 'annuals with cocoa',
 'cashew',
 'cassava adj cocoa',
 'cidrela',
 'citrus',
 'citrus and palm',
 'cocoa',
 'cocoa and annual',
 'cocoa and banana',
 'cocoa and ginger',
 'cocoa and rubber',
 'coconut',
 'eucalyptus',
 'forest',
 'forest and cocoa',
 'mahogany',
 'mango',
 'melina',
 'oil palm',
 'oil palm and cocoa',
 'open forest',
 'palm',
 'palm and cocoa',
 'pawpaw',
 'rice and cocoa',
 'rice and cocoa,palm',
 'rice and oil palm',
 'rice and palm',
 'riverine forest',
 'rubber',
 'rubber and cocoa, rice',
 'shaded cocoa',
 'shea',
 'shea and maize',
 'shea dawadawa',
 'teak',
 'teak and cidrela']

priority = df[df.dominant.isin(categories_to_use)]
priority.head()

Unnamed: 0,time,lat,lon,land use,dominant,district,remarks
0,2022-07-24 00:00:00+00:00,6.648703,-0.743282,,open forest,kuahu east,
1,2022-07-24 00:00:00+00:00,6.696985,-0.745465,,teak,kuahu east,
4,2022-07-24 00:00:00+00:00,6.685867,-0.768746,,open forest,kuahu east,
18,2022-07-24 00:00:00+00:00,6.556199,-0.726153,,teak,kuahu east,
21,2022-07-23 00:00:00+00:00,6.588588,-0.695538,,open forest,kwahu south,


In [189]:
print(f'Total points: {len(df)}')
print(f'Total crop categories: {len(df.dominant.value_counts())}')
print(f'The highest number of points are collected for \n{top_25[:10]}')
print('---')
print(f'Total priority points: {len(priority)}')
print(f'Total priority categories: {len(categories_to_use)}')
print('---')
print(f'Data was collected between {df.time.min()} and {df.time.max()}')


Total points: 1937
Total crop categories: 124
The highest number of points are collected for 
cocoa           227
maize           201
shea            196
rice            166
grassland       128
millet           81
palm             55
shaded cocoa     53
soyabean         52
teak             51
Name: dominant, dtype: int64
---
Total priority points: 845
Total priority categories: 38
---
Data was collected between 2022-05-12 00:00:00+00:00 and 2022-07-24 00:00:00+00:00


In [154]:
df.to_csv(folder + 'rmsc_clean.csv')

In [190]:
priority.to_csv(folder + 'rmsc_priority.csv', index=False)

## Other questions / explorations

In [112]:
# whats the distribution of land use categories?
# probably better to use the dominant category rather than land use
df['land use'].value_counts()

cropland                  889
woodland                  237
grassland                 125
wetland                    50
otc                        46
forest                     43
shaded cocoa               38
plantation                 26
woodland with grass        18
natural tree species       13
settlement/baresurface      7
bare surface                6
settlement                  6
woodland                    6
mono cocoa                  3
forest reserve              3
cropland                    2
riverine forest             2
savannah woodland           2
wetland (settlement)        1
mountains                   1
                            1
grassland                   1
maize                       1
plantain                    1
shrub                       1
annuals                     1
Name: land use, dtype: int64

In [114]:
# plantation category looks like forest plantations
df[df['land use'] == 'plantation']['dominant'].value_counts()

teak          21
melina         1
borassus       1
eucalyptus     1
cassier        1
mahogany       1
Name: dominant, dtype: int64

In [116]:
# otc land use looks like other tree crops
df[df['land use'] == 'otc']['dominant'].value_counts()

rubber      16
coconut     11
palm         9
oil palm     4
cashew       4
citrus       2
Name: dominant, dtype: int64

In [117]:
# cocoa is represented in a few places
df[df['land use'] == 'mono cocoa']['dominant'].value_counts()

cocoa    3
Name: dominant, dtype: int64

In [118]:
df[df['land use'] == 'shaded cocoa']['dominant'].value_counts()

cocoa    38
Name: dominant, dtype: int64

In [121]:
df[df['land use'] == 'cropland']['dominant'].value_counts()[:10]

maize          172
rice           150
cocoa           91
millet          80
soyabean        46
yam             44
cashew          38
guinea corn     27
fallow          21
mango           19
Name: dominant, dtype: int64

In [167]:
df[df.dominant == 'mango']

Unnamed: 0,time,lat,lon,land use,dominant,district,remarks
96,NaT,6.871911,-0.29579,cropland,mango,afram plains north south,
100,NaT,6.943461,-0.232487,cropland,mango,afram plains north south,
189,NaT,7.014287,-0.144192,cropland,mango,afram plains north south,
196,NaT,7.111356,-0.212417,cropland,mango,afram plains north south,
221,NaT,7.678434,-0.668814,cropland,mango,sene west,
223,NaT,7.770024,-0.575882,cropland,mango,sene west,
21,NaT,7.456335,-1.962908,cropland,mango,Techiman Municipal,Adjoined to Cropland
46,NaT,7.282855,-2.1464,cropland,mango,Tano North,Adjoined to Fallow Land
161,NaT,7.223139,-1.919775,cropland,mango,Ahafo-Ano,Adjoined to Sacred grove
164,NaT,7.228311,-1.89254,cropland,mango,Ahafo-Ano,Adjoined to Cropland


In [166]:
# v20 has 809 plots, doesnt that mean 158,564 samples?
809*(14*14)

158564