## Decision Tree: Branch Analyses

1. Reporting Year (Year 0 - baseline)
2. TTC Tree cover
3. Land Use
4. Strategy
5. Imagery or site accessibility

### General Notes
- inactive projects include: africorp-intl, divine-bamboo-group, germark, poel, s3d-niger (7 total in cohort 1)

### Decisions
- An "open" or "closed" designation was assigned at the project level based on the proportion of sites that fell the into open/closed category.
- Project names that did not align were dropped.
- ttc NA - sites without TTC % were dropped. This occurs due to missing TTC tiles, eventually we will have data. 
- Sites where planting occurs in 2025 were dropped. TODO: Check the other sites in the prj.
- What does imagery % represent and what is a workable threshold? Does it represent coverage for the polygon or wider AOI?
- plant_start - date that planting is started is available by site, so would have to aggregate for a project. Does it make sense to do the first date of planting?
- This analysis lays out what the decision trees would look like at year0, year3 and year6, but only performs the analysis for year0. In order to do the subsequent years, would need to include the date that TTC analysis was run and check that against the planting date to understand the temporal component for each project (ie was the canopy open at plant start).


**Notes on datasets**
- `target_sys` in ttc csv refers to the current land use (used for error calcs). This is not used.
- when merging ttc and ft_polys using 3 keys - project name, site name and polygon name - there are some duplicates. It seems all values are the same except the slope and aspect stats. Not using these columns for now so dropping.

**Next**
- priority 1: pull all inputs from TM API - ttc and features need the same project names
- distributed vs concentrated assignment
- drop projects with NaN target sys (blue-forest and wells for zoe) and print out decision.
- list potential branches to build (aspect, slope, etc.)

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import re
import os
import math

import sys
sys.path.append('../src/')
import image_availability as img
import clean_raw_data as clean
import decision_trees as tree

%load_ext autoreload
%autoreload 2

# PARAMS

In [None]:
imagery_dir = "../data/imagery_availability/cohort1/"         # Rhiannon saved as "../data/afr100_cohort1_imagery_availability/"
ttc_input = '../data/ttc_baseline_110624.csv'                 # these are the ttc statistics
new_poly_feats = "../data/all_projects_TM_081424"             # this contains new land cover & intervention type
old_poly_feats = '../data/features_polygon_071824.csv'        # original polygon feature information
canopy_threshold = 40                                         # threshold for identifying open vs closed canopy projects
cloud_thresh = 50                                             # threshold for identifying image quality
img_count = 1                                                 # threshold for identifying image availability

In [None]:
# feats = pd.read_csv('../data/features_polygon_100124.csv')
# feats.head()

In [None]:
#feats[(feats.target_sys == 'natural-forest')&(feats.practice == 'assisted-natural-regeneration')]

In [None]:
#feats[feats.project_name == 'wildlife action group']

In [None]:
feats.practice.value_counts()

In [None]:
feats.target_sys.value_counts()

In [None]:
feats.practice.value_counts()

In [None]:
feats = feats.rename(columns={'project_name': 'project',
                              'site_name' : 'sitename'})

In [None]:
ttc = clean.clean_ttc_csv(ttc_input, 40)
ttc.head()

In [None]:
comb = pd.merge(feats, ttc, on=['project', 'sitename', 'poly_name'])

In [None]:
comb

In [None]:
feats.project.value_counts()

## TTC CHECK

In [None]:
print(pd.__version__)

In [None]:
df.info()

In [None]:
df = pd.read_csv(ttc_input)
nulls = df[df.tree_cover.isna()]
ttc = df.dropna(subset=['tree_cover'])
ttc.info()

# Data Prep & Cleaning

In [None]:
ttc, feats = clean.clean_combine_inputs(ttc_input, old_poly_feats, new_poly_feats, canopy_threshold)

In [None]:
#list(set(feats.project_name))

In [None]:
#list(set(ttc.project_name))

In [None]:
feats

## Create Trees

In [None]:
tree = tree.year0(ttc_input,
          old_poly_feats,
          new_poly_feats,
          imagery_dir,
          canopy_threshold,
          cloud_thresh,
          img_count)

In [None]:
df.target_sys.value_counts()

In [None]:
base.shape

In [None]:
base

# Issues / QAQC Needs

### Issue 1. Diff in count

In [None]:
# new shapefile has 38 fewer polygons
shp = gpd.read_file(new_poly_feats)
ft_poly = pd.read_csv(old_poly_feats)
display(shp.shape, ft_poly.shape)
display(shp.shape[0] - ft_poly.shape[0])

In [None]:
# compare project, poly and site names across TM polygons
shp_prjlist = list(set(shp.Project))
ft_prjlist = list(set(ft_poly.Project))
shp_polylist = list(set(shp.poly_name))
ft_polylist = list(set(ft_poly.poly_name))
shp_sitelist = list(set(shp.SiteName))
ft_sitelist = list(set(ft_poly.SiteName))
diff_prj = [i for i in ft_prjlist if i not in shp_prjlist]
diff_poly = [i for i in ft_polylist if i not in shp_polylist]
diff_site = [i for i in ft_sitelist if i not in shp_sitelist]

print(diff_prj)
print(diff_poly)
print(diff_site)

In [None]:
# how many polys within the missing sites?
site_df = ft_poly[ft_poly.SiteName.isin(['Bbambula',
                              'Reforestation Project\xa0',
                              'Main_Gate',
                              'koko Regeneration site',
                              'Namwene',
                              'Kikandwa'])]
site_df.shape

### Issue 2. Duplicate polygons
Some polygons are almost exactly the same except for the slope and aspect statistics. Were two methods applied to calculate these stats? Which one to drop?

### Issue 3. TTC is null for some polygons
221 polygons have null values for TTC, these are dropped from the analysis.

# Check Results

# Legacy

In [None]:
    # ## BRANCH 2 ##
    # # not represented: wetland,plantation / urban-forest
    # open_landcovers = ['agroforest', 
    #                    'mangrove', 
    #                    'wetland', 
    #                    'silvopasture', 
    #                    'plantation', 
    #                    'natural-forest', 
    #                    'agroforest,silvopasture',
    #                    'agroforest,wetland',
    #                    ]
    # closed_landcovers = ['plantation', 
    #                      'natural-forest',
    #                      'urban-forest'
    #                      ]
    # open_sys = open_[open_.target_sys.isin(open_landcovers)]
    # closed_sys = closed_[closed_.target_sys.isin(closed_landcovers)]

In [None]:
    af_tp = open_[(open_.target_sys == 'agroforestry') & (open_.practice == 'tree-planting')]
    af_ds = open_[(open_.target_sys == 'agroforestry') & (open_.practice == 'direct-seeding')]
    af_anr = open_[(open_.target_sys == 'agroforestry') & (open_.practice == 'ANR')]
    mgr_tp = open_[(open_.target_sys == 'mangrove') & (open_.practice == 'tree-planting')]
    mgr_anr = open_[(open_.target_sys == 'mangrove') & (open_.practice == 'ANR')]
    nf_tp = open_[(open_.target_sys == 'natural forest') & (open_.practice == 'tree-planting')]
    nf_ds = open_[(open_.target_sys == 'natural forest') & (open_.practice == 'direct-seeding')]
    nf_anr = open_[(open_.target_sys == 'natural forest') & (open_.practice == 'ANR')]
    plant_tp = open_[(open_.target_sys == 'plantation') & (open_.practice == 'tree-planting')]
    wet_tp = open_[(open_.target_sys == 'wetland') & (open_.practice == 'tree-planting')]
    wet_anr = open_[(open_.target_sys == 'wetland') & (open_.practice == 'ANR')]
    silvo_tp a=  & (open_.practice == 'tree-planting')]
    silvo_ds = open_[(open_.target_sys == 'silvopasture') & (open_.practice == 'direct-seeding')]
    silvo_anr = open_[(open_.target_sys == 'silvopasture') & (open_.practice == 'ANR')]
    plant_tp_closed = closed_[(closed_.target_sys == 'plantation') & (closed_.practice == 'tree-planting')]
    nf_tp_closed =  & (closed_.practice == 'tree-planting')]
    nf_ds_closed = closed_[(closed_.target_sys == 'natural forest') & (closed_.practice == 'direct-seeding')]
    nf_anr_closed = closed_[(closed_.target_sys == 'natural forest') & (closed_.practice == 'ANR')]

    af = open_[(open_.target_sys == 'agroforestry')]
    mgr = open_[(open_.target_sys == 'mangrove')]
    wet = open_[(open_.target_sys == 'wetland')]
    silvo = open_[(open_.target_sys == 'silvopasture')]
    plant_open = open_[(open_.target_sys == 'plantation')]
    nf_open = open_[(open_.target_sys == 'natural forest')]
    plant_closed = closed_[(closed_.target_sys == 'plantation')]
    nf_closed = closed_[(closed_.target_sys == 'natural forest')]

In [None]:
img_dir = '../data/imagery_availability/cohort1/'
imagery_files = os.listdir(img_dir)
imagery = []
for project in imagery_files:
    df = pd.read_csv(f"{img_dir}/{project}")
    sub_df = df[['Name', 
                 'properties.datetime',
                 'collection', 
                 'properties.eo:cloud_cover', 
                 'properties.off_nadir_avg']]
    sub_df = sub_df.assign(Project=project.replace('afr100_', '').replace('_imagery_availability.csv', ''))
    imagery.append(sub_df)
all_projects_df = pd.concat(imagery).reset_index()
all_projects_df = all_projects_df[['Project', 
                                   'Name', 
                                   'properties.datetime',
                                   'collection', 
                                   'properties.eo:cloud_cover',
                                   'properties.off_nadir_avg']]
all_projects_df = all_projects_df[~pd.isna(all_projects_df['Name'])]
all_projects_df.rename(columns={'Name':'poly_name'}, inplace=True)
all_projects_df.loc[:, 'properties.datetime'] = pd.to_datetime(all_projects_df['properties.datetime'], 
                                                               format='mixed').dt.normalize()
all_projects_df.loc[:, 'properties.datetime'] = all_projects_df['properties.datetime'].apply(lambda x: x.replace(tzinfo=None))

In [None]:
all_projects_df.info()

In [None]:
# create a master csv of all projects for img availability, elevation and sun
img_dir = 'imagery_availability/cohort1/'
csv_list = listdir(img_dir)
csv_list = [re.split(r'_', file)[1] for file in csv_list if file.endswith('.csv')]
master_csv = pd.DataFrame()
for i in csv_list[0:1]:
    i = 'bccp'
    df = pd.read_csv(f'{img_dir}afr100_{i}_imagery_availability.csv')
    df = df[['id', 
             'collection', 
             'properties.datetime',
             'properties.eo:cloud_cover',
             'properties.collect_time_end',
             'properties.collect_time_start',
#              'properties.off_nadir_avg', 
#              'properties.off_nadir_end',
#              'properties.off_nadir_max', 
#              'properties.off_nadir_min',
#              'properties.view:sun_elevation_max',
#              'properties.view:sun_elevation_min',
            ]]
    dt_cols = ['properties.datetime',
                 'properties.collect_time_end',
                 'properties.collect_time_start',
               ]
    df[dt_cols] = df[dt_cols].apply(pd.to_datetime, errors='coerce')
    # Ensure 'properties.datetime' column is timezone-naive
    df['properties.datetime'] = df['properties.datetime'].apply(lambda x: x.replace(tzinfo=None) if x.tzinfo else x)
    planting_date = ttc[ttc.project == i]['plantstart']
#    df['baseline_imgs'] = df['properties.datetime'].apply(lambda x: 1 if planting_date <= x <= planting_date + timedelta(days=365) else 0)
#     df['baseline_cloud_free'] = len(df[(df['baseline_imgs'] == 1) & (df['properties.eo:cloud_cover'] == 1)])
#     agg_data = {
#         'project_name': i,
#         'total_imgs': len(df),
#         'total_cloud_free': sum(df['properties.eo:cloud_cover']),
#         'baseline_imgs': sum(df['baseline_imgs']),
#         'baseline_cloud_free':sum(df['baseline_cloud_free']),
#                              }
# #         'early_imgs':,
# #         'early_cloud_free',
# #         'endline_imgs':,
# #         'endline_cloud_free':,
#     agg_table = pd.DataFrame([agg_data])
    

In [None]:
ft_raw.columns = ft_raw.columns.map(lambda x: re.sub(' ', '_', x.lower().strip()))
ft = ft_raw[['project', 'area_ha', 'percent_imagery_coverage', 'agroforestry',
       'tree_planting', 'assisted_natural_regeneration', 'enrichment_planting',
       'reforestation', 'direct_propagules_planting', 'natural_regeneration',
       'applied_nucleation/tree_island', 'woodlot', 'direct_seeding',
       'mangrove_tree_restoration', 'natural_forest', 'riparian_restoration']]
ft = ft.rename(columns={'mangrove_tree_restoration': 'mangrove',
                    'riparian_restoration':'riparian'})

## land use
land_use = ['agroforestry',
            'woodlot',
            'mangrove', 
            'natural_forest', 
            'riparian']

for col in land_use:
    ft[col] = ft[col].apply(lambda x: 1 if x > 0 else x)

## planting strategy
strategy = ['direct_seeding', 
            'tree_planting', 
            'assisted_natural_regeneration', 
            'natural_regeneration', # change this
            'enrichment_planting',
            'direct_propagules_planting',
            'applied_nucleation/tree_island',
            'reforestation',
           ]

for col in strategy:
    ft[col] = ft[col].apply(lambda x: 1 if x > 0 else x)