## 1.3. The data tables for modeling

This notebook builds different data tables for Machine learning in Python with respect to the outputs.

In [1]:
library("tidyverse")
library("skimr")

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.0     [32mv[39m [34mpurrr  [39m 0.3.3
[32mv[39m [34mtibble [39m 2.1.3     [32mv[39m [34mdplyr  [39m 0.8.5
[32mv[39m [34mtidyr  [39m 1.0.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.5.0

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



The data in file __`pr_potato_df.csv`__ is imported from __`1.2_preprocessing-2.ipynb`__ file.

In [2]:
data_ml <- read.csv(file = 'output/pr_potato_df.csv')
data_ml$index <- 1:nrow(data_ml)
keys_col <- c("NoEssai", "NoBloc", "NoTraitement", "ID", "ID_bl")
dose_vars <- c('NtotDose', 'PtotDose', 'KtotDose', 'test_type')

Blocks containing only one value are deleted. Unique identifiers are defined as the concatenation of trials and the block (`NoEssai-NoBloc`), and as the concatenation of trial, block and treatment (`NoEssai-NoBloc-NoTraitement`). 

In [3]:
data_ml <- subset(data_ml, !Bloc_numLevels == 1) %>%
    mutate(ID = paste0(NoEssai, '-', NoBloc, '-', NoTraitement),
           ID_bl = paste0(NoEssai, '-', NoBloc))
nrow(data_ml)

In [4]:
# Time-span in data_ml
min(data_ml$Annee, na.rm = TRUE)
max(data_ml$Annee, na.rm = TRUE)

### Remove dots from column names 

Note single elements (_e.g. soil_Al_) must be listed after balances to avoid replacing it befor parsing the balance (_if `soil_Al` is parsed before, `soil_Al.P` is translated to `soil Al.P`_).

In [5]:
# Local useful csv table to remove some characters in colnames
translate_col <- read_csv(file = 'data/translate_col.csv')

Parsed with column specification:
cols(
  from_name_mm = [31mcol_character()[39m,
  to_name_mm = [31mcol_character()[39m,
  to_name_mm_long = [31mcol_character()[39m,
  to_name_mm_long_fr = [31mcol_character()[39m
)



In [6]:
for (i in 1:nrow(translate_col)) {
  index <- which(colnames(data_ml) == translate_col$from_name_mm[i])
  colnames(data_ml)[index] <- translate_col$to_name_mm[i]
}

### Select usefull columns

As a first step, we select useful columns for the prediction. The data table will then be filtered to remove rows containing at least one missing values. We must assure to select meaningful columns which are not inducing too much NAs.

In [7]:
var_ml <- c(
    # Response variables
    "RendVendableMaxParEssai",
    ## other outputs are added separately in codes below
    # Keys (Random effect)
    "NoEssai", "NoBloc", "NoTraitement", "ID", "ID_bl",
    # Management
    "DensitePlants", "PrecCropFiveClasses",  
    # Cultivar
    "Cultivar", "Maturity5", "growing.season", 
    # Weather
    'temp_moy_5years', 'prec_tot_5years', 'sdi_5years', 'gdd_5years',
    # Doses
    ## Nitrogen
    "NtotDose",
    ## Phosphorous
    "PtotDose",
    ## Potassium
    "KtotDose",
    # Soil
    "soilTextIlr1", "soilTextIlr2", "soilTextIlr3",
    "soilTypeIlr1_3", "soilTypeIlr2_3",
    "soil_pH",
    "soil_P1_Fv.AlP", "soil_P1_Al.P", #sbp1 
    "soil_K2_FvMgCa.K", "soil_K2_Fv.MgCa", "soil_K2_Mg.Ca", # sbp2
    "soil_P", "soil_K", "soil_Al", "ISP1" # CRAAQ
)

The following chains filter out trials whose highest yield is below 28 Mg/ha. This is done to prevent the use of data highly perturbed by factors that were not monitored like diseases, management failures or extreme weather events. Useless features are also discarded. Then we remove lines containing NAs. The last operation of the chain resets the levels of factors to avoid encoding factors containing categories (levels) ruled out from the dataset. 

Different tables are built for different outputs. The number of complete cases varies depending on the output variables.

### Complete cases with all the outputs
This table gathers trials common to all the outputs.

In [8]:
ml_df <- data_ml %>%
  filter(RendVendableMaxParEssai >= 28) %>%
  select(one_of(c("index", "RendVendable", 'tsizeMS_L', 'tsizeS_M', "PoidsSpec", var_ml, "Annee"))) %>%
  na.omit() %>%
  droplevels()

In [9]:
ml_df <- ml_df %>% select(-Annee)
ml_df$Cultivar <- factor(ml_df$Cultivar)

We merge the `test_type` column (not included previously to avoid remove rows with missing `test_type`).

In [10]:
test_type <- data_ml %>%
    select(index, test_type)
nrow(test_type)

In [11]:
ml_df <- left_join(ml_df, test_type, by = "index") %>% 
                    select(-index, -RendVendableMaxParEssai)

We export this table as `df_all.csv` for multioutput machine learning in Python.

In [12]:
write_csv(ml_df, "output/df_all.csv")

### Marketable yield
Process a data frame with `RendVendable` or marketable yield solely as response variable. Process identically as for multitask GP but only select `RendVendable`, ID and var_ml. The test_type data frame remains the same"

In [13]:
RendVendable_ml_df <- data_ml %>%
  filter(RendVendableMaxParEssai >= 28) %>%
  select(one_of(c("index", "RendVendable", var_ml, "Annee"))) %>%
  na.omit() %>% # many NAs in pH and OM, and some in yield
  droplevels()  # Make sure not having extra factors to have a start vector of correct length
RendVendable_ml_df$Cultivar <- factor(RendVendable_ml_df$Cultivar)

Stats marketable yield data (structure per year):

In [14]:
# Time span filtered trials
min(RendVendable_ml_df$Annee, na.rm = T)
max(RendVendable_ml_df$Annee, na.rm = T)

In [15]:
stat_rv <- RendVendable_ml_df %>%
    select("NoEssai", "NoBloc", "NoTraitement", "Cultivar", "Maturity5", "Annee", "DensitePlants",  
           "growing.season", 'temp_moy_5years', 'prec_tot_5years', 'gdd_5years', "RendVendable") %>%
    group_by(Annee) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(RendVendable_ml_df), 
              min_nblocs = min(n_distinct(NoBloc)), max_nblocs = max(n_distinct(NoBloc)),
              min_ntreat = min(n_distinct(NoTraitement)), max_ntreat = max(n_distinct(NoTraitement)),
              min_GS = min(growing.season), mean_GS = mean(growing.season), max_GS = max(growing.season),
              min_T = min(temp_moy_5years), mean_T = mean(temp_moy_5years), max_T = max(temp_moy_5years),
              min_PPT = min(prec_tot_5years), mean_PPT = mean(prec_tot_5years), max_PPT = max(prec_tot_5years),
              min_GDD = min(gdd_5years), mean_GDD = mean(gdd_5years), max_GDD = max(gdd_5years),
              min_RV = min(RendVendable), mean_RV = mean(RendVendable), max_RV = max(RendVendable))
stat_rv
write_csv(stat_rv, "output/stat_rv.csv")

Annee,n_samples,n_trials,percent,min_nblocs,max_nblocs,min_ntreat,max_ntreat,min_GS,mean_GS,...,max_T,min_PPT,mean_PPT,max_PPT,min_GDD,mean_GDD,max_GDD,min_RV,mean_RV,max_RV
<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1979,10,1,0.1691189,3,3,4,4,116,116.0,...,15.75853,365.48,365.48,365.48,1505.2,1505.2,1505.2,8.5,25.0,31.1
1980,10,1,0.1691189,3,3,4,4,116,116.0,...,16.119,354.92,354.92,354.92,1565.0,1565.0,1565.0,5.6,25.08,32.9
1981,30,3,0.5073567,6,6,4,4,116,116.0,...,16.30964,383.66,405.3867,420.98,1569.48,1721.12,1871.88,14.3,29.61667,40.1
1987,8,1,0.1352951,2,2,4,4,110,110.0,...,16.47429,437.06,437.06,437.06,1815.76,1815.76,1815.76,16.98,34.585,44.23
1993,144,6,2.435312,3,3,27,27,121,124.9375,...,16.79694,412.28,419.0437,424.62,1952.96,2046.755,2082.16,14.071038,36.68887,53.30556
1994,84,3,1.4205987,3,3,10,10,124,124.6429,...,17.27429,439.32,440.6571,441.4,2145.36,2151.056,2154.22,7.468124,36.03904,51.93989
1995,81,3,1.369863,3,3,9,9,106,108.6667,...,17.79911,444.46,450.86,463.66,1890.4,1927.6,2002.0,15.519126,27.61283,40.27322
1996,258,8,4.3632674,6,6,11,11,102,120.686,...,18.68296,371.4,445.9235,529.12,1631.0,2099.052,2264.86,3.333333,34.24359,58.94809
1997,306,8,5.1750381,6,6,22,22,98,116.3333,...,18.81005,343.84,470.699,537.64,1627.28,1990.232,2116.28,2.535918,26.87887,46.434
1998,280,14,4.7353289,6,6,5,5,107,122.7714,...,17.94491,358.68,387.7474,443.46,1983.14,2081.409,2132.68,4.295082,33.63736,61.05


In [16]:
RendVendable_ml_df %>%
    select("NoEssai", "NoBloc", "NoTraitement", "Cultivar", "Maturity5", "Annee", "DensitePlants",  
           "growing.season", 'temp_moy_5years', 'prec_tot_5years', 'gdd_5years', "RendVendable") %>%
    group_by(Maturity5) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(RendVendable_ml_df))

Maturity5,n_samples,n_trials,percent
<fct>,<int>,<int>,<dbl>
early,248,13,4.194148
early mid-season,741,36,12.53171
late,518,26,8.760359
mid-season,3667,162,62.015897
mid-season late,739,36,12.497886


We merge the `test_type` column (not included previously to avoid removing rows with missing test_type).

In [56]:
test_type <- data_ml %>%
    select(index, test_type)
# join it to the data frame
RendVendable_ml_df <- left_join(RendVendable_ml_df, test_type, by = "index") %>% 
                            select(-index, -RendVendableMaxParEssai, -Annee)

Stats 1 (the __df_RendVend.csv__ global structure):

In [60]:
stat1 <- RendVendable_ml_df %>%
    select("NoEssai", "NoBloc", "NoTraitement", "NtotDose", "PtotDose", "KtotDose", 'test_type', "RendVendable") %>%
    group_by(test_type) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(RendVendable_ml_df), 
              N_min = min(NtotDose), N_max = max(NtotDose),
              P_min = min(PtotDose), P_max = max(PtotDose),
              K_min = min(KtotDose), K_max = max(KtotDose))
stat1

"Factor `test_type` contains implicit NA, consider using `forcats::fct_explicit_na`"


test_type,n_samples,n_trials,percent,N_min,N_max,P_min,P_max,K_min,K_max
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
K,936,45,15.829528,80,260,75,240,0,420
N,3068,151,51.885676,0,250,88,250,57,270
NPK,591,16,9.994926,0,225,0,300,0,300
P,1300,60,21.985456,0,260,0,300,0,270
,18,1,0.304414,218,218,110,110,55,55


We export this table as __`df_RendVend.csv`__ for GP in Python.

In [61]:
write_csv(RendVendable_ml_df, "output/df_RendVend.csv")

### Proportions for tuber size: `tsizeMS_L`, `tsizeS_M`.

In [62]:
tuberSize <- data_ml %>%
  filter(RendVendableMaxParEssai >= 28) %>%
  select(one_of(c("index", 'tsizeMS_L', 'tsizeS_M', var_ml, "Annee"))) %>%
  na.omit() %>%
  droplevels()

tuberSize$Cultivar <- factor(tuberSize$Cultivar)

In [63]:
stat_ts <- tuberSize %>%
    select("NoEssai", "NoBloc", "NoTraitement", "Cultivar", "Maturity5", "Annee", "DensitePlants",  
           "growing.season", 'temp_moy_5years', 'prec_tot_5years', 'gdd_5years') %>%
    group_by(Annee) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(tuberSize), 
              min_nblocs = min(n_distinct(NoBloc)), max_nblocs = max(n_distinct(NoBloc)),
              min_ntreat = min(n_distinct(NoTraitement)), max_ntreat = max(n_distinct(NoTraitement)),
              min_GS = min(growing.season), mean_GS = mean(growing.season), max_GS = max(growing.season),
              min_T = min(temp_moy_5years), mean_T = mean(temp_moy_5years), max_T = max(temp_moy_5years),
              min_PPT = min(prec_tot_5years), mean_PPT = mean(prec_tot_5years), max_PPT = max(prec_tot_5years),
              min_GDD = min(gdd_5years), mean_GDD = mean(gdd_5years), max_GDD = max(gdd_5years)
              )
stat_ts
write_csv(stat_ts, "output/stat_ts.csv")

Annee,n_samples,n_trials,percent,min_nblocs,max_nblocs,min_ntreat,max_ntreat,min_GS,mean_GS,max_GS,min_T,mean_T,max_T,min_PPT,mean_PPT,max_PPT,min_GDD,mean_GDD,max_GDD
<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1993,117,5,2.5674786,3,3,9,9,121,125.1538,127,16.45143,16.5776,16.63368,418.82,420.6046,424.62,1952.96,2038.585,2076.64
1994,84,3,1.843318,3,3,10,10,124,124.6429,125,17.20701,17.23104,17.27429,439.32,440.6571,441.4,2145.36,2151.056,2154.22
1995,81,3,1.7774852,3,3,9,9,106,108.6667,114,17.52958,17.70927,17.79911,444.46,450.86,463.66,1890.4,1927.6,2002.0
1996,60,2,1.3166557,6,6,5,5,102,118.0,134,15.96156,16.42587,16.89019,378.04,453.58,529.12,1631.0,1947.93,2264.86
1997,198,4,4.3449638,3,3,22,22,123,123.0,123,16.90796,17.03372,17.18476,518.32,529.77,537.64,2083.04,2097.78,2116.28
1998,184,8,4.0377441,6,6,5,5,131,131.0,131,17.94491,17.94491,17.94491,358.68,358.68,358.68,2132.68,2132.68,2132.68
1999,288,11,6.3199473,6,6,14,14,120,125.3611,128,16.98877,17.82402,18.26183,377.4,413.6835,506.56,2042.16,2100.595,2165.4
2000,184,8,4.0377441,6,6,5,5,122,122.0,122,18.7281,18.7281,18.7281,279.8,279.8,279.8,1833.22,1833.22,1833.22
2002,165,7,3.6208032,4,4,11,11,98,117.4667,137,16.37548,16.68475,16.94787,308.12,397.5435,504.18,1571.58,1869.148,2225.96
2003,238,15,5.2227343,3,3,10,10,98,111.084,122,16.51102,17.15811,18.5081,296.16,350.3797,394.76,1626.76,1814.324,2035.1


In [64]:
tuberSize <- left_join(tuberSize, test_type, by = "index") %>% 
                    select(-index, -RendVendableMaxParEssai, -Annee)

The __tuberSize.csv__ table global structure:

In [65]:
stat2 <- tuberSize %>%
    select("NoEssai", "NoBloc", "NoTraitement", "NtotDose", "PtotDose", "KtotDose", 'test_type') %>%
    group_by(test_type) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(tuberSize), 
              N_min = min(NtotDose), N_max = max(NtotDose),
              P_min = min(PtotDose), P_max = max(PtotDose),
              K_min = min(KtotDose), K_max = max(KtotDose))
stat2

"Factor `test_type` contains implicit NA, consider using `forcats::fct_explicit_na`"


test_type,n_samples,n_trials,percent,N_min,N_max,P_min,P_max,K_min,K_max
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
K,901,43,19.7717797,80,220,75,200,0,300
N,2378,122,52.183454,0,250,100,216,57,270
NPK,363,9,7.965767,0,225,0,300,0,300
P,897,33,19.6840026,110,210,0,300,0,270
,18,1,0.3949967,218,218,110,110,55,55


We export this table as __`df_tuberSize.csv`__ for GP in Python.

In [66]:
write_csv(tuberSize, "output/df_tuberSize.csv")

### Specific gravity, `PoidsSpec` as response variable

In [67]:
PoidsSpec_ml_df <- data_ml %>%
  filter(RendVendableMaxParEssai >= 28) %>%
  select(one_of(c("index", "PoidsSpec", var_ml, "Annee"))) %>%
  na.omit() %>%
  droplevels()

PoidsSpec_ml_df$Cultivar <- factor(PoidsSpec_ml_df$Cultivar)

In [68]:
stat_sg <- PoidsSpec_ml_df %>%
    select("NoEssai", "NoBloc", "NoTraitement", "Cultivar", "Maturity5", "Annee", "DensitePlants",  
           "growing.season", 'temp_moy_5years', 'prec_tot_5years', 'gdd_5years') %>%
    group_by(Annee) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(PoidsSpec_ml_df), 
              min_nblocs = min(n_distinct(NoBloc)), max_nblocs = max(n_distinct(NoBloc)),
              min_ntreat = min(n_distinct(NoTraitement)), max_ntreat = max(n_distinct(NoTraitement)),
              min_GS = min(growing.season), mean_GS = mean(growing.season), max_GS = max(growing.season),
              min_T = min(temp_moy_5years), mean_T = mean(temp_moy_5years), max_T = max(temp_moy_5years),
              min_PPT = min(prec_tot_5years), mean_PPT = mean(prec_tot_5years), max_PPT = max(prec_tot_5years),
              min_GDD = min(gdd_5years), mean_GDD = mean(gdd_5years), max_GDD = max(gdd_5years)
              )
stat_sg
write_csv(stat_sg, "output/stat_sg.csv")

Annee,n_samples,n_trials,percent,min_nblocs,max_nblocs,min_ntreat,max_ntreat,min_GS,mean_GS,max_GS,min_T,mean_T,max_T,min_PPT,mean_PPT,max_PPT,min_GDD,mean_GDD,max_GDD
<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1993,144,6,3.3850494,3,3,27,27,121,124.9375,127,16.45143,16.61873,16.79694,412.28,419.0437,424.62,1952.96,2046.755,2082.16
1994,84,3,1.9746121,3,3,10,10,124,124.6429,125,17.20701,17.23104,17.27429,439.32,440.6571,441.4,2145.36,2151.056,2154.22
1995,81,3,1.9040903,3,3,9,9,106,108.6667,114,17.52958,17.70927,17.79911,444.46,450.86,463.66,1890.4,1927.6,2002.0
1996,118,8,2.7738599,6,6,11,11,102,119.8729,134,15.96156,17.09373,18.68296,371.4,449.8437,529.12,1631.0,2048.192,2264.86
1997,306,8,7.1932299,6,6,22,22,98,116.3333,123,15.76802,17.15782,18.81005,343.84,470.699,537.64,1627.28,1990.232,2116.28
1998,273,14,6.4174894,6,6,5,5,107,122.7363,131,17.06717,17.64268,17.94491,358.68,387.8716,443.46,1983.14,2081.19,2132.68
1999,285,11,6.6995769,6,6,14,14,120,125.4175,128,16.98877,17.83282,18.26183,377.4,413.1774,506.56,2042.16,2101.21,2165.4
2000,183,8,4.3018336,6,6,5,5,122,122.0,122,18.7281,18.7281,18.7281,279.8,279.8,279.8,1833.22,1833.22,1833.22
2002,68,2,1.5984955,4,4,11,11,137,137.0,137,16.94787,16.94787,16.94787,504.18,504.18,504.18,2225.96,2225.96,2225.96
2003,154,14,3.6201222,3,3,10,10,98,111.5974,122,16.51102,17.36567,18.5081,296.16,351.6496,394.76,1626.76,1839.95,2035.1


In [69]:
PoidsSpec_ml_df <- left_join(PoidsSpec_ml_df, test_type, by = "index") %>% 
                    select(-index, -RendVendableMaxParEssai, -Annee)

The __df_PoidsSpec.csv__ global structure:

In [70]:
stat3 <- PoidsSpec_ml_df %>%
    select("NoEssai", "NoBloc", "NoTraitement", "NtotDose", "PtotDose", "KtotDose", 'test_type') %>%
    group_by(test_type) %>%
    summarise(n_samples = n(), 
              n_trials = n_distinct(NoEssai),
              percent = 100*n_samples/nrow(PoidsSpec_ml_df), 
              N_min = min(NtotDose), N_max = max(NtotDose),
              P_min = min(PtotDose), P_max = max(PtotDose),
              K_min = min(KtotDose), K_max = max(KtotDose))
stat3

"Factor `test_type` contains implicit NA, consider using `forcats::fct_explicit_na`"


test_type,n_samples,n_trials,percent,N_min,N_max,P_min,P_max,K_min,K_max
<fct>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
K,880,42,20.6864128,80,260,75,240,0,420
N,1956,117,45.9802539,0,250,88,215,57,270
NPK,410,16,9.6379878,0,225,0,300,0,300
P,990,38,23.2722144,110,260,0,300,0,270
,18,1,0.4231312,218,218,110,110,55,55


We export this table as __`df_PoidsSpec.csv`__ for GP in Python.

In [71]:
write_csv(PoidsSpec_ml_df, "output/df_PoidsSpec.csv")

### Common table for trial selection in python codes

In [68]:
sg_inliers <- read.csv('output/sg_inliers.csv') # after the deletion of outliers in the dataset used for specific gravity (1.4.4).

In [23]:
common_df <- RendVendable_ml_df %>%
    filter(NoEssai %in% tuberSize$NoEssai) %>%
    filter(NoEssai %in% sg_inliers$NoEssai) %>%
    select(keys_col, dose_vars, RendVendable)
write_csv(common_df, 'output/common_df.csv')

### Other informations for the article

In [24]:
t(RendVendable_ml_df %>%
    group_by(test_type) %>%
    select(DensitePlants) %>%
    summarize_all(list(median)))

"Factor `test_type` contains implicit NA, consider using `forcats::fct_explicit_na`"Adding missing grouping variables: `test_type`


0,1,2,3,4,5
test_type,K,N,NPK,P,
DensitePlants,36430,36036,43716,33118,31223.0


In [25]:
summary(RendVendable_ml_df %>% 
            select(RendVendable, growing.season, DensitePlants, temp_moy_5years, prec_tot_5years, soil_pH, soil_P, soil_K, soil_Al))

  RendVendable    growing.season  DensitePlants   temp_moy_5years
 Min.   : 2.536   Min.   : 91.0   Min.   :26667   Min.   :11.75  
 1st Qu.:26.284   1st Qu.:113.0   1st Qu.:31227   1st Qu.:16.67  
 Median :32.557   Median :122.0   Median :36430   Median :17.12  
 Mean   :33.208   Mean   :120.4   Mean   :37580   Mean   :17.20  
 3rd Qu.:39.426   3rd Qu.:130.0   3rd Qu.:43716   3rd Qu.:17.80  
 Max.   :86.157   Max.   :143.0   Max.   :54645   Max.   :19.90  
 prec_tot_5years    soil_pH          soil_P            soil_K      
 Min.   :  0.0   Min.   :4.617   Min.   :  2.153   Min.   : 10.06  
 1st Qu.:358.7   1st Qu.:5.400   1st Qu.: 49.038   1st Qu.: 69.15  
 Median :412.3   Median :5.550   Median :110.000   Median :122.00  
 Mean   :402.6   Mean   :5.645   Mean   :144.843   Mean   :134.43  
 3rd Qu.:461.8   3rd Qu.:5.876   3rd Qu.:169.000   3rd Qu.:179.62  
 Max.   :580.9   Max.   :6.986   Max.   :667.500   Max.   :497.18  
    soil_Al      
 Min.   : 343.2  
 1st Qu.:1280.8  
 Median 

In [28]:
(r_sample <- read.csv('output/r_sample.csv'))
(test_types <- read.csv('output/test_types.csv'))

X,NoEssai,test_type,ID
<int>,<int>,<fct>,<fct>
2540,194,P,194-2-4


X,NoEssai,test_type,ID
<int>,<int>,<fct>,<fct>
1745,8804,N,8804-5-5
5524,412,P,412-3-4
3843,320,K,320-3-3


In [27]:
features <- c("NoEssai", "test_type", "Cultivar", "Maturity5", "growing.season", 
              "DensitePlants", 'temp_moy_5years', 'prec_tot_5years', 
              "soil_pH", "soil_P", "soil_K", "soil_Al", "ISP1", "Texture",
              "N_minDoseTrial", "N_maxDoseTrial", "P_minDoseTrial",
              "P_maxDoseTrial", "K_minDoseTrial", "K_maxDoseTrial")
data_ml %>% 
    filter(NoEssai %in% c(r_sample$NoEssai, test_types$NoEssai) & NoBloc == 1 & NoTraitement ==1) %>% 
    select(features) %>%
    t()

0,1,2,3,4
NoEssai,8804,194,320,412
test_type,N,P,K,P
Cultivar,FL 1533,Superior,Krantz,Goldrush
Maturity5,mid-season,early mid-season,mid-season,mid-season
growing.season,131,102,112,108
DensitePlants,43716,36430,31226,36433
temp_moy_5years,17.94491,15.96156,17.64958,16.39004
prec_tot_5years,358.68,378.04,447.88,363.16
soil_pH,5.5333,5.5000,6.1307,5.7599
soil_P,175.00000,22.56000,348.60023,46.33619


In [23]:
ts <- data_ml %>%
  filter(RendVendableMaxParEssai >= 28) %>%
  select(one_of(c("index", 'tsizeMS_L', 'tsizeS_M', 
                  "RendGros", "RendMoy", "RendPetit",
                  var_ml))) %>%
  na.omit() %>%
  droplevels()

In [24]:
ts %>% filter(RendGros == 0) %>% nrow() / nrow(ts)

In [25]:
ts %>% filter(RendMoy == 0) %>% nrow() / nrow(ts)

In [26]:
ts %>% filter(RendPetit == 0) %>% nrow() / nrow(ts)