# Data Gathering & ETL

## Data Sources

The training data is taken from two published sources:

1. The CROPtime model developed by Oregon State University [^1], and
2. 'The effects of temperature on flowering, fruit set and fruit development of glasshouse sweet pepper (Capsicum annuum L)', published in 1989 [^2]

The training data covers two classes:

* 1 - Field-grown: Crop of C. annuum was grown outdoors with no temperature control.
* 2 - Hothouse: Crop of C. annuum was grown in a heated structure under artificial lights.

## Sample Size

This data is unfortunately very sparse, as the field-grown cohort covers only 4 measured crops and the hothouse cohort measured 10 samples. The primary reason for the lack of published data is likely due to the manner in which most Capsicum varieties are cultivated and harvested: Outside of a few regions, crops are grown in hothouses where all growth factors are controlled, and the fruits of these plants are harvested at an unripe stage so that they can be sold green at market. As a result there is little data regarding Degree Days/temperatures, humidity, and irrigation/soil moisture values required for the plants to produce mature fruits (sweet or hot).

## Processing Raw Data - Finding GDDs for Each Trial

The first task is to convert the field trial data into the format required: a table of **GDDC** and **Days_to_maturity** values, where

* **GDDC** is the total number of growing degree days (calculated in degrees Celsius) experienced by the plants before reaching fruit maturity, and
* **Days_to_maturity** are the number of days elapsed from transplant until the first fruits matured.

The tables given by the study provide the columns **GDDF** and **Days_to_maturity**, where **GDDF** is the number of growing degree days measured in degrees Fahrenheit:

In [1]:
import pandas

field_raw = pandas.read_csv('annuum-field-raw.csv')
field_raw.head()

Unnamed: 0,Gddf_total,Days_to_maturity
0,1998,84
1,1692,79
2,1767,73
3,1682,78


The formula to convert to growing degree days Celsius is 
```
GDDC = (5/9) * GDDF:
```

In [2]:
Gddc_total = field_raw['Gddf_total'] * (5.0/9.0)
field_raw.insert(1, 'Gddc_total', Gddc_total)
field_raw.head()

Unnamed: 0,Gddf_total,Gddc_total,Days_to_maturity
0,1998,1110.0,84
1,1692,940.0,79
2,1767,981.666667,73
3,1682,934.444444,78


The hothouse trial data is already given in degrees Celsius, so we can directly compute growing degree days Celsius. The formula for daily growing degree days is

```
GDDC_daily = ((daily_temp_max_c + daily_temp_min) / 2) - base_temp_c.
```

For this study, **daily_temp_max_c** can be assumed to be the highest daily temperature given, **daily_temp_min** can be assumed to be the lowest daily temperature given, and **base_temp_c** is the temperature below which no plant growth will occur, given as 52° F/11.1° C (from the CROPtime model).

The total number of growing degree days Celsius (GDDC_total) is then calculated by multiplying GDDC_daily by the number of days until the first fruits matured:

In [3]:
hothouse_raw = pandas.read_csv('annuum-hothouse-raw.csv')

Gddc_daily = ((hothouse_raw['Day_temp_c'] + hothouse_raw['Night_temp_c']) / 2) - 11.1
hothouse_raw.insert(2, 'Gddc_daily', Gddc_daily)

Gddc_total = hothouse_raw['Gddc_daily'] * hothouse_raw['Days_to_maturity']
hothouse_raw.insert(3, 'Gddc_total', Gddc_total)
hothouse_raw.head(10)

Unnamed: 0,Day_temp_c,Night_temp_c,Gddc_daily,Gddc_total,Days_to_maturity
0,18.5,20.6,8.45,618.54,73.2
1,21.1,15.4,7.15,607.75,85.0
2,21.1,17.8,8.35,667.165,79.9
3,21.1,20.6,9.75,625.95,64.2
4,24.7,14.0,8.25,645.975,78.3
5,24.5,15.0,8.65,543.22,62.8
6,24.5,17.9,10.1,609.03,60.3
7,24.7,20.7,11.6,671.64,57.9
8,27.9,15.4,10.55,731.115,69.3
9,27.9,20.8,13.25,755.25,57.0


Now we can add class labels and concatenate the data into one table:

In [4]:
field_raw['Label'] = 1
hothouse_raw['Label'] = 2

#TODO - Keep hothouse data later to crete synthetic weather reading - this is for demo only
hothouse_raw = hothouse_raw.drop(columns=['Day_temp_c','Night_temp_c', 'Gddc_daily'])
field_raw = field_raw.drop(columns=['Gddf_total'])

annuum_raw = pandas.concat([field_raw, hothouse_raw])
annuum_raw.to_csv('annuum_raw.csv',index=False)
annuum_raw.head(14)


Unnamed: 0,Gddc_total,Days_to_maturity,Label
0,1110.0,84.0,1
1,940.0,79.0,1
2,981.666667,73.0,1
3,934.444444,78.0,1
0,618.54,73.2,2
1,607.75,85.0,2
2,667.165,79.9,2
3,625.95,64.2,2
4,645.975,78.3,2
5,543.22,62.8,2


We can use the resulting file **annuum_raw.csv** for further preprocessing and training.

In the next step, we will pull the historical data for the field test and create synthetic weather data for the hothouse crop.

## Bibliography

[^1]: CROPtime by Oregon State University. https://smallfarms.oregonstate.edu/smallfarms/crops/croptime

[^2] Bakker, J. C. (1989). *The effects of temperature on flowering, fruit set and fruit development of glasshouse sweet pepper (Capsicum annuum L.)*. Journal of Horticultural Science, 64(3), 313-320.