# Data Gathering & ETL

## Data Sources

The training data is taken from two pubished sources:

1. The CROPtime model developed by Oregon State University [^1], and
2. 'The effects of temperature on flowering, fruit set and fruit development of glasshouse sweet pepper (Capsicum annuum L)', published in 1989 [^2]

The training data covers two classes:

* 1 - Fieldgrown: Crop of C. annuum was grown outdoors with no temperature control.
* 2 - Hothouse: Crop of C. annuum was grown in a heated structure under artifical lights.

## Sample Size

This data is unfortunately very sparse, as the fieldgrown cohort covers only 4 measured crops and the hothouse cohort measured 10. The primary reason for the lack pof published data is likely due to the manner in which most Capsicum varieties are cultivated and harvested: Outside of a few regions, crops are grown in hothouses where all growth factors are controlled, and the fruits of these plants are harvested at an unripe stage so that they can be sold green at market. As a result there is little data regarding Degree Days/temperatures, humidity, and irrigation/soil moisture values required for the plants to produce mature fruits (sweet or hot).

## Processing Raw Data - Finding GDDs

The first task is to convert the hothouse data into the format required: a table of **GDDF** and **Days_to_maturity** values, where

* **GDDF** is the number of total growing degree days (calculated in degrees Fahrenheit) experienced by the plants before reaching fruit maturity, and
* **Days_to_maturity** are the number of days elapsed from transplant until the fruits matured.

The tables given by the study provides the columns **Day_temp**, **Night_temp**, and **Days_to_maturity**. In this study the temperature is maintained at one of two levels constantly throughout the day, so for our purposes we can assume that the higher temperature is the highest temperature, and the lower value is the minimum temperature experienced by the plants that day.

Here is the raw table:

In [1]:
import pandas

hothouse_raw = pandas.read_csv('annuum-hothouse-raw.csv')
hothouse_raw.head(10)

Unnamed: 0,Day_temp_c,Night_temp_c,Days_to_maturity
0,18.5,20.6,73.2
1,21.1,15.4,85.0
2,21.1,17.8,79.9
3,21.1,20.6,64.2
4,24.7,14.0,78.3
5,24.5,15.0,62.8
6,24.5,17.9,60.3
7,24.7,20.7,57.9
8,27.9,15.4,69.3
9,27.9,20.8,57.0


The formula for caulculating growing degree days is given as: **GDDC = ((Temp_max_c + Temp_min_c) / 2) - Base_c**, where **GDDC** is the growing degee days calculated in Celsius and **Base_c** is the temperature below which no plant growth will occur, given as **52° F/11.1° C**.

Once the daily growing degree days are calculated, we can multiple by the recorded number of days to maturity to find the total GDDs (in Celsius) experienced:


In [2]:
Gddc_daily = ((hothouse_raw['Day_temp_c'] + hothouse_raw['Night_temp_c']) / 2) - 11.1
hothouse_raw.insert(2, 'Gddc_daily', Gddc_daily)

Gddc_total = hothouse_raw['Gddc_daily'] * hothouse_raw['Days_to_maturity']
hothouse_raw.insert(3, 'Gddc_total', Gddc_total)

hothouse_raw.head(10)

Unnamed: 0,Day_temp_c,Night_temp_c,Gddc_daily,Gddc_total,Days_to_maturity
0,18.5,20.6,8.45,618.54,73.2
1,21.1,15.4,7.15,607.75,85.0
2,21.1,17.8,8.35,667.165,79.9
3,21.1,20.6,9.75,625.95,64.2
4,24.7,14.0,8.25,645.975,78.3
5,24.5,15.0,8.65,543.22,62.8
6,24.5,17.9,10.1,609.03,60.3
7,24.7,20.7,11.6,671.64,57.9
8,27.9,15.4,10.55,731.115,69.3
9,27.9,20.8,13.25,755.25,57.0


The formula to convert to growing degree days Fahrenheit is **GDDF = (9/5) GDDC**, so we should be able to find total GDDF by multiplying by **1.8**:

In [3]:
Gddf_total = hothouse_raw['Gddc_total'] * 1.8
hothouse_raw.insert(4, 'Gddf_total', Gddf_total)

hothouse_raw.head(10)

Unnamed: 0,Day_temp_c,Night_temp_c,Gddc_daily,Gddc_total,Gddf_total,Days_to_maturity
0,18.5,20.6,8.45,618.54,1113.372,73.2
1,21.1,15.4,7.15,607.75,1093.95,85.0
2,21.1,17.8,8.35,667.165,1200.897,79.9
3,21.1,20.6,9.75,625.95,1126.71,64.2
4,24.7,14.0,8.25,645.975,1162.755,78.3
5,24.5,15.0,8.65,543.22,977.796,62.8
6,24.5,17.9,10.1,609.03,1096.254,60.3
7,24.7,20.7,11.6,671.64,1208.952,57.9
8,27.9,15.4,10.55,731.115,1316.007,69.3
9,27.9,20.8,13.25,755.25,1359.45,57.0


The field-grown data already consists of total growing degree days Fahreneit and days to maturity for each of the four crops grown:

In [4]:
field_raw = pandas.read_csv('annuum-field-raw.csv')
field_raw.head()

Unnamed: 0,Gddf_total,Days_to_maturity
0,1998,84
1,1692,79
2,1767,73
3,1682,78


Once we add class labels and concatenate the data into one table, we have:

In [5]:
field_raw['Label'] = 1
hothouse_raw['Label'] = 2

hothouse_raw = hothouse_raw.drop(columns=['Day_temp_c','Night_temp_c', 'Gddc_daily', 'Gddc_total'])

annuum_raw = pandas.concat([field_raw, hothouse_raw])
annuum_raw.to_csv('annuum_raw.csv',index=False)
annuum_raw.head(14)

Unnamed: 0,Gddf_total,Days_to_maturity,Label
0,1998.0,84.0,1
1,1692.0,79.0,1
2,1767.0,73.0,1
3,1682.0,78.0,1
0,1113.372,73.2,2
1,1093.95,85.0,2
2,1200.897,79.9,2
3,1126.71,64.2,2
4,1162.755,78.3,2
5,977.796,62.8,2


We can use the resulting file **annuum_raw.csv** for further preprocessing and training.

## Bibliography

[^1]: CROPtime by Oregon State University. https://smallfarms.oregonstate.edu/smallfarms/crops/croptime

[^2] Bakker, J. C. (1989). *The effects of temperature on flowering, fruit set and fruit development of glasshouse sweet pepper (Capsicum annuum L.)*. Journal of Horticultural Science, 64(3), 313-320.