# Clustering Time Series

In addition to finding clusters in images, maps, and conventional tabular datasets based on a list of object propoerties, clustering techniches can also be applied to time series. In this example we will identify distinct regions of the U.S. Southern Great Plains according to the timing of soil water deficit around the period of winter wheat anthesis. This phenological period is known to be the most susceptible to drought conditions, thus soil water deficits during this state can negatively impact crop yields.

The time series were generated using the Simple Simulation crop Model (SSM, Soltani and Sinclair) and the resulting time series of soil water deficit were discretized in intervals of biological growing degree days (BD), so that all the time series have the same temporal length.

In the dataset, each row is a site-year and each `BD` column is a window of cumulative biological days since anthesis. The soil water deficit value is expressed as the fraction of transpirable soil water (FTSW), which ranges from 1 (no-water limiting conditions) to 0 (soil with no available water to plant roots).

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from bokeh.plotting import figure, show, output_notebook
output_notebook()

In [2]:
# Load data
df = pd.read_csv("../datasets/wheat_ftsw.csv")
df.head()


Unnamed: 0,climate_class,city,state,year,BD-14,BD-18,BD-22,BD-26,BD-31,BD-35,BD-39,BD-43,BD-47,BD-51,BD-55,BD-60,BD-64,BD-68,BD-72,yield
0,semi-arid,Akron,CO,1986,0.642,0.707,0.468,0.279,0.285,0.134,0.249,0.225,0.082,0.039,0.031,0.024,0.033,0.068,0.042,2952
1,semi-arid,Byers,CO,1989,0.461,0.486,0.5,0.33,0.13,0.062,0.025,0.092,0.039,0.044,0.031,0.01,0.003,0.007,0.002,1715
2,semi-arid,Lamar,CO,1992,0.796,0.675,0.582,0.395,0.278,0.163,0.155,0.159,0.111,0.157,0.08,0.028,0.034,0.086,0.096,3437
3,semi-arid,Sedgwick,CO,1996,0.775,0.821,0.761,0.523,0.289,0.136,0.059,0.088,0.157,0.201,0.15,0.206,0.126,0.108,0.117,4091
4,semi-arid,Colby,KS,1999,0.714,0.611,0.66,0.397,0.214,0.177,0.099,0.054,0.031,0.012,0.008,0.004,0.003,0.002,0.001,1619


In [3]:
# Select the rows and columns of FTSW for clustering
X = df.iloc[:,4:-1].values
print(X)


[[0.642 0.707 0.468 ... 0.033 0.068 0.042]
 [0.461 0.486 0.5   ... 0.003 0.007 0.002]
 [0.796 0.675 0.582 ... 0.034 0.086 0.096]
 ...
 [0.887 0.968 0.879 ... 0.382 0.336 0.445]
 [0.929 1.027 1.019 ... 0.545 0.502 0.441]
 [0.951 0.999 1.066 ... 0.519 0.52  0.615]]


In [4]:
# Compute median FTSW after anthesis. 
# Useful to characterize soil moisture conditions in a single value
df.insert(4,'FTSW',np.median(X,axis=1))
df.head()

Unnamed: 0,climate_class,city,state,year,FTSW,BD-14,BD-18,BD-22,BD-26,BD-31,...,BD-39,BD-43,BD-47,BD-51,BD-55,BD-60,BD-64,BD-68,BD-72,yield
0,semi-arid,Akron,CO,1986,0.134,0.642,0.707,0.468,0.279,0.285,...,0.249,0.225,0.082,0.039,0.031,0.024,0.033,0.068,0.042,2952
1,semi-arid,Byers,CO,1989,0.044,0.461,0.486,0.5,0.33,0.13,...,0.025,0.092,0.039,0.044,0.031,0.01,0.003,0.007,0.002,1715
2,semi-arid,Lamar,CO,1992,0.157,0.796,0.675,0.582,0.395,0.278,...,0.155,0.159,0.111,0.157,0.08,0.028,0.034,0.086,0.096,3437
3,semi-arid,Sedgwick,CO,1996,0.157,0.775,0.821,0.761,0.523,0.289,...,0.059,0.088,0.157,0.201,0.15,0.206,0.126,0.108,0.117,4091
4,semi-arid,Colby,KS,1999,0.054,0.714,0.611,0.66,0.397,0.214,...,0.099,0.054,0.031,0.012,0.008,0.004,0.003,0.002,0.001,1619


In [5]:
# Get cumulative biological days for plotting purposes from the headers
cbd = [np.float(col[3:]) for col in df.columns[5:-1]]
print(cbd)


[14.0, 18.0, 22.0, 26.0, 31.0, 35.0, 39.0, 43.0, 47.0, 51.0, 55.0, 60.0, 64.0, 68.0, 72.0]


In [6]:
f = figure(width=400, height=300)
f.line(cbd,X[0,:])
f.xaxis.axis_label = 'Cumulative Biological Days'
f.yaxis.axis_label = 'FTSW'
show(f)

In [7]:
# Plot the relationship between FTSW and crop yield to understnad the bigger picture
f = figure(width=400, height=300)
f.circle(df['FTSW'], df['yield'])
f.xaxis.axis_label = 'FTSW'
f.yaxis.axis_label = 'Wheat Yield (kg/ha)'
show(f)

In [8]:
k = 3 # To match the number of climate classes
groups = KMeans(n_clusters=k, random_state=0).fit_predict(X)
print(groups)


[0 0 0 0 0 0 2 0 0 2 0 1 2 0 2 0 2 0 0 2 0 2 0 0 2 2 1 0 2 2 0 2 2 1 2 1 2
 2 1 2 1 2 0 1 2 1 0 2 2 2 1 2 1 0 2 2 1 1 1 1 1 2 1 1 1 2 2 2 2 1 2 1 1 1
 1]


In [9]:
# Add groups to DataFrame
df.insert(1,"Kgroup",groups) # Insert next to climate class
df.head()

Unnamed: 0,climate_class,Kgroup,city,state,year,FTSW,BD-14,BD-18,BD-22,BD-26,...,BD-39,BD-43,BD-47,BD-51,BD-55,BD-60,BD-64,BD-68,BD-72,yield
0,semi-arid,0,Akron,CO,1986,0.134,0.642,0.707,0.468,0.279,...,0.249,0.225,0.082,0.039,0.031,0.024,0.033,0.068,0.042,2952
1,semi-arid,0,Byers,CO,1989,0.044,0.461,0.486,0.5,0.33,...,0.025,0.092,0.039,0.044,0.031,0.01,0.003,0.007,0.002,1715
2,semi-arid,0,Lamar,CO,1992,0.157,0.796,0.675,0.582,0.395,...,0.155,0.159,0.111,0.157,0.08,0.028,0.034,0.086,0.096,3437
3,semi-arid,0,Sedgwick,CO,1996,0.157,0.775,0.821,0.761,0.523,...,0.059,0.088,0.157,0.201,0.15,0.206,0.126,0.108,0.117,4091
4,semi-arid,0,Colby,KS,1999,0.054,0.714,0.611,0.66,0.397,...,0.099,0.054,0.031,0.012,0.008,0.004,0.003,0.002,0.001,1619


In [10]:
# Examine matching of Kgroups with known climate classification according to Aridity Index
for i in range(df.shape[0]):
    print(df.loc[i,"city"],': ', df.loc[i,"climate_class"], df.loc[i,"Kgroup"])
    

Akron :  semi-arid 0
Byers :  semi-arid 0
Lamar :  semi-arid 0
Sedgwick :  semi-arid 0
Colby :  semi-arid 0
Dodge City :  semi-arid 0
Elkhart :  semi-arid 2
Garden City :  semi-arid 0
Goodland :  semi-arid 0
Liberal :  semi-arid 2
Ness City :  semi-arid 0
Oakley :  semi-arid 1
St. John :  semi-arid 2
Tribune :  semi-arid 0
Beaver :  semi-arid 2
Boise City :  semi-arid 0
Crosbyton :  semi-arid 2
Dalhart :  semi-arid 0
Dumas :  semi-arid 0
Haskell :  semi-arid 2
Hereford :  semi-arid 0
Muleshoe :  semi-arid 2
Perryton :  semi-arid 0
Plainview :  semi-arid 0
Quanah :  semi-arid 2
Ellsworth :  dry-subhumid 2
Ellsworth :  dry-subhumid 1
Great Bend :  dry-subhumid 0
Greensburg :  dry-subhumid 2
Kiowa :  dry-subhumid 2
Meade :  dry-subhumid 0
Medicine Lodge :  dry-subhumid 2
Norton :  dry-subhumid 2
Pratt :  dry-subhumid 1
Salina :  dry-subhumid 2
Scandia :  dry-subhumid 1
Smith Center :  dry-subhumid 2
Altus :  dry-subhumid 2
Alva :  dry-subhumid 1
Bessie :  dry-subhumid 2
Elk City :  dry-su

In [11]:
f = figure(width=500, height=400)
f.xaxis.axis_label = 'Cumulative Biological Days'
f.yaxis.axis_label = 'FTSW'

for i,g in enumerate(groups):
    if g == 0: 
        f.line(cbd, X[i,:], line_color='red', line_dash='solid')
    elif g == 1: 
        f.line(cbd, X[i,:], line_color='blue', line_dash='solid')
    elif g == 2: 
        f.line(cbd, X[i,:], line_color='green', line_dash='solid')
        
show(f)

## References

Sciarresi, C., Patrignani, A., Soltani, A., Sinclair, T. and Lollato, R.P., 2019. Plant traits to increase winter wheat yield in semiarid and subhumid environments. Agronomy Journal, 111(4), pp.1728-1740.