<a href="https://colab.research.google.com/github/uteyechea/crime-prediction-using-artificial-intelligence/blob/master/Part3_Temporal_Autocorrelation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Part 3: Get autocorrelated sequences (windowing) to train the RNN in a peculiar pattern. 
Unfortunately training must be done again for every new sequence or window of data points in the time series. We are training the RNN to learn a specific class of pattern in a time series, instead of having the RNN learn the full time series itself. Why? Because I don't trust the RNN to be able to learn the whole time series, nevertheless I will test my doubts later and see if it can truly learn the whole time series.

##3.1 Dependencies, mount Google Drive and set system path
Import the relevant packages we will use to preprocess the raw data files.

In [45]:
import os
import gc

import pandas as pd
from scipy import stats

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

path='/content/drive/My Drive/Colab Notebooks/crime_prediction'

Mounted at /content/drive


##3.2 Locate time series file and load to memory
Verify file column names and remove nulls.

In [46]:
file_path=os.path.join(path,'data','theft.csv')
file=pd.read_csv(file_path,sep=',',parse_dates=['Date'],index_col='Date')
file.isnull().values.any() # nulls?

False

In [47]:
print(file.columns.get_loc('zone4')) #Series autocorrelation requires a pandas.Series
print(file.columns[4]) #we can either select a pandas.Series from a DataFrame using column name or index.

4
zone4


In [48]:
file #Normalized to [0,1] data file contents.

Unnamed: 0_level_0,zone0,zone1,zone2,zone3,zone4,zone5,zone6,zone7,zone8,zone9,zone10,zone11,zone12,zone13,zone14,zone15,zone16,zone17,zone18,zone19
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2001-01-01,0.0,0.171429,0.37500,0.100000,0.193548,0.0,0.454545,0.075342,0.40,0.12,0.181818,0.205882,0.206897,0.171429,0.000000,0.294118,0.166667,0.09375,0.304348,0.342105
2001-01-02,0.0,0.000000,0.00000,0.000000,0.000000,0.0,0.045455,0.006849,0.00,0.00,0.000000,0.000000,0.034483,0.000000,0.000000,0.029412,0.055556,0.00000,0.021739,0.000000
2001-01-03,0.0,0.000000,0.03125,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
2001-01-05,0.0,0.000000,0.00000,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.021739,0.000000
2001-01-06,0.0,0.000000,0.00000,0.000000,0.032258,0.0,0.000000,0.000000,0.00,0.00,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.00000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-07-04,0.0,0.114286,0.00000,0.100000,0.096774,0.0,0.090909,0.095890,0.12,0.16,0.000000,0.147059,0.103448,0.257143,0.038462,0.117647,0.166667,0.18750,0.217391,0.210526
2020-07-05,0.0,0.257143,0.06250,0.100000,0.225806,0.0,0.272727,0.089041,0.20,0.12,0.090909,0.235294,0.034483,0.057143,0.038462,0.058824,0.000000,0.15625,0.021739,0.078947
2020-07-06,0.0,0.342857,0.18750,0.200000,0.161290,0.0,0.181818,0.089041,0.16,0.22,0.090909,0.294118,0.068966,0.085714,0.000000,0.147059,0.055556,0.34375,0.239130,0.131579
2020-07-07,0.0,0.142857,0.06250,0.133333,0.064516,0.0,0.181818,0.075342,0.16,0.22,0.000000,0.176471,0.000000,0.171429,0.115385,0.117647,0.000000,0.18750,0.217391,0.184211


##3.3 Series autocorrelation

1. Get a window of length t-n, where t is the end datetime and n is the length of the window or sequence in a data series.
2. Then compute the correlation between the selected window and all other windows in the same data series. 
3. Create a list of datetamps, add a new timestamps whenever a high correlation is found.
4. Use the datestamps to generate a sequence of highly correlated sequences.




###3.3.1 Get window, as in windowing from statistics

In [49]:
def window(dataframe,end_date,lookback_periods,column_name):
  window=dataframe.loc[pd.date_range(start=end_date,periods=lookback_periods,freq='-1D'),column_name]
  return window[::-1] #Sort window by datetime, as in any time series, older values first ...

In [50]:
window(file,'2019-01-01',lookback_periods=10,column_name='zone11')

2018-12-23    0.058824
2018-12-24    0.205882
2018-12-25    0.088235
2018-12-26    0.117647
2018-12-27    0.176471
2018-12-28    0.235294
2018-12-29    0.117647
2018-12-30    0.117647
2018-12-31    0.176471
2019-01-01    0.176471
Freq: D, Name: zone11, dtype: float64

###3.3.2 Get windows correlation

In [51]:
def corr_windows(series1,series2):
  #assert something?
  series1=series1.reset_index(drop=True) #Better find a way to simplify this procedure
  series2=series2.reset_index(drop=True) #Better find a way to simplify this procedure
  ro=series1.corr(series2) 
#maybe try np.correlate (mode=same)
  return ro 

###3.3.3 Get timestamp of autocorrelated time series window

In [75]:
def autocorr_timestamps(dataframe,end_date,lookback_periods,column_name,min_correlation):
  timestamps=[]
  series2=window(dataframe,end_date,lookback_periods,column_name)
  #loop over index type datetime, up to date end_date and starting from index at posittion lookback_periods
  for timestamp in dataframe.loc[:end_date].index[lookback_periods::]:   
    try:
      series1=window(dataframe,timestamp,lookback_periods,column_name)
      correlation=corr_windows(series1,series2)
      if correlation >= min_correlation:
      #record datetime value. We will use this datetime value to generate all sequences that will go as input to the RNN
      #print(dataframe.index[epoch]) 
        timestamps.append(timestamp)
    except:
      print('Missing date in dataframe while looping over time series, i.e. discontinuity in time series')
  return timestamps

In [76]:
autocorr_timestamps(dataframe=file,end_date='2019-01-01',lookback_periods=10,column_name='zone11',min_correlation=0.75)

Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in 

[Timestamp('2002-05-21 00:00:00'),
 Timestamp('2002-10-09 00:00:00'),
 Timestamp('2003-07-06 00:00:00'),
 Timestamp('2003-10-11 00:00:00'),
 Timestamp('2004-06-29 00:00:00'),
 Timestamp('2004-10-05 00:00:00'),
 Timestamp('2005-04-18 00:00:00'),
 Timestamp('2005-12-03 00:00:00'),
 Timestamp('2006-05-24 00:00:00'),
 Timestamp('2007-10-11 00:00:00'),
 Timestamp('2008-02-19 00:00:00'),
 Timestamp('2008-05-20 00:00:00'),
 Timestamp('2008-06-14 00:00:00'),
 Timestamp('2008-11-25 00:00:00'),
 Timestamp('2009-01-05 00:00:00'),
 Timestamp('2009-05-02 00:00:00'),
 Timestamp('2010-06-22 00:00:00'),
 Timestamp('2010-11-16 00:00:00'),
 Timestamp('2011-01-11 00:00:00'),
 Timestamp('2011-03-05 00:00:00'),
 Timestamp('2011-09-16 00:00:00'),
 Timestamp('2012-03-23 00:00:00'),
 Timestamp('2013-01-14 00:00:00'),
 Timestamp('2013-06-30 00:00:00'),
 Timestamp('2013-08-07 00:00:00'),
 Timestamp('2013-11-05 00:00:00'),
 Timestamp('2014-01-16 00:00:00'),
 Timestamp('2014-02-25 00:00:00'),
 Timestamp('2014-08-

###3.3.4 Get autocorrelated windows

In [87]:
def autocorr_windows(dataframe,end_date,lookback_periods,column_name,min_correlation):
  timestamps=autocorr_timestamps(dataframe,end_date,lookback_periods,column_name,min_correlation)
  apriori_windows={}
  aposteriori_windows={}
  windows={}
  for timestamp in timestamps:
    try:
      apriori_windows[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=lookback_periods,freq='-1D'),column_name]
      aposteriori_windows[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=lookback_periods,freq='1D',closed='right'),column_name]
    except:
      print('Missing date in dataframe while looping over time series, i.e. discontinuity in time series')
    try:
      windows[timestamp]=apriori_windows[timestamp][::-1].append(aposteriori_windows[timestamp])
    except:
      print('Error while constructing window for datetime stamp:',str(timestamp))
  return windows

In [88]:
windows=autocorr_windows(dataframe=file,end_date='2019-01-01',lookback_periods=10,column_name='zone11',min_correlation=0.75)

Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in 

In [89]:
windows

{Timestamp('2002-05-21 00:00:00'): 2002-05-12    0.176471
 2002-05-13    0.382353
 2002-05-14    0.205882
 2002-05-15    0.323529
 2002-05-16    0.352941
 2002-05-17    0.441176
 2002-05-18    0.176471
 2002-05-19    0.264706
 2002-05-20    0.588235
 2002-05-21    0.441176
 2002-05-22    0.264706
 2002-05-23    0.617647
 2002-05-24    0.441176
 2002-05-25    0.323529
 2002-05-26    0.235294
 2002-05-27    0.264706
 2002-05-28    0.294118
 2002-05-29    0.382353
 2002-05-30    0.411765
 Freq: D, Name: zone11, dtype: float64,
 Timestamp('2002-10-09 00:00:00'): 2002-09-30    0.382353
 2002-10-01    0.764706
 2002-10-02    0.323529
 2002-10-03    0.470588
 2002-10-04    0.647059
 2002-10-05    0.764706
 2002-10-06    0.411765
 2002-10-07    0.411765
 2002-10-08    0.764706
 2002-10-09    0.500000
 2002-10-10    0.500000
 2002-10-11    0.823529
 2002-10-12    0.500000
 2002-10-13    0.647059
 2002-10-14    0.588235
 2002-10-15    0.382353
 2002-10-16    0.588235
 2002-10-17    0.500000
 200

##3.4 Export autocorrelated windows to file

In [90]:
def save_to_file(windows,file_path):
  with open(file_path,'w') as file:
    for timestamp in windows:
      windows[timestamp].to_csv(file,mode='a',header=False,index=False)
      file.write('\n')

In [103]:
save_to_path=os.path.join(path,'data','training','theft.csv')
save_to_path

In [106]:
save_to_file(windows,save_to_path)

##*** 3.5 DataFrame correlation(raw code not really used, I kept this code here to later use it somewhere else. )

In [None]:
def get_window(dataframe,t,n):
  #if type(t) == int elif type(t)==str then use .loc, instead of iloc. For the time being, we will just assume t and n are integers 
  return dataframe.iloc[t-n:t,:]

def get_correlation(dataframe1,dataframe2):
  assert dataframe1.shape==dataframe2.shape,'Dataframes must have the same shape'
  ro=dataframe1.corrwith(dataframe2.set_index(dataframe1.index),axis=0) #Force alignment. Make sure size is the same for both dataframes
  return ro.mean() #For the time being we will stop with a general correlation among all zones

def get_correlated_endof_sequence_timestamp(dataframe,t,n=10,min_correlation=0.5):
  endof_sequence_timestamp=[]
  dataframe2=get_window(dataframe,t,n)
  for epoch in reversed(range(2*n,t)):
    dataframe1=get_window(dataframe,epoch-n,n)
    correlation=get_correlation(dataframe1,dataframe2)
    if correlation >= min_correlation:
      #record datetime value. We will use this datetime value to generate all sequences that will go as input to the RNN
      #print(dataframe.index[epoch]) 
      endof_sequence_timestamp.append(dataframe.index[epoch])
  return endof_sequence_timestamp

def get_correlated_dataframe_slice(dataframe,endof_sequence_timestamp,n):
  input={}
  output={}
  for timestamp in endof_sequence_timestamp:
    try:
      input[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=n,freq='-1D')]
      output[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=n+1,freq='1D',closed='right')]
    except:
      print('Missing dates at ', str(timestamp))
  #df with all inputs and outputs  
  return input,output

def get_IO_sequence(dataframe,periods,min_correlation=0.5):
  endof_sequence_timestamp=get_correlated_endof_sequence_timestamp(dataframe,len(dataframe),periods,min_correlation)
  input,output=get_correlated_dataframe_slice(dataframe,endof_sequence_timestamp,periods)
  #return input,output
  sequence={}
  for key in endof_sequence_timestamp: #Change from input to timestamp sequence
    try:
      sequence[key]= input[key][::-1].append(output[key])
    except:
      print('Error with key',str(key))
  return sequence      

def save_to_file(sequence,file_path):
  with open(file_path,'a') as file:
    for key in sequence:
      sequence[key].to_csv(file,mode='a',header=False,index=False)
      file.write('\n')

It would be ideal to save a list of dates for each zone where the correlation is high. Nevertheless, we will demote this as further work. 

In [None]:
sequence=get_IO_sequence(file,periods=10,min_correlation=0.25)

Missing dates at  2001-12-18 00:00:00
Missing dates at  2001-09-18 00:00:00
Missing dates at  2001-08-24 00:00:00
Missing dates at  2001-05-20 00:00:00
Error with key 2001-12-18 00:00:00
Error with key 2001-09-18 00:00:00
Error with key 2001-08-24 00:00:00
Error with key 2001-05-20 00:00:00


In [None]:
sequence

{Timestamp('2002-02-17 00:00:00'):                zone1     zone2     zone3  ...   zone17    zone18    zone19
 2002-02-08 -1.835100 -2.006852 -2.110972  ... -2.14365 -1.750917 -2.537583
 2002-02-09 -1.614799 -1.817576 -2.110972  ... -1.98516 -1.750917 -2.537583
 2002-02-10 -1.614799 -2.006852 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-11 -1.835100 -2.006852 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-12 -1.835100 -1.817576 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-13 -1.614799 -1.817576 -2.110972  ... -2.14365 -1.750917 -2.456650
 2002-02-14 -1.835100 -2.006852 -2.242696  ... -1.98516 -1.750917 -2.294784
 2002-02-15 -1.835100 -2.006852 -2.110972  ... -2.14365 -1.750917 -2.537583
 2002-02-16 -1.835100 -1.817576 -2.242696  ... -2.14365 -1.750917 -2.456650
 2002-02-17 -1.835100 -2.006852 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-18 -1.835100 -2.006852 -2.242696  ... -2.14365 -1.432137 -2.537583
 2002-02-19 -1.835100 -2.006852 -2.110972  ... -2.1436

In [95]:
file_path=os.path.join(path,'data','training','theft2.csv')


In [94]:
file_path

'/content/drive/My Drive/Colab Notebooks/crime_prediction/data/theft.csv'

In [None]:
save_to_file(sequence,file_path)