<a href="https://colab.research.google.com/github/uteyechea/crime-prediction-using-artificial-intelligence/blob/master/Part3_Temporal_Autocorrelation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Part 3: Get autocorrelated sequences (windowing) to train the RNN in a peculiar pattern. 
Unfortunately training must be done again for every new sequence or window of data points in the time series. We are training the RNN to learn a specific class of pattern in a time series, instead of having the RNN learn the full time series itself. Why? Because I don't trust the RNN to be able to learn the whole time series, nevertheless I will test my doubts later and see if it can truly learn the whole time series.

##3.1 Dependencies, mount Google Drive and set system path
Import the relevant packages we will use to preprocess the raw data files.

In [1]:
import os
import gc

import pandas as pd
from scipy import stats

from google.colab import drive
drive.mount('/content/drive', force_remount=True)



path='/content/drive/My Drive/Colab Notebooks/crime_prediction'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


##3.2 Locate time series file and load to memory
Verify file column names and remove nulls.

In [2]:
file_path=os.path.join(path,'data','training_data_theft.csv')
file=pd.read_csv(file_path,sep=',',parse_dates=['Date'],index_col='Date')
file.isnull().values.any() # nulls?

False

In [3]:
print(file.columns.get_loc('zone4')) #Series autocorrelation requires a pandas.Series
print(file.columns[4]) #we can either select a pandas.Series from a DataFrame using column name or index.

4
zone4


In [4]:
file #Normalized to [0,1] data file contents.

Unnamed: 0_level_0,zone0,zone1,zone2,zone3,zone4,zone5,zone6,zone7,zone8,zone9,zone10,zone11,zone12,zone13,zone14,zone15,zone16,zone17,zone18,zone19
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2001-01-01,0.0,0.10,0.324,0.364,0.407,0.333,0.0,0.077,0.152,0.100,0.161,0.107,0.405,0.049,0.2,0.294,0.138,0.294,0.167,0.158
2001-01-02,0.0,0.00,0.000,0.000,0.037,0.033,0.0,0.000,0.000,0.000,0.032,0.000,0.000,0.049,0.0,0.029,0.000,0.000,0.000,0.000
2001-01-03,0.0,0.00,0.000,0.030,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.000
2001-01-05,0.0,0.00,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.000,0.024,0.000,0.0,0.000,0.000,0.000,0.000,0.000
2001-01-06,0.0,0.00,0.027,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-12-28,0.0,0.34,0.216,0.212,0.370,0.333,0.0,0.315,0.152,0.133,0.290,0.143,0.048,0.341,0.3,0.235,0.379,0.147,0.222,0.105
2018-12-29,0.0,0.40,0.108,0.121,0.259,0.200,0.0,0.287,0.212,0.033,0.032,0.214,0.024,0.244,0.1,0.088,0.207,0.147,0.139,0.053
2018-12-30,0.0,0.22,0.135,0.182,0.333,0.100,0.0,0.224,0.242,0.167,0.194,0.107,0.119,0.195,0.2,0.118,0.241,0.235,0.167,0.211
2018-12-31,0.0,0.24,0.270,0.121,0.185,0.267,0.0,0.224,0.273,0.033,0.226,0.250,0.190,0.341,0.2,0.206,0.310,0.088,0.167,0.105


##3.3 Series autocorrelation

1. Get a window of length t-n, where t is the end datetime and n is the length of the window or sequence in a data series.
2. Then compute the correlation between the selected window and all other windows in the same data series. 
3. Create a list of datetamps, add a new timestamps whenever a high correlation is found.
4. Use the datestamps to generate a sequence of highly correlated sequences.




###3.3.1 Get window, as in windowing from statistics

In [5]:
def window(dataframe,end_date,lookback_periods,column_name):
  window=dataframe.loc[pd.date_range(start=end_date,periods=lookback_periods,freq='-1D'),column_name]
  return window[::-1] #Sort window by datetime, as in any time series, older values first ...

In [6]:
window(file,'2019-01-01',lookback_periods=10,column_name='zone11')

2018-12-23    0.000
2018-12-24    0.179
2018-12-25    0.000
2018-12-26    0.107
2018-12-27    0.071
2018-12-28    0.143
2018-12-29    0.214
2018-12-30    0.107
2018-12-31    0.250
2019-01-01    0.214
Freq: D, Name: zone11, dtype: float64

###3.3.2 Get windows correlation

In [7]:
def corr_windows(series1,series2):
  #assert something?
  series1=series1.reset_index(drop=True) #Better find a way to simplify this procedure
  series2=series2.reset_index(drop=True) #Better find a way to simplify this procedure
  ro=series1.corr(series2) 
#maybe try np.correlate (mode=same)
  return ro 

###3.3.3 Get timestamp of autocorrelated time series window

In [8]:
def autocorr_timestamps(dataframe,end_date,lookback_periods,column_name,min_correlation):
  timestamps=[]
  series2=window(dataframe,end_date,lookback_periods,column_name)
  #loop over index type datetime, up to date end_date and starting from index at posittion lookback_periods
  for timestamp in dataframe.loc[:end_date].index[lookback_periods::]:   
    try:
      series1=window(dataframe,timestamp,lookback_periods,column_name)
      correlation=corr_windows(series1,series2)
      if correlation >= min_correlation:
      #record datetime value. We will use this datetime value to generate all sequences that will go as input to the RNN
      #print(dataframe.index[epoch]) 
        timestamps.append(timestamp)
    except:
      print('Missing date in dataframe while looping over time series, i.e. discontinuity in time series')
  return timestamps

In [9]:
autocorr_timestamps(dataframe=file,end_date='2019-01-01',lookback_periods=10,column_name='zone11',min_correlation=0.75)

Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in 

[Timestamp('2002-02-22 00:00:00'),
 Timestamp('2002-05-07 00:00:00'),
 Timestamp('2002-07-30 00:00:00'),
 Timestamp('2002-11-19 00:00:00'),
 Timestamp('2003-01-09 00:00:00'),
 Timestamp('2004-03-02 00:00:00'),
 Timestamp('2004-09-05 00:00:00'),
 Timestamp('2004-11-02 00:00:00'),
 Timestamp('2005-06-17 00:00:00'),
 Timestamp('2005-08-04 00:00:00'),
 Timestamp('2006-03-24 00:00:00'),
 Timestamp('2007-07-26 00:00:00'),
 Timestamp('2008-04-16 00:00:00'),
 Timestamp('2008-11-02 00:00:00'),
 Timestamp('2009-11-20 00:00:00'),
 Timestamp('2010-01-01 00:00:00'),
 Timestamp('2010-03-13 00:00:00'),
 Timestamp('2011-05-10 00:00:00'),
 Timestamp('2011-07-14 00:00:00'),
 Timestamp('2012-09-07 00:00:00'),
 Timestamp('2014-01-14 00:00:00'),
 Timestamp('2014-03-18 00:00:00'),
 Timestamp('2014-07-05 00:00:00'),
 Timestamp('2015-01-20 00:00:00'),
 Timestamp('2015-10-27 00:00:00'),
 Timestamp('2016-06-30 00:00:00'),
 Timestamp('2016-10-07 00:00:00'),
 Timestamp('2017-01-20 00:00:00'),
 Timestamp('2017-09-

###3.3.4 Get autocorrelated windows

In [10]:
def autocorr_windows(dataframe,end_date,lookback_periods,column_name,min_correlation):
  timestamps=autocorr_timestamps(dataframe,end_date,lookback_periods,column_name,min_correlation)
  apriori_windows={}
  aposteriori_windows={}
  windows={}
  for timestamp in timestamps:
    try:
      apriori_windows[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=lookback_periods,freq='-1D'),column_name]
      aposteriori_windows[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=lookback_periods,freq='1D',closed='right'),column_name]
    except:
      print('Missing date in dataframe while looping over time series, i.e. discontinuity in time series')
    try:
      windows[timestamp]=apriori_windows[timestamp][::-1].append(aposteriori_windows[timestamp])
    except:
      print('Error while constructing window for datetime stamp:',str(timestamp))
  return windows

In [11]:
windows=autocorr_windows(dataframe=file,end_date='2019-01-01',lookback_periods=10,column_name='zone11',min_correlation=0.75)

Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in time series
Missing date in dataframe while looping over time series, i.e. discontinuity in 

In [12]:
windows

{Timestamp('2002-02-22 00:00:00'): 2002-02-13    0.000
 2002-02-14    0.000
 2002-02-15    0.000
 2002-02-16    0.000
 2002-02-17    0.000
 2002-02-18    0.000
 2002-02-19    0.036
 2002-02-20    0.000
 2002-02-21    0.071
 2002-02-22    0.036
 2002-02-23    0.000
 2002-02-24    0.036
 2002-02-25    0.000
 2002-02-26    0.036
 2002-02-27    0.000
 2002-02-28    0.036
 2002-03-01    0.143
 2002-03-02    0.000
 2002-03-03    0.000
 Freq: D, Name: zone11, dtype: float64,
 Timestamp('2002-05-07 00:00:00'): 2002-04-28    0.286
 2002-04-29    0.429
 2002-04-30    0.321
 2002-05-01    0.393
 2002-05-02    0.357
 2002-05-03    0.536
 2002-05-04    0.536
 2002-05-05    0.357
 2002-05-06    0.464
 2002-05-07    0.607
 2002-05-08    0.607
 2002-05-09    0.429
 2002-05-10    0.571
 2002-05-11    0.357
 2002-05-12    0.357
 2002-05-13    0.607
 2002-05-14    0.393
 2002-05-15    0.500
 2002-05-16    0.286
 Freq: D, Name: zone11, dtype: float64,
 Timestamp('2002-07-30 00:00:00'): 2002-07-21    0.179

In [13]:
len(windows)

36

##3.4 Export autocorrelated windows to file 

Round values to some appropiate value.

In [27]:
#Remember to match the format to the value of round(df) used in Part2
def save_to_file(windows,file_path):
  with open(file_path,'w') as file:
    for timestamp in windows:
      windows[timestamp].to_csv(file,mode='a',header=False,index=False,float_format='%.3f')
      file.write('\n')

In [28]:
save_to_path=os.path.join(path,'data','training','theft.csv')
save_to_path

'/content/drive/My Drive/Colab Notebooks/crime_prediction/data/training/theft.csv'

In [29]:
save_to_file(windows,save_to_path)

##*** 3.5 DataFrame correlation(raw code not really used, I kept this code here to later use it somewhere else. )

In [None]:
def get_window(dataframe,t,n):
  #if type(t) == int elif type(t)==str then use .loc, instead of iloc. For the time being, we will just assume t and n are integers 
  return dataframe.iloc[t-n:t,:]

def get_correlation(dataframe1,dataframe2):
  assert dataframe1.shape==dataframe2.shape,'Dataframes must have the same shape'
  ro=dataframe1.corrwith(dataframe2.set_index(dataframe1.index),axis=0) #Force alignment. Make sure size is the same for both dataframes
  return ro.mean() #For the time being we will stop with a general correlation among all zones

def get_correlated_endof_sequence_timestamp(dataframe,t,n=10,min_correlation=0.5):
  endof_sequence_timestamp=[]
  dataframe2=get_window(dataframe,t,n)
  for epoch in reversed(range(2*n,t)):
    dataframe1=get_window(dataframe,epoch-n,n)
    correlation=get_correlation(dataframe1,dataframe2)
    if correlation >= min_correlation:
      #record datetime value. We will use this datetime value to generate all sequences that will go as input to the RNN
      #print(dataframe.index[epoch]) 
      endof_sequence_timestamp.append(dataframe.index[epoch])
  return endof_sequence_timestamp

def get_correlated_dataframe_slice(dataframe,endof_sequence_timestamp,n):
  input={}
  output={}
  for timestamp in endof_sequence_timestamp:
    try:
      input[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=n,freq='-1D')]
      output[timestamp]=dataframe.loc[pd.date_range(start=timestamp,periods=n+1,freq='1D',closed='right')]
    except:
      print('Missing dates at ', str(timestamp))
  #df with all inputs and outputs  
  return input,output

def get_IO_sequence(dataframe,periods,min_correlation=0.5):
  endof_sequence_timestamp=get_correlated_endof_sequence_timestamp(dataframe,len(dataframe),periods,min_correlation)
  input,output=get_correlated_dataframe_slice(dataframe,endof_sequence_timestamp,periods)
  #return input,output
  sequence={}
  for key in endof_sequence_timestamp: #Change from input to timestamp sequence
    try:
      sequence[key]= input[key][::-1].append(output[key])
    except:
      print('Error with key',str(key))
  return sequence      

def save_to_file(sequence,file_path):
  with open(file_path,'a') as file:
    for key in sequence:
      sequence[key].to_csv(file,mode='a',header=False,index=False)
      file.write('\n')

It would be ideal to save a list of dates for each zone where the correlation is high. Nevertheless, we will demote this as further work. 

In [None]:
sequence=get_IO_sequence(file,periods=10,min_correlation=0.25)

Missing dates at  2001-12-18 00:00:00
Missing dates at  2001-09-18 00:00:00
Missing dates at  2001-08-24 00:00:00
Missing dates at  2001-05-20 00:00:00
Error with key 2001-12-18 00:00:00
Error with key 2001-09-18 00:00:00
Error with key 2001-08-24 00:00:00
Error with key 2001-05-20 00:00:00


In [None]:
sequence

{Timestamp('2002-02-17 00:00:00'):                zone1     zone2     zone3  ...   zone17    zone18    zone19
 2002-02-08 -1.835100 -2.006852 -2.110972  ... -2.14365 -1.750917 -2.537583
 2002-02-09 -1.614799 -1.817576 -2.110972  ... -1.98516 -1.750917 -2.537583
 2002-02-10 -1.614799 -2.006852 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-11 -1.835100 -2.006852 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-12 -1.835100 -1.817576 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-13 -1.614799 -1.817576 -2.110972  ... -2.14365 -1.750917 -2.456650
 2002-02-14 -1.835100 -2.006852 -2.242696  ... -1.98516 -1.750917 -2.294784
 2002-02-15 -1.835100 -2.006852 -2.110972  ... -2.14365 -1.750917 -2.537583
 2002-02-16 -1.835100 -1.817576 -2.242696  ... -2.14365 -1.750917 -2.456650
 2002-02-17 -1.835100 -2.006852 -2.242696  ... -2.14365 -1.750917 -2.537583
 2002-02-18 -1.835100 -2.006852 -2.242696  ... -2.14365 -1.432137 -2.537583
 2002-02-19 -1.835100 -2.006852 -2.110972  ... -2.1436

In [None]:
file_path=os.path.join(path,'data','training','theft2.csv')


In [None]:
file_path

'/content/drive/My Drive/Colab Notebooks/crime_prediction/data/theft.csv'

In [None]:
save_to_file(sequence,file_path)