<a href="https://colab.research.google.com/github/uteyechea/crime-prediction-using-artificial-intelligence/blob/master/Part5_Black_Box_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Part 5: Black-Box Testing

Test crimes predicted sequence against a real sequence of crimes.

##5.1  Dependencies, mount Google Drive and set system path
Import the relevant packages we will use to train the RNN model.

In [1]:
import os
import gc

import pandas as pd
from scipy import stats

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

path='/content/drive/My Drive/Colab Notebooks/crime_prediction'

#Update our path to import from 
import sys
sys.path.append(path+'/libs')
import autocorr as ac


Mounted at /content/drive


##5.1 RNN input sequence parameters

The windowing over the time series used an end_date and some look_back period, all highly correlated windows were put in sequence, such sequence was fed into the RNN for training purposes.

In [2]:
#Input sequence parameters
end_date='2019-01-01'
lookback_periods=10
column_name='zone11'
min_correlation=0.75

##5.2 Import test data 
Import test data sequence, i.e. real data sequence unknown to the RNN at the time of training 

In [3]:
test_file_path=os.path.join(path,'data','theft.csv')
file=pd.read_csv(test_file_path,sep=',',parse_dates=['Date'],index_col='Date')
file.isnull().values.any() # nulls?

False

In [21]:
dataframe=file
timestamp=end_date
apriori_window=dataframe.loc[pd.date_range(start=timestamp,periods=lookback_periods,freq='-1D'),column_name]
apriori_window=apriori_window[::-1]
aposteriori_window=dataframe.loc[pd.date_range(start=timestamp,periods=lookback_periods,freq='1D',closed='right'),column_name]
test_window=apriori_window.append(aposteriori_window)

In [22]:
test_window

2018-12-23    0.058824
2018-12-24    0.205882
2018-12-25    0.088235
2018-12-26    0.117647
2018-12-27    0.176471
2018-12-28    0.235294
2018-12-29    0.117647
2018-12-30    0.117647
2018-12-31    0.176471
2019-01-01    0.176471
2019-01-02    0.117647
2019-01-03    0.235294
2019-01-04    0.147059
2019-01-05    0.147059
2019-01-06    0.117647
2019-01-07    0.352941
2019-01-08    0.323529
2019-01-09    0.235294
2019-01-10    0.294118
Freq: D, Name: zone11, dtype: float64

##5.3 Import forecast data

Data sequence generated by the RNN as the most likely future crime sequence. 

In [30]:
prediction_file_path=os.path.join(path,'data','prediction','rnn_output.txt')
prediction_windows=pd.read_csv(prediction_file_path,sep=',')
print(prediction_windows.shape)

(525, 1)


Remove non-numeric values. It very well can happen that the RNN predicts some number in the wrong format, for example: 1.324.234

In [31]:
prediction_windows.iloc[:,0] = pd.to_numeric(prediction_windows.iloc[:,0], errors='coerce') #Remove non float values, substitute them with NAN
prediction_windows=prediction_windows.dropna()
prediction_windows.shape

(516, 1)

Verify all predicted data to be free of nulls

In [32]:
prediction_windows.isnull().values.any()

False

##5.4 Estimate error in RNN output.

For some date end_date we will compare the predicted sequence vs the real sequence during some period of N time units. 

Fix data types to pandas series

In [33]:
print(test_window.shape)
print(prediction_windows.iloc[:,0].shape)
print(type(test_window))
print(type(prediction_windows.iloc[:,0]))

(19,)
(516,)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [34]:
prediction=prediction_windows.iloc[:,0]

In [35]:
assert type(prediction)==type(test_window)

In [36]:
#series1 test_window
#series2 prediction_windows

def correlation(apriori_window,aposteriori_window,rnn_output_series,periods,min_correlation):
  apriori_window=apriori_window.reset_index(drop=True) #Better find a way to simplify this procedure
  aposteriori_window=aposteriori_window.reset_index(drop=True) #Better find a way to simplify this procedure
  for i,row in enumerate(rnn_output_series):
    try:
      predicted_window=rnn_output_series[i-periods:i]
      predicted_window=predicted_window.reset_index(drop=True) #Better find a way to simplify this procedure
      ro=apriori_window.corr(predicted_window)
      if ro > min_correlation:
        predicted_window2=rnn_output_series[i+1:i+periods]
        predicted_window2=predicted_window2.reset_index(drop=True)
        ro2=aposteriori_window.corr(predicted_window2)
        if ro2 > 0.75:
          print(i,ro,ro2)
          print(predicted_window2,aposterioi_window)

    except:
      print('fix the loop indices')




In [47]:
correlation(apriori_window,aposteriori_window,rnn_output_series=prediction,periods=len(apriori_window),min_correlation=0.5)

484 0.785514675815196 0.7868018047042111
fix the loop indices


In [55]:
target_window=prediction[484-len(apriori_window):484+len(apriori_window)-1]

In [57]:
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(y=test_window, x=target_window)
fig.show()

line chart

In [97]:
df={}
df=pd.DataFrame(df)
df1={}
df1=pd.DataFrame(df1)


df['value']=list(test_window.reset_index(drop=True))
df['code']=['test']*len(test_window)

df1['value']=list(target_window.reset_index(drop=True))
df1['code']=['target']*len(target_window)

df3=df.append(df1)


In [98]:
import plotly.express as px

df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df3, x=df3.index, y="value", color='code')
fig.show()