Getting data: 
Download logs of failing requests from [GCP logs](https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22dfuseio-global%22%0Aresource.labels.location%3D%22us-central1%22%0Aresource.labels.cluster_name%3D%22saas-us-central1%22%0Alabels.%22k8s-pod%2Fname%22%3D%22rpc-proxy%22%0AtextPayload:%20%22timeout%22%0Aresource.labels.namespace_name%3D%22eth-mainnet%22;timeRange=PT3H;summaryFields=resource%252Flabels%252Fnamespace_name:false:32:beginning;cursorTimestamp=2023-03-13T16:40:50.032921377Z;resultsSearch=timeout%20response%20from%20reference?project=dfuseio-global)

```
kubectl config set-context --current --namespace=eth-mainnet
```


Run port forward:
```
kubectl port-forward deploy/statedb-server-v2 9000
```
and the EVM executor ([found here](https://github.com/streamingfast/evm-executor))

```
export DEBUG=.*
go run ./cmd/evmx executor serve json-rpc --listen-addr=:8080 --chain=mainnet  --state-provider-dsn=statedb://localhost:9000  --timeout=60s 2>&1 | grep -e "executing call" -e "fetching storage value"  | tee -a  logs-[logs_suffix].txt
```

example curl
```
curl localhost:8080   -X POST   -H "Content-Type: application/json"   --data '{"jsonrpc":"2.0","method":"eth_call","params":[{"data":"0xa4c3e73c","gas":"0x2faf080","to":"0x0ed8fa7fc63a8eb5487e7f87caf1ab3914ea4eca"},{"blockHash":"0x6d29951406e01bc9f9fad518589a3d1063779fabe915c695a3ce4e3c2f57c7ae"}],"id":1801694}'
```


From vm:
```
kubectl config set-context --current --namespace=eth-mainnet

kubectl get pods -o wide
```
Get IP from one of the `statedb-server` pods and use as the statedb host:
```
cd ~/evm-executor
export DEBUG=.*
go run ./cmd/executor serve json-rpc --listen-addr=:8084 --chain=mainnet  --state-provider-dsn=statedb://10.1.173.42:9000  --timeout=600s  2>&1 | grep -e "executing call" -e "fetching storage value" | tee -a  /data/mdm/logs-20230314.txt
```

Run the extractor:
```
cd evx-data
python3 extract_logs.py
```

Once completed run locally:
```
scp vm:/data/mdm/logs-20230310.txt .
```

In [None]:
# import requests
# import re

# logs_suffix = '20230310'

# import json
# with open('downloaded-logs-{}.json'.format(logs_suffix)) as f:
#     logs = json.load(f)
# print('logs length: {}'.format(len(logs)))

# def unpack_payload(text_payload):
#     match = re.search("{.*}", text_payload).group()
#     resp = json.loads(match)
#     return resp['request']

# processed = {}

# url = 'http://localhost:8080'
# headers = {'Content-Type': 'application/json'}

# for idx, log in enumerate(logs):
#     payload = unpack_payload(log['textPayload'])

#     # deduplicate logs
#     if str(payload) in processed.keys():
#         continue

#     print('processing request {}'.format(idx))
#     resp = requests.post(url, data=payload, headers=headers)
#     processed[str(payload)] = 1
    
# len(processed)

## Transform logs to model input

In [2]:
import pandas as pd
import re
import json

def get_token_key_suffix(x):
    if (x['key_suffix_prev'] != x['key_suffix']) or \
    (x['key_prefix_prev'] != x['key_prefix']) or \
    (x['addr_prev'] != x['addr']) or \
    (x['method_prev'] != x['method']) or \
    (x['trace_id_prev'] != x['trace_id']):
        return ' KS:' + x['key_suffix']
    else:
        return ''

def get_token_key_prefix(x):
    if (x['key_prefix_prev'] != x['key_prefix']) or \
    (x['addr_prev'] != x['addr']) or \
    (x['method_prev'] != x['method']) or \
    (x['trace_id_prev'] != x['trace_id']):
        return ' KP:' + x['key_prefix']
    else:
        return ''
        
def get_token_addr(x):
    if (x['addr_prev'] != x['addr']) or \
    (x['method_prev'] != x['method']) or \
    (x['trace_id_prev'] != x['trace_id']):
        return ' A:' + x['addr']
    else:
        return ''

def get_token_method(x):
    if x['method_prev'] != x['method'] or x['trace_id_prev'] != x['trace_id']:
        return ' M:' + x['method']
    else:
        return ''

def process_logs(filename):
    lines = []
    with open(filename) as f:
        lines = f.readlines()

    print('finished reading file')
    # cleanup
    del(lines[-1])

    executing_calls = []
    fetching_storages = []
    for l in lines:
        match = re.search("{.*}", l).group()
        data_json = json.loads(match)

        if 'executing call' in l:
            executing_calls.append(data_json)
        elif 'fetching storage' in l:
            data_json['timestamp'] = l[:28]
            fetching_storages.append(data_json)

    # data = sorted(data, key=lambda x: x['trace_id'] + x['timestamp'])
    executing_calls_df = pd.DataFrame(executing_calls)
    fetching_storages_df = pd.DataFrame(fetching_storages)

    executing_calls_df = pd.DataFrame(executing_calls)
    fetching_storages_df = pd.DataFrame(fetching_storages)

    executing_calls_df = pd.json_normalize(executing_calls)[['trace_id', 'params.data']]
    executing_calls_df['method'] = executing_calls_df['params.data'].str[:8]

    executing_calls_df['trace_id'] = executing_calls_df['trace_id'].astype('string')
    fetching_storages_df['trace_id'] = fetching_storages_df['trace_id'].astype('string')

    for col in ['addr', 'key']:
        fetching_storages_df[col] = fetching_storages_df[col].str.lower()

    print(len(fetching_storages_df))

    df = pd.merge(left=fetching_storages_df,right=executing_calls_df, on='trace_id', how='inner')[["trace_id", "addr", "key", "timestamp", "method"]]

    print(len(df))

    df = df.astype(
        {
            'addr':'string',
            'key':'string',
            'timestamp':'string',
            'method':'string',
        })

    df['key_prefix'] = df['key'].str[:63]
    df['key_suffix'] = df['key'].str[63:]
    df.sort_values(['trace_id', 'timestamp'], inplace=True)

    df['trace_id_prev'] = df.trace_id.shift(1)
    df.trace_id_prev.fillna('', inplace=True)

    df['trace_token'] = df.apply(lambda x: ' T' if x['trace_id_prev'] != x['trace_id'] else '', axis=1)

    print('processing tokens')

    for col in ['method', 'key_prefix', 'key_suffix', 'addr']:
        df[col+'_prev'] = df[col].shift(1)
        df[col+'_prev'].fillna('', inplace=True)

    df['method_token'] = df.apply(get_token_method, axis=1)

    print('method done')

    df['addr_token'] = df.apply(get_token_addr, axis=1)

    print('addr done')

    df['key_prefix_token'] = df.apply(get_token_key_prefix, axis=1)

    print('key prefix done')

    df['key_suffix_token'] = df.apply(get_token_key_suffix, axis=1)

    print('key suffix done')

    df['token'] = df.trace_token + df.method_token + df.addr_token + df.key_prefix_token + df.key_suffix_token
    display(df.describe())
    display(df.head())

    return df

In [4]:
df = process_logs('data/logs-20230314.txt')

filename = 'valid'
with open(f'data/model_input_{filename}.txt', 'w') as f:
    for _, item in df.token.items():
        f.write(item)

finished reading file
115552
115552
processing tokens
method done
addr done
key prefix done
key suffix done


Unnamed: 0,trace_id,addr,key,timestamp,method,key_prefix,key_suffix,trace_id_prev,trace_token,method_prev,key_prefix_prev,key_suffix_prev,addr_prev,method_token,addr_token,key_prefix_token,key_suffix_token,token
count,115552,115552,115552,115552,115552,115552,115552,115552,115552.0,115552,115552,115552,115552,115552.0,115552.0,115552.0,115552,115552
unique,768,462,8683,61139,58,2564,3502,769,2.0,59,2565,3503,463,59.0,463.0,2565.0,3502,6662
top,737b0e460f84fa40e3e6a0d36879ab87,0xc2edad668740f1aa35e4d8f227fb8e17dca888cd,0x00000000000000000000000000000000000000000000...,2023-03-29T16:02:28.444-0400,482ba306,0x00000000000000000000000000000000000000000000...,1,737b0e460f84fa40e3e6a0d36879ab87,,482ba306,0x00000000000000000000000000000000000000000000...,1,0xc2edad668740f1aa35e4d8f227fb8e17dca888cd,,,,KS:001,KP:0x0000000000000000000000000000000000000000...
freq,587,12729,2307,12,47956,14745,2313,587,114784.0,47956,14745,2313,12729,114784.0,98249.0,59534.0,2313,1572


Unnamed: 0,trace_id,addr,key,timestamp,method,key_prefix,key_suffix,trace_id_prev,trace_token,method_prev,key_prefix_prev,key_suffix_prev,addr_prev,method_token,addr_token,key_prefix_token,key_suffix_token,token
82079,002de3b5f12fb6861731cf57df296d2a,0x25bf7b72815476dd515044f9650bf79bad0df655,0x00000000000000000000000000000000000000000000...,2023-03-29T16:11:17.426-0400,278e4153,0x00000000000000000000000000000000000000000000...,002,,T,,,,,M:278e4153,A:0x25bf7b72815476dd515044f9650bf79bad0df655,KP:0x0000000000000000000000000000000000000000...,KS:002,T M:278e4153 A:0x25bf7b72815476dd515044f9650b...
82080,002de3b5f12fb6861731cf57df296d2a,0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,0x0df992ebbea8a74cdec8667d120cd9e01b5272f7dcbc...,2023-03-29T16:11:18.120-0400,278e4153,0x0df992ebbea8a74cdec8667d120cd9e01b5272f7dcbc...,751,002de3b5f12fb6861731cf57df296d2a,,278e4153,0x00000000000000000000000000000000000000000000...,002,0x25bf7b72815476dd515044f9650bf79bad0df655,,A:0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,KP:0x0df992ebbea8a74cdec8667d120cd9e01b5272f7...,KS:751,A:0x0959158b6040d32d04c301a72cbfd6b39e21c9ae ...
82081,002de3b5f12fb6861731cf57df296d2a,0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,0x88ee56716077ffdb6729512f5649fb881d48812cf508...,2023-03-29T16:11:18.226-0400,278e4153,0x88ee56716077ffdb6729512f5649fb881d48812cf508...,2aa,002de3b5f12fb6861731cf57df296d2a,,278e4153,0x0df992ebbea8a74cdec8667d120cd9e01b5272f7dcbc...,751,0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,,,KP:0x88ee56716077ffdb6729512f5649fb881d48812c...,KS:2aa,KP:0x88ee56716077ffdb6729512f5649fb881d48812c...
82082,002de3b5f12fb6861731cf57df296d2a,0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,0xfd5bf0a923e693bd84ac9d5de6744929b917c5ce279f...,2023-03-29T16:11:18.326-0400,278e4153,0xfd5bf0a923e693bd84ac9d5de6744929b917c5ce279f...,6fe,002de3b5f12fb6861731cf57df296d2a,,278e4153,0x88ee56716077ffdb6729512f5649fb881d48812cf508...,2aa,0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,,,KP:0xfd5bf0a923e693bd84ac9d5de6744929b917c5ce...,KS:6fe,KP:0xfd5bf0a923e693bd84ac9d5de6744929b917c5ce...
82083,002de3b5f12fb6861731cf57df296d2a,0x25bf7b72815476dd515044f9650bf79bad0df655,0x00000000000000000000000000000000000000000000...,2023-03-29T16:11:18.430-0400,278e4153,0x00000000000000000000000000000000000000000000...,001,002de3b5f12fb6861731cf57df296d2a,,278e4153,0xfd5bf0a923e693bd84ac9d5de6744929b917c5ce279f...,6fe,0x0959158b6040d32d04c301a72cbfd6b39e21c9ae,,A:0x25bf7b72815476dd515044f9650bf79bad0df655,KP:0x0000000000000000000000000000000000000000...,KS:001,A:0x25bf7b72815476dd515044f9650bf79bad0df655 ...


Now move to the Training notebook to process the logs

## Getting the data to postgres

In [None]:
with open('logs-{}.txt'.format(logs_suffix)) as f:
    lines = f.readlines()

In [None]:
data = []
for l in lines:
    search = re.search("{.*}", l)
    if search is not None:
        match = search.group()
        data_json = json.loads(match)
        data.append(data_json)

df = pd.DataFrame(data)
df.head()

In [None]:
df.to_csv('logs2df_{}.csv'.format(logs_suffix), index_label='idx')