# Big Language Model training using Ethereum Smart Contracts

# Introduction

In this series of notebooks, you will learn how to collect Ethereum (And other L2 networks) Smart Contract data and then train a GPT2 Language Model. The goal is to build a pretraining model that can later be used for specific AI applications. For instance, we will use the pre-trained model to fine-tune another model capable of classifying smart contracts as normal or as "malicious," or in other words, to detect smart contracts that could be used for a cyberattack, even before they can do any harm. Please take a look at this repo to see another way to accomplish this same task by using machine learning algorithms: [Malicious Smart Contract ML](https://github.com/forta-network/starter-kits/tree/main/malicious-smart-contract-ml-py)

# Smart Contract Training Dataset Collection

First, we start by collecting the required data to train our LLM. We use general smart contract data, as well as data that has already been classified as trusted and data that has been collected from previous on-chain attacks. This notebook collects smart contract bytecode and decompiled opcodes for normal and malicious contract classification. Pretraining contracts are gathered from Zettablock and malicious contracts from [Forta Network's labeled datasets GitHub repo](https://github.com/forta-network/labelled-datasets).

As a note, our goal in these tutorials is not to train the LLM to be able to reproduce or generate Solidity smart contract code, but to use bytecode that can be used in other kind of task such as classification.

As another note, if you find something wrong or not clear in these tutorials, please report an issue here: [Tikuna Issues](https://github.com/edenia/tikuna/issues).

In [None]:
# We start by calling the required libraries

import logging
import pickle
import os
import requests
import json
import time

from evmdasm import EvmBytecode
import pandas as pd
from tqdm import tqdm
from web3 import Web3
from dotenv import load_dotenv

# We use dotenv for environment variable configuration, so you need a .env file
# with the secret variables such as ZETTABLOCK_API_KEY=<YOUR_API_KEY>
# Load secrets
dotenv_path = '../.env'
load_dotenv(dotenv_path)

tqdm.pandas()
# disable warning logs from evmdasm tool
logging.getLogger("evmdasm").setLevel(logging.CRITICAL)

# We define where our data is going to be located in the file system
zettablock_data_file = "/data/forta/ethereum/text/pretraining/zettablock_data"
processed_data_file = "/data/forta/ethereum/text/pretraining/big_pretrain_data"

# In this tutorial we use Zetablock to collect smart contract data
# You can get a free account and have your API key to be able to replicate this 
# tutorial (https://app.zettablock.com/v2/explore/projects)
# COnfigure your Zettablock API key
API_KEY = os.environ.get("ZETTABLOCK_API_KEY")
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    # credentials
    "X-API-KEY": API_KEY
}

# Zettablock endpoint
ZETTABLOCK_DATA_LAKE_ENDPOINT = os.environ.get("ZETTABLOCK_DATA_LAKE_ENDPOINT")

# Configure the blockchains we are interested in
# We will collect data from Ethereum, Polygon and BSC
blockchains = ["ethereum_mainnet", "polygon_mainnet", "bsc_mainnet"]

# Final training and validation files
train_file_path = "/data/forta/ethereum/text/pretraining/pretraining_train.csv"
val_file_path = "/data/forta/ethereum/text/pretraining/pretraining_val.csv"

# Collect smart contract data

The Zetablock API allows us to download data from different Ethereum blockchains and use SQL-like queries to specify all the data we need for our training:

In [None]:
# Code taken from Zettablock tutorials
# check response until success or failed is returned
def get_response(queryrun_id):
    import time
    i = 1
    queryrun_status_endpoint = f'https://api.zettablock.com/api/v1/queryruns/{queryrun_id}/status'
    while True:
        res = requests.get(queryrun_status_endpoint, headers=headers)
        state = json.loads(res.text)['state']
        if state == 'SUCCEEDED' or state == 'FAILED':
            return state
        time.sleep(i)
        i += 1

def download_file(url: str, local_file: str, headers=None, params=None):
    resp = requests.get(url, stream=True, headers=headers, params=params)
    total = int(resp.headers.get('content-length', 0))
    with open(local_file, 'ab') as file, tqdm(
        desc=local_file,
        total=total,
        unit='iB',
        unit_scale=True,
        unit_divisor=1024,
    ) as bar:
        for data in resp.iter_content(chunk_size=1024):
            size = file.write(data)
            bar.update(size)

def call_zettablock_api(query_text, blockchain):
    # Get Smart Contract Data from Zettablock for several blockchains
    query = {"query": query_text, "resultCacheExpireMillis": 86400000}
    
    # Create a query with SQL statement, and get query id
    res = requests.post(data_lake_query_endpoint, headers=headers, data=json.dumps(query))
    print(res.text)
    
    # Trigger the query by query id, and get queryrun id
    query_id = res.json()['id']
    data_lake_submission_endpoints = f'https://api.zettablock.com/api/v1/queries/{query_id}/trigger'
    res = requests.post(data_lake_submission_endpoints, headers=headers, data='{}')
    
    # Check status using queryrun id
    queryrun_id = res.json()['queryrunId']
    
    if get_response(queryrun_id) == 'SUCCEEDED':
        # Fetch result from queryrun id
        params = {'includeColumnName': 'true'}
        queryrun_result_endpoint = f'https://api.zettablock.com/api/v1/stream/queryruns/{queryrun_id}/result'
        # if the result is huge, consider using stream and write to a file
        download_file(queryrun_result_endpoint, zettablock_data_file+"_"+blockchain+".csv", headers=headers, params=params)
    else:
        print('query failed, please check status message for details')
        print(res.json())

# Preprocess smart contract bytecode

Since we want to train an LLM, a model suitable for natural language processing, we need to extract information from the Smart Contracts that can be analyzed as text. We decided to use the contracts' bytecode as the model input, but still, we need to preprocess it beforehand:

In [None]:
# Code provided by the Forta team
# Decompile and dissasemble the smart contract bytecode
def get_opcodes(creation_bytecode) -> str:
    bytecode = creation_bytecode
    if bytecode is None:
        return ''

    try:
        opcodes = EvmBytecode(bytecode).disassemble()
    except Exception:
        return ''
    
    return " ".join([str(op).strip() for op in opcodes])

In [None]:
# Code provided by the Forta team
# Filter opcodes to get the best features
def get_exp_2_features(row):
    creator = row['contract_creator']
    opcodes = row['decompiled_opcodes'].split()
    mask = '0xffffffffffffffffffffffffffffffffffffffff'
    features = []
    for i in range(len(opcodes)-1):
        first = opcodes[i]
        second = opcodes[i+1]
        if not first.startswith('0x'):
            token = first
            if first.startswith('UNKNOWN') or first.startswith('INVALID'):
                token = first.split('_')[0]
            features.append(token)
        elif first == 'PUSH4':
            features.append(second)
        elif first == 'PUSH20':
            if second == creator:
                features.append('creator')
            elif second == mask:
                features.append(mask)
            else:
                features.append('address')
        elif first == 'PUSH32':
            features.append(second)
    return " ".join(features)

# Create the queries to download the data

Here, we actually create the SQL-like queries to download the data from Zetablock. We use smart contract data such as contract address, contract name, and bytecode, among others.

In [None]:
# Create queries for the supported blockchains
# They don't have all the same available data
def get_query(blockchain):
    query_text = ""
    if blockchain == "ethereum_mainnet":
        query_text = '''
            SELECT contract.address as contract_address,
                   contract.name as contract_name,
                   contract.creator as contract_creator,
                   tags.name as contract_tag_name, 
                   tags.type as contract_type,
                   contract.code as contract_code
            FROM {}.contracts contract LEFT JOIN {}.labels tags ON tags.address = contract.address
            LIMIT 30000
        '''.format(blockchain, blockchain)
    elif blockchain == "polygon_mainnet":
        query_text = '''
            SELECT contracts.address as contract_address,
                   mappings.contract_name as contract_name,
                   contracts.creator_address as contract_creator,
                   mappings.contract_category as contract_type,
                   contracts.bytecode as contract_code
            FROM polygon_mainnet.contract_creations contracts LEFT JOIN polygon_mainnet.contract_mappings mappings ON mappings.contract_address = contracts.address
            LIMIT 30000
        '''.format(blockchain, blockchain)
    else:
        query_text = '''
            SELECT contracts.address as contract_address,
                   contracts.creator_address as contract_creator,
                   contracts.bytecode as contract_code
            FROM bsc_mainnet.contract_creations contracts
            LIMIT 30000
        '''.format(blockchain, blockchain)
    return query_text

# Get contracts and filter the data
def get_pretrain_contracts():
    # Get data from Zettablock for 3 different EVM compatible blockchains
    for blockchain in blockchains:
        if not os.path.exists(zettablock_data_file+"_"+blockchain+".csv"):
            print("Dowloading data from %s..." % (blockchain))
            call_zettablock_api(get_query(blockchain), blockchain)
        if not os.path.exists(processed_data_file+"_"+blockchain+".csv"):
            """Collects contracts from Zettablock and its decompiled opcodes.""" 
            chunksize = 10 ** 6
            print("Decompiling and extracting opcodes for blockchain %s:" % blockchain )
            with pd.read_csv(zettablock_data_file+"_"+blockchain+".csv", chunksize=chunksize) as contract_reader:
                for contracts in contract_reader:
                    contracts['decompiled_opcodes'] = contracts['contract_code'].progress_apply(get_opcodes)
                    contracts = contracts[(contracts['decompiled_opcodes'].notna()) & (contracts['decompiled_opcodes'] != '')]
                    contracts.drop_duplicates('contract_address', inplace=True)
                    contracts['decompiled_opcodes'] = contracts.progress_apply(get_exp_2_features, axis=1)
                    # Store data so we don't have to download it all the time
                    contracts.to_csv(processed_data_file+"_"+blockchain+".csv", mode='a')
        else:
            print("%s already exists." % processed_data_file+"_"+blockchain+".csv")

# Run the collection and preprocessing code

Finally, run the functions created to collect the data from Zetablock and execute the preprocessing to store the data that we will use later in the next tutorials.

In [None]:
# Actually run the collection code
get_pretrain_contracts()

In [None]:
# Finally prepare data for pretraining phase
pretraining_data = {}
for blockchain in blockchains:
    # Load data from disk
    pretraining_data[blockchain] = pd.read_csv(processed_data_file+"_"+blockchain+".csv")

# Concat data in the same pandas variable
pretraining_data = pd.concat(list(pretraining_data.values()))
# Suffle data so we have mixed and heterogeneos samples from all the blockchains
print(pretraining_data.shape)
pretraining_data = pretraining_data.sample(frac = 1)

# Define the amount of data you want to use in the training
# Consider your processing capabilities, for example if you have
# only a CPU available or actually you have set of GPUs then you 
# can increase this value
training_samples = 60000
# Save the data to disk
training_data = pretraining_data[:training_samples]
validation_data = pretraining_data[training_samples:]
training_data['decompiled_opcodes'].to_csv(train_file_path, sep='\t', index=False)
validation_data['decompiled_opcodes'].to_csv(val_file_path, sep='\t', index=False)

In the following notebook tutorial, [Train a new tokenizer based on smart contract opcodes](notebook_2_GPT_tokenizer.ipynb), we will use the collected data to train a Tokenizer, a necessary step to proceed with the pretraining process for the LLM.