# **Ethereum Smart Contracts Security Analysis**

In this notebook we are going to analyze the state of the abilities of **LLMs** in the context of code generation, particularly for the generation of smart contracts in Solidity, in the Ethereum blockchain.

---

(*At this moment we are using a dataset and we are no more genrating the prompt ourself, but there's also the old code for prompt generation starting from a dataset containing smart contracts in solidity*)

**Huggingface Dataset**: *https://huggingface.co/datasets/braindao/Solidity-Dataset*

The steps will be the following:

1.   *Import contracts and turn them into prompt* (**OPTIONAL**)

1.   Import dataset containing the prompts for the generation

2.   Setup OpenAI with API key

3.   Setup a text to give as input to the LLM coder containing instructions for the generation

4.   **Generate smart contracts**





In [None]:
%%capture

from google.colab import drive
drive.mount('/content/drive')

#UNCOMMENT FIRST THREE LINES IF YOU ARE USING CONTRACT DATASET FOR PROMPT GENERATION

#!unzip /content/drive/MyDrive/Ethereum_smart_contract_datast

#import shutil
#shutil.rmtree('/content/__MACOSX')
#shutil.rmtree('/content/Ethereum_smart_contract_datast')

**Setup OpenAI with his API key**

In [None]:
%%capture

!pip install openai
from openai import OpenAI

client = OpenAI(
    api_key = "YOUR_API_KEY",
)

## **Contract collection**

The following cell is used if we are going to generate the prompts ourself.
To do so we load a dataset taken from Github containing a huge amount of smart contracts written for Ethereum in Solidity. We collect a fixed number of contracts to save in local in order to pass them to the prompt generator.

**YOU CAN SKIP IT IF YOU ARE USING A PROMPT DATASET**



In [None]:
import os

contract_codes = []
#The dataset is on my Google Drive
source_directory = '/content/Ethereum_smart_contract_datast/contract_dataset_github/'
codes_collected = 0
codes_needed = 500 #To increase the dataset size

for root, dirs, files in os.walk(source_directory):
    for file in files:
        if file.endswith('.sol'):
            file_path = os.path.join(root, file)
            with open(file_path, 'r') as file:
                contract_codes.append(file.read()) #Add the contract code to the array codes_collected
                codes_collected += 1
                if codes_collected >= codes_needed:
                    break

    #Stop we you reached the requested number of contracts
    if codes_collected >= codes_needed:
        break

#print(f"Contracts collected: {len(contract_codes)}")

In [None]:
#IF YOU WANT TO SEE ONE OF THE CONTRACTS COLLECTED
index_to_print = 300

if 0 <= index_to_print < len(contract_codes):
    print(f"Contract code {index_to_print + 1}:")
    print(contract_codes[index_to_print])
else:
    print("Invalid index")


## **PROMPT GENERATOR**

In this section we proceed to generate the prompts for the LLMs.

**YOU CAN SKIP IT IF YOU ARE USING A PROMPT DATASET**

To do so we defined two different instructions, which gives different level of deepness to the contract generated. **At this moment** the preferred one is **instructions1**

In [None]:
instructions1 = """
Generate a prompt for an AI model for creating a contract starting from the smart contract code the user gives you.
Underline that it should be in solidity
The prompt should be a description of the contract, stating the purpose of it and its rules.
If I give the prompt to another AI model it should generate pretty much the same contract.
Specify the rules, the purpose and the general description of the code.
"""

instructions2 = """
Generate a prompt for creating a contract starting from the smart contract code the user gives you.
Underline that it should be in solidity
The prompt should be a description of the contract, stating the purpose of it and its rules.
If I give the prompt to another AI model it should generate pretty much the same contract.
It should be a simple description
Not possible to modify or add anything with respect to the given prompt
"""

We give to the generator two parameters:


*   The instructions for the generation
*   The smart contract that should be turned into prompt



In [None]:
def prompt_generatorv1(smartcontract, generator_instructions):
    response = client.chat.completions.create(
        model = "gpt-3.5-turbo",
        messages = [
          {"role": "system", "content": generator_instructions},
          {"role": "user", "content": smartcontract},
        ]
    )

    generated_prompt = response.choices[0].message.content.strip()

    return generated_prompt

**GENERATE 500 PROMPTS SELECTING A RANDOM CONTRACT AND TURNING IT INTO PROMPT**

In [None]:
import random

prompts = []
how_many_prompts = 500

for _ in range(how_many_prompts):
    #Random contract selection
    contract_index = random.randint(0, len(contract_codes) - 1)
    selected_contract = contract_codes[contract_index]

    prompts.append(prompt_generatorv1(selected_contract, instructions1))


**CELL USED TO PRINT A SELECTED PROMPT JUST TO CHECK**

In [None]:
index_to_print = 14

if 0 <= index_to_print < len(prompts):
    print(f"My generated prompt {index_to_print + 1}:")
    print(prompts[index_to_print])
else:
    print("Invalid index")

**PROMPT DATASET EXPORT AS A CSV FILE**

In [None]:
data = {"Prompt": prompts}
df = pd.DataFrame(data)

#Exporting dataset
csv_file_path = "/content/test_prompts_dataset.csv"  #UNDER CONTRUCTION
df.to_csv(csv_file_path, index=False)

print(f"DATASET SAVED IN {csv_file_path}")


##**DATASET LOADING FROM HUGGINGFACE**

In this section we load the dataset from HuggingFace, the link is available at the beginning of this notebook

In [None]:
import pandas as pd
from IPython.display import display

prompt_dataset = pd.read_parquet("hf://datasets/braindao/Solidity-Dataset/SolidityP.parquet")
#display(prompt_dataset[['average']])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**CELL TO CHECK THE CONTENT OF THE DATASET, WE ARE USING COLUMN AVERAGE**

In [None]:
sample_row = prompt_dataset.sample(n=1).iloc[0]
print(sample_row['average'])

Create a smart contract that follows the ERC20 token standard, allowing for token transfers, approvals, and burns. Use the SafeMath library for mathematical operations. Ensure that the contract tracks balances, allowances, and total supply accurately. Implement a constructor for initializing the contract name, symbol, and decimal places. Include functions for transferring tokens, approving spenders, and burning tokens. Focus on building a basic, functional contract that follows the ERC20 standard.


I save the first 250 prompts to use for the generation, in order to generate the same 250 contracts with both models

In [None]:
prompts_to_generate = prompt_dataset['average'][:250].tolist()

## **CODE GENERATION WITH GPT-4**

In this section we proceed to actually generate the code for the smart contracts, using GPT4 model. To do so we define the instructions to give to the coder and we decide how many contracts we want to generate, in our case we are going to generate 500 smart contracts




In [None]:
#CODER INSTRUCTIONS

coder = """
You will generate a deployable smart contract code in solidity, based on the prompt I give you.
Use Solidity version ^0.8.0

The file should contain only solidity code, no comments or "```sol".

I should be able to copy your response and paste it in a sol file to deploy.
Do not use Import statement, only code, if there's any import, replace it with code for the actual imported contract.
"""

In [None]:
#FOLDER FOR SOLIDITY CONTRACTS
import os
import shutil

output_dir = 'gpt_contracts/'

if os.path.exists(output_dir):
    shutil.rmtree(output_dir)

os.makedirs(output_dir, exist_ok=True)

During the manual checking we have seen that the code privided by chatgpt was not always ONLY CODE, but contained additional info or rows that we didn't want, so we created a function that removes everything above the two possible first lines, licence and pragma, and also everything after the last '}'

In [None]:
def sanitize_code(code):
    #REMOVE EVERYTHING ABOVE // SPDX
    spdx_index = code.find('// SPDX')
    if spdx_index != -1:
        code = code[spdx_index:]
    else:
        #IF THERE'S NO // SPDX, REMOVE EVERYTHING ABOVE pragma
        pragma_index = code.find('pragma')
        if pragma_index != -1:
            code = code[pragma_index:]

    #REMOVE EVERITHING AFTER LAST "}"
    last_brace_index = code.rfind('}')
    if last_brace_index != -1:
        code = code[:last_brace_index + 1]

    return code

**CODE GENERATOR PARAMETERS**

*   **Coder instructions** defined above
*   **Prompt** used for the generation



In [None]:
def code_generator(prompt, coder_instructions):
    response = client.chat.completions.create(
        model = "gpt-4",
        messages = [
          {"role": "system", "content": coder_instructions},
          {"role": "user", "content": prompt},
        ]
    )

    generated_contract = response.choices[0].message.content.strip()

    return generated_contract

generated_data = []

contracts_to_generate = 250 #SELECT NUMBER OF CONTRACTS NEEDED
for i in range(contracts_to_generate):
    gpt_contract = sanitize_code(code_generator(prompts_to_generate[i], coder))
    prompt_used_gpt = prompts_to_generate[i]
    file_name_gpt = f'contract_{i + 1}.sol'

    with open(f'/content/gpt_contracts/contract_{i + 1}.sol', 'w') as file:
        file.write(sanitize_code(gpt_contract))

    generated_data.append([prompt_used_gpt, gpt_contract, file_name_gpt])

**GENERATE CSV FILE WITH DATA GENERATED**

In [None]:
df = pd.DataFrame(generated_data, columns=['prompt_gpt', 'gpt_contract', 'file_name_gpt'])

**EXPORT THE FOLDER OF GENERATED CONTRACTS**

In [None]:
import shutil
from google.colab import files

shutil.make_archive('/content/gpt_contracts', 'zip', '/content/gpt_contracts')
files.download('gpt_contracts.zip')


print("The contracts have been saved and zipped successfully.")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The contracts have been saved and zipped successfully.


In [None]:
df.to_csv('/content/gpt.csv', index=False)
files.download('/content/gpt.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **CODE GENERATION WITH DEEPSEEK-CODER**

In this section we proceed to actually generate the code for the smart contracts, using DeepSeek-Coder model. To do so we define the instructions to give to the coder and we decide how many contracts we want to generate, in our case we are going to generate 500 smart contracts

In [None]:
#Remove the opening "```solidity" and closing "```" delimiters

def remove_code_delimiters(generated_code):
    #Split into lines
    lines = generated_code.split('\n')

    #Remove the opening delimiter if it's in the first line
    if lines[0].strip() == "```solidity":
        lines = lines[1:]

    #Remove the opening delimiter if it's in the first line
    if lines[0].strip() == "```sol":
        lines = lines[1:]

    #Remove the closing delimiter if it's in the last line
    if lines[-1].strip() == "```":
        lines = lines[:-1]

    #Join the lines back together
    cleaned_code = '\n'.join(lines)
    return cleaned_code

In [None]:
#FOLDER FOR SOLIDITY CONTRACTS
import os

output_dir = 'deepseek_contracts/'

if os.path.exists(output_dir):
    shutil.rmtree(output_dir)

os.makedirs(output_dir, exist_ok=True)

In [None]:
deepseek_api_endpoint = "https://api.deepseek.com"
deepseek_api_key = "YOUR__API__KEY__HERE"

deepseek_client = OpenAI(api_key=deepseek_api_key, base_url=deepseek_api_endpoint)

deepseek_coder_instructions = """ You will generate a deployable smart contract code in solidity, based on the prompt I give you.
                                  Use Solidity version ^0.8.0

                                  The contract should be made only in Solidity and it must be ready to deploy
                                  only code , do not put "```solidity" at the beginning of the response neither ``` at the end.
                                  I want a fully deployable file with only code.

                                  Do not use Import statement, only code, if there's any import, replace it with code for the actual imported contract.

                                  """

def code_generator_deepseek(prompt, coder_instructions):
      response = deepseek_client.chat.completions.create(
          model="deepseek-coder",
          messages=[
              {"role": "system", "content": coder_instructions},
              {"role": "user", "content": prompt},
          ],
          stream=False
      )

      generated_contract = remove_code_delimiters(response.choices[0].message.content)

      return generated_contract

deepseek_data = []

contracts_to_generate = 250 #SELECT NUMBER OF CONTRACTS NEEDED
for i in range(contracts_to_generate):
      ds_contract = remove_code_delimiters(code_generator_deepseek(prompts_to_generate[i], deepseek_coder_instructions))
      prompt_used_ds = prompts_to_generate[i]
      file_name_ds = f'contract_{i + 1}.sol'

      with open(f'/content/deepseek_contracts/contract_{i + 1}.sol', 'w') as file:
        file.write(ds_contract)

      deepseek_data.append([prompt_used_ds, ds_contract, file_name_ds])

**ADD THE CONTRACT GENERATED BY DEEPSEEK TO THE CSV**

In [None]:
df_deepseek = pd.DataFrame(deepseek_data, columns=['deepseek_prompt', 'deepseek_contract', 'deepseek_file_name'])

df_deepseek.to_csv('/content/deepseek.csv', index=False)

In [None]:
import shutil
from google.colab import files

shutil.make_archive('/content/deepseek_contracts', 'zip', '/content/deepseek_contracts')
files.download('deepseek_contracts.zip')
files.download('/content/deepseek.csv')


print("The contracts have been saved and zipped successfully.")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The contracts have been saved and zipped successfully.


## COMBINED CSV DOWNLOAD

In [None]:
import pandas as pd

# Load the two CSV files
gpt_df = pd.read_csv('/content/gpt.csv')
deepseek_df = pd.read_csv('/content/deepseek.csv')

# Merge the two dataframes on the specified columns
merged_df = pd.merge(gpt_df, deepseek_df, left_on='file_name', right_on='file_name_deepseek')

# Save the merged dataframe to a new CSV file
merged_csv_path = '/content/generated_contracts.csv'
merged_df.to_csv(merged_csv_path, index=False)

print(f"Merged CSV file has been written to {merged_csv_path}")


Merged CSV file has been written to /content/gpt_contracts/generated_contracts.csv
