# Step Four: 1. Pre-Extract "CPSP" information of Alloy from Papers

## &#9679; "Composition-Processing-Structure-Performance" (CPSP) Information of Alloy
#### 1. **Letter "C" in "CPSP" (C)**: Alloy Composition, Alloy_Designation, and Alloy_Formula; 
#### 2. **First letter "P" in "CPSP" (P1)**: Processing routes/Heat treatment (Operation actions, Conditions, and Corresponding parameters in each period);
#### 3. **Letter "S" in "CPSP" (S)**: Structure/Microstructure (Average Grain Size, Structure type(bcc/fcc/hcp), Strengthening Phase, Characteristics Of Precipitation 
#### (Finer/Uniformly), Grain Boundary(High-angle/Small-angle);
#### 4. **Second letter "P" in "CPSP" (P2)**: Performance/Properties (Mechancial Properties: σb, σs, δ, σbc, σbb, Young’s modulus, shear modulus, poisson’s ratio,
#### flexural modulus, hardness, reduction of area, conductivity ... Corrosion Properties: Ecorr, Icorr, KISCC, ISSRT ... Fracture Properties: KIC, toughness,
#### fatigue strength, fracture mode (brittle/ductile) ... Strengthening Mechanism: grain strengthening, dislocation strengthening, precipitation strengthening, 
#### solid–solution strengthening.

## &#9679; Extract "CPSP" Inforamtion in the field of metallurgy from Literature by Prompt Engineering 
#### (Multiple Publishers: Elsevier, Springer Nature, Royal Soc Chemistry)
#### **Note:** 1. Relevant information on Composition "C" and Processing "P1" generally distributed in "Experimental Methods" Section.
#### **Note:** 2. Relevant information on Structure "S" and Performance "P2" generally distributed in "Results and Discussions" Section.
#### This is the code for Alloy, and you can If modify the prompts and examples in "Prompt+Example" folder to process your own data.

### **Step 4.1.1 Pre-Processing**: 
#### **Aims**: Get structured pieces of information from original papers (Json file), reduce GPT usage cost/time and improve extraction accuracy as possible.
#### **HOW TO USE**: Change key_words according to the sections you want to extract. Examples are provided in annotations labeled by """

### **Step 4.1.2 Pre-Extracting by prompt engineering**: 
#### **Aims**: Extract "C"+"P1" from pre-extracted "Experimental Methods" sections by prompt engineering, 
#### Extract "S", "P2" from "Results and Discussions" sections by prompt engineering,
#### Integrate Table-Info by prompt engineering,
#### **HOW TO USE**: Prompts used in the case are all provided in "Prompt+Example/Prompt.txt", and examples for GPT provided in "Prompt+Example/Example", 
#### Fill in your own API_key and base_url, and modify the prompts and examples based on your task.

## 4.1.1 Pre-Processing

In [3]:
import os
import json
import chardet  # Import chardet library for encoding detection

# Set input and output paths
input_path = 'Data/TextJSON'  # Input path
output_path = 'Data/ExperimentalMethod' # Output path

# Detect file encoding
def detect_file_encoding(file_path):
    with open(file_path, 'rb') as f:
        raw_data = f.read()
        result = chardet.detect(raw_data)
        return result['encoding']

def process_text_file(input_file_path, output_file_path):
    print(input_file_path)
    detected_encoding = detect_file_encoding(input_file_path)
    try:
        # Use the detected encoding to read the JSON file
        with open(input_file_path, "r", encoding=detected_encoding) as f:
            content = json.load(f)
        
        # Check and filter 'Sections'
        if isinstance(content.get("Sections"), list):
            filtered_sections = [
            ]
            for section in content["Sections"]:
                # print(section.get('name'))
                # print(section["content"])
                if isinstance(section['content'][0],dict):
                    # print("dict")
                    # print(section)
                    flag = False
                    """ 
                    # keywords for Experimental Methods 
                    for key_word in ["test", "Test", "tests", "Tests", "Procedures", "procedures", "procedure","Procedure", "heat-treatment", "Heat-treatment", 
                    "Heat treatment","heat treatment","experiment", "Experiment", "Experimental","experimental", "methodology", "method", "Methodology", "Method", 
                    "material", "processing", "Material", "Processing", "preparation", "Preparation", "process", "Process", "materials", "Materials", 
                    "sample", "samples", "Sample", "Samples"]:
                    
                    # keywords for Results & Discussions
                    for key_word in ["heat-treatment","Heat-treatment","heat treatment","Heat treatment", "results", "Results", "result", "Result", 
                    "discussions", "Discussions", "Discussion", "discussion", "microstructure", "Microstructure", "microstructural", 
                    "Microstructural", "analysis", "Analysis", "Analytical", "analytical", "performance", "Performance", "Properties", 
                    "properties", "Comparsion", "comparsion","Mechanism", "mechanism", "Mechanisms", "mechanisms", "nanostructure", 
                    "Nanostructure", "Characterization", "characterization"]:
                    
                    # keywords for Abstarct
                    for key_word in ["Abstarct", "abstarct"]:
                    """
                    for key_word in ["XXX","YYY"]: ### keywords of sections Your want to extract
                                if key_word in section.get('name'):
                                    filtered_sections.append(section)
                                    flag = True
                                    
                    if flag:
                        continue
                    for c in section['content']:
                        # print("c",c)
                        # print(c.get("name"))
                        """
                        # keywords for Experimental Methods 
                        for key_word in ["test", "Test", "tests", "Tests", "Procedures", "procedures", "procedure","Procedure", "heat-treatment", "Heat-treatment", 
                        "Heat treatment","heat treatment","experiment", "Experiment", "Experimental","experimental", "methodology", "method", "Methodology", "Method", 
                        "material", "processing", "Material", "Processing", "preparation", "Preparation", "process", "Process", "materials", "Materials", 
                        "sample", "samples", "Sample", "Samples"]:
                        # keywords for Results & Discussions
                        for key_word in ["heat-treatment","Heat-treatment","heat treatment","Heat treatment", "results", "Results", "result", "Result", 
                        "discussions", "Discussions", "Discussion", "discussion", "microstructure", "Microstructure", "microstructural", 
                        "Microstructural", "analysis", "Analysis", "Analytical", "analytical", "performance", "Performance", "Properties", 
                        "properties", "Comparsion", "comparsion","Mechanism", "mechanism", "Mechanisms", "mechanisms", "nanostructure", 
                        "Nanostructure", "Characterization", "characterization"]:
                        # keywords for Abstarct
                        for key_word in ["Abstarct", "abstarct"]:
                        """
                        for key_word in ["XXX","YYY"]: ### keywords of sections Your want to extract
                                if key_word in c.get('name'):
                                    filtered_sections.append(c)
                                    print(c)
                                    break
                elif isinstance(section['content'][0],str):
                    # print("str")
                # if isinstance(section[0],dict)
                    # print(section)
                    # print(section.get("name"))
                    """
                    # keywords for Experimental Methods 
                    for key_word in ["test", "Test", "tests", "Tests", "Procedures", "procedures", "procedure","Procedure", "heat-treatment", "Heat-treatment", 
                    "Heat treatment","heat treatment","experiment", "Experiment", "Experimental","experimental", "methodology", "method", "Methodology", "Method", 
                    "material", "processing", "Material", "Processing", "preparation", "Preparation", "process", "Process", "materials", "Materials", 
                    "sample", "samples", "Sample", "Samples"]:
                    # keywords for Results & Discussions
                    for key_word in ["heat-treatment","Heat-treatment","heat treatment","Heat treatment", "results", "Results", "result", "Result", 
                    "discussions", "Discussions", "Discussion", "discussion", "microstructure", "Microstructure", "microstructural", 
                    "Microstructural", "analysis", "Analysis", "Analytical", "analytical", "performance", "Performance", "Properties", 
                    "properties", "Comparsion", "comparsion","Mechanism", "mechanism", "Mechanisms", "mechanisms", "nanostructure", 
                    "Nanostructure", "Characterization", "characterization"]:
                    # keywords for Abstarct
                    for key_word in ["Abstarct", "abstarct"]:
                    """
                    for key_word in ["XXX","YYY"]: ### keywords of sections Your want to extract
                            if key_word in section.get('name'):
                                filtered_sections.append(section)
                                # print(section)
                                break
               
            print("filterd:",filtered_sections)
            content["Sections"] = filtered_sections
            
            # Ensure output directory exists
            os.makedirs(output_path, exist_ok=True)
            
            # Write to output file
            with open(output_file_path, "w", encoding="utf-8") as f:
                json.dump(content, f, indent=4, ensure_ascii=False)
            print(f"Processed and saved: {output_file_path}")
        else:
            print(f"Warning: 'Sections' in {input_file_path} is not a list. Skipping file.")
            
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON in {input_file_path}: {e}")
    except UnicodeDecodeError as e:
        print(f"Error with encoding {detected_encoding} in {input_file_path}: {e}. Trying UTF-8 instead.")
        # Attempt to read the file using UTF-8 encoding if the detected one fails
        try:
            with open(input_file_path, "r", encoding="utf-8") as f:
                content = json.load(f)
            # Continue processing if read is successful
        except Exception as e:
            print(f"Unexpected error processing {input_file_path} with UTF-8: {e}")
    except IOError as e:
        print(f"Error processing file {input_file_path}: {e}")
    except Exception as e:
        print(f"Unexpected error processing {input_file_path}: {e}")

def main():
    # Ensure output directory exists
    os.makedirs(output_path, exist_ok=True)
    
    # Process all JSON files in the folder
    print("Starting to process files...")
    processed_count = 0
    
    for file in os.listdir(input_path):
        if file.endswith(".json"):  # Process only JSON files
            input_file_path = os.path.join(input_path, file)
            output_file_path = os.path.join(output_path, file)
            process_text_file(input_file_path, output_file_path)
            processed_count += 1
    
    print(f"\nProcessing complete. Files processed: {processed_count}")

if __name__ == "__main__":
    main()

Starting to process files...
Data/TextJSON/10.1557-s43578-022-00691-2.json
filterd: [{'name': 'Material and methods', 'content': ["Pure metal sheets of titanium, zirconium, niobium, and tantalum (purity 99.5%, 99.2%, 99.9%, 99.95%, respectively) were purchased from The Nilaco Corporation. They were cut into circular plates of 3\xa0mm in diameter and 0.2\xa0mm thick. They were mechanically mirror-polished with alumina powder of 0.3\xa0µm and ultrasonically rinsed in acetone and ethanol. They were heated at 400–800\xa0°C for 3.6 ks in the air for oxidation. The surface morphology of the samples was observed with scanning electron microscopy (SEM, JXA-6300P, JEOL, Japan). The surface film formed by the heat treatment was examined by X-ray diffractometry (XRD) using a diffractometer (X’pert pro, PANalytical, Japan) operating at 45\xa0kV and 40\xa0mA. The average (arithmetical mean) surface roughness, Ra, and length measured along the surface roughness contour, L', were analyzed using a sur

## 4.1.2 Pre-Extracting by prompt engineering

In [14]:
import csv
import os
import json
from platform import system
import time
import chardet
from tqdm import tqdm
import openai
from openai import OpenAI
from utils import *


# -- global var start --
api_key = "..." ### Fill in API_key
base_url = "..." ### Fill in base_url
input_csv = r'Prompt+Example/Example/OneShot-Table.csv'  # CSV file path containing 'Input' and 'Output' columns
input_data_path = r'Data/Table-Combined'  # Directory containing new input data
output_path = r'Data/Table-Combine'  # Directory to save the results
n_folder_path = r'Data/Table-Combine'  # Folder to store files without 'Table' in the name
system_content = """ ###### """ # Fill in your prompt
system_instructions = {
    "role": "system",
    "content": system_content
}

client = OpenAI(
    api_key=api_key,
    base_url=base_url
)
# -- global var end --


# -- functions start --
# 1. create user content
def create_user_content():
    # Read the CSV file containing I/O pairs
    example_content = ''' ### ''' # Fill in your prompt
    with open(input_csv, mode='r') as file:
        csv_reader = csv.DictReader(file)
        # Collecting I/O pairs from CSV
        example_num = 11
        for row in csv_reader:
            ex1 = row["Input"],
            ex1_output = row["Output"],
            example_content += f"""### Example {example_num}:\nInput:\n{ex1}\nOutput:\n{ex1_output}\n"""
            example_num += 1
    return example_content


# 2. detect_file_encoding
def detect_file_encoding(file_path):
    with open(file_path, 'rb') as f:
        raw_data = f.read()
        result = chardet.detect(raw_data)
        return result['encoding']
# 3 .get root dirs
def get_dirs(root_dir):
    dirs = os.listdir(root_dir)
    if '.DS_Store' in dirs:
        dirs.remove('.DS_Store')
    return dirs

# 4. detect_file_exist
def detect_file_exist(file_path):
    if os.path.exists(file_path):
        return True
    else:
        return False

# -- functions end --
def few_shot_from_csv(input_csv, input_data_path, output_path, api_key, n_folder_path):
    # Process new input data and get prediction
    file_list = os.listdir(input_data_path)  # Load input data
    flag = 1
    for file in file_list:
        print(f"current file: {file},index:{flag}")
        if detect_file_exist(os.path.join(output_path,file)):
            print(f" {os.path.join(output_path,file)} Exist")
            continue
        messages = [system_instructions]  # Add system instruction first
        user_instructions = {
            "role": "user",
            "content": create_user_content()
        }
        messages.append(user_instructions)

        # Maintain original file names
        input_file_path = os.path.join(input_data_path, file)
        output_file_name = os.path.splitext(file)[0] + '.json'
        output_file_path = os.path.join(output_path, output_file_name)
        encodeing = detect_file_encoding(input_file_path)
        # Read file content
        with open(input_file_path, 'r', encoding=encodeing, errors='ignore') as input_file:
            text = input_file.read()
            # print(text)
        print(f"Processing file: {file}")

        # Add the new input to be processed by the model
        messages.append({"role": "user", "content": text})
        print(f"submited {file}")
        response = client.chat.completions.create(
            model="gpt-4-1106-preview",
            messages=messages,
            temperature=0.3,
        #    frequency_penalty=0,
        #    presence_penalty=0
        #    max_completion_tokens = 8192
        )

        prediction = response.choices[0].message.content.strip()
        print(f"predicted {prediction}")
        send_num = 1
        max_retries = 3
        retries = 0
        while True:

            if "..." in prediction:
                send_num += 1
                print(f"Omit,ask again : {send_num}")
                if send_num >= 3:
                    messages = messages[0:-3]
                messages.append({"role": "assistant", "content": prediction})
                messages.append({"role": "user",
                                 "content": "Do not omit and Continue your output, each cell should be converted !!!"})

                if retries < max_retries:
                    try:
                        response = client.chat.completions.create(
                            model="gpt-4-1106-preview",
                            messages=messages,
                            temperature=0.3,
                        #    frequency_penalty=0,
                        #    presence_penalty=0,
                        #    max_completion_tokens = 8192
                        )
                        prediction = response.choices[0].message.content.strip()

                    except openai.InternalServerError as e:
                        retries += 1
                        print(f"Omit：{e}。Retrying... ({retries}/{max_retries})")
                        time.sleep(2 ** retries)  # 指数回退延迟
                        if retries == max_retries:
                            raise Exception("Maximized Retry times. Overloaded")
            else:
                # print("omit not in prediction,ask again")
                # print("：Complete，Save it")
                break
        # print(prediction)
        if not os.path.exists(output_path):
            print(output_path)
            os.mkdir(output_path)
        with open(os.path.join(output_path,file), "w") as f:
            f.write(prediction)
        print("：Completed，Save it as:", output_path + file)

def main():
    """
    Main function to set up parameters, read configuration, and invoke the few-shot learning process.
    """
       # Ensure output and 'N' paths exist
    if not os.path.exists(output_path):
        os.makedirs(output_path)

    if not os.path.exists(n_folder_path):
        os.makedirs(n_folder_path)

    input_json_dirs = get_dirs(input_data_path)
    print("input_json_dirs:",input_json_dirs)
    for input_dir in input_json_dirs:
        print("input_dir:",input_dir)
        input_json_path = os.path.join(input_data_path, input_dir)
        output_json_path = os.path.join(output_path, input_dir)
        few_shot_from_csv(input_csv, input_json_path, output_json_path, api_key, n_folder_path)

    # Call the few-shot function to process the CSV and input data
    #few_shot_from_csv(input_csv, input_data_path, output_path, api_key, n_folder_path)

# Run the main function if this script is executed
if __name__ == "__main__":
    main()


input_json_dirs: ['1']
input_dir: 1
current file: 10.1016-j.msea.2018.12.103.json,index:1
Processing file: 10.1016-j.msea.2018.12.103.json
submited 10.1016-j.msea.2018.12.103.json
predicted JSON for Initial material: 
{
  "Scan step size, μm": "0.25",
  "Map area, μm^2": "400 × 400",
  "Number of pixels": "2,957,724",
  "Number of grains": "614",
  "Yield strength, MPa": "136 ± 1.0",
  "Ultimate tensile strength, MPa": "349 ± 2.0",
  "Ductility, %": "28 ± 1.0"
}

JSON for 30 pct. rolling: 
{
  "Scan step size, μm": "0.7",
  "Map area, μm^2": "400 × 400",
  "Number of pixels": "589,463",
  "Number of grains": "2514"
}

JSON for 40 pct. rolling: 
{
  "Scan step size, μm": "0.25",
  "Map area, μm^2": "300 × 300",
  "Number of pixels": "1,661,683",
  "Number of grains": "6532",
  "Yield strength, MPa": "431 ± 0.6",
  "Ultimate tensile strength, MPa": "478 ± 0.6",
  "Ductility, %": "7.5 ± 0.2",
  "Measured yield strength, MPa": "431",
  "Threshold stress, MPa": "110",
  "Calculated dislocat