# Step Four: 2. Extract + classify Alloy "CPSP" information

## &#9679; "Composition-Processing-Structure-Performance" (CPSP) Information of Alloy
#### 1. **Letter "C" in "CPSP" (C)**: Alloy Composition, Alloy_Designation, and Alloy_Formula; 
#### 2. **First letter "P" in "CPSP" (P1)**: Processing routes/Heat treatment (Operation actions, Conditions, and Corresponding parameters in each period);
#### 3. **Letter "S" in "CPSP" (S)**: Structure/Microstructure (Average Grain Size, Structure type(bcc/fcc/hcp), Strengthening Phase, Characteristics Of Precipitation 
#### (Finer/Uniformly), Grain Boundary(High-angle/Small-angle);
#### 4. **Second letter "P" in "CPSP" (P2)**: Performance/Properties (Mechancial Properties: σb, σs, δ, σbc, σbb, Young’s modulus, shear modulus, poisson’s ratio,
#### flexural modulus, hardness, reduction of area, conductivity ... Corrosion Properties: Ecorr, Icorr, KISCC, ISSRT ... Fracture Properties: KIC, toughness,
#### fatigue strength, fracture mode (brittle/ductile) ... Strengthening Mechanism: grain strengthening, dislocation strengthening, precipitation strengthening, 
#### solid–solution strengthening.

## &#9679; Extraction and Classification Alloy "CPSP" Inforamtion from Pre-Processed Alloy Corpus in 4.1
### (Multiple Publishers: Elsevier, Springer Nature, Royal Soc Chemistry)
### **Step 4.2.1 Merge and Clean Relevant Info output in Different directories**: 
#### **Aims**: Merge output in Abstarct, Combined-Tables, Experimental methods, Results and Discussions sections in 4.1 to get the **completed Alloy "CPSP" Info**.
#### **HOW TO USE**: Replace the input and output directory paths with your own. 
### **Step 4.2.2 Extract Complete "CPSP" Info by prompt engineering**: 
#### **Aims**: Sort out alloy composition, processing, microstructure and performance of the same sample into a seperatre JSON structure. 
#### **HOW TO USE**: Refer to 4.1.2, and Fill in your own API_key and base_url, and modify the prompts and examples based on your task.
### **Step 4.2.3 Convert "CPSP" Info of different alloys into tabular format**: 
#### **Aims**: Convert information in each JSON object from output in 4.2.2 into tabular format. 
#### **HOW TO USE**: Use the code in 4.2.3.

## 4.2.1 Merge and Clean

In [18]:
import os
import re
import chardet

def detect_encoding(file_path):
    # Read the raw data from the file
    with open(file_path, 'rb') as f:
        raw_data = f.read()
    # Detect the encoding of the file using chardet
    result = chardet.detect(raw_data)
    return result['encoding']

def clean_and_merge_json_files(paths, output_path):
    # Get a set of all file names in the given paths that end with '.json'
    all_files = set()
    for path in paths:
        all_files.update(set([f for f in os.listdir(path) if f.endswith('.json')]))

    # Ensure the output directory exists
    os.makedirs(output_path, exist_ok=True)

    for filename in all_files:
        content_list = []
        
        try:
            # Iterate over each path and merge files with the same name
            for path in paths:
                file_path = os.path.join(path, filename)
                
                if os.path.exists(file_path):
                    # Detect the encoding of the file
                    encoding = detect_encoding(file_path)
                    # Open the file with detected encoding
                    with open(file_path, 'r', encoding=encoding, errors='ignore') as f:
                        content = f.read()
                        
                        # Clean the content by removing unwanted characters and normalizing whitespace
                        cleaned_content = re.sub(r"```json|```", "", content).strip()
                        cleaned_content = re.sub(r"\s+", " ", cleaned_content)  # Remove extra whitespace
                        cleaned_content = cleaned_content.replace("\\n", "").replace("\t", " ")
                        print(cleaned_content)
                        
                        # Attempt to convert the cleaned content into a string format, ensuring it's valid JSON
                        # Here it's assumed that cleaned_content is a JSON string or a string close to JSON format
                        content_list.append(cleaned_content)

            # Write the merged content to the output file
            output_file = os.path.join(output_path, filename)
            with open(output_file, 'w', encoding='utf-8') as f:
                # Separate contents with newline characters to ensure each cleaned content is written
                f.write("\n".join(content_list))
            print(f"Processed and saved {filename} to {output_path}")

        except Exception as e:
            print(f"Failed to process {filename}: {e}")

# Example usage
paths = [
    "Data/PaperJson/Abstract",  # Abstract Folder
    "Data/PaperJson/Experimental-Method-Results",  # Experimental Folder
    "Data/CombinedTable/GPT-Processed-CombinedTable",  # Table Folder
    "Data/PaperJson/Results-Discussions-Results"  # Discussions Folder

]
output_path = "Data/PaperJson/JSON-All-Results" # Output Folder
clean_and_merge_json_files(paths, output_path)

{ "DOI": "10.1016/j.actamat.2005.02.010", "Journal": "Acta Materialia", "Keywords": [ "Nickel alloy", "Yield strength", "Precipitate free zones", "Transmission electron microscopy", "Computer simulations" ], "Sections": [], "Title": "The effects of a second aging treatment on the yield strength of γ′-hardened NIMONIC PE16-polycrystals having γ′-precipitate free zones", "abstract": "The nickel-base alloy NIMONIC PE16 is strengthened by coherent spherical nano-scale precipitates of the Ll(2)-long range ordered gamma'-phase. Along grain boundaries precipitate free zones (PFZs) may form, which lower the yield strength drastically. In this study it has been attempted to eliminate this softening effect by double aging treatments: the main gamma'-precipitate dispersion is grown at 1079 K and subsequently finer gamma'-particles are precipitated at 949 K - in between the larger 1079 K-gamma'-precipitates and in the PFZs formed at 1079 K. Unfortunately softening by PFZs could not be suppressed b

## 4.2.2 Extract Complete "CPSP" Info by prompt engineering: 
#### Use the Code in 4.1.2.

## 4.2.3 Sort out "CPSP" Info of different alloys into tabular files: 

### 1. Combine Main_Idea Json

In [45]:
import os
import chardet
import re
import json
import shutil

def detect_encoding(file_path):
    """Detect the file encoding"""
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

def format_doi(filename):
    """Format the filename into a DOI"""
    doi = filename.replace("-", "/").replace(".json", "")
    return doi

def combine_main_ideas(path1, output_file, move_path):
    """Combine all main idea JSONs from the files in the directory into one file"""
    # Store all main ideas
    all_main_ideas = []
    
    # Iterate through all files in the directory
    for filename in os.listdir(path1):
        if filename.endswith('.json'):
            file_path = os.path.join(path1, filename)
            
            # Detect the file encoding
            encoding = detect_encoding(file_path)
            
            # Read the file content
            with open(file_path, 'r', encoding=encoding) as f:
                content = f.readlines()
            
            # Remove lines containing ```json and ```
            cleaned_lines = [line for line in content if "```json" not in line and "```" not in line]
            final_content = "".join(cleaned_lines)
            
            # Search and extract the content block containing "JSON for Main idea:"
            main_idea_match = re.search(r'JSON for Main idea:\s*({.*?})', final_content, re.DOTALL)
            
            if main_idea_match:
                main_idea_content = main_idea_match.group(1)
                
                try:
                    # Convert the extracted content into a dictionary
                    main_idea_data = json.loads(main_idea_content)
                    
                    # Add the formatted DOI field at the first position
                    formatted_doi = format_doi(filename)
                    main_idea_data = {"DOI": formatted_doi, **main_idea_data}
                    
                    # Add to the list
                    all_main_ideas.append(main_idea_data)
                    
                    print(f"Extracted 'Main idea' from: {filename}")
                
                except json.JSONDecodeError:
                    print(f"Error decoding 'Main idea' JSON in file: {filename}")
            else:
                # Move the file to a new path
                shutil.move(file_path, os.path.join(move_path, filename))
                print(f"Moved '{filename}' to '{move_path}'")

    # Create the final dictionary containing all main ideas
    final_json = {
        "main_ideas": all_main_ideas
    }
    
    # Save the combined JSON file
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(final_json, f, ensure_ascii=False, indent=4)
    
    print(f"\nSuccessfully combined {len(all_main_ideas)} main ideas into: {output_file}")

def count_main_ideas(file_path):
    """Count the number of JSON objects with a complete structure"""
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    count = 0
    for main_idea in data['main_ideas']:
        if all(key in main_idea for key in ["DOI", "research_object", "processing_route", "performance", "main idea"]):
            count += 1
    
    return count

# Example usage
source_path  = 'C:/Users/UNSW/Desktop/Alloy/HT/SN/JSON-CPSP'
output_json = 'C:/Users/UNSW/Desktop/Alloy/HT/SN/Mainidea-Combined.json'
move_path = 'C:/Users/UNSW/Desktop/Alloy/HT/SN/CPSP-NoMainidea'  # Path for files without main idea
combine_main_ideas(source_path, output_json, move_path)

# Count the number of main ideas with a complete structure in the output JSON file
count = count_main_ideas(output_json)
print(f"Number of main ideas with complete structure: {count}")

Processed: 10.1016-j.ijplas.2004.03.00271.json
Processed: 10.1016-j.ijplas.2004.03.00271.json
Processed: 10.1016-j.ijplas.2004.03.00271.json
Processed: 10.1016-j.ijplas.2004.03.00222.json
Processed: 10.1016-j.ijplas.2004.03.00222.json
Processed: 10.1016-j.ijplas.2004.03.00222.json
Processed: 10.1016-j.ijplas.2004.03.005.json
Processed: 10.1016-j.ijplas.2004.03.005.json
Processed: 10.1016-j.ijplas.2004.03.005.json
All JSON blocks have been combined and saved to /Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Data/CustomizedTable/AllJSON.json


### 2. Convert Main_Idea Json into tabular format

In [17]:
import json

# Read the JSON file
input_path = "/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/All-CPSP-JSON2.json"
output_path = "/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/All-CPSP-JSON2.txt"

with open(input_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Convert each JSON object to a tabular format in a single line
with open(output_path, 'w', encoding='utf-8') as f_out:
    for item in data["main_ideas"]:
        # Convert each key-value pair to the format "key|value"
        line = "|".join([f"{key}|{value}" for key, value in item.items()])
        # Write to the file, with each JSON object occupying one line
        f_out.write(line + "\n")

print(f"Conversion complete, output to file: {output_path}")

转换完成，输出到文件：/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/CombinedMainidea2.txt


 ### 3. Combine CPSP Json

In [7]:
import os
import chardet
import re

def detect_encoding(file_path):
    """Detect the file encoding"""
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

def format_doi(filename):
    """Format the filename into a DOI"""
    return filename.replace("-", "/").replace(".json", "")

def clean_and_format_json_blocks(path1, alljson_path):
    """Clean and format JSON blocks from files in the directory, then combine them into one file"""
    all_json_data = []  # Store all processed JSON blocks

    # Iterate through all files in the specified directory
    for filename in os.listdir(path1):
        if filename.endswith('.json'):
            file_path = os.path.join(path1, filename)

            # Detect the file encoding
            encoding = detect_encoding(file_path)

            # Read the file content
            with open(file_path, 'r', encoding=encoding) as f:
                content = f.read()

            # Remove lines containing ```json and ```
            cleaned_content = re.sub(r'```json\s*|```', '', content)

            # Delete the "JSON for Main idea:" block and its content
            cleaned_content = re.sub(r'JSON for Main idea:\s*\{.*?\}', '', cleaned_content, flags=re.DOTALL)

            # Generate a formatted DOI string
            formatted_doi = format_doi(filename)

            # Add the DOI to each JSON block, ensuring a newline after the DOI
            modified_content = re.sub(
                r'(JSON for [^:]+:\s*\{)',
                rf'\1\n  "DOI": "{formatted_doi}",',
                cleaned_content
            )

            # Add the processed content to the list
            all_json_data.append(modified_content.strip())
            print(f"Processed: {filename}")

    # Combine all JSON data into a single file with an empty line between each block
    with open(alljson_path, 'w', encoding='utf-8') as f:
        f.write("\n\n".join(all_json_data))
    
    print(f"All JSON blocks have been combined and saved to {alljson_path}")

# Example usage
path1 = 'C:/Users/UNSW/Desktop/Alloy/HT/CPSP-Results/3'
alljson_path = 'C:/Users/UNSW/Desktop/Alloy/HT/CPSP-Results/All-CPSP-JSON.json'
clean_and_format_json_blocks(path1, alljson_path)

Processed: 10.1016:j.msea.2016.05.016.json
Processed: 10.1007-s11665-021-05930-x.json
Processed: 10.1007-s11665-021-05930-x copy.json
All JSON blocks have been combined and saved to /Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/All-CPSP-JSON.json


### 4. Clean Combined CPSP JSON

In [9]:
import re

def clean_json_format(alljson_path, cleaned_json_path):
    with open(alljson_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Remove leading spaces and newlines from DOI, and combine DOI and composition on the same line
    content = re.sub(
        r'\{\s*"DOI":\s*"([^"]+)",\s*"composition":\s*"([^"]+)"',
        r'{"DOI": "\1", "composition": "\2"',
        content
    )
    
    # Remove keys with value "N/A", including cases where there is no comma after the key-value pair
    content = re.sub(r',?\s*"[a-zA-Z0-9_’‘’\-\s]+":\s*"N/A"\s*', '', content)
    
    # Remove empty objects, such as "corrosion_properties": {}
    content = re.sub(r',?\s*"[a-zA-Z0-9_’‘’\-\s]+":\s*\{\s*\}', '', content)
    
    # Clean up extra commas, commas in empty curly braces, and excessive newlines
    content = re.sub(r'\{\s*,\s*', '{', content)
    content = re.sub(r',\s*\}', '}', content)
    content = re.sub(r'\n\s*\n', '\n', content)

    # Ensure each JSON field is on a separate line and clean up excess whitespace
    content = re.sub(r',\s*"', ',\n"', content)
    content = re.sub(r'{\s*"', '{\n"', content)
    content = re.sub(r'"\s*}', '"\n}', content)

    # Write the cleaned content to a new file
    with open(cleaned_json_path, 'w', encoding='utf-8') as f:
        f.write(content)

    print(f"Cleaned JSON content has been saved to {cleaned_json_path}")

# Updated paths
alljson_path = '/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/All-CPSP-JSON.json'
cleaned_json_path = '/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/All-CPSP-JSON2.json'
clean_json_format(alljson_path, cleaned_json_path)

Cleaned JSON content has been saved to /Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/All-CPSP-JSON2.json


### 5. Convert Combined-CPSP.json into tabular file

In [39]:
# 输入和输出文件路径
input_file = "/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/CPSP-Formatted_Results3.json"  # 输入文件路径
output_file = "/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/CPSP-Formatted_Results4.json"  # 输出文件路径

# 处理文件
with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
    buffer = ""
    for line in infile:
        # 去掉空行和换行符
        line = line.strip()
        if line:
            # 检查是否以 "JSON for ...:" 开头并替换为 "Sample Name: ..."
            if line.startswith("JSON for ") and ":" in line:
                # 将 "JSON for " 替换为 "#Sample Name# " 并删除最后一个冒号
                line = line.replace("JSON for ", "#Sample Name# ", 1)
                line = line.rstrip(":")
            
            # 检查并替换 {"DOI": " 为 #DOI#
            if '{"DOI": "' in line:
                line = line.replace('{"DOI": "', '#DOI#').replace('",', '#,', 1)  # 替换第一个 DOI 出现
                            
            # 检查并替换 #, "Composition": " 为 #Composition#
            if '#, "Composition": "' in line:
                line = line.replace('#, "Composition": "', '#Composition#')
            
            # 累积内容到 buffer
            buffer += line + " "
        else:
            # 遇到空行，保存当前 buffer 到输出文件并清空 buffer
            if buffer.strip():  # 忽略空 buffer
                outfile.write(buffer.strip() + "\n")
                buffer = ""
    # 处理文件末尾的剩余内容
    if buffer.strip():
        outfile.write(buffer.strip() + "\n")

print(f"转换完成，结果保存到 {output_file}")

转换完成，结果保存到 /Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/CPSP-Formatted_Results4.json


In [52]:
# 输入和输出文件路径
input_file = "/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/CPSP-Formatted_Results3.json"  # 输入文件路径
output_file = "/Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/CPSP-Formatted_Results4.json"  # 输出文件路径

# 处理文件
with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
    buffer = ""
    for line in infile:
        # 去掉空行和换行符
        line = line.strip()
        
        if line:
            # 检查是否以 "JSON for ...:" 开头并替换为 "Sample Name: ..."
            if line.startswith("JSON for ") and ":" in line:
                # 将 "JSON for " 替换为 "#Sample Name# " 并删除最后一个冒号
                line = line.replace("JSON for ", "#Sample Name# ", 1)
                line = line.rstrip(":")
                
            # 检查并替换 {"DOI": " 为 #DOI#
            if '{"DOI": "' in line:
                # 替换 {"DOI": " 为 #DOI# 并移除后面的空格
                line = line.replace('{"DOI": "', '#DOI#').replace('",', '#', 1)
                line = line.replace("# ", "#")  # 删除 # 后可能存在的空格

            # 检查并替换 #, "Composition": " 为 #Composition#
            if '#, "Composition": "' in line:
                line = line.replace('#, "Composition": "', '#Composition#')
            
            # 累积内容到 buffer
            buffer += line + " "
        else:
            # 遇到空行，保存当前 buffer 到输出文件并清空 buffer
            if buffer.strip():  # 忽略空 buffer
                outfile.write(buffer.strip() + "\n")
                buffer = ""
    # 处理文件末尾的剩余内容
    if buffer.strip():
        outfile.write(buffer.strip() + "\n")

print(f"转换完成，结果保存到 {output_file}")

转换完成，结果保存到 /Users/zixuanzhao/Desktop/Prompt+Example/AlloyGPT/Prompt+Example/untitled folder2/CPSP-Formatted_Results4.json
