### The goal is translate technical artical, but not technical terms

#### idea:
1. COT
2. pydantic model: native_translate, revised_translate

The idea is to have multi-steps, to improve the qulity of the translation.

### schema

In [81]:
from pydantic import BaseModel, Field, field_validator
from typing import List, Literal

class OutputSchema(BaseModel):
    native_translation: str = Field(..., description="The native translation of the text")
    revised_translation: str = Field(..., description="The revised translation of the text")

LanguageType = Literal["English", "Japanese", "Chinese"]

class TranslatePayload(OutputSchema):
    source_language: LanguageType = Field(..., description="The original language of the text")
    target_language: LanguageType = Field(..., description="The target language for the translation of the text")

    @field_validator('target_language')
    def validate_target_language(cls, value, info):
        source_language = info.data.get('source_language')
        if source_language and value == source_language:
            raise ValueError("Target language should be different from the source language")
        return value

class TranslatePayloads(BaseModel):
    payloads: List[TranslatePayload]

### parser

In [82]:
from langchain_core.output_parsers import JsonOutputParser

translate_output_parser = JsonOutputParser(pydantic_object=OutputSchema)

### prompt

In [83]:
class TranslationPrompt:
    SYSTEM_PROMPT = """Your job is to translate the data provide by a human user from {source_language} to {target_language}, focusing on translating as much as possible while carefully preserving technical context.

1.Proprietary Names & Jargon: Keep all brand names, product names, technical jargon, and acronyms in their original form. ex [EDR(Endpoint Detection and Response), SWG(Secure Web Gateway), TCP/IP(Transmission Control Protocol/Internet Protocol), PM(Product Manager), AI(Artificial Intelligence)]
2.Units of Measurement: Do not translate numerical units (e.g., GB, MB/s, GHz).
3.Code & Command-Line: Leave any code snippets, command-line instructions, or programming syntax unchanged.
4.Abbreviations & Short Forms: Keep all technical abbreviations (e.g., API, TCP/IP) without translation.
5.Consistency: Make sure technical accuracy is prioritized over natural fluency, especially for industry-specific terms.
6.Boolean Values & Data Types: Do not translate Boolean values (“True”/“False”) or NoneType (“None”).
7.Technical Context: For any ambiguous technical terms, keep the original text in parentheses for reference.

Try to translate the data as much as possible, but follow the above guidelines to ensure technical accuracy.

{format_instructions}

You will first generate a native_translation of the text, and then revise it to a revised_translation, ensure technical accuracy and context preservation.

Please ensure the output is formatted as specified, in JSON format.
"""

    HUMAN_PROMPT = """
{article}
"""

### create template

In [84]:
from langchain.prompts.chat import ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.prompts import PromptTemplate

system_prompt = TranslationPrompt.SYSTEM_PROMPT
human_prompt = TranslationPrompt.HUMAN_PROMPT

translation_prompt_tempalte = ChatPromptTemplate(
    [
        SystemMessagePromptTemplate(
            prompt=PromptTemplate(
                template=system_prompt,
                input_variables=["source_language", "target_language", "article"],
                partial_variables={"format_instructions": translate_output_parser.get_format_instructions()}
            )
        ),
        HumanMessagePromptTemplate.from_template(human_prompt)
    ]
)

In [85]:
test = translation_prompt_tempalte.invoke({"source_language": "English", "target_language": "Japanese", "article": "This is a test article"})
print(test)

messages=[SystemMessage(content='Your job is to translate the data provide by a human user from English to Japanese, focusing on translating as much as possible while carefully preserving technical context.\n\n1.Proprietary Names & Jargon: Keep all brand names, product names, technical jargon, and acronyms in their original form. ex [EDR(Endpoint Detection and Response), SWG(Secure Web Gateway), TCP/IP(Transmission Control Protocol/Internet Protocol), PM(Product Manager), AI(Artificial Intelligence)]\n2.Units of Measurement: Do not translate numerical units (e.g., GB, MB/s, GHz).\n3.Code & Command-Line: Leave any code snippets, command-line instructions, or programming syntax unchanged.\n4.Abbreviations & Short Forms: Keep all technical abbreviations (e.g., API, TCP/IP) without translation.\n5.Consistency: Make sure technical accuracy is prioritized over natural fluency, especially for industry-specific terms.\n6.Boolean Values & Data Types: Do not translate Boolean values (“True”/“Fal

## chain

In [86]:
from langchain_openai import ChatOpenAI
from pprint import pprint

llm = ChatOpenAI(model="gpt-4o-mini")

translation_chain = translation_prompt_tempalte | llm | translate_output_parser

In [87]:
test_article = "Artificial intelligence (AI) has rapidly evolved in recent years, becoming an integral part of various industries. From healthcare to finance, AI is revolutionizing how tasks are performed, improving efficiency and accuracy. In healthcare, AI is used for predictive diagnostics and personalized treatment plans, enabling doctors to provide better patient care. In the financial sector, AI algorithms help detect fraud and make data-driven investment decisions. As AI continues to advance, ethical considerations such as privacy and job displacement become increasingly important. Balancing innovation with these concerns will be key to harnessing AI’s full potential."

In [88]:
test_article_2 = """
In today’s threat landscape, cybersecurity pivots around concepts like EDR, SIEM, and SOC orchestration. EDR (Endpoint Detection and Response) focuses on real-time telemetry, leveraging XDR frameworks to aggregate data from diverse endpoints, thus providing heightened threat visibility. SIEM (Security Information and Event Management) platforms, like Splunk and ArcSight, enable SOC (Security Operations Center) teams to ingest, parse, and correlate logs, streamlining MTTR (Mean Time to Respond).

Attack vectors such as APTs (Advanced Persistent Threats), spear-phishing, and MITM (Man-in-the-Middle) attacks exploit vulnerabilities within an organization’s perimeter and layered defenses. Red and Blue Teams simulate these TTPs (Tactics, Techniques, and Procedures) to test and harden cybersecurity posture. MFA (Multi-Factor Authentication) and IAM (Identity and Access Management) remain pivotal in restricting unauthorized access.

The Zero Trust model, focusing on “never trust, always verify,” is crucial in modern architectures, especially with SASE (Secure Access Service Edge) deployments, integrating SWG (Secure Web Gateway) and CASB (Cloud Access Security Broker) functionalities. Encryption protocols, like TLS 1.3 and AES-256, enforce data confidentiality across untrusted networks.
"""

In [89]:
result = translation_chain.invoke({"source_language": "English", "target_language": "Chinese", "article": test_article})
for key , value in result.items():
    print(f"{key}: {value}")

native_translation: 人工智能(AI)在近年来迅速发展，成为各个行业不可或缺的一部分。从医疗到金融，AI正在彻底改变任务执行的方式，提高效率和准确性。在医疗领域，AI用于预测诊断和个性化治疗计划，使医生能够提供更好的病人护理。在金融行业，AI算法帮助检测欺诈并做出基于数据的投资决策。随着AI的不断进步，隐私和就业替代等伦理问题变得越来越重要。平衡创新与这些问题将是利用AI全部潜力的关键。
revised_translation: 人工智能(AI)在近年来迅速发展，成为各个行业不可或缺的一部分。从医疗到金融，AI正在彻底改变任务执行的方式，提高效率和准确性。在医疗领域，AI用于预测诊断和个性化治疗计划，使医生能够提供更好的患者护理。在金融行业，AI算法帮助检测欺诈并做出基于数据的投资决策。随着AI的不断进步，隐私和就业替代等伦理问题变得越来越重要。平衡创新与这些问题将是发挥AI全部潜力的关键。


In [90]:
result = translation_chain.invoke({"source_language": "English", "target_language": "Chinese", "article": test_article_2})
for key , value in result.items():
    print(f"{key}: {value}")
    print("======================================")

native_translation: 在今天的威胁环境中，网络安全围绕着像EDR、SIEM和SOC编排这样的概念进行转变。EDR（端点检测与响应）专注于实时遥测，利用XDR框架从不同的端点聚合数据，从而提供增强的威胁可见性。SIEM（安全信息与事件管理）平台，如Splunk和ArcSight，使SOC（安全运营中心）团队能够摄取、解析和关联日志，从而简化MTTR（平均响应时间）。

攻击向量，如APTs（高级持续威胁）、鱼叉式网络钓鱼和MITM（中间人）攻击，利用组织周边和分层防御中的漏洞。红队和蓝队模拟这些TTPs（战术、技术和程序），以测试和增强网络安全态势。MFA（多因素认证）和IAM（身份和访问管理）在限制未经授权的访问中仍然至关重要。

零信任模型，专注于“永不信任，总是验证”，在现代架构中至关重要，特别是在SASE（安全接入服务边缘）部署中，集成了SWG（安全网络网关）和CASB（云访问安全代理）功能。加密协议，如TLS 1.3和AES-256，在不受信任的网络上强制执行数据机密性。
revised_translation: 在今天的威胁环境中，网络安全围绕着像EDR、SIEM和SOC编排这样的概念进行转变。EDR（Endpoint Detection and Response）专注于实时遥测，利用XDR框架从不同的端点聚合数据，从而提供增强的威胁可见性。SIEM（Security Information and Event Management）平台，如Splunk和ArcSight，使SOC（Security Operations Center）团队能够摄取、解析和关联日志，从而简化MTTR（Mean Time to Respond）。

攻击向量，如APTs（Advanced Persistent Threats）、鱼叉式网络钓鱼和MITM（Man-in-the-Middle）攻击，利用组织周边和分层防御中的漏洞。红队和蓝队模拟这些TTPs（Tactics, Techniques, and Procedures），以测试和增强网络安全态势。MFA（Multi-Factor Authentication）和IAM（Identity and Access Management）在限制未经授权的访问中仍然至关重要。

零信任模型，专注于“永不信任，总是