In [1]:
import os
from transformers import MarianMTModel, MarianTokenizer, PegasusForConditionalGeneration, PegasusTokenizer

1. model_name_translation = "Helsinki-NLP/opus-mt-zh-en"

	Purpose:
	•	Specifies the name of the translation model.
	•	"Helsinki-NLP/opus-mt-zh-en" is a model from the Helsinki-NLP group in the Hugging Face Transformers library.
	•	opus-mt-zh-en indicates that the model is fine-tuned for translating from Chinese (zh) to English (en).
	Why?:
	•	This string will be passed to Hugging Face’s from_pretrained method to load the pre-trained model and tokenizer.

2. tokenizer_translation = MarianTokenizer.from_pretrained(model_name_translation)

	•	What It Does:
	•	Initializes a tokenizer for the MarianMT (Marian Machine Translation) model.
	•	MarianTokenizer is a tokenizer designed specifically for MarianMT models. It:
	1.	Encodes Input Text: Converts Chinese text into token IDs that the model understands.
	2.	Decodes Output Text: Converts the model’s output token IDs back into readable English text.
	•	The from_pretrained(model_name_translation) method loads the tokenizer’s vocabulary and configuration for this specific model (Helsinki-NLP/opus-mt-zh-en) from the Hugging Face model hub.
	•	Why?:
	•	Tokenization is a crucial preprocessing step for any NLP model. For translation, the tokenizer ensures:
	•	Chinese text is tokenized in a way that aligns with the model’s training.
	•	The model can generate appropriate English text during decoding.


3. model_translation = MarianMTModel.from_pretrained(model_name_translation)

	•	What It Does:
	•	Loads the pre-trained MarianMT model for Chinese-to-English translation.
	•	MarianMTModel is a class specifically designed for MarianMT models in the Hugging Face Transformers library. It provides:
	•	The encoder-decoder architecture required for translation tasks.
	•	The model weights fine-tuned on Chinese-to-English translation data.
	•	Why?:
	•	The from_pretrained method ensures that the exact architecture and weights corresponding to the Helsinki-NLP/opus-mt-zh-en model are loaded, so it can perform translations accurately.

In [None]:
# Define the model name
model_name_translation = "Helsinki-NLP/opus-mt-zh-en"

# Load the tokenizer and model
tokenizer_translation = MarianTokenizer.from_pretrained(model_name_translation)
model_translation = MarianMTModel.from_pretrained(model_name_translation)

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/805k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/807k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.62M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/312M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

MarianMTModel.from_pretrained(model_name_translation) is responsible for loading the model weights into memory. It will trigger a `pytorch_model.bin` file to be stored `locally` on your computer if the process completes successfully.

If you are downloading a pre-trained model from Hugging Face's transformers library (or similar), the file is cached locally in your home directory `~/.cache/huggingface/transformers/`. Within this directory, subdirectories are created for each model based on its name or repository.

If you specify a custom path when downloading or saving the model (e.g., model.save_pretrained('custom_path')), the file will be stored in that directory.

Once downloaded, the file remains on your disk and is reused whenever you load the model again, unless you manually delete it.

Managing Storage:

- Clear Cache: You can clear unused models from the cache if you need to free up space:
bash
`huggingface-cli cache delete`

- Move to External Drive: You can also move the cache directory to an external drive or another location by setting the TRANSFORMERS_CACHE environment variable:
bash
`export TRANSFORMERS_CACHE=/path/to/your/custom/location`


In [3]:
# Example text to translate (Chinese)
text_to_translate = "你好，世界！"  # "Hello, world!"

# Preprocess: Tokenize the text
input_tokens = tokenizer_translation(text_to_translate, return_tensors="pt", padding=True, truncation=True)

# Translate: Generate predictions
translated_tokens = model_translation.generate(**input_tokens)

# Postprocess: Decode the output tokens to readable English
translated_text = tokenizer_translation.decode(translated_tokens[0], skip_special_tokens=True)

print(f"Translated text: {translated_text}")

Translated text: Hello, world!


In [None]:
# Input and output file paths
input_file = "chinese_text.txt"  # File containing Chinese text (one line per sentence)
output_file = "translated_text.txt"  # File to save the English translations

# Read the input file and translate each line
try:
    with open(input_file, "r", encoding="utf-8") as infile, open(output_file, "w", encoding="utf-8") as outfile:
        for line in infile:
            # Remove leading/trailing whitespaces
            line = line.strip()

            # Skip empty lines
            if not line:
                continue

            # Tokenize the input text
            input_tokens = tokenizer_translation(line, return_tensors="pt", padding=True, truncation=True)

            # Generate translation
            translated_tokens = model_translation.generate(**input_tokens)

            # Decode the translated tokens to English text
            translated_text = tokenizer_translation.decode(translated_tokens[0], skip_special_tokens=True)

            # Write the translated text to the output file
            outfile.write(translated_text + "\n")

    print(f"Translation completed. Translated text saved to '{output_file}'.")
except Exception as e:
    print(f"An error occurred: {e}")

In [None]:
# Define the model name
model_name_summary = "google/pegasus-xsum"

# Load the tokenizer and model
tokenizer_summary = PegasusTokenizer.from_pretrained(model_name_summary)
model_summary = PegasusForConditionalGeneration.from_pretrained(model_name_summary)

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

In [6]:
print(f"Model cached at: {model_translation.config.architectures}")

Model cached at: ['MarianMTModel']
