# Machine Translation: Polish to English
This notebook demonstrates automatic translation of Polish text to English using a pre-trained transformer model from Hugging Face.
Overview

## Model: Helsinki-NLP/opus-mt-pl-en (Marian MT)
## Task: Translate Polish sentences from MasterChef transcript to English
## Framework: Transformers library with PyTorch backend

## Use Case
This translation is part of a larger NLP pipeline for emotion classification, where we need English translations to complement Polish text features.

In [None]:
# !pip install datasets
# !pip install transformers
# !pip install sentencepiece
# !pip install transformers[torch]`
# !pip install sacrebleu
# !pip install evaluate
# !pip install sacrebleu
# !pip install accelerate -U
# !pip install gradio 
# !pip install kaleido cohere  openai tiktoken typing-extensions==4.5.0

# Dataset Exploration
Loading a parallel English-Polish dataset to understand the translation format.
Dataset Source: Gregniuki/english-polish-idioms
This step helps us understand the expected input/output structure for our translation task.

https://huggingface.co/datasets/Gregniuki/english-polish-idioms 

In [7]:
from datasets import load_dataset
dataset = load_dataset("Gregniuki/english-polish-idioms")
print(dataset['train'][0]) 

{'translation': {'en': 'Hello, how are you?', 'pl': 'Cześć, jak się masz?'}}


# Model and Tokenizer Loading
Loading the pre-trained Marian MT model for Polish to English translation.
## Model Details:

Architecture: Marian Neural Machine Translation
Training: Opus parallel corpus (high-quality multilingual data)
Direction: Polish → English (pl-en)
Performance: State-of-the-art for this language pair

## Why This Model:

Specifically trained on Polish-English pairs
Good performance on conversational text
Reasonable size for inference speed

In [8]:
from transformers import MarianMTModel, MarianTokenizer

model_name = "Helsinki-NLP/opus-mt-pl-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)




# Transcript Translation Pipeline
## Process Overview:

1. Load Polish MasterChef transcript from Excel file
2. For each Polish sentence:

Tokenize using Marian tokenizer
Generate English translation using pre-trained model
Decode output tokens back to text


3. Add translations as new column
4. Save enhanced dataset

In [12]:
import pandas as pd
import torch


df = pd.read_excel(r"C:\Users\zosia\Documents\GitHub\fae2-nlpr-group-group-17-1\Task 7\STT_Assembly.xlsx") 

translations = []


for sent in df['Sentence']:
    inputs = tokenizer(sent, return_tensors = 'pt', truncation = True, padding = True)
    outputs = model.generate(**inputs) #unpacking dictionary
    translated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    translations.append(translated)

df['Translation'] = translations

df.head()


Unnamed: 0,Sentence,S,I,D,Translation
0,To jest MasterChef.,0.0,0.0,0.0,This is MasterChef.
1,Szansę na tytuł najlepszego kucharza w Polsce ...,0.0,0.0,0.0,Only 12 people have a chance of being the best...
2,Oto oni.,0.0,0.0,0.0,There they are.
3,"""Jestem dinozaurem, który chce walczyć, który ...",0.0,0.0,0.0,"""I'm a dinosaur who wants to fight, who won't ..."
4,Ten program daje mi ogromną siłę.,0.0,0.0,0.0,This program gives me great strength.


In [None]:
df = df.drop(columns=['S', 'I','D'])
df.to_excel("STT_Assembly_translated.xlsx", index=False)

## Summary of Translation Model Findings

Our translation analysis shows mostly good results with two main types of mistakes.

## Types of Mistakes:

Speech-to-Text Errors: Most problems come from wrong words in the original Polish text, not from bad translation. The model translates the wrong Polish words correctly into wrong English words.

Multiple Meaning Words: Some mistakes happen with Polish words that have different meanings. For example, "kropka" can mean "dot" or "period." The model sometimes picks the wrong meaning for the situation.

## Overall Quality: 

The grammar in translations is correct. The model handles Polish language rules well and makes proper English sentences.

## Main Problems: 

Translation quality depends mostly on having correct Polish text to start with. The model also sometimes struggles to pick the right meaning when Polish words have multiple meanings. When given good Polish text, the translation works well.
 
 