# Google Translation Model using BERT (Spanish to English)

This guide outlines the process of building a translation model to convert Spanish sentences into English using a fine-tuned BERT model. The steps include dataset preparation, preprocessing, model fine-tuning, evaluation, and inference.

## Dataset Preparation
First, you need to gather a dataset of Spanish-English translation pairs. This dataset should consist of sentences in Spanish and their corresponding translations in English.

## Preprocessing
Sentences in both languages must be tokenized and formatted for BERT input, including handling special tokens like `[CLS]`, `[SEP]`, and `[MASK]`.

```python
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-cased')


In [4]:
from transformers import MarianMTModel, MarianTokenizer

src_text = 'Hola, ¿cómo estás?'

# Load the tokenizer
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-es-en')

# Load the model
model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-es-en')

# Tokenize the text
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

# Decode the output
translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

print("Spanish:", src_text)
print("English Translation:", translated_text[0])

tokenizer_config.json: 100%|██████████| 44.0/44.0 [00:00<00:00, 43.0kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
source.spm: 100%|██████████| 826k/826k [00:00<00:00, 4.80MB/s]
target.spm: 100%|██████████| 802k/802k [00:00<00:00, 2.38MB/s]
vocab.json: 100%|██████████| 1.59M/1.59M [00:00<00:00, 6.32MB/s]
config.json: 100%|██████████| 1.44k/1.44k [00:00<?, ?B/s]
pytorch_model.bin: 100%|██████████| 312M/312M [00:39<00:00, 7.84MB/s] 
generation_config.json: 100%|██████████| 293/293 [00:00<00:00, 321kB/s]


Spanish: Hola, ¿cómo estás?
English Translation: Hey, how are you?


In [None]:
from flask import Flask, request, render_template, session, redirect, url_for
from transformers import MarianMTModel, MarianTokenizer

app = Flask(__name__)
app.secret_key = 'your_secret_key'  # Required to use sessions

# Load tokenizer and model
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-es-en')
model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-es-en')

@app.route('/', methods=['GET', 'POST'])
def index():
    if 'history' not in session:
        session['history'] = []  # Initialize conversation history

    if request.method == 'POST':
        src_text = request.form['text']  # Get text from form
        if src_text:  # Check if the text is not empty
            translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
            translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated][0]
            
            # Update history
            session['history'].append({'spanish': src_text, 'english': translated_text})
            session.modified = True  # To ensure the session gets saved

    return render_template('index.html', history=session['history'])

if __name__ == '__main__':
    app.run(debug=True)
