<a href="https://colab.research.google.com/github/simplysowj/Data_engg/blob/main/Machine_Translation_Wikipedia_Biographies_English_to_Spanish.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Machine Translation - Wikipedia Biographies - English to Spanish

Project Objective: To tune a model for translation from English to Spanish

Dataset Source: https://www.kaggle.com/datasets/paultimothymooney/translated-wikipedia-biographies?select=Translated+Wikipedia+Biographies+-+EN_ES.csv

##### Import Necessary Libraries

In [None]:
!pip install sparknlp


Collecting sparknlp
  Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Collecting spark-nlp (from sparknlp)
  Downloading spark_nlp-5.3.2-py2.py3-none-any.whl (564 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spark-nlp, sparknlp
Successfully installed spark-nlp-5.3.2 sparknlp-1.0.0


In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=9830adeefe454bd5ad71f519b8816770ad9f6a15fcf897249a2b899ddf7b639b
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sklearn.metrics import classification_report, accuracy_score
import pyspark.sql.functions as F

##### Start Spark NLP Session

In [None]:
spark = sparknlp.start(gpu=True)

##### Ingest & Start Preprocessing Data

In [None]:
# File location and type
file_location = "/content/Translated Wikipedia Biographies - EN_ES.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = "\t"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

df = df.drop('sourceLanguage', 'targetLanguage', 'documentID', 'stringID', 'entityName', 'sourceURL')

display(df)

DataFrame[sourceLanguage,targetLanguage,documentID,stringID,sourceText,translatedText,perceivedGender,entityName,sourceURL: string]

##### Return Number of Samples in Dataset

In [None]:
df.count()

1471

In [None]:
df.printSchema()


root
 |-- sourceLanguage,targetLanguage,documentID,stringID,sourceText,translatedText,perceivedGender,entityName,sourceURL: string (nullable = true)



In [None]:
df.show()


+----------------------------------------------------------------------------------------------------------------+
|sourceLanguage,targetLanguage,documentID,stringID,sourceText,translatedText,perceivedGender,entityName,sourceURL|
+----------------------------------------------------------------------------------------------------------------+
|                                                                                            en,es,1,1-1,"Kais...|
|                                                                                            en,es,1,1-2,"Outs...|
|                                                                                            en,es,1,1-3,"Her ...|
|                                                                                            en,es,1,1-4,Mäkär...|
|                                                                                            en,es,1,1-5,She s...|
|                                                                               

In [None]:
df = spark.read.option("delimiter", ",").csv("/content/Translated Wikipedia Biographies - EN_ES.csv", header=True)



In [None]:
df.select('perceivedGender').distinct().count()


47

##### Return Unique Values in 'perceivedGender' Features (& Number of Unique Values)

In [None]:
unique_label_vals = df.select('perceivedGender').distinct().count()
print(unique_label_vals)
print(df.select('perceivedGender').distinct().show(unique_label_vals))

47
+--------------------+
|     perceivedGender|
+--------------------+
| 17 February 1933...|
|"En 2009, dio a c...|
|"" which contains...|
|              Iraq."|
| ""It Must Have B...|
|"Por entonces, re...|
|Esa misma tempora...|
| and Security"" b...|
| las mujeres no t...|
|Fue la primera ba...|
| o patrimonio cul...|
|"Mientras formaba...|
|Sus seguidores de...|
|Siguiendo el cons...|
|              Female|
|"Su libro Puerta ...|
| ""Pride"" for ""...|
| discussing yamba...|
|               Twins|
|           el pueblo|
| which would late...|
|Qubeka ascendió a...|
|Uno de sus casos ...|
|En octubre de ese...|
| que ha crecido h...|
| Paralamas' prope...|
|             Neutral|
| and ""Fading Lik...|
|Durante el «Best ...|
|           en 2005."|
|"Tarin renunció a...|
|"En 2013, complet...|
| is a Colombian d...|
| published by the...|
|"A pesar de que e...|
|             de 2013|
|Ese mismo año, fo...|
|      held in Kigali|
|                Male|
|"En julio de 1980...|
|Es la r

##### Filter Columns to Remove Incorrect Samples

In [None]:
genders = ['Female', 'Male']

df = df.filter(df.perceivedGender.isin(genders))

unique_label_vals = df.select('perceivedGender').distinct().count()
print(unique_label_vals)
print(df.select('perceivedGender').distinct().show(unique_label_vals))

2
+---------------+
|perceivedGender|
+---------------+
|         Female|
|           Male|
+---------------+

None


##### Return Total Number of Samples in Processed Dataset & Drop Unnecessary Feature

In [None]:
df = df.drop('perceivedGender')
df.count()

1312

##### Split Dataset into Training & Testing Datasets

In [None]:
train_ds, test_ds = df.randomSplit(weights=[0.80, 0.20], seed=42)

##### Define Pipeline Stages & Pipeline

In [None]:
doc = DocumentAssembler()\
    .setInputCol("sourceText")\
    .setOutputCol("document")

sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

translator = MarianTransformer.pretrained("opus_mt_en_es", "xx")\
    .setInputCols(["sentence"])\
    .setOutputCol("translation")

en_es_translation_pipeline = Pipeline().setStages([doc, sentence, translator])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
opus_mt_en_es download started this may take some time.
Approximate size to download 398.8 MB
[OK!]


##### Fit/Train Model

In [None]:
en_es_translation_model = en_es_translation_pipeline.fit(train_ds)

##### Inference: Predictions Using Test Dataset

In [None]:
preds = en_es_translation_model.transform(test_ds)

##### Return Only Necessary Features & Convert to Pandas DataFrame

In [None]:
preds_in_pandas = preds.select(F.col("sourceText").alias("source"), F.col("translatedText").alias("ground_truth"), F.col("translation.result").alias("predictions")).toPandas()

##### Display Condensed Predictions Output

In [None]:
display(preds_in_pandas)

Unnamed: 0,source,ground_truth,predictions
0,"During the 2008–09 Biathlon World Cup, she has...","Durante la Copa Mundial de Biatlón 2008-2009, ...","[Durante la Copa Mundial de Biatlón 2008-09, h..."
1,Mäkäräinen was originally a cross-country skie...,Mäkäräinen era originalmente esquiadora de cam...,[Mäkäräinen fue originalmente un esquiador de ...
2,"In 2004, she made the Finnish National Team.",En 2004 fue parte del equipo nacional finlandés.,"[En 2004, hizo la selección nacional finlandesa.]"
3,His first international successes were with th...,Sus primeros éxitos internacionales fueron con...,[Sus primeros éxitos internacionales fueron co...
4,Born in Paddington to Nigerian parents who wer...,"Nacido en Paddington, de padres nigerianos que...",[Nacido en Paddington de padres nigerianos que...
...,...,...,...
221,He was the first African American elected to C...,Fue el primer afroamericano del Norte de Calif...,[Fue el primer afroamericano elegido al Congre...
222,"He participated in the review of the Children,...","Participó en la revisión de la Ley de Niños, J...",[Participó en el examen de la Ley de 1989 sobr...
223,Pierre Brizon (16 May 1878 – 1 August 1923) wa...,Pierre Brizon (16 de mayo de 1878–1 de agosto ...,[Pierre Brizon (16 de mayo de 1878 - 1 de agos...
224,In 1907 he was elected a councilor in the dist...,"En 1907, fue electo para el cargo de consejero...",[En 1907 fue elegido concejal en el distrito d...


##### Save Model

In [34]:
OUTPUT_DIR = '/content/model'

In [35]:
en_es_translation_model.save(r"/content/model")