## Install depencencies

In [None]:
! pip install -r requirements.txt

## Connect to Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
cd drive/MyDrive/NLP

/content/drive/MyDrive


## Source Code

In [10]:
! git clone https://github.com/samzirbo/MT-Dataset-Toolkit

Cloning into 'MT-Dataset-Toolkit'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 38 (delta 5), reused 35 (delta 3), pack-reused 0[K
Receiving objects: 100% (38/38), 429.86 KiB | 1.70 MiB/s, done.
Resolving deltas: 100% (5/5), done.


In [11]:
cd MT-Dataset-Toolkit

/content/drive/MyDrive/MT-Dataset-Toolkit


# ExtractTranscriptsSpider

This Scrapy spider extracts transcripts of TED talks in specified languages and saves them to an output file.

## Usage

Run the spider using the following command:

```bash
scrapy crawl ExtractTranscripts \
    [-a INPUT=data/talks.csv] \
    [-a OUTPUT=data/transcripts.json] \
    [-a LANGUAGES=en,fr,es] \
    [-a MAX_RETRIES=10] \
    [-a MAX_TALKS=100] \
    -o/O transcripts.jsonl:jsonlines
```

### Arguments

- `INPUT` (optional): Path to the input CSV file containing the talks. The file should have columns for `name` or `id` and optionally `gender`.
- `OUTPUT` (optional): Path to the output file where the transcripts will be saved. Defaults to `transcripts.json`.
- `LANGUAGES` (optional): Comma-separated list of languages to extract the transcripts for. Defaults to `en`.
- `MAX_RETRIES` (optional): Maximum number of retries to request the transcript for a talk. Defaults to 10.
- `MAX_TALKS` (optional): Maximum number of talks to request the transcript for. If not specified, all talks in the input file will be processed.

### Example

To run the spider with specific arguments:

```bash
scrapy crawl ExtractTranscripts -a INPUT=data/talks.csv -a OUTPUT=data/transcripts.json -a LANGUAGES=en,fr,es -a MAX_RETRIES=5 -a MAX_TALKS=50 -o transcripts.json:jsonl
```

## Description

This spider reads a list of TED talks from a CSV file or a default JSONL file containing all TEDTalks ids, requests transcripts for each talk in the specified languages, and saves the transcripts to an output file.

### Workflow

1. **Initialization**: The spider reads the input file to get a list of talks. If no input file is provided, it reads from a default JSONL file. It initializes the output file and keeps track of finished talks to avoid duplication.

2. **Start Requests**: For each talk in the list, the spider constructs a URL and sends a request to the TED website.

3. **Check Languages**: The spider checks if the required languages are available for the talk. If they are, it requests the transcript in each required language.

4. **Parse Talk**: The spider parses the transcript data and saves it if all required languages are available for the talk. If a transcript request fails or the language does not match, it retries up to the maximum number of retries.

### Output

The output is saved in JSONL format, where each line represents a talk with its ID, name, and transcripts in the specified languages.

In [28]:
cd scraper

/content/drive/MyDrive/MT-Dataset-Toolkit/scraper


In [29]:
! scrapy crawl ExtractTranscripts -a LANGUAGES=en,ro -a MAX_TALKS=10 -O ../demo.json:jsonl

2024-05-15 21:14:22 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: TEDScraper)
2024-05-15 21:14:22 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.7, Platform Linux-6.1.58+-x86_64-with-glibc2.35
Extracting transcripts for 3758 talks for languages: {'ro', 'en'}
2024-05-15 21:14:22 [scrapy.addons] INFO: Enabled addons:
[]
2024-05-15 21:14:22 [asyncio] DEBUG: Using selector: EpollSelector
2024-05-15 21:14:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-05-15 21:14:22 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-05-15 21:14:22 [scrapy.extensions.telnet] INFO: Telnet Password: bad0e0981ce6c6d7
2024-05-15 21:14:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.c

In [31]:
! head -10 ../demo.json

     1	{"TALK-ID": "robert_full_the_sticky_wonder_of_gecko_feet", "TALK-NAME": "robert_full_the_sticky_wonder_of_gecko_feet", "TRANSCRIPTS": {"ro": "Vreau să vă imaginaţi că sunteţi un student în laboratorul meu. Ce vreau eu să faceţi este să creaţi un design inspirat de biologie. Şi iată deci provocarea: Vreau să mă ajutaţi să creez un model de contact complet 3D, dinamic, parametrizat. Traducerea la asta e, puteţi să mă ajutaţi să construiesc o labă a piciorului? Şi asta e o provocare reală şi vreau să ma ajutaţi. Bineînţeles, în provocare există un premiu. Nu e chiar Premiul TED, dar este un tricou unic de la laboratorul nostru. Aşa că, vă rog, trimiteţi-mi ideile voastre despre cum să proiectezi o labă a piciorului. Acum, dacă vrem să proiectăm o labă a piciorului, ce trebuie să facem? Trebuie, mai întâi, să ştim ce este o labă a piciorului. Dacă mergem la dicţionar, el spune, \"Este extremitatea cea mai de jos a piciorului care este în contact direct cu pământul în timpul statului

# Sentence Aligner using Bertalign

This script aligns sentences from transcripts of TED talks in a source language to a target language using the Bertalign model. It reads an input file containing transcripts, aligns the sentences, and writes the aligned sentences to an output file.

## Usage

Run the script using the following command:

```bash
python align_corpus.py --INPUT <input_file> [OPTIONS]
```

### Arguments

- `--INPUT` (str): The input file containing the transcripts to align. This file should be in JSONL format with each line containing a JSON object with at least `TALK-ID`, `TALK-NAME`, and `TRANSCRIPTS` fields.
- `--OUTPUT` (str): The output file where the aligned sentences will be saved.
- `--GENDER` (bool): Whether the input data contains gender information. Default is `False`.
- `--SRC_LANG` (str): The source language code. Default is `en`.
- `--TGT_LANG` (str): The target language code. Default is `es`.
- `--NO_TALKS` (int): The number of talks to align. If not specified, all talks will be processed.
- `--OFFSET` (int): The offset to start aligning the talks. Default is `0`.
- `--MAX_ALIGN` (int): The maximum number of alignments. Default is `5`.
- `--TOP_K` (int): The top k alignments to consider. Default is `3`.
- `--WIN` (int): The window size for the second alignment alignment. Default is `5`.
- `--SKIP` (float): The skip value for alignment. Default is `-0.1`.
- `--MARGIN` (bool): Modified cosine similarity if True. Default is `True`.
- `--LEN_PENALTY` (bool): Add length penalty to the similarity score. Default is `True`.
- `--IS_SPLIT` (bool): Whether the input data is split into sentences. Default is `False`.

### Example

To run the script with specific arguments:

```bash
python align_corpus.py --INPUT=data/talks.jsonl --OUTPUT=data/aligned_transcripts.jsonl --GENDER=True --SRC_LANG=en --TGT_LANG=es --NO_TALKS=100 --OFFSET=0 --MAX_ALIGN=5 --TOP_K=3 --WIN=5 --SKIP=-0.1 --MARGIN=True --LEN_PENALTY=True --IS_SPLIT=False
```

## Description

### Functionality

The `align_corpus` function reads the input file to get a list of TED talks, finds the best sentence alignments using the Bertalign model, and saves the aligned sentences to the output file. It ensures that previously aligned talks are not reprocessed.

### Output

The output is saved in JSONL format, where each line represents an aligned sentence pair with the talk ID, talk name, and optionally gender information.

In [32]:
cd ../sentence_aligner/

/content/drive/MyDrive/MT-Dataset-Toolkit/sentence_aligner


In [None]:
! python align.py --INPUT ../demo.json --SRC_LANG en --TGT_LANG ro

In [20]:
head -10 ../demo.aligned.json

 aligned_talks.0-200.json     [0m[01;34mexperiments[0m/                [01;34mMT-Dataset-Toolkit[0m/       [01;34mwandb[0m/
 aligned_talks.600-800.json   [01;34mmt5.baseline.constant[0m/      [01;34msentence_aligner[0m/
[01;34m'Colab Notebooks'[0m/            [01;34mmt5.baseline.more_epochs[0m/   test.json
 dataset.json                 [01;34mmT5.en-es.pretrained[0m/       transcripts.en-es.jsonl
