# **CAMeL Arabic Text Tokenization**

## **CAMeL Tools Installation**

In [None]:
# Check thet python is between 3.8 and 3.12
import sys
print(sys.version)

3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]


### Install the Rust Complier

In [None]:
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

[1minfo:[0m downloading installer
[0m[1minfo: [0mprofile set to 'default'
[0m[1minfo: [0mdefault host triple is x86_64-unknown-linux-gnu
[0m[1minfo: [0msyncing channel updates for 'stable-x86_64-unknown-linux-gnu'
[0m[1minfo: [0mdefault toolchain set to 'stable-x86_64-unknown-linux-gnu'

  [0m[1mstable-x86_64-unknown-linux-gnu unchanged[0m - rustc 1.82.0 (f6e511eec 2024-10-15)

[0m[1m
Rust is installed now. Great!
[0m
To get started you may need to restart your current shell.
This would reload your [0m[1mPATH[0m environment variable to include
Cargo's bin directory ($HOME/.cargo/bin).

To configure your current shell, you need to source
the corresponding [0m[1menv[0m file under $HOME/.cargo.

This is usually done by running one of the following (note the leading DOT):
. "$HOME/.cargo/env"            # For sh/bash/zsh/ash/dash/pdksh
source "$HOME/.cargo/env.fish"  # For fish


In [None]:
#modifing the PATH environment variable by appending the path to Cargo's bin directory
import os
os.environ['PATH'] += ":/root/.cargo/bin"

In [None]:
#Check that the Rust compiler is installed
!rustc --version

rustc 1.82.0 (f6e511eec 2024-10-15)


### Install CAMeL

In [None]:
!pip install camel-tools



### Installing Camel Database used for morphology

In [None]:
!camel_data -i light # Install the light dataset for camel_tools

No new packages will be installed.


## **OKVQA-ar** Question Tokenization

### Questions DATA Reading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/ColabData/OKVQA/translated-dataset

/content/drive/MyDrive/ColabData/OKVQA/translated-dataset


In [None]:
import json

with open('question_train.json', 'r') as datafile:
  datafile1 = json.load(datafile)
question_train = datafile1['questions']
with open('question_val.json', 'r') as datafile:
  datafile2 = json.load(datafile)
question_val = datafile2['questions']


In [None]:
print(len(question_train), question_train[0])
print(len(question_val), question_val[0])

9009 {'image_id': 51606, 'question': 'What is the hairstyle of the blond called?', 'question_id': 516065, 'question_ar': 'ماذا تسمى تسريحة الشعر الاشقر؟'}
5046 {'image_id': 297147, 'question': 'What sport can you use this for?', 'question_id': 2971475, 'question_ar': 'في أي رياضة يمكنك استخدام هذا؟'}


### Arabic Text Extraction and concatenation

In [None]:
import json

def extract_question_ar(data):
    question_ar_list = []
    # data is a list of dictionaries, each containing 'question_ar' directly
    for item in data:
        # Access 'question_ar' directly from each dictionary
        question_ar_list.append(item['question_ar'])
    return question_ar_list


# Extract question_ar from each dataset
question_ar_data_list = [
    ("question_train", extract_question_ar(question_train)),
    ("question_val", extract_question_ar(question_val)),
]


### Tokenization and Average Length Calculation

In [None]:
# The code Tokenize and save the data into JSON file

import sys
import os
from google.colab import drive
import json

from camel_tools.morphology.database import MorphologyDB # Correct import path for MorphologyDB
from camel_tools.tokenizers.word import simple_word_tokenize

from tqdm import tqdm  # Import tqdm for progress tracking



# Load the database
dataB = MorphologyDB.builtin_db()

# Process each dataset individually
for data_name, question_ar_data in question_ar_data_list:
    print(f"Number of question_ar in {data_name}: {len(question_ar_data)}")

    tokenized_questions = []
    total_tokens = 0

    # Tokenize the questions using CAMeL Tools
    for question in tqdm(question_ar_data, desc=f"Processing questions in {data_name}"):
        tokens = simple_word_tokenize(question)
        tokenized_questions.append(tokens)
        total_tokens += len(tokens)

    avg_question_length = total_tokens / len(question_ar_data) if len(question_ar_data) > 0 else 0
    print(f"Total number of tokens in {data_name}: {total_tokens}")
    print(f"Average length of questions in {data_name}: {avg_question_length}")

    # Create a dictionary to store results
    results = {
        "data_name": data_name,
        "total_tokens": total_tokens,
        "avg_question_length": avg_question_length,
        "tokenized_questions": tokenized_questions
    }

    # Save the results to a JSON file
    json_file_path = f'/content/drive/MyDrive/ColabData/OKVQA/OKVQA_Statistics/{data_name}_CAMeL_tokenized.json'
    with open(json_file_path, 'w', encoding='utf-8') as jsonfile:
        json.dump(results, jsonfile, ensure_ascii=False, indent=4)
    print(f"Results for {data_name} saved to {json_file_path}")

Number of question_ar in question_train: 9009


Processing questions in question_train: 100%|██████████| 9009/9009 [00:22<00:00, 402.54it/s]


Total number of tokens in question_train: 71969
Average length of questions in question_train: 7.988566988566989
Results for question_train saved to /content/drive/MyDrive/ColabData/OKVQA/OKVQA_Statistics/question_train_CAMeL_tokenized.json
Number of question_ar in question_val: 5046


Processing questions in question_val: 100%|██████████| 5046/5046 [00:12<00:00, 402.62it/s]


Total number of tokens in question_val: 40487
Average length of questions in question_val: 8.023583036068173
Results for question_val saved to /content/drive/MyDrive/ColabData/OKVQA/OKVQA_Statistics/question_val_CAMeL_tokenized.json


## **OKVQA-ar** Answers Tokenization

### Answers DATA Reading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/ColabData/OKVQA/translated-dataset

/content/drive/MyDrive/ColabData/OKVQA/translated-dataset


In [None]:
import json

with open('annotation_train.json', 'r') as datafile:
  datafile1 = json.load(datafile)
annotation_train = datafile1['annotations']
with open('annotation_val.json', 'r') as datafile:
  datafile2 = json.load(datafile)
annotation_val = datafile2['annotations']


In [None]:
print(len(annotation_train), annotation_train[0])
print(len(annotation_val), annotation_val[0])

9009 {'image_id': 51606, 'answer_type': 'other', 'question_type': 'four', 'question_id': 516065, 'answers': [{'answer_id': 1, 'raw_answer': 'pony tail', 'answer_confidence': 'yes', 'answer': 'pony tail', 'raw_answer_ar': 'ذيل حصان', 'answer_ar': 'ذيل حصان'}, {'answer_id': 2, 'raw_answer': 'pony tail', 'answer_confidence': 'yes', 'answer': 'pony tail', 'raw_answer_ar': 'ذيل حصان', 'answer_ar': 'ذيل حصان'}, {'answer_id': 3, 'raw_answer': 'pony tail', 'answer_confidence': 'yes', 'answer': 'pony tail', 'raw_answer_ar': 'ذيل حصان', 'answer_ar': 'ذيل حصان'}, {'answer_id': 4, 'raw_answer': 'pony tail', 'answer_confidence': 'yes', 'answer': 'pony tail', 'raw_answer_ar': 'ذيل حصان', 'answer_ar': 'ذيل حصان'}, {'answer_id': 5, 'raw_answer': 'pony tail', 'answer_confidence': 'yes', 'answer': 'pony tail', 'raw_answer_ar': 'ذيل حصان', 'answer_ar': 'ذيل حصان'}, {'answer_id': 6, 'raw_answer': 'pony tail', 'answer_confidence': 'yes', 'answer': 'pony tail', 'raw_answer_ar': 'ذيل حصان', 'answer_ar': 'ذيل

### Arabic Text Extraction and concatenation

In [None]:
import json


def extract_answer_ar(data):
    answer_ar_list = []

    for annotation in data:
       for answer in annotation['answers']:
          if 'answer_ar' in answer:
             answer_ar_list.append(answer['answer_ar'])
    return answer_ar_list



# Extract answer_ar from each dataset
answer_ar_data_list = [
    ("annotation_train", extract_answer_ar(annotation_train)),
    ("annotation_val", extract_answer_ar(annotation_val)),
]


In [None]:
# prompt: inspect the answer_ar_data_list

# Accessing and inspecting answer_ar_data_list
for data_name, answer_ar_data in answer_ar_data_list:
    print(f"Inspecting {data_name}:")
    print(f"Number of Arabic answers: {len(answer_ar_data)}")
    print("First 5 Arabic answers:")
    for i in range(min(5, len(answer_ar_data))):
        print(answer_ar_data[i])
    print("-" * 20)

Inspecting annotation_train:
Number of Arabic answers: 90090
First 5 Arabic answers:
ذيل حصان
ذيل حصان
ذيل حصان
ذيل حصان
ذيل حصان
--------------------
Inspecting annotation_val:
Number of Arabic answers: 50460
First 5 Arabic answers:
سباق
سباق
سباق
سباق
سباق
--------------------


### Tokenization and Average Length Calculation

In [None]:
# Tokenize  and save the data into JSON file

import sys
import os
from google.colab import drive
import json

from camel_tools.morphology.database import MorphologyDB # Correct import path for MorphologyDB
from camel_tools.tokenizers.word import simple_word_tokenize

from tqdm import tqdm  # Import tqdm for progress tracking



# Load the database
dataB = MorphologyDB.builtin_db()

# Process each dataset individually
for data_name, answer_ar_data in answer_ar_data_list:
    print(f"Number of answer_ar in {data_name}: {len(answer_ar_data)}")

    tokenized_answers = []
    total_tokens = 0

    # Tokenize the answers using CAMeL Tools
    for answer in tqdm(answer_ar_data, desc=f"Processing answers in {data_name}"):
        tokens = simple_word_tokenize(answer)
        tokenized_answers.append(tokens)
        total_tokens += len(tokens)

    avg_answer_length = total_tokens / len(answer_ar_data) if len(answer_ar_data) > 0 else 0
    print(f"Total number of tokens in {data_name}: {total_tokens}")
    print(f"Average length of answers in {data_name}: {avg_answer_length}")

    # Create a dictionary to store results
    results = {
        "data_name": data_name,
        "total_tokens": total_tokens,
        "avg_answer_length": avg_answer_length,
        "tokenized_answers": tokenized_answers
    }

    # Save the results to a JSON file
    json_file_path = f'/content/drive/MyDrive/ColabData/OKVQA/OKVQA_Statistics/{data_name}_CAMeL_tokenized.json'
    with open(json_file_path, 'w', encoding='utf-8') as jsonfile:
        json.dump(results, jsonfile, ensure_ascii=False, indent=4)
    print(f"Results for {data_name} saved to {json_file_path}")

Number of answer_ar in annotation_train: 90090


Processing answers in annotation_train: 100%|██████████| 90090/90090 [00:16<00:00, 5374.77it/s]


Total number of tokens in annotation_train: 128572
Average length of answers in annotation_train: 1.4271506271506271
Results for annotation_train saved to /content/drive/MyDrive/ColabData/OKVQA/OKVQA_Statistics/annotation_train_CAMeL_tokenized.json
Number of answer_ar in annotation_val: 50460


Processing answers in annotation_val: 100%|██████████| 50460/50460 [00:08<00:00, 5824.66it/s]


Total number of tokens in annotation_val: 70084
Average length of answers in annotation_val: 1.388902100673801
Results for annotation_val saved to /content/drive/MyDrive/ColabData/OKVQA/OKVQA_Statistics/annotation_val_CAMeL_tokenized.json


## **VQAv2-ar** Question Tokenization

### Questions DATA Reading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/ColabData/VQAv2/translated-dataset

/content/drive/MyDrive/ColabData/VQAv2/translated-dataset


In [None]:
import json

with open('v2_OpenEnded_mscoco_train2014_questions_translated.json', 'r') as datafile:
  question_train = json.load(datafile)
with open('v2_OpenEnded_mscoco_val2014_questions_translated.json', 'r') as datafile:
  question_val = json.load(datafile)
with open('v2_OpenEnded_mscoco_test2015_questions_translated.json', 'r') as datafile:
  question_test = json.load(datafile)
with open('v2_OpenEnded_mscoco_test-dev2015_questions_translated.json', 'r') as datafile:
  question_test_dev = json.load(datafile)


In [None]:
print(len(question_train), question_train[0])
print(len(question_val), question_val[0])
print(len(question_test), question_train[0])
print(len(question_test_dev), question_val[0])

443757 {'image_id': 458752, 'question': 'What is this photo taken looking through?', 'question_id': 458752000, 'question_ar': 'ما هي هذه الصورة التي التقطت من خلال النظر؟'}
214354 {'image_id': 262148, 'question': 'Where is he looking?', 'question_id': 262148000, 'question_ar': 'أين ينظر؟'}
447793 {'image_id': 458752, 'question': 'What is this photo taken looking through?', 'question_id': 458752000, 'question_ar': 'ما هي هذه الصورة التي التقطت من خلال النظر؟'}
107394 {'image_id': 262148, 'question': 'Where is he looking?', 'question_id': 262148000, 'question_ar': 'أين ينظر؟'}


### Arabic Text Extraction and concatenation

In [None]:
import json

def extract_question_ar(data):
    question_ar_list = []
    # data is a list of dictionaries, each containing 'question_ar' directly
    for item in data:
        # Access 'question_ar' directly from each dictionary
        question_ar_list.append(item['question_ar'])
    return question_ar_list


# Extract question_ar from each dataset
question_ar_data_list = [
    ("question_train", extract_question_ar(question_train)),
    ("question_val", extract_question_ar(question_val)),
    ("question_test", extract_question_ar(question_test)),
    ("question_test_dev", extract_question_ar(question_test_dev)),
]


### Tokenization and Average Length Calculation

In [None]:
#Tokenize only and save the data into JSON file

import sys
import os
from google.colab import drive
import json

from camel_tools.morphology.database import MorphologyDB # Correct import path for MorphologyDB
from camel_tools.tokenizers.word import simple_word_tokenize

from tqdm import tqdm  # Import tqdm for progress tracking



# Load the database
dataB = MorphologyDB.builtin_db()

# Process each dataset individually
for data_name, question_ar_data in question_ar_data_list:
    print(f"Number of question_ar in {data_name}: {len(question_ar_data)}")

    tokenized_questions = []
    total_tokens = 0

    # Tokenize the questions using CAMeL Tools
    for question in tqdm(question_ar_data, desc=f"Processing questions in {data_name}"):
        tokens = simple_word_tokenize(question)
        tokenized_questions.append(tokens)
        total_tokens += len(tokens)

    avg_question_length = total_tokens / len(question_ar_data) if len(question_ar_data) > 0 else 0
    print(f"Total number of tokens in {data_name}: {total_tokens}")
    print(f"Average length of questions in {data_name}: {avg_question_length}")

    # Create a dictionary to store results
    results = {
        "data_name": data_name,
        "total_tokens": total_tokens,
        "avg_question_length": avg_question_length,
        "tokenized_questions": tokenized_questions
    }

    # Save the results to a JSON file
    json_file_path = f'/content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/{data_name}_CAMeL_tokenized.json'
    with open(json_file_path, 'w', encoding='utf-8') as jsonfile:
        json.dump(results, jsonfile, ensure_ascii=False, indent=4)
    print(f"Results for {data_name} saved to {json_file_path}")

Number of question_ar in question_train: 443757


Processing questions in question_train: 100%|██████████| 443757/443757 [13:06<00:00, 564.10it/s]


Total number of tokens in question_train: 2713470
Average length of questions in question_train: 6.114765513558096
Results for question_train saved to /content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/question_train_CAMeL_tokenized.json
Number of question_ar in question_val: 214354


Processing questions in question_val: 100%|██████████| 214354/214354 [06:19<00:00, 565.45it/s]


Total number of tokens in question_val: 1308766
Average length of questions in question_val: 6.10562900622335
Results for question_val saved to /content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/question_val_CAMeL_tokenized.json
Number of question_ar in question_test: 447793


Processing questions in question_test: 100%|██████████| 447793/447793 [12:40<00:00, 589.12it/s]


Total number of tokens in question_test: 2649326
Average length of questions in question_test: 5.916407804498954
Results for question_test saved to /content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/question_test_CAMeL_tokenized.json
Number of question_ar in question_test_dev: 107394


Processing questions in question_test_dev: 100%|██████████| 107394/107394 [03:02<00:00, 587.76it/s]


Total number of tokens in question_test_dev: 636601
Average length of questions in question_test_dev: 5.927714769912658
Results for question_test_dev saved to /content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/question_test_dev_CAMeL_tokenized.json


## **VQAv2-ar** Answers Tokenization

### Answers DATA Reading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/ColabData/VQAv2/translated-dataset

/content/drive/MyDrive/ColabData/VQAv2/translated-dataset


In [None]:
import json

with open('v2_mscoco_train2014_annotations_translated2.json', 'r') as datafile:
  datafile1 = json.load(datafile)
annotation_train = datafile1['annotations']
with open('v2_mscoco_val2014_annotations_translated2.json', 'r') as datafile:
  datafile2 = json.load(datafile)
annotation_val = datafile2['annotations']


In [None]:
print(len(annotation_train), annotation_train[0])
print(len(annotation_val), annotation_val[0])

443757 {'question_type': 'what is this', 'multiple_choice_answer': 'net', 'answers': [{'answer': 'net', 'answer_confidence': 'maybe', 'answer_id': 1, 'answer_ar': 'شبكة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 2, 'answer_ar': 'شبكة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 3, 'answer_ar': 'شبكة'}, {'answer': 'netting', 'answer_confidence': 'yes', 'answer_id': 4, 'answer_ar': 'المعاوضة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 5, 'answer_ar': 'شبكة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 6, 'answer_ar': 'شبكة'}, {'answer': 'mesh', 'answer_confidence': 'maybe', 'answer_id': 7, 'answer_ar': 'شبكة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 8, 'answer_ar': 'شبكة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 9, 'answer_ar': 'شبكة'}, {'answer': 'net', 'answer_confidence': 'yes', 'answer_id': 10, 'answer_ar': 'شبكة'}], 'image_id': 458752, 'answer_type': 'other', 'question_id': 

### Arabic Text Extraction and concatenation

In [None]:
import json

# extracting all answer_ar from data1 and data2

def extract_answer_ar(data):
    answer_ar_list = []

    for annotation in data:
       for answer in annotation['answers']:
          if 'answer_ar' in answer:
             answer_ar_list.append(answer['answer_ar'])
    return answer_ar_list



# Extract answer_ar from each dataset
answer_ar_data_list = [
    ("annotation_train", extract_answer_ar(annotation_train)),
    ("annotation_val", extract_answer_ar(annotation_val)),
]


### Tokenization and Average Length Calculation

In [None]:
# Tokenize and save the data into JSON file

import sys
import os
from google.colab import drive
import json

from camel_tools.morphology.database import MorphologyDB # Correct import path for MorphologyDB
from camel_tools.tokenizers.word import simple_word_tokenize

from tqdm import tqdm  # Import tqdm for progress tracking



# Load the database
dataB = MorphologyDB.builtin_db()

# Process each dataset individually
for data_name, answer_ar_data in answer_ar_data_list:
    print(f"Number of answer_ar in {data_name}: {len(answer_ar_data)}")

    tokenized_answers = []
    total_tokens = 0

    # Tokenize the answers using CAMeL Tools
    for answer in tqdm(answer_ar_data, desc=f"Processing answers in {data_name}"):
        tokens = simple_word_tokenize(answer)
        tokenized_answers.append(tokens)
        total_tokens += len(tokens)

    avg_answer_length = total_tokens / len(answer_ar_data) if len(answer_ar_data) > 0 else 0
    print(f"Total number of tokens in {data_name}: {total_tokens}")
    print(f"Average length of answers in {data_name}: {avg_answer_length}")

    # Create a dictionary to store results
    results = {
        "data_name": data_name,
        "total_tokens": total_tokens,
        "avg_answer_length": avg_answer_length,
        "tokenized_answers": tokenized_answers
    }

    # Save the results to a JSON file
    json_file_path = f'/content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/{data_name}_CAMeL_tokenized.json'
    with open(json_file_path, 'w', encoding='utf-8') as jsonfile:
        json.dump(results, jsonfile, ensure_ascii=False, indent=4)
    print(f"Results for {data_name} saved to {json_file_path}")

Number of answer_ar in annotation_train: 4437570


Processing answers in annotation_train: 100%|██████████| 4437570/4437570 [07:32<00:00, 9802.34it/s] 


Total number of tokens in annotation_train: 5364094
Average length of answers in annotation_train: 1.2087908472429731
Results for annotation_train saved to /content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/annotation_train_CAMeL_tokenized.json
Number of answer_ar in annotation_val: 2143540


Processing answers in annotation_val: 100%|██████████| 2143540/2143540 [03:39<00:00, 9783.58it/s]


Total number of tokens in annotation_val: 2602360
Average length of answers in annotation_val: 1.2140477901042201
Results for annotation_val saved to /content/drive/MyDrive/ColabData/VQAv2/VQAv2_Statistics/annotation_val_CAMeL_tokenized.json




---

