#**Llama 2**

[Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

There are several variations available, but the ones that interest us are based on the GGLM library.

We can see the different variations that Llama-2-13B-GGML has [here](https://huggingface.co/models?search=llama%202%20ggml).



In this case, we will use the model called [Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML).

#**Step 1: Install All the Required Packages**

In [None]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 numpy==1.23.4 --force-reinstall --upgrade --no-cache-dir --verbose
!pip install huggingface_hub
!pip install llama-cpp-python==0.1.78
!pip install numpy==1.23.4
!pip install transformers
!pip install sentence_transformers
!pip install boto3

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python==0.1.78
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting setuptools>=42
    Downloading setuptools-69.0.2-py3-none-any.whl (819 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 819.5/819.5 kB 4.6 MB/s eta 0:00:00
  Collecting scikit-build>=0.13
    Downloading scikit_build-0.17.6-py3-none-any.whl (84 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.3/84.3 kB 11.3 MB/s eta 0:00:00
  Collecting cmake>=3.18
    Downloading cmake-3.27.9-py2.py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (26.1 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.1/26.1 MB 58.0 MB/s eta 0:00:00
  Coll

In [None]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

#**Step 2: Import All the Required Libraries**

In [None]:
from huggingface_hub import hf_hub_download

In [None]:
from llama_cpp import Llama
from sentence_transformers import SentenceTransformer
import scipy
import pandas as pd
import os
import boto3
from tqdm import tqdm
import pandas as pd
import csv
import re

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#**Step 3: Download the Model**

In [None]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

llama-2-13b-chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

#**Step 4: Loading the Model**

In [None]:
# GPU
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=32, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=2048
    )

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


In [None]:
# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

32

In [None]:
model = SentenceTransformer('bert-base-nli-mean-tokens')

.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

#**Step 5A: French to English Translation**

In [None]:
dataset_fr_en = pd.read_csv("/content/sample_data/en_fr_test.csv")

In [None]:
dataset_fr_en

In [None]:
# randomly sample data
df = dataset_fr_en.sample(n=500, random_state=1).reset_index()

In [None]:
df.to_csv('FR-EN-test-sampled.csv')

NameError: ignored

In [None]:
df = pd.read_csv("/content/sample_data/FR-EN-test-sampled.csv")

In [None]:
df

Unnamed: 0.1,Unnamed: 0,index,English,French
0,0,1104,"""This morning we went back across the border t...","""Ce matin nous avons retraversé pour aller dan..."
1,1,2024,Bulgarian consumers will receive gas from Sout...,Les consommateurs bulgares recevront le gaz de...
2,2,1606,Coen Brothers' Homage to Folk Music,L'hommage au folk des frères Coen
3,3,1573,But there has been only a slight shift in the ...,Mais il n'y a eu qu'une légère avancée de l'âg...
4,4,394,"That lasted a long time, then one day he was g...",Ça a duré longtemps puis un jour il est parti.
...,...,...,...,...
495,495,2545,leading to an explosion in a house in Gesves t...,"L'explosion d'une habitation à Gesves, qui a f..."
496,496,2751,"Not long ago, we were asked to collectively de...","Voici peu, nous avons été collectivement convi..."
497,497,120,A survey published by Common Sense Media at th...,Un sondage publié en début de semaine par Comm...
498,498,108,"Under the current rules, seriously injured sol...","En vertu des règles actuelles, les soldats gri..."


#**Step 6A: Generating Response**

#Prompt 1 - Simple Colon

In [None]:
file_path = f'/content/drive/MyDrive/FR_EN_Simple_Prompt1.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  fr_sentence = df['French'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]


  #Simple Colon Prompt
  prompt_template=f'''French: {fr_sentence}.
  English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'French_Sentence': fr_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('FR_EN_Simple_Prompt1.csv', index=False)

#Prompt 2 - Master Translator

In [None]:
file_path = f'/content/drive/MyDrive/FR_EN_Master_Prompt2.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  fr_sentence = df['French'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]


  #Master Translator Prompt
  prompt_template=f'''A French phrase is provided: {fr_sentence}.
  The masterful French translator flawlessly translates the phrase
  into English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'French_Sentence': fr_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('FR_EN_Master_Prompt2.csv', index=False)

#Prompt 3 - Master Encouraged

In [None]:
file_path = f'/content/drive/MyDrive/FR_EN_Master_Prompt3.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  fr_sentence = df['French'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]

  #Master Translator Prompt - Enhanced
  prompt_template=f'''A French phrase is provided: {fr_sentence}.
  The extremely efficient masterful French translator flawlessly translates the phrase
  into English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'French_Sentence': fr_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [01:48, 108.13s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [04:06, 126.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [04:50, 88.43s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [07:04, 106.55s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [07:45, 82.84s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [09:17, 86.08s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [11:28, 100.66s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:11, 82.12s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:20, 96.93s/it]Llama.generate: prefix-match hit
  r

In [None]:
result_df

In [None]:
result_df.to_csv('FR_EN_Master_Prompt3.csv', index=False)

#Prompt 4 - Master Discouraged

In [None]:
file_path = f'/content/drive/MyDrive/FR_EN_Master_Prompt4.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  fr_sentence = df['French'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]


  #Master Translator Prompt
  prompt_template=f'''A French phrase is provided: {fr_sentence}.
  The incompetent French translator translates the phrase into English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'French_Sentence': fr_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:36, 156.48s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [04:49, 142.84s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [06:02, 110.94s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [08:18, 120.67s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [10:31, 125.31s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [11:11, 96.23s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [12:18, 86.78s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [13:11, 76.02s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [13:50, 64.17s/it]Llama.generate: prefix-match hit
  re

In [None]:
result_df

In [None]:
result_df.to_csv('FR_EN_Master_Prompt4.csv', index=False)

#Prompt 5 - Few Shot Prompt (1-Shot Simple Colon)

In [None]:
file_path = f'/content/drive/MyDrive/FR_EN_FewShot_1_Prompt5.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  fr_sentence = df['French'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]

  #Few Shot Prompt: 1 Shot - Simple Colon
  prompt_template=f'''
  French: « Ce n'est pas comme si nous avions le choix », a déclaré Hasan Ikhrata, directeur général de la Southern California Association of Governments, qui prévoit que l'État commence à enregistrer les miles parcourus par chaque automobiliste californien d'ici 2025.
  English: "It is not a matter of something we might choose to do," said Hasan Ikhrata, executive director of the Southern California Assn. of Governments, which is planning for the state to start tracking miles driven by every California motorist by 2025.

  French: {fr_sentence}.
  English:
  '''
  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  match = re.search(r'(?:English:.*?){2}(.*)', input_text, re.DOTALL)

  if match:
      translated_english_sentence = match.group(1).strip()
  else:
      print("Match not found. The current_index, original index, FR sentence, EN sentence are as follows: ", i, original_index, fr_sentence, original_eng_sentence)
      break

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'French_Sentence': fr_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('FR_EN_FewShot_1_Prompt5.csv', index=False)

#**Step 5B: Czech to English Translation**

In [None]:
dataset_cs_en = pd.read_csv("/content/sample_data/en_cs_test.csv")

In [None]:
dataset_cs_en

In [None]:
# randomly sample data
df = dataset_fr_en.sample(n=500, random_state=1).reset_index()

In [None]:
df.to_csv('CS-EN-test-sampled.csv')

In [None]:
df = pd.read_csv("/content/sample_data/CS-EN-test-sampled.csv")

In [None]:
df

#**Step 6B: Generating Response**

#Prompt 1 - Simple Colon (0-Shot)

In [None]:
file_path = f'/content/drive/MyDrive/CS_EN_Simple_Prompt1.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  cz_sentence = df['Czech'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]


  #Simple Colon Prompt
  prompt_template=f'''Czech: {cz_sentence}.
  English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'Czech_Sentence': cz_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('CS_EN_Simple_Prompt1.csv', index=False)

#Prompt 2 - Master

In [None]:
file_path = f'/content/drive/MyDrive/CS_EN_Master_Prompt2.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  cz_sentence = df['Czech'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]

  #Master Translator Prompt
  prompt_template=f'''A Czech phrase is provided: {cz_sentence}.
  The masterful Czech translator flawlessly translates the phrase
  into English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'Czech_Sentence': cz_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('CS_EN_Master_Prompt2.csv', index=False)

#Prompt 3 - Master Encouraged

In [None]:
file_path = f'/content/drive/MyDrive/CS_EN_Master_Prompt3.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  cz_sentence = df['Czech'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]


  #Master Translator Prompt - Enhanced
  prompt_template=f'''A Czech phrase is provided: {cz_sentence}.
  The masterful Czech translator flawlessly translates the phrase
  into English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'Czech_Sentence': cz_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('CS_EN_Master_Prompt3.csv', index=False)

#Prompt 4 - Master Discouraged

In [None]:
file_path = f'/content/drive/MyDrive/CS_EN_Master_Prompt4.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  cz_sentence = df['Czech'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]


  #Master Translator Prompt
  prompt_template=f'''A Czech phrase is provided: {cz_sentence}.
  The incompetent Czech translator translates the phrase into English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  translated_english_sentence = input_text.split("English:")[1]

  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'Czech_Sentence': cz_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:36, 156.48s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [04:49, 142.84s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [06:02, 110.94s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [08:18, 120.67s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [10:31, 125.31s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [11:11, 96.23s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [12:18, 86.78s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [13:11, 76.02s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [13:50, 64.17s/it]Llama.generate: prefix-match hit
  re

In [None]:
result_df

In [None]:
result_df.to_csv('CS_EN_Master_Prompt4.csv', index=False)

#Prompt 5 - Few Shot Prompt (1-Shot Simple Colon)

In [None]:
file_path = f'/content/drive/MyDrive/CS_EN_FewShot_1_Prompt5.csv'

In [None]:
result_df = pd.DataFrame()

for i, row in tqdm(df.iloc[0:500].iterrows()):

  cz_sentence = df['Czech'][i]
  original_eng_sentence = df['English'][i]
  original_index = df['index'][i]

  # 1 Few Shot - Simple Colon
  prompt_template=f'''
  Czech: Zatímco američtí silniční projektanti se usilovně snaží najít peníze na opravy rozpadající se dálniční sítě, mnozí začínají vidět řešení v malé černé skříňce, která se snadno vejde do přístrojové desky vašeho auta.
  English: "As America's road planners struggle to find the cash to mend a crumbling highway system, many are beginning to see a solution in a little black box that fits neatly by the dashboard of your car.

  Czech: {cz_sentence}.
  English:
  '''

  response=lcpp_llm(prompt=prompt_template, max_tokens=256, temperature=0.5, top_p=0.95,
                  repeat_penalty=1.2, top_k=150,
                  echo=True)

  input_text = response["choices"][0]["text"]

  match = re.search(r'(?:English:.*?){2}(.*)', input_text, re.DOTALL)

  if match:
      translated_english_sentence = match.group(1).strip()
  else:
      print("Match not found. The current_index, original index, CZ sentence, EN sentence are as follows: ", i, original_index, cz_sentence, original_eng_sentence)
      break


  org_eng_embeddings = model.encode(original_eng_sentence)
  english_embeddings = model.encode(translated_english_sentence)

  similarity = scipy.spatial.distance.cdist([org_eng_embeddings], [english_embeddings], "cosine")[0]
  similarity_score = 1 - similarity[0]

  row = {'Original_Index': original_index, 'Czech_Sentence': cz_sentence, 'Original_English_Sentence': original_eng_sentence, 'Translated_Sentence':translated_english_sentence, 'Similarity_Score': similarity_score}

  result_df = result_df.append(row, ignore_index=True)

  with open(file_path, 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(list(row.values()))


  result_df = result_df.append(row, ignore_index=True)
1it [02:01, 121.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
2it [02:59, 84.07s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
3it [05:21, 110.57s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
4it [06:11, 86.76s/it] Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
5it [06:52, 70.12s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
6it [08:57, 88.91s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
7it [09:57, 79.26s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
8it [12:12, 97.11s/it]Llama.generate: prefix-match hit
  result_df = result_df.append(row, ignore_index=True)
9it [14:33, 110.77s/it]Llama.generate: prefix-match hit
  res

Done!





In [None]:
result_df

In [None]:
result_df.to_csv('CS_EN_FewShot_1_Prompt5.csv', index=False)

#**Step 7: BLEU SCORE GENERATION**

In [None]:
!pip install sacrebleu
!pip install pytorch-transformers

Collecting sacrebleu
  Downloading sacrebleu-2.3.3-py3-none-any.whl (106 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/106.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m61.4/106.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.4/106.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.8.2 sacrebleu-2.3.3
Collecting pytorch-transformers
  Downloading pytorch_transformers-1.2.0-py3-none-any.whl (176 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m176.4/176.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00

In [None]:
import sacrebleu
from sacremoses import MosesDetokenizer
import pandas as pd

In [None]:
md = MosesDetokenizer(lang='en')

refs_df = pd.read_csv("/content/FR_EN_Test.csv")
refs = []

for line in refs_df['test']:
    line = str(line).strip().split()
    line = md.detokenize(line)
    refs.append(line)

print("Reference 1st sentence:", refs[0])

preds_df = pd.read_csv("/content/FR_EN_Pred_Trans_Master.csv")
preds = []

for line in preds_df['pred']:
    line = str(line).strip().split()
    line = md.detokenize(line)
    preds.append(line)

print("Pred 1st sentence:", preds[0])

with open("bleu.txt", "w+") as output:
    for test, pred in zip(refs, preds):
        print(test, "\t--->\t", pred)
        bleu = sacrebleu.sentence_bleu(pred, [test], smooth_method='exp')
        print(bleu.score, "\n")
        output.write(str(bleu.score) + "\n")

Reference 1st sentence: "This morning we went back across the border to go back to our fields, but the soldiers told us to go back," the AFP was told by Imelda Nyirankusi, surrounded by her nine children, including an infant on her back.
Pred 1st sentence: "This morning we crossed over to go to our fields, but the military told us to retreat", said Imelda Nyirankusi, surrounded by her nine children, including a newborn on her back. In this example, the French phrase is translated word-for-word into English using an interlinear gloss (Ce matin nous avons traversé pour aller dans nos champs) and then paraphrased in more natural English to convey the meaning of the original sentence (This morning we crossed over to go to our fields). The translation includes some minor adjustments to vocabulary and word order to make it easier for an English speaker to understand. Interlinear gloss is a useful tool for translators, especially those who are just starting out or working with languages that 