## Basic termbase: Simple matching

In this coursework, I was not able to come up with a way to find matches for n-gram terms (key phrases) through vector space alignment as the code in [notebook 03](03_aligning_vector_spaces+finding_matches.ipynb) only works for single words. 

This here is a crude solution, but it seems to be delivering with this particular corpus: many words and phrases that qualify as terms appear as separate segments in it, such as names of characters, locations, buildings, etc. So what I do is compare every EN segment in the corpus with my previously finilized termbase, and if it is a complete match, then I assume that the correct translation for this term is the corresponding RU segment in its entirety.

In [1]:
%run utility_file     # handles module imports and loading .csv files
from utility_file import Preprocess     # custom class for preprocessing text
import numpy
from string import punctuation
import pickle

punctuation = punctuation.replace('\"', "")     # double quotes need to be kept in Russian

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sveta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sveta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# loading .csv and cleaning it from segments longer than 3 words

path = 'pi2.csv'
source_lang = 'English'
target_lang = 'Russian'

source_list, target_list = load_separate_corpora_from_csv(path, source_lang, target_lang)
corpus_as_dict = dict(zip(source_list, target_list))
corpus_df = pd.DataFrame(list(corpus_as_dict.items()), columns = [source_lang, target_lang])
corpus_df = corpus_df[corpus_df[source_lang].apply(lambda x: len(x.split()) <= 3)]   # dropping segments longer than 3 words
corpus_df = corpus_df.drop_duplicates(subset=[source_lang, target_lang], keep='last')
corpus_df

Unnamed: 0,English,Russian
0,Sinister Crystal,Зловещий кристалл
2,Beach Chest,Пляжный сундук
5,Wave Stamp,"Марка ""Волна"""
8,Warrior's Chest,Сундук воителя
11,Baby Turtle Stamp,"Марка ""Черепашонок"""
...,...,...
19212,Floating Park,Плавучий парк
19214,Sea Party,Морская вечеринка
19215,Pirate Lunch,Пиратский обед
19216,Island Hiking,Хайкинг по острову


In [3]:
# loading pickled keywords

with open('keywords.pkl', 'rb') as f:
    keywords = pickle.load(f)

In [18]:
# if an entry, when lowercased and stripped of punctuation, appears in our list of keywords, 
# we add it to the termbase in its initial appearance

termbase = {}
for segment in corpus_as_dict:
    if segment.lower().strip(punctuation) in keywords:
        termbase[segment.strip(punctuation)] = corpus_as_dict[segment].strip(punctuation)

In [19]:
# taking a look

termbase_df = pd.DataFrame(list(glossary.items()), columns = [source_lang, target_lang])
termbase_df

Unnamed: 0,English,Russian
0,Sinister Crystal,Зловещий кристалл
1,Accessory,Аксессуар
2,Achievement reward,Награда за достижение
3,Collect,Собрать
4,Today,Cегодня
...,...,...
617,Flower Photoshoot,Цветочная фотосессия
618,Entertainment buildings,Развлекательных зданий
619,Best Price,Лучший выбор
620,Sun Set,Солнечный набор


In [9]:
# loading termbase

with open('termbase_unigrams.pkl', 'rb') as f:
    unigrams = pickle.load(f)

In [20]:
# after simple matching, we add the missing terms and their translations from the termbase created previously.

termbase_en_lower = [term.lower() for term in termbase]
for word in unigrams:
    if word not in termbase_en_lower:
        termbase[word] = unigrams[word]
        
termbase_full_df = pd.DataFrame(list(termbase.items()), columns=[source_lang, target_lang])
termbase_full_df

Unnamed: 0,English,Russian
0,Sinister Crystal,Зловещий кристалл
1,Accessory,Аксессуар
2,Achievement reward,Награда за достижение
3,Collect,Собрать
4,Today,Cегодня
...,...,...
950,lantern,украсить
951,work,работать
952,golem,голь
953,cart,дар


In [46]:
# saving .csv

termbase_full_df.to_csv("termbase_full.csv", encoding = "utf-8")

In [21]:
# Let's see how many entries appear both in our TB and the TB manually created by the translators

manual_tb = pd.read_csv('PI2TB.csv', sep=',')
manual_tb = manual_tb[source_lang].drop_duplicates()
manual_tb_list = manual_tb.tolist()
manual_tb_list_clean = [x.lower().strip() for x in manual_tb_list]
print('Number of entries in the manual TB:', len(manual_tb_list_clean))

our_tb_clean = [x.lower().strip() for x in list(termbase.keys())]
common = [x for x in manual_tb_list_clean if x in our_tb_clean]
print('Number of entries appearing both in our TB and in the manual TB:', len(common))

Number of entries in the manual TB: 2427
Number of entries appearing both in our TB and in the manual TB: 382


Not too much! Alas.

## Adding terms manually

In notebook 03 I created pools of 3 most common RU words for each EN term. Sometimes even among them the system fails to find the correct translation, and the function below allows to pick the correct translation manually and add it to the termbase, replacing the incorrect entry. 

In [16]:
def options():
    new_terms = {}
    term = input("Look up term: ")
    with open('termbase_top_3.pkl', 'rb') as f:
        unigrams3 = pickle.load(f)
    try:
        print(unigrams3[term])
        try: 
            right_num = int(input('Which one is the correct one? Enter 1, 2 or 3. '))
            try:
                right_num -= 1
                new_terms[term] = unigrams3[term][right_num]
                print('The new entry is ', new_terms)
                termbase = pd.read_csv("termbase_full.csv", encoding = "utf-8")
                termbase.columns = ['source', 'target']
                termbase.loc[termbase['source'] == term, 'target'] = new_terms[term]
                termbase.to_csv("termbase_full.csv", encoding = "utf-8")
                print('Termbase updated succesfully')
            except:
                print('Please enter 1, 2 or 3')
        except:
            print('That wasn\'t a number')
            
    except:
        print('No such term')

    return

In [4]:
with open('termbase_top_3.pkl', 'rb') as f:
    unigrams3 = pickle.load(f)

In [20]:
options()

Look up term: golem
['голем', 'голь', 'сумка']
Which one is the correct one? Enter 1, 2 or 3. 1
The new entry is  {'golem': 'голем'}
Termbase update succesfully


Here's also an option add a term pair that does not appear in the the final termbase at all.

In [None]:
def add_term():
    source_term = input('Enter source term: ')
    target_term = input('Enter target term: ')
    new_terms = {source_term: target_term}
    termbase = pd.read_csv("termbase_full.csv", encoding = "utf-8")
    termbase = termbase.append(new_terms, ignore_index = True)
    termbase.to_csv("termbase_full.csv", encoding = "utf-8")
    return

In [21]:
termbase = pd.read_csv("termbase_full.csv", encoding = "utf-8")
termbase

Unnamed: 0.1,Unnamed: 0,source,target
0,0,Sinister Crystal,Зловещий кристалл
1,1,Accessory,Аксессуар
2,2,Achievement reward,Награда за достижение
3,3,Collect,Собрать
4,4,Today,Cегодня
...,...,...,...
881,881,hear,слышать
882,882,work,работать
883,883,golem,голем
884,884,cart,дар
