# Report of the merge results

This notebook is used to generate the plots that would aid in analysing the results of the tag generation. The plots that are generated in this notebook are

1. Distribution of frequency of tokens vs number of tokens 
2. Distribution of number of tokens with and without a correct spelling
3. Comparison plots of _1_ and _2_, before and after merge.
4. Distribution of similarity scores at each round (and iteration) in merging the tokens.

## Imports

In [None]:
import json
import pickle

import numpy as np
import pandas as pd

import seaborn as sns, matplotlib.pyplot as plt

## Correctly spelled words

The list of correctly spelled words is stored as pickle file at `data/intermediate_steps/words_with_correct_spellings.pickle`. The pickle file is read to use the words set.

In [None]:
with open(
    "./../data/intermediate_steps/words_with_correct_spellings.pickle", "rb"
) as outfile:
    words_with_correct_spellings = pickle.load(outfile)

## Before merge

At this stage, the profession strings were cleaned for most of the special characters, mistakenly split tokens are merged and tokens are generated after refiltering with dots and normalising and de-normalising the tokens not present in the dictionary.

### Read the unique tokens before merging

In [None]:
with open("./../data/intermediate_steps/unique_denormalised_tokens.json", encoding="utf8") as f:
    unique_tokens_before_merge_with_count = json.load(f)
    
unique_tokens_before_merge = unique_tokens_before_merge_with_count.keys()
unique_tokens_before_merge_in_dict = [uniq_tok for uniq_tok in unique_tokens_before_merge if uniq_tok in words_with_correct_spellings]
num_of_unique_tokens_before_merge = len(unique_tokens_before_merge)
num_of_unique_tokens_before_merge_in_dict = len(unique_tokens_before_merge_in_dict)
percent_in_dict = round(((num_of_unique_tokens_before_merge_in_dict/num_of_unique_tokens_before_merge)*100),2)

print("Number of unique tokens that represent the professions before the merginig are {}, \nand {} out of them are in the dictionary ({}%).".format(
    num_of_unique_tokens_before_merge, num_of_unique_tokens_before_merge_in_dict, percent_in_dict))

### Distribution of Frequency of Token Frequency

In [None]:
freq_of_token_freq_before_merge = {}

for token, token_prop in unique_tokens_before_merge_with_count.items():
    count = token_prop["count"]
    if not count in freq_of_token_freq_before_merge:
        freq_of_token_freq_before_merge[count] = {"no_of_tokens": 0, "in_dictionary": 0, "not_in_dictionary":0}
    freq_of_token_freq_before_merge[count]["no_of_tokens"] += 1
    if token in words_with_correct_spellings:
        freq_of_token_freq_before_merge[count]["in_dictionary"] += 1
    else:
        freq_of_token_freq_before_merge[count]["not_in_dictionary"] += 1
    
freq_of_token_freq_before_merge_df = pd.DataFrame.from_dict(freq_of_token_freq_before_merge,
                                                  orient='index'
                                                 ).sort_values(by=['no_of_tokens'], ascending=False
                                                              ).rename_axis('token_frequency').reset_index()

In [None]:
freq_of_token_freq_before_merge_df

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)
count_freq_plot_before_merge = sns.scatterplot(x="token_frequency", y="no_of_tokens",
                                            data=freq_of_token_freq_before_merge_df, ax=ax)
count_freq_plot_before_merge.set(yscale='log')
count_freq_plot_before_merge.set(xscale='log')
count_freq_plot_before_merge.set(title="Distribution of token frequency before merge")
count_freq_plot_before_merge.set(xlabel="Frequency of Tokens (log scale)")
count_freq_plot_before_merge.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In the above plot (with both axes on log scale), the X axis is the frequency of a token in the dataset i.e. number of times it has appeared in the 4M lines. The Y axis is the count of tokens that have a given frequency on the X axis. From this plot, it can interpreted that nearly 50,000 (out of the 87,698) occur only once and the next highest number of tokens i.e. 10,000 of them occure only twice. The set of tokens that appear less than 6 times together compose 80% of the tokens and 89% of them are not in the dictionary. At the other end, only few tokens have very high frequency and they are also present in the dictionary. The tokens with low frequency and not in dictionary represent that they are wrongly spelt or they contain foreign words that are not in the french dictionary or they are abbrevated forms of longer words and few have an old spelling or the words is currently not in use. Our aim in normalising the tokens is to correct the spellings of these tokens and fill the abbrevated forms such that most of them are in the dictionary.

### Distribution of Tokens in the list of correctly spelled words

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Distribution of token spelling correctness before merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

## After creation of tags


After merging and filling the full forms, the tokens are compared.


### Merging tokens with 75% similarity

#### Read the unique tokens after tag generation

In [None]:
with open("./../data/intermediate_steps/sim_score_75/cleaned_unique_tokens.json", encoding="utf8") as f:
    unique_tokens_after_merge_sim75_with_count = json.load(f)
    
unique_tokens_after_merge_sim75 = unique_tokens_after_merge_sim75_with_count.keys()
unique_tokens_after_merge_sim75_in_dict = [uniq_tok for uniq_tok in unique_tokens_after_merge_sim75 if uniq_tok in words_with_correct_spellings]
num_of_unique_tokens_after_merge_sim75 = len(unique_tokens_after_merge_sim75)
num_of_unique_tokens_after_merge_sim75_in_dict = len(unique_tokens_after_merge_sim75_in_dict)

percent_in_dict = round(((num_of_unique_tokens_after_merge_sim75_in_dict/num_of_unique_tokens_after_merge_sim75)*100),2)

print("Number of unique tokens that represent the professions after merginig with 75% similarity are {}, \nand {} out of them are in the dictionary ({}%).".format(
    num_of_unique_tokens_after_merge_sim75, num_of_unique_tokens_after_merge_sim75_in_dict, percent_in_dict))

### Distribution of Frequency of Token Frequency

In [None]:
freq_of_token_freq_after_merge_sim75 = {}

for token, count in unique_tokens_after_merge_sim75_with_count.items():
    if not count in freq_of_token_freq_after_merge_sim75:
        freq_of_token_freq_after_merge_sim75[count] = {"no_of_tokens": 0, "in_dictionary": 0, "not_in_dictionary":0}
    freq_of_token_freq_after_merge_sim75[count]["no_of_tokens"] += 1
    if token in words_with_correct_spellings:
        freq_of_token_freq_after_merge_sim75[count]["in_dictionary"] += 1
    else:
        freq_of_token_freq_after_merge_sim75[count]["not_in_dictionary"] += 1
    
freq_of_token_freq_after_merge_sim75_df = pd.DataFrame.from_dict(freq_of_token_freq_after_merge_sim75,
                                                  orient='index'
                                                 ).sort_values(by=['no_of_tokens'], ascending=False
                                                              ).rename_axis('token_frequency').reset_index()

In [None]:
freq_of_token_freq_after_merge_sim75_df

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.scatterplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="no_of_tokens", label='before_merge', ax=ax)
sns.scatterplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="no_of_tokens", label='after_merge_sim75', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of distribution of token frequency before and after merge with 75% similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Distribution of token spelling correctness after merge with 75 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
# sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='before_merge', ax=ax)
# sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='after_merge_sim75', ax=ax)

fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of token spelling correctness before and after merge with 75 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
# sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)
# sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of correctly spelled tokens before and after merge with 75 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

# sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

# sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of wrongly spelled tokens before and after merge with 75 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

### Merging tokens with 85% similarity


#### Read the unique tokens after merging

In [None]:
with open("./../data/intermediate_steps/sim_score_85/cleaned_unique_tokens.json", encoding="utf8") as f:
    unique_tokens_after_merge_sim85_with_count = json.load(f)
    
unique_tokens_after_merge_sim85 = unique_tokens_after_merge_sim85_with_count.keys()
unique_tokens_after_merge_sim85_in_dict = [uniq_tok for uniq_tok in unique_tokens_after_merge_sim85 if uniq_tok in words_with_correct_spellings]
num_of_unique_tokens_after_merge_sim85 = len(unique_tokens_after_merge_sim85)
num_of_unique_tokens_after_merge_sim85_in_dict = len(unique_tokens_after_merge_sim85_in_dict)

percent_in_dict = round(((num_of_unique_tokens_after_merge_sim85_in_dict/num_of_unique_tokens_after_merge_sim85)*100),2)

print("Number of unique tokens that represent the professions after merginig with 85% similarity are {}, \nand {} out of them are in the dictionary ({}%).".format(
    num_of_unique_tokens_after_merge_sim85, num_of_unique_tokens_after_merge_sim85_in_dict, percent_in_dict))

### Distribution of Frequency of Token Frequency

In [None]:
freq_of_token_freq_after_merge_sim85 = {}

for token, count in unique_tokens_after_merge_sim85_with_count.items():
    if not count in freq_of_token_freq_after_merge_sim85:
        freq_of_token_freq_after_merge_sim85[count] = {"no_of_tokens": 0, "in_dictionary": 0, "not_in_dictionary":0}
    freq_of_token_freq_after_merge_sim85[count]["no_of_tokens"] += 1
    if token in words_with_correct_spellings:
        freq_of_token_freq_after_merge_sim85[count]["in_dictionary"] += 1
    else:
        freq_of_token_freq_after_merge_sim85[count]["not_in_dictionary"] += 1
    
freq_of_token_freq_after_merge_sim85_df = pd.DataFrame.from_dict(freq_of_token_freq_after_merge_sim85,
                                                  orient='index'
                                                 ).sort_values(by=['no_of_tokens'], ascending=False
                                                              ).rename_axis('token_frequency').reset_index()

In [None]:
freq_of_token_freq_after_merge_sim85_df

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.scatterplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="no_of_tokens", label='before_merge', ax=ax)
sns.scatterplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="no_of_tokens", label='after_merge_sim85', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of distribution of token frequency before and after merge with 85% similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Distribution of token spelling correctness after merge with 85 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
# sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='before_merge', ax=ax)
# sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='after_merge_sim75', ax=ax)

fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (85)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of token spelling correctness before and after merge with 85 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
# sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (85)', ax=ax)
# sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of correctly spelled tokens before and after merge with 85 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

# sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

# sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of wrongly spelled tokens before and after merge with 85 similarity threshold")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

## Comparison before and After merge

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.scatterplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="no_of_tokens", label='before_merge', ax=ax)
sns.scatterplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="no_of_tokens", label='after_merge_sim75', ax=ax)
sns.scatterplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="no_of_tokens", label='after_merge_sim85', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of distribution of token frequency before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (85)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Distribution of token spelling correctness before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (85)', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)


ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of correctly spelled tokens before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)


sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of wrongly spelled tokens before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)
sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (85)', ax=ax)
sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

# ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of token spelling correctness before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)

sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="in_dictionary", label='Has correct spelling before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (85)', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="in_dictionary", label='Has correct spelling after merge (75)', ax=ax)


# ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of correctly spelled tokens before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

In [None]:
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)


sns.lineplot(data=freq_of_token_freq_before_merge_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled before merge', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim75_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (75)', ax=ax)

sns.lineplot(data=freq_of_token_freq_after_merge_sim85_df, x="token_frequency", y="not_in_dictionary", label='Not correctly Spelled (85)', ax=ax)

# ax.set(yscale='log')
ax.set(xscale='log')
ax.set(title="Comparision of wrongly spelled tokens before and after merge")
ax.set(xlabel="Frequency of Tokens (log scale)")
ax.set(ylabel="Number of Tokens (log scale)")
ax.grid()

## Distriution of round wise similiarity scores during the merge

During the each iteration of each round of the merge process the summary of distribution of the similarity scores is printed and also saved to a pickle file. Here the information is summarised.

Read the replacement scores files. Each file is an array of array. The first level length of array corresponds to the number of iterations occured in the round and the second level length corresponds to number of unique merges occured during the iteration.

### Merging with similarity score of 75

In [None]:
round_wise_data = {1: {}, 2: {}, 3:{}}

with open(
    "./../data/intermediate_steps/sim_score_75/replacement_scores_round_after_round_1.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_1 = pickle.load(outfile)
    
for itr,merg_scr in enumerate(replacement_scores_round_after_round_1):
    round_wise_data[1][itr+1] = {"uniq_mergs": len(merg_scr), "average_merge_score": np.mean(merg_scr), "median_merge_score": np.median(merg_scr)}
    
with open(
    "./../data/intermediate_steps/sim_score_75/replacement_scores_round_after_round_2.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_2 = pickle.load(outfile)
    
for itr,merg_scr in enumerate(replacement_scores_round_after_round_2):
    round_wise_data[2][itr+1] = {"uniq_mergs": len(merg_scr), "average_merge_score": np.mean(merg_scr), "median_merge_score": np.median(merg_scr)}
    

with open(
    "./../data/intermediate_steps/sim_score_75/replacement_scores_round_after_round_3.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_3 = pickle.load(outfile)

for itr,merg_scr in enumerate(replacement_scores_round_after_round_3):
    round_wise_data[3][itr+1] = {"uniq_mergs": len(merg_scr), "average_merge_score": np.mean(merg_scr), "median_merge_score": np.median(merg_scr)}

In [None]:
avg_sim_scr_per_itr = {}

for round_num in range(1,4):
    avg_sim_scr_per_itr["round_" + str(round_num)] = {itr: info["average_merge_score"] for itr, info in round_wise_data[round_num].items()}
    
median_sim_scr_per_itr = {}

for round_num in range(1,4):
    median_sim_scr_per_itr["round_" + str(round_num)] = {itr: info["median_merge_score"] for itr, info in round_wise_data[round_num].items()}
    
    
unq_mergs_per_itr = {}

for round_num in range(1,4):
    unq_mergs_per_itr["round_" + str(round_num)] = {itr: info["uniq_mergs"] for itr, info in round_wise_data[round_num].items()}    

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)

df_mrgs = pd.DataFrame(unq_mergs_per_itr)
ax = df_mrgs.plot.line(title="Unique Merges per iteration (75)", xlabel="Iteration", ylabel="Number of unique merges", ax=ax)
ax.locator_params(integer=True)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1)
fig.set_size_inches(11.7, 8.27)

df_avg_scr = pd.DataFrame(avg_sim_scr_per_itr)
ax1 = df_avg_scr.plot.line(title="Average Similarity Score per iteration (75)", xlabel="Iteration", ylabel="average similarity score", ax=axes[0])
ax1.locator_params(integer=True)
    
df_median_scr = pd.DataFrame(median_sim_scr_per_itr)
ax2 = df_median_scr.plot.line(title="Median Similarity Score per iteration (75)", xlabel="Iteration", ylabel="Median similarity score", ax=axes[1])
ax2.locator_params(integer=True)

plt.tight_layout()

### Merging with similarity score of 85

In [None]:
round_wise_data = {1: {}, 2: {}, 3:{}}

with open(
    "./../data/intermediate_steps/sim_score_85/replacement_scores_round_after_round_1.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_1 = pickle.load(outfile)
    
for itr,merg_scr in enumerate(replacement_scores_round_after_round_1):
    round_wise_data[1][itr+1] = {"uniq_mergs": len(merg_scr), "average_merge_score": np.mean(merg_scr), "median_merge_score": np.median(merg_scr)}
    
with open(
    "./../data/intermediate_steps/sim_score_85/replacement_scores_round_after_round_2.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_2 = pickle.load(outfile)
    
for itr,merg_scr in enumerate(replacement_scores_round_after_round_2):
    round_wise_data[2][itr+1] = {"uniq_mergs": len(merg_scr), "average_merge_score": np.mean(merg_scr), "median_merge_score": np.median(merg_scr)}
    

with open(
    "./../data/intermediate_steps/sim_score_85/replacement_scores_round_after_round_3.pickle", "rb"
) as outfile:
    replacement_scores_round_after_round_3 = pickle.load(outfile)

for itr,merg_scr in enumerate(replacement_scores_round_after_round_3):
    round_wise_data[3][itr+1] = {"uniq_mergs": len(merg_scr), "average_merge_score": np.mean(merg_scr), "median_merge_score": np.median(merg_scr)}

In [None]:
avg_sim_scr_per_itr = {}

for round_num in range(1,4):
    avg_sim_scr_per_itr["round_" + str(round_num)] = {itr: info["average_merge_score"] for itr, info in round_wise_data[round_num].items()}
    
median_sim_scr_per_itr = {}

for round_num in range(1,4):
    median_sim_scr_per_itr["round_" + str(round_num)] = {itr: info["median_merge_score"] for itr, info in round_wise_data[round_num].items()}
    
    
unq_mergs_per_itr = {}

for round_num in range(1,4):
    unq_mergs_per_itr["round_" + str(round_num)] = {itr: info["uniq_mergs"] for itr, info in round_wise_data[round_num].items()}    

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)

df_mrgs = pd.DataFrame(unq_mergs_per_itr)
ax = df_mrgs.plot.line(title="Unique Merges per iteration (75)", xlabel="Iteration", ylabel="Number of unique merges", ax=ax)
ax.locator_params(integer=True)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1)
fig.set_size_inches(11.7, 8.27)

df_avg_scr = pd.DataFrame(avg_sim_scr_per_itr)
ax1 = df_avg_scr.plot.line(title="Average Similarity Score per iteration (75)", xlabel="Iteration", ylabel="average similarity score", ax=axes[0])
ax1.locator_params(integer=True)
    
df_median_scr = pd.DataFrame(median_sim_scr_per_itr)
ax2 = df_median_scr.plot.line(title="Median Similarity Score per iteration (75)", xlabel="Iteration", ylabel="Median similarity score", ax=axes[1])
ax2.locator_params(integer=True)

plt.tight_layout()