# Simple BPE

Implement simple BPE algorithm mentioned in the paper which introduced BPE to handle out-of-vocab words.

Paper: [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909)

## Paper code

Test out the paper code mentioned in `Algorithm 1`

In [7]:
import re, collections
def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    # print(bigram)
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    # print(p)
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        # print(w_out)
        v_out[w_out] = v_in[word]
    return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2, 'n e w e s t </w>':6, 'w i d e s t </w>':3}

num_merges = 5
for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(f"Best Pair: {best}")
    print(f"New vocab after merge: {vocab}")
    print("*" * 20)

Best Pair: ('e', 's')
New vocab after merge: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
********************
Best Pair: ('es', 't')
New vocab after merge: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
********************
Best Pair: ('est', '</w>')
New vocab after merge: {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
********************
Best Pair: ('l', 'o')
New vocab after merge: {'lo w </w>': 5, 'lo w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
********************
Best Pair: ('lo', 'w')
New vocab after merge: {'low </w>': 5, 'low e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
********************


## Personal Implementation (Unit cases)

Implement own BPE with the same concept, some more changes

In [1]:
sample_text = """
Working in Teams: There are 4 assignments in the class. Assignment 1 must be done individually, while Assignments 2, 3, and 4 must be done in teams of 2-3 (individual submissions will not be accepted for these assignments). If you are having trouble finding a group, the instructor and TAs will help you find one after the first initial survey.

Submission Information: To submit your assignment you must submit via canvas a zip file containing:

your code: This should be in a directory “code” in the top directory unless specified otherwise.
system outputs (assignments 1 and 2): The format will be specified separately for each assignment.
a report (assignments 2, 3 and 4, optional for assignment 1): This should be named “report.pdf” in the top directory. This is for assignments 2, 3 and 4, and can be up to 7 pages for assignments 2 and 3 and 9 pages for assignment 4. References are not included in the page count, and it is OK to submit appendices that include supplementary information such as hyperparameter settings or additional output examples, although there is no guarantee that the TAs will read them. Submissions that exceed the page count will be penalized one third grade for each page over (e.g. A to A- or A- to B+). You may also submit report.pdf for assignment 1 if you have any interesting infromation to convey to the TAs, for example if you did anything interesting above and beyond the minimal requirements.
a link to a github repository containing your code (assignments 2, 3 and 4): This should be a single line file “github.txt” in the top directory. Your github repository must be viewable to the TAs in charge of the assignment by the submission deadline. If your repository is private make it accessible to the TAs by the submission deadline. If your repository is not visible to the TAs, your assignment will not be considered complete, so if you are worried please submit well in advance of the deadline so we can confirm the submission is visible. We use this repository to check contributions of all team members.
Late Day Policy: In case there are unforeseen circumstances that don’t let you turn in your assignment on time, 5 late days total for assignments 2 and 3 will be allowed. Note that other than these late days, we will not be making exceptions and extending deadlines except for health reasons, so please try to be frugal with your late days and use them only if necessary. Assignments that are late beyond the allowed late days will be graded down one third-grade per day late (e.g. A to A- for one day, and A to B+ for two days).
"""

In [2]:
# split wrt white space and add <\w> at the end of each word
words = sample_text.strip().strip('\n').split(' ')
words[:10]

['Working',
 'in',
 'Teams:',
 'There',
 'are',
 '4',
 'assignments',
 'in',
 'the',
 'class.']

In [29]:
dict_words = []

for word in words:
    dict_words.append(" ".join(w for w in word))

dict_words[:10]

['W o r k i n g',
 'i n',
 'T e a m s :',
 'T h e r e',
 'a r e',
 '4',
 'a s s i g n m e n t s',
 'i n',
 't h e',
 'c l a s s .']

In [30]:
dict_words = [w + " </w>" for w in dict_words]
dict_words[:10]

['W o r k i n g </w>',
 'i n </w>',
 'T e a m s : </w>',
 'T h e r e </w>',
 'a r e </w>',
 '4 </w>',
 'a s s i g n m e n t s </w>',
 'i n </w>',
 't h e </w>',
 'c l a s s . </w>']

In [43]:
vocab = {}

for word in dict_words:
    vocab[word] = vocab.get(word, 0) + 1

vocab

{'W o r k i n g </w>': 1,
 'i n </w>': 11,
 'T e a m s : </w>': 1,
 'T h e r e </w>': 1,
 'a r e </w>': 6,
 '4 </w>': 2,
 'a s s i g n m e n t s </w>': 4,
 't h e </w>': 20,
 'c l a s s . </w>': 1,
 'A s s i g n m e n t </w>': 1,
 '1 </w>': 3,
 'm u s t </w>': 4,
 'b e </w>': 15,
 'd o n e </w>': 2,
 'i n d i v i d u a l l y , </w>': 1,
 'w h i l e </w>': 1,
 'A s s i g n m e n t s </w>': 2,
 '2 , </w>': 4,
 '3 , </w>': 1,
 'a n d </w>': 15,
 't e a m s </w>': 1,
 'o f </w>': 4,
 '2 - 3 </w>': 1,
 '( i n d i v i d u a l </w>': 1,
 's u b m i s s i o n s </w>': 1,
 'w i l l </w>': 9,
 'n o t </w>': 5,
 'a c c e p t e d </w>': 1,
 'f o r </w>': 13,
 't h e s e </w>': 2,
 'a s s i g n m e n t s ) . </w>': 1,
 'I f </w>': 3,
 'y o u </w>': 7,
 'h a v i n g </w>': 1,
 't r o u b l e </w>': 1,
 'f i n d i n g </w>': 1,
 'a </w>': 5,
 'g r o u p , </w>': 1,
 'i n s t r u c t o r </w>': 1,
 'T A s </w>': 4,
 'h e l p </w>': 1,
 'f i n d </w>': 1,
 'o n e </w>': 4,
 'a f t e r </w>': 1,
 'f i r

In [34]:
def get_stats(vocab):
    stats = {}
    
    for word, count in vocab.items():
        letters = word.split()
        for i in range(len(letters) - 1):
            stats[(letters[i], letters[i + 1])] = stats.get((letters[i], letters[i + 1]), 0) + count

    return stats

In [36]:
stats = get_stats(vocab)
stats

{('W', 'o'): 1,
 ('o', 'r'): 34,
 ('r', 'k'): 1,
 ('k', 'i'): 2,
 ('i', 'n'): 44,
 ('n', 'g'): 12,
 ('g', '</w>'): 9,
 ('n', '</w>'): 25,
 ('T', 'e'): 1,
 ('e', 'a'): 14,
 ('a', 'm'): 7,
 ('m', 's'): 3,
 ('s', ':'): 1,
 (':', '</w>'): 7,
 ('T', 'h'): 6,
 ('h', 'e'): 33,
 ('e', 'r'): 15,
 ('r', 'e'): 30,
 ('e', '</w>'): 94,
 ('a', 'r'): 12,
 ('4', '</w>'): 2,
 ('a', 's'): 23,
 ('s', 's'): 29,
 ('s', 'i'): 35,
 ('i', 'g'): 19,
 ('g', 'n'): 19,
 ('n', 'm'): 19,
 ('m', 'e'): 25,
 ('e', 'n'): 26,
 ('n', 't'): 29,
 ('t', 's'): 12,
 ('s', '</w>'): 43,
 ('t', 'h'): 45,
 ('c', 'l'): 3,
 ('l', 'a'): 7,
 ('s', '.'): 3,
 ('.', '</w>'): 15,
 ('A', 's'): 9,
 ('t', '</w>'): 38,
 ('1', '</w>'): 3,
 ('m', 'u'): 4,
 ('u', 's'): 6,
 ('s', 't'): 10,
 ('b', 'e'): 18,
 ('d', 'o'): 4,
 ('o', 'n'): 31,
 ('n', 'e'): 12,
 ('n', 'd'): 23,
 ('d', 'i'): 11,
 ('i', 'v'): 3,
 ('v', 'i'): 7,
 ('i', 'd'): 4,
 ('d', 'u'): 2,
 ('u', 'a'): 3,
 ('a', 'l'): 15,
 ('l', 'l'): 14,
 ('l', 'y'): 3,
 ('y', ','): 2,
 (',', '</w>'

In [37]:
best_pair = sorted(stats.items(), key=lambda i:i[1], reverse=True)[0][0]
best_pair

('e', '</w>')

In [38]:
def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

In [44]:
num_merges = 50

for merge in range(num_merges):
    stats = get_stats(vocab)
    best = sorted(stats.items(), key=lambda i:i[1], reverse=True)[0][0]
    vocab = merge_vocab(best, vocab)
    print(f"Merge {merge + 1}, Best Pair: {best}, Best Count: {stats[best]}")

Merge 1, Best Pair: ('e', '</w>'), Best Count: 94
Merge 2, Best Pair: ('t', 'h'), Best Count: 45
Merge 3, Best Pair: ('i', 'n'), Best Count: 44
Merge 4, Best Pair: ('s', '</w>'), Best Count: 43
Merge 5, Best Pair: ('t', '</w>'), Best Count: 38
Merge 6, Best Pair: ('d', '</w>'), Best Count: 35
Merge 7, Best Pair: ('o', 'r'), Best Count: 34
Merge 8, Best Pair: ('s', 'i'), Best Count: 34
Merge 9, Best Pair: ('o', 'n'), Best Count: 31
Merge 10, Best Pair: ('o', 'u'), Best Count: 27
Merge 11, Best Pair: ('s', 'si'), Best Count: 26
Merge 12, Best Pair: ('e', 'n'), Best Count: 26
Merge 13, Best Pair: ('a', 'n'), Best Count: 24
Merge 14, Best Pair: ('m', 'en'), Best Count: 21
Merge 15, Best Pair: ('o', '</w>'), Best Count: 21
Merge 16, Best Pair: ('th', 'e</w>'), Best Count: 20
Merge 17, Best Pair: ('ssi', 'g'), Best Count: 19
Merge 18, Best Pair: ('ssig', 'n'), Best Count: 19
Merge 19, Best Pair: ('ssign', 'men'), Best Count: 19
Merge 20, Best Pair: ('r', 'e'), Best Count: 19
Merge 21, Best P

In [42]:
vocab

{'W or k ing</w>': 1,
 'in</w>': 11,
 'T ea m s : </w>': 1,
 'T h e r e</w>': 1,
 'ar e</w>': 6,
 '4 </w>': 2,
 'assignmen ts</w>': 4,
 'the</w>': 20,
 'c l a s s .</w>': 1,
 'A ssignmen t</w>': 1,
 '1 </w>': 3,
 'm u s t</w>': 4,
 'be</w>': 15,
 'd on e</w>': 2,
 'in di v i d u a l l y ,</w>': 1,
 'w h il e</w>': 1,
 'A ssignmen ts</w>': 2,
 '2 ,</w>': 4,
 '3 ,</w>': 1,
 'and</w>': 15,
 't ea m s</w>': 1,
 'o f</w>': 4,
 '2 - 3 </w>': 1,
 '( in di v i d u a l</w>': 1,
 'submi ssi on s</w>': 1,
 'wil l</w>': 9,
 'n o t</w>': 5,
 'a c c e p t ed</w>': 1,
 'for</w>': 13,
 'th e s e</w>': 2,
 'assignmen t s ) .</w>': 1,
 'I f</w>': 3,
 'you </w>': 7,
 'h a v ing</w>': 1,
 't r ou b l e</w>': 1,
 'f in d ing</w>': 1,
 'a </w>': 5,
 'g r ou p ,</w>': 1,
 'in s t r u c tor </w>': 1,
 'T A s</w>': 4,
 'h e l p </w>': 1,
 'f in d</w>': 1,
 'on e</w>': 4,
 'a f t e r</w>': 1,
 'f i r s t</w>': 1,
 'in i t i a l</w>': 1,
 's u r v e y . \n \n S ubmi ssi on </w>': 1,
 'I n for m at i on : </w>': 

## Implement in a class

In [119]:
import re

class BPE:
    def __init__(self):
        self.word_cnt = {}
        self.vocab = {}
        self.merges = {}
        self.idx = 0

    def init_vocab(self, train_text):
        # add words in training data
        words = re.split(r'\s+|\n', train_text) 

        letters = sorted(list(set([w for  word in words for w in word])))
        for letter in letters:
            self.vocab[letter] = self.idx
            self.idx += 1

        # add the eow token
        self.vocab["</w>"] = self.idx
        self.idx += 1

        # add the <unk> token
        self.vocab["<unk>"] = self.idx
        self.idx += 1

        # word_cnt dict as we see before
        dict_words = []
        for word in words:
            ss_word = " ".join(w for w in word)
            ss_word += " </w>"
            dict_words.append(ss_word)

        for word in dict_words:
            self.word_cnt[word] = self.word_cnt.get(word, 0) + 1

    def get_stats(self):
        # train corpus is the dict format we saw
        stats = {}

        for word, count in self.word_cnt.items():
            letters = word.split()
            for i in range(len(letters) - 1):
                pair = (letters[i], letters[i + 1])
                stats[pair] = stats.get(pair, 0) + count

        return stats
    
    def merge(self, top_pair):
        # update vocab and merge
        self.vocab["".join(top_pair)] = self.idx
        self.merges[top_pair] = self.idx
        self.idx += 1

        # update the vocabs
        v_out = {}
        bigram = re.escape(' '.join(top_pair))
        p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
        for word in self.word_cnt:
            w_out = p.sub(''.join(top_pair), word)
            v_out[w_out] = self.word_cnt[word]
        self.word_cnt = v_out 

    def get_encoded_size(self):
        encoded_size = 0
        for word, count in self.word_cnt.items():
            letters = word.split()
            encoded_size += len(letters) * count

        return encoded_size
    
    def get_id2token(self):
        id2token = {}
        for token, idx in self.vocab.items():
            id2token[idx] = token

        return id2token

In [120]:
text = """
Terrance Stanley Fox (July 28, 1958 – June 28, 1981) was a Canadian athlete, humanitarian, and cancer research activist. In 1980, having had one leg amputated due to cancer, he embarked on a cross-Canada run to raise money and awareness for cancer research. The annual Terry Fox Run, first held in 1981, has grown to involve millions of participants in over 60 countries and is the world's largest one-day fundraiser for cancer research; over C$900 million has been raised in his name through the Terry Fox Research Institute as of September 2024.[1]

Fox was a distance runner and basketball player for Port Coquitlam Senior Secondary School, later named after him, and Simon Fraser University. His right leg was amputated in 1977 after he was diagnosed with osteosarcoma, though he continued to run using an artificial leg. He also played wheelchair basketball in Vancouver, winning three national championships.

In 1980, he began the Marathon of Hope to raise money for cancer research. He hoped to raise one dollar from each of Canada's 24 million people at the time. He began with little fanfare from St John's, Newfoundland and Labrador, in April that year, and ran the equivalent of a full marathon every day. Fox had become a national star by the time he reached Ontario; he made numerous public appearances with businessmen, athletes, and politicians in his efforts to raise money. He was forced to end his run outside Thunder Bay after the cancer spread to his lungs. Fox died nine months later on June 28, 1981.

Fox was the youngest person named a Companion of the Order of Canada and won the 1980 Lou Marsh Award as the nation's top sportsman. He was named Canada's Newsmaker of the Year in both 1980 and 1981 by The Canadian Press. Considered a national hero, he has had many buildings, statues, roads, and parks named in his honour across the country.

Early life and cancer
Terrance Stanley Fox was born on July 28, 1958, in Winnipeg, Manitoba, to Rolland and Betty Fox. Rolland was a switchman for the Canadian National Railway.[2] Fox spent his childhood in the Transcona suburb of Winnipeg, where he attended Wayoata Elementary School.[3] Fox had an elder brother, Fred, a younger brother, Darrell, and a younger sister, Judith.[4] Fox's maternal grandmother is Métis and Fox's younger brother Darrell has official Métis status.[5]

His family moved to Surrey in British Columbia in 1966, then settled in Port Coquitlam in 1968.[4] He had doting parents,[6] and his father recalled that Fox was extremely competitive.[7] Fox attempted to join his school's basketball team, though struggled because of his height. His coach suggested that Fox try cross-country running, which Fox did in order to impress his coach.[8][9][10] Fox continued to improve on his basketball skills, and in grade 12 he won his high school's athlete of the year award.[4] Fox was unsure whether he wanted to go to university, but Fox's mother convinced him to enrol at Simon Fraser University. He studied kinesiology with the intention of becoming a physical education teacher.[11] He was also a member of the junior varsity basketball team.[4]

a prosthetic leg in a display case
Fox's favourite prosthetic leg used during his Marathon of Hope, on display at the Canadian Museum of History
On November 12, 1976, Fox was driving to the family home in Port Coquitlam when he was distracted by nearby bridge construction and crashed into the back of a pickup truck. Fox injured his right knee in the crash and felt pain in December, but chose to ignore it until the end of basketball season.[12] By March 1977, the pain had intensified and he went to a hospital, where he was diagnosed with osteosarcoma, a form of cancer that often starts near the knees.[4] Fox believed his car accident weakened his knee and left it vulnerable to the disease, though his doctors argued there was no connection.[13] He was told that his leg had to be amputated, he would require chemotherapy treatment, and that recent medical advances meant he had a 50-per cent chance of survival. Fox learned that two years before, the figure would have been only 15 per cent; the improvement in survival rates impressed on him the value of cancer research.[14] With the help of an artificial leg, Fox was walking three weeks after the amputation.[4] Doctors were impressed with Fox's positive outlook, saying it contributed to his rapid recovery.[15] Fox endured sixteen months of chemotherapy and found the time he spent in the British Columbia Cancer Control Agency facility difficult as he watched fellow cancer patients suffer and die from the disease.[16]

In the summer of 1977, Rick Hansen, working with the Canadian Wheelchair Sports Association, invited Fox to try out for his wheelchair basketball team.[17] Less than two months after learning how to play the sport, Fox was named a member of the team for the national championship in Edmonton, Alberta.[18] He won three national titles with the team,[4] and was named an all-star by the North American Wheelchair Basketball Association in 1980.[19]

Marathon of Hope

Fox in 1980
The night before his cancer surgery, Fox had been given an article about Dick Traum, the first amputee to complete the New York City Marathon.[4] The article inspired him; he embarked on a 14-month training program, telling his family he planned to compete in a marathon himself.[2] In private, he devised a more extensive plan. His hospital experiences had made Fox angry at how little money was dedicated to cancer research. He intended to run the length of Canada in the hope of increasing cancer awareness, a goal he initially divulged only to his friend Douglas Alward.[20]

Fox ran with an unusual gait, as he was required to hop-step on his good leg due to the extra time the springs in his artificial leg required to reset after each step.[21] He found the training painful as the additional pressure he had to place on both his good leg and his stump led to bone bruises, blisters, and intense pain. Fox found that after about 20 minutes of each run, he crossed a pain threshold and the run became easier.[22]

On September 2, 1979, Fox competed in a 17-mile (27 km) road race in Prince George. He finished in last place, ten minutes behind his closest competitor, but his effort was met with tears and applause from the other participants.[4] Following the marathon, he revealed his full plan to his family.[23] His mother discouraged him, angering Fox, though she later came to support the project. She recalled, "He said, 'I thought you'd be one of the first persons to believe in me.' And I wasn't. I was the first person who let him down".[24] Fox initially hoped to raise $1 million,[24] then $10 million, but later sought to raise $1 for each of Canada's 24 million citizens.[25]

Preparation
On October 15, 1979, Fox sent a letter to the Canadian Cancer Society in which he announced his goal and appealed for funding. He stated that he would "conquer" his disability, and promised to complete his run, even if he had to "crawl every last mile". Explaining why he wanted to raise money for research, Fox described his personal experience of cancer treatment:

I soon realized that that would only be half my quest, for as I went through the 16 months of the physically and emotionally draining ordeal of chemotherapy, I was rudely awakened by the feelings that surrounded and coursed through the cancer clinic. There were faces with the brave smiles, and the ones who had given up smiling. There were feelings of hopeful denial, and the feelings of despair. My quest would not be a selfish one. I could not leave knowing these faces and feelings would still exist, even though I would be set free from mine. Somewhere the hurting must stop ... and I was determined to take myself to the limit for this cause.[26]

Fox closed his letter with the statement: "We need your help. The people in cancer clinics all over the world need people who believe in miracles. I am not a dreamer, and I am not saying that this will initiate any kind of definitive answer or cure to cancer. I believe in miracles. I have to."[26]


The van used in the Marathon of Hope on display at the Royal British Columbia Museum
The Cancer Society was skeptical of his success but agreed to support Fox once he had acquired sponsors and requested he get a medical certificate from a heart specialist stating that he was fit to attempt the run. Fox was diagnosed with left ventricular hypertrophy – an enlarged heart – a condition commonly associated with athletes. Doctors warned Fox of the potential risks he faced, though they did not consider his condition a significant concern. They endorsed his participation when he promised that he would stop immediately if he began to experience any heart problems.[27]

A second letter was sent to several corporations seeking donations for a vehicle and running shoes, and to cover the other costs of the run.[28] Fox sent other letters asking for grants to buy a running leg.[28] The Ford Motor Company donated a camper van,[6] while Imperial Oil contributed fuel, and Adidas his running shoes.[29] Fox turned away any company that requested he endorse their products and refused any donation that carried conditions, as he insisted that nobody was to profit from his run.[6]

Start of the marathon

Terry Fox Statue at Mile 0 in St. John's, Canada
The Marathon began on April 12, 1980, when Fox dipped his right leg in the Atlantic Ocean near St. John's, Newfoundland and Labrador, and filled two large bottles with ocean water. He intended to keep one as a souvenir and pour the other into the Pacific Ocean upon completing his journey at Victoria, British Columbia.[25] Fox was supported on his run by Doug Alward, who drove the van and cooked meals.[29]

Fox was met with gale-force winds, heavy rain, and a snowstorm in the first days of his run.[2] He was initially disappointed with the reception he received but was heartened upon arriving in Channel-Port aux Basques, Newfoundland and Labrador, where the town's 10,000 residents presented him with a donation of over $10,000.[29] Throughout the trip, Fox frequently expressed his anger and frustration to those he saw as impeding the run, and he fought regularly with Alward. When they reached Nova Scotia, they were barely on speaking terms, and it was arranged for Fox's brother Darrell, then 17, to join them as a buffer.[24]

Fox left the Maritimes on June 10 and faced new challenges upon entering Quebec due to his group's inability to speak French[30] and drivers who continually forced him off the road.[31] Fox arrived in Montreal on June 22, one-third of the way through his 8,000-kilometre (5,000 mi) journey, having collected over $200,000 in donations.[21] Fox's run caught the attention of Isadore Sharp, the founder and CEO of Four Seasons Hotels and Resorts, who lost a son to melanoma in 1978 just a year after Terry's diagnosis.[32] Sharp gave food and accommodation at his hotels to Fox's team. When Fox was discouraged because so few people were making donations, Sharp pledged $2 a mile and persuaded close to 1,000 other corporations to do the same.[33] Fox was convinced by the Canadian Cancer Society that arriving in Ottawa for Canada Day would aid fundraising efforts, so he remained in Montreal for a few extra days.[31]

Ontario and marathon's end

The Terry Fox Monument in Thunder Bay
Fox crossed into Ontario on the last Saturday in June, and he was met by a brass band and thousands of residents who lined the streets to cheer him on, while the Ontario Provincial Police gave him an escort throughout the province.[34] Despite the sweltering heat of summer, he continued to run 26 miles (42 km) per day.[30] On his arrival in Ottawa, Fox met Governor General Ed Schreyer, Prime Minister Pierre Trudeau, and was the guest of honour at numerous sporting events in the city.[34] In front of over 16,000 fans, he performed a ceremonial kickoff at a Canadian Football League game between the Ottawa Rough Riders and Saskatchewan Roughriders,[35] and was given a standing ovation. Fox's journal reflected his growing excitement at the reception he had received.[36]

On July 11, Fox arrived in Toronto where a crowd of 10,000 people met him and he was honoured in Nathan Phillips Square.[37] As he ran to the square, he was joined on the road by many people, including National Hockey League star Darryl Sittler, who presented Fox with his 1980 All-Star Game jersey. The Cancer Society estimated it collected $100,000 in donations that day alone.[4] That evening he threw the ceremonial first pitch at Exhibition Stadium preceding a baseball game between the Toronto Blue Jays and the Cleveland Indians. As he continued through southern Ontario, he was met by Hockey Hall of Fame Hockey player Bobby Orr who presented him with a cheque for $25,000. Fox considered meeting Orr the highlight of his journey.[4]

refer to caption
Fox's path across eastern Canada. He began at St. John's on the east coast and ran west.
As Fox's fame grew, the Cancer Society scheduled him to attend more functions and give more speeches.[38] Fox attempted to accommodate any request that he believed would raise money, no matter how far out of his way it took him.[39] He bristled, however, at what he felt were media intrusions into his personal life, for example when the Toronto Star reported that he had gone on a date.[40] Fox was left unsure whom he could trust in the media after negative articles began to emerge, including one by The Globe and Mail that highlighted tensions with his brother Darrell and claimed he was running because he held a grudge against a doctor who had misdiagnosed his condition, allegations he referred to as "trash".[41][42]

The physical demands of running a marathon every day took their toll on Fox's body. Apart from the rest days in Montreal taken at the request of the Cancer Society, he refused to take a day off, even on his 22nd birthday.[43] He frequently had shin splints and an inflamed knee. He developed cysts on his stump and experienced dizzy spells.[44] At one point, he had a soreness in his ankle that would not go away. Although he feared he had developed a stress fracture, he ran for three more days before seeking medical attention, and was then relieved to learn it was tendonitis and could be treated with painkillers.[45] Fox rejected calls for him to seek regular medical checkups,[46] and dismissed suggestions he was risking his future health.[41] By late August, Fox described that he was exhausted before he began the day's run.[47] On September 1, outside Thunder Bay, he was forced to stop briefly after he had an intense coughing fit and experienced pains in his chest. He resumed running as the crowds along the highway shouted out their encouragement.[48] A few miles later, short of breath and with continued chest pain, he asked Alward to drive him to a hospital.[49] The next day, Fox held a tearful press conference during which he announced that his cancer had returned and spread to his lungs. He was forced to end his run after 143 days and 5,373 kilometres (3,339 mi).[50] Fox refused offers to complete the run in his stead, stating that he wanted to complete his marathon himself.[4]

National response
External videos
video icon " The Terry Fox Story" – Terry Fox Foundation (4:03 min)
Fox had raised $1.7 million (equivalent to $6 million in 2023) when he was forced to abandon the Marathon.[51] A week after his run ended, the CTV Television Network organized a nationwide telethon in support of Fox and the Canadian Cancer Society.[52] Supported by Canadian and international celebrities, the five-hour event raised $10.5 million (equivalent to $37 million in 2023).[4] Among the donations were $1 million each by the governments of British Columbia and Ontario, the former to create a new research institute to be founded in Fox's name and the latter an endowment given to the Ontario Cancer Treatment and Research Foundation.[53] Donations continued throughout the winter, and by April over $23 million had been raised (equivalent to $73 million in 2023).[54]

Supporters and well-wishers from around the world inundated Fox with letters and tokens of support. At one point, he was receiving more mail than the rest of Port Coquitlam combined.[55] Such was his fame that one letter sent from the United States addressed simply to "Terry Fox, Canada" was successfully delivered.[56]

In September 1980, Fox was invested in a special ceremony as a Companion of the Order of Canada; he was, and remains, the youngest person to be so honoured.[57][58] The Lieutenant Governor of British Columbia named him to the Order of the Dogwood, the province's highest award.[59] Canada's Sports Hall of Fame commissioned a permanent exhibit,[60] and Fox was named the winner of the Lou Marsh Award for 1980 as the nation's top athlete.[61] He was named Canada's 1980 Newsmaker of the Year. The Ottawa Citizen described the national response to his marathon as "one of the most powerful outpourings of emotion and generosity in Canada's history".
"""

In [121]:
bpe_tokenizer = BPE()

In [122]:
bpe_tokenizer.init_vocab(train_text=text)

In [123]:
len(bpe_tokenizer.word_cnt)

1215

In [124]:
len(bpe_tokenizer.vocab)

75

In [125]:
num_merges = 100

old_size = bpe_tokenizer.get_encoded_size()

for i in range(num_merges):
    stats = bpe_tokenizer.get_stats()
    top_pair = sorted(stats.items(), key=lambda i:i[-1], reverse=True)[0][0]
    bpe_tokenizer.merge(top_pair)
    new_size = bpe_tokenizer.get_encoded_size()
    compression = old_size / new_size
    print(f"Iteration {i + 1}, Top Pair: {top_pair}, Compression: {compression:.4f}X")

Iteration 1, Top Pair: ('e', '</w>'), Compression: 1.0274X
Iteration 2, Top Pair: ('d', '</w>'), Compression: 1.0496X
Iteration 3, Top Pair: ('s', '</w>'), Compression: 1.0693X
Iteration 4, Top Pair: ('t', 'h'), Compression: 1.0892X
Iteration 5, Top Pair: ('n', '</w>'), Compression: 1.1072X
Iteration 6, Top Pair: ('e', 'r'), Compression: 1.1237X
Iteration 7, Top Pair: ('a', 'n'), Compression: 1.1393X
Iteration 8, Top Pair: ('t', '</w>'), Compression: 1.1532X
Iteration 9, Top Pair: ('e', 'd</w>'), Compression: 1.1670X
Iteration 10, Top Pair: (',', '</w>'), Compression: 1.1793X
Iteration 11, Top Pair: ('i', 'n'), Compression: 1.1915X
Iteration 12, Top Pair: ('o', 'n'), Compression: 1.2035X
Iteration 13, Top Pair: ('th', 'e</w>'), Compression: 1.2155X
Iteration 14, Top Pair: ('o', '</w>'), Compression: 1.2266X
Iteration 15, Top Pair: ('y', '</w>'), Compression: 1.2377X
Iteration 16, Top Pair: ('a', 'r'), Compression: 1.2491X
Iteration 17, Top Pair: ('er', '</w>'), Compression: 1.2593X
Ite

In [126]:
list(bpe_tokenizer.merges.values())

[75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174]

In [127]:
bpe_tokenizer.merges

{('e', '</w>'): 75,
 ('d', '</w>'): 76,
 ('s', '</w>'): 77,
 ('t', 'h'): 78,
 ('n', '</w>'): 79,
 ('e', 'r'): 80,
 ('a', 'n'): 81,
 ('t', '</w>'): 82,
 ('e', 'd</w>'): 83,
 (',', '</w>'): 84,
 ('i', 'n'): 85,
 ('o', 'n'): 86,
 ('th', 'e</w>'): 87,
 ('o', '</w>'): 88,
 ('y', '</w>'): 89,
 ('a', 'r'): 90,
 ('er', '</w>'): 91,
 ('o', 'r'): 92,
 ('h', 'i'): 93,
 ('t', 'i'): 94,
 ('r', 'e'): 95,
 ('o', 'u'): 96,
 ('t', 'o</w>'): 97,
 ('an', 'd</w>'): 98,
 ('h', 'e</w>'): 99,
 ('e', 'n'): 100,
 ('a', 'l'): 101,
 ('F', 'o'): 102,
 ('a', '</w>'): 103,
 (']', '</w>'): 104,
 ('Fo', 'x'): 105,
 ('o', 'n</w>'): 106,
 ('a', 's</w>'): 107,
 ('o', 'f'): 108,
 ('.', '['): 109,
 ('i', 'n</w>'): 110,
 ('of', '</w>'): 111,
 ('hi', 's</w>'): 112,
 ('Fox', '</w>'): 113,
 ('g', '</w>'): 114,
 ('l', 'e'): 115,
 ('.', '</w>'): 116,
 ('w', 'as</w>'): 117,
 ('in', 'g</w>'): 118,
 ('c', 'h'): 119,
 ('d', 'i'): 120,
 ('s', 't'): 121,
 ('a', 'm'): 122,
 ('a', 't'): 123,
 ('al', '</w>'): 124,
 ('r', 'o'): 125,
 ('r

### Now lets try encoding some text

In [128]:
test_text = "My name is Soham Mistri."

In [129]:
# seperate text to words
test_words = re.split(r'\s+|\n', test_text) 
test_words = [" ".join(w) + " </w>" for w in test_words]
test_words

['M y </w>',
 'n a m e </w>',
 'i s </w>',
 'S o h a m </w>',
 'M i s t r i . </w>']

In [130]:
id2token = bpe_tokenizer.get_id2token()
id2token

{0: '"',
 1: '$',
 2: "'",
 3: '(',
 4: ')',
 5: ',',
 6: '-',
 7: '.',
 8: '0',
 9: '1',
 10: '2',
 11: '3',
 12: '4',
 13: '5',
 14: '6',
 15: '7',
 16: '8',
 17: '9',
 18: ':',
 19: ';',
 20: 'A',
 21: 'B',
 22: 'C',
 23: 'D',
 24: 'E',
 25: 'F',
 26: 'G',
 27: 'H',
 28: 'I',
 29: 'J',
 30: 'L',
 31: 'M',
 32: 'N',
 33: 'O',
 34: 'P',
 35: 'Q',
 36: 'R',
 37: 'S',
 38: 'T',
 39: 'U',
 40: 'V',
 41: 'W',
 42: 'Y',
 43: '[',
 44: ']',
 45: 'a',
 46: 'b',
 47: 'c',
 48: 'd',
 49: 'e',
 50: 'f',
 51: 'g',
 52: 'h',
 53: 'i',
 54: 'j',
 55: 'k',
 56: 'l',
 57: 'm',
 58: 'n',
 59: 'o',
 60: 'p',
 61: 'q',
 62: 'r',
 63: 's',
 64: 't',
 65: 'u',
 66: 'v',
 67: 'w',
 68: 'x',
 69: 'y',
 70: 'z',
 71: 'é',
 72: '–',
 73: '</w>',
 74: '<unk>',
 75: 'e</w>',
 76: 'd</w>',
 77: 's</w>',
 78: 'th',
 79: 'n</w>',
 80: 'er',
 81: 'an',
 82: 't</w>',
 83: 'ed</w>',
 84: ',</w>',
 85: 'in',
 86: 'on',
 87: 'the</w>',
 88: 'o</w>',
 89: 'y</w>',
 90: 'ar',
 91: 'er</w>',
 92: 'or',
 93: 'hi',
 94: 't

In [160]:
def decode(encoded_text):
    decoded_tokens = "".join([id2token[enc] for enc in encoded_text])
    decoded_text = decoded_tokens.split("</w>")
    return " ".join(decoded_text).strip().strip("\n")

In [249]:
def encode(text):
    words = re.split(r'\s+|\n', text) 
    words = [" ".join(w) + " </w>" for w in words]

    word_encoding = []
    original_length = 0
    
    for word in words: 
        tokens = word.split()
        original_length += len(tokens)

        for pair in bpe_tokenizer.merges:
            i = 0
            while i < len(tokens) - 1 :
                if pair[0] == tokens[i] and pair[1] == tokens[i + 1]:
                    tokens[i:i+2] = [pair[0] + pair[1]]
                else:
                    i += 1

        for token in tokens:
            if token in bpe_tokenizer.vocab:
                word_encoding.append(bpe_tokenizer.vocab[token])
            else:
                word_encoding.append(bpe_tokenizer.vocab["<unk>"])

    compression = original_length / len(word_encoding)
    print(f"Compression: {compression:.4f}X")
    return word_encoding

In [268]:
text = """This is a repository of code examples for the 2021 edition of CMU CS 11-711 Advanced NLP."""

In [269]:
word_encoding = encode(text)

Compression: 1.5789X


In [270]:
decode(word_encoding)

'This is a repository of code examples for the 2021 edition of CMU CS 11-711 Advanced NLP.'

In [271]:
assert text == decode(word_encoding)

## So finally BPE tokenizer

In [272]:
import re

class BPE:
    def __init__(self):
        self.word_cnt = {}
        self.vocab = {}
        self.merges = {}
        self.idx = 0

    def init_vocab(self, train_text):
        # add words in training data
        words = re.split(r'\s+|\n', train_text) 

        letters = sorted(list(set([w for  word in words for w in word])))
        for letter in letters:
            self.vocab[letter] = self.idx
            self.idx += 1

        # add the eow token
        self.vocab["</w>"] = self.idx
        self.idx += 1

        # add the <unk> token
        self.vocab["<unk>"] = self.idx
        self.idx += 1

        # word_cnt dict as we see before
        dict_words = []
        for word in words:
            ss_word = " ".join(w for w in word)
            ss_word += " </w>"
            dict_words.append(ss_word)

        for word in dict_words:
            self.word_cnt[word] = self.word_cnt.get(word, 0) + 1

    def get_stats(self):
        # train corpus is the dict format we saw
        stats = {}

        for word, count in self.word_cnt.items():
            letters = word.split()
            for i in range(len(letters) - 1):
                pair = (letters[i], letters[i + 1])
                stats[pair] = stats.get(pair, 0) + count

        return stats
    
    def merge(self, top_pair):
        # update vocab and merge
        self.vocab["".join(top_pair)] = self.idx
        self.merges[top_pair] = self.idx
        self.idx += 1

        # update the vocabs
        v_out = {}
        bigram = re.escape(' '.join(top_pair))
        p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
        for word in self.word_cnt:
            w_out = p.sub(''.join(top_pair), word)
            v_out[w_out] = self.word_cnt[word]
        self.word_cnt = v_out 

    def get_encoded_size(self):
        encoded_size = 0
        for word, count in self.word_cnt.items():
            letters = word.split()
            encoded_size += len(letters) * count

        return encoded_size
    
    def get_id2token(self):
        id2token = {}
        for token, idx in self.vocab.items():
            id2token[idx] = token

        return id2token
    
    def encode(self, text):
        words = re.split(r'\s+|\n', text) 
        words = [" ".join(w) + " </w>" for w in words]

        word_encoding = []
        original_length = 0
        
        for word in words: 
            tokens = word.split()
            original_length += len(tokens)

            for pair in self.merges:
                i = 0
                while i < len(tokens) - 1 :
                    if pair[0] == tokens[i] and pair[1] == tokens[i + 1]:
                        tokens[i:i+2] = [pair[0] + pair[1]]
                    else:
                        i += 1

            for token in tokens:
                if token in self.vocab:
                    word_encoding.append(self.vocab[token])
                else:
                    word_encoding.append(self.vocab["<unk>"])

        compression = original_length / len(word_encoding)
        print(f"Compression: {compression:.4f}X")
        return word_encoding
    
    def decode(self, encoded_text):
        id2token = self.get_id2token()
        decoded_tokens = "".join([id2token[enc] for enc in encoded_text])
        decoded_text = decoded_tokens.split("</w>")
        return " ".join(decoded_text).strip().strip("\n")

In [276]:
text = """
Terrance Stanley Fox (July 28, 1958 – June 28, 1981) was a Canadian athlete, humanitarian, and cancer research activist. In 1980, having had one leg amputated due to cancer, he embarked on a cross-Canada run to raise money and awareness for cancer research. The annual Terry Fox Run, first held in 1981, has grown to involve millions of participants in over 60 countries and is the world's largest one-day fundraiser for cancer research; over C$900 million has been raised in his name through the Terry Fox Research Institute as of September 2024.[1]

Fox was a distance runner and basketball player for Port Coquitlam Senior Secondary School, later named after him, and Simon Fraser University. His right leg was amputated in 1977 after he was diagnosed with osteosarcoma, though he continued to run using an artificial leg. He also played wheelchair basketball in Vancouver, winning three national championships.

In 1980, he began the Marathon of Hope to raise money for cancer research. He hoped to raise one dollar from each of Canada's 24 million people at the time. He began with little fanfare from St John's, Newfoundland and Labrador, in April that year, and ran the equivalent of a full marathon every day. Fox had become a national star by the time he reached Ontario; he made numerous public appearances with businessmen, athletes, and politicians in his efforts to raise money. He was forced to end his run outside Thunder Bay after the cancer spread to his lungs. Fox died nine months later on June 28, 1981.

Fox was the youngest person named a Companion of the Order of Canada and won the 1980 Lou Marsh Award as the nation's top sportsman. He was named Canada's Newsmaker of the Year in both 1980 and 1981 by The Canadian Press. Considered a national hero, he has had many buildings, statues, roads, and parks named in his honour across the country.

Early life and cancer
Terrance Stanley Fox was born on July 28, 1958, in Winnipeg, Manitoba, to Rolland and Betty Fox. Rolland was a switchman for the Canadian National Railway.[2] Fox spent his childhood in the Transcona suburb of Winnipeg, where he attended Wayoata Elementary School.[3] Fox had an elder brother, Fred, a younger brother, Darrell, and a younger sister, Judith.[4] Fox's maternal grandmother is Métis and Fox's younger brother Darrell has official Métis status.[5]

His family moved to Surrey in British Columbia in 1966, then settled in Port Coquitlam in 1968.[4] He had doting parents,[6] and his father recalled that Fox was extremely competitive.[7] Fox attempted to join his school's basketball team, though struggled because of his height. His coach suggested that Fox try cross-country running, which Fox did in order to impress his coach.[8][9][10] Fox continued to improve on his basketball skills, and in grade 12 he won his high school's athlete of the year award.[4] Fox was unsure whether he wanted to go to university, but Fox's mother convinced him to enrol at Simon Fraser University. He studied kinesiology with the intention of becoming a physical education teacher.[11] He was also a member of the junior varsity basketball team.[4]

a prosthetic leg in a display case
Fox's favourite prosthetic leg used during his Marathon of Hope, on display at the Canadian Museum of History
On November 12, 1976, Fox was driving to the family home in Port Coquitlam when he was distracted by nearby bridge construction and crashed into the back of a pickup truck. Fox injured his right knee in the crash and felt pain in December, but chose to ignore it until the end of basketball season.[12] By March 1977, the pain had intensified and he went to a hospital, where he was diagnosed with osteosarcoma, a form of cancer that often starts near the knees.[4] Fox believed his car accident weakened his knee and left it vulnerable to the disease, though his doctors argued there was no connection.[13] He was told that his leg had to be amputated, he would require chemotherapy treatment, and that recent medical advances meant he had a 50-per cent chance of survival. Fox learned that two years before, the figure would have been only 15 per cent; the improvement in survival rates impressed on him the value of cancer research.[14] With the help of an artificial leg, Fox was walking three weeks after the amputation.[4] Doctors were impressed with Fox's positive outlook, saying it contributed to his rapid recovery.[15] Fox endured sixteen months of chemotherapy and found the time he spent in the British Columbia Cancer Control Agency facility difficult as he watched fellow cancer patients suffer and die from the disease.[16]

In the summer of 1977, Rick Hansen, working with the Canadian Wheelchair Sports Association, invited Fox to try out for his wheelchair basketball team.[17] Less than two months after learning how to play the sport, Fox was named a member of the team for the national championship in Edmonton, Alberta.[18] He won three national titles with the team,[4] and was named an all-star by the North American Wheelchair Basketball Association in 1980.[19]

Marathon of Hope

Fox in 1980
The night before his cancer surgery, Fox had been given an article about Dick Traum, the first amputee to complete the New York City Marathon.[4] The article inspired him; he embarked on a 14-month training program, telling his family he planned to compete in a marathon himself.[2] In private, he devised a more extensive plan. His hospital experiences had made Fox angry at how little money was dedicated to cancer research. He intended to run the length of Canada in the hope of increasing cancer awareness, a goal he initially divulged only to his friend Douglas Alward.[20]

Fox ran with an unusual gait, as he was required to hop-step on his good leg due to the extra time the springs in his artificial leg required to reset after each step.[21] He found the training painful as the additional pressure he had to place on both his good leg and his stump led to bone bruises, blisters, and intense pain. Fox found that after about 20 minutes of each run, he crossed a pain threshold and the run became easier.[22]

On September 2, 1979, Fox competed in a 17-mile (27 km) road race in Prince George. He finished in last place, ten minutes behind his closest competitor, but his effort was met with tears and applause from the other participants.[4] Following the marathon, he revealed his full plan to his family.[23] His mother discouraged him, angering Fox, though she later came to support the project. She recalled, "He said, 'I thought you'd be one of the first persons to believe in me.' And I wasn't. I was the first person who let him down".[24] Fox initially hoped to raise $1 million,[24] then $10 million, but later sought to raise $1 for each of Canada's 24 million citizens.[25]

Preparation
On October 15, 1979, Fox sent a letter to the Canadian Cancer Society in which he announced his goal and appealed for funding. He stated that he would "conquer" his disability, and promised to complete his run, even if he had to "crawl every last mile". Explaining why he wanted to raise money for research, Fox described his personal experience of cancer treatment:

I soon realized that that would only be half my quest, for as I went through the 16 months of the physically and emotionally draining ordeal of chemotherapy, I was rudely awakened by the feelings that surrounded and coursed through the cancer clinic. There were faces with the brave smiles, and the ones who had given up smiling. There were feelings of hopeful denial, and the feelings of despair. My quest would not be a selfish one. I could not leave knowing these faces and feelings would still exist, even though I would be set free from mine. Somewhere the hurting must stop ... and I was determined to take myself to the limit for this cause.[26]

Fox closed his letter with the statement: "We need your help. The people in cancer clinics all over the world need people who believe in miracles. I am not a dreamer, and I am not saying that this will initiate any kind of definitive answer or cure to cancer. I believe in miracles. I have to."[26]


The van used in the Marathon of Hope on display at the Royal British Columbia Museum
The Cancer Society was skeptical of his success but agreed to support Fox once he had acquired sponsors and requested he get a medical certificate from a heart specialist stating that he was fit to attempt the run. Fox was diagnosed with left ventricular hypertrophy – an enlarged heart – a condition commonly associated with athletes. Doctors warned Fox of the potential risks he faced, though they did not consider his condition a significant concern. They endorsed his participation when he promised that he would stop immediately if he began to experience any heart problems.[27]

A second letter was sent to several corporations seeking donations for a vehicle and running shoes, and to cover the other costs of the run.[28] Fox sent other letters asking for grants to buy a running leg.[28] The Ford Motor Company donated a camper van,[6] while Imperial Oil contributed fuel, and Adidas his running shoes.[29] Fox turned away any company that requested he endorse their products and refused any donation that carried conditions, as he insisted that nobody was to profit from his run.[6]

Start of the marathon

Terry Fox Statue at Mile 0 in St. John's, Canada
The Marathon began on April 12, 1980, when Fox dipped his right leg in the Atlantic Ocean near St. John's, Newfoundland and Labrador, and filled two large bottles with ocean water. He intended to keep one as a souvenir and pour the other into the Pacific Ocean upon completing his journey at Victoria, British Columbia.[25] Fox was supported on his run by Doug Alward, who drove the van and cooked meals.[29]

Fox was met with gale-force winds, heavy rain, and a snowstorm in the first days of his run.[2] He was initially disappointed with the reception he received but was heartened upon arriving in Channel-Port aux Basques, Newfoundland and Labrador, where the town's 10,000 residents presented him with a donation of over $10,000.[29] Throughout the trip, Fox frequently expressed his anger and frustration to those he saw as impeding the run, and he fought regularly with Alward. When they reached Nova Scotia, they were barely on speaking terms, and it was arranged for Fox's brother Darrell, then 17, to join them as a buffer.[24]

Fox left the Maritimes on June 10 and faced new challenges upon entering Quebec due to his group's inability to speak French[30] and drivers who continually forced him off the road.[31] Fox arrived in Montreal on June 22, one-third of the way through his 8,000-kilometre (5,000 mi) journey, having collected over $200,000 in donations.[21] Fox's run caught the attention of Isadore Sharp, the founder and CEO of Four Seasons Hotels and Resorts, who lost a son to melanoma in 1978 just a year after Terry's diagnosis.[32] Sharp gave food and accommodation at his hotels to Fox's team. When Fox was discouraged because so few people were making donations, Sharp pledged $2 a mile and persuaded close to 1,000 other corporations to do the same.[33] Fox was convinced by the Canadian Cancer Society that arriving in Ottawa for Canada Day would aid fundraising efforts, so he remained in Montreal for a few extra days.[31]

Ontario and marathon's end

The Terry Fox Monument in Thunder Bay
Fox crossed into Ontario on the last Saturday in June, and he was met by a brass band and thousands of residents who lined the streets to cheer him on, while the Ontario Provincial Police gave him an escort throughout the province.[34] Despite the sweltering heat of summer, he continued to run 26 miles (42 km) per day.[30] On his arrival in Ottawa, Fox met Governor General Ed Schreyer, Prime Minister Pierre Trudeau, and was the guest of honour at numerous sporting events in the city.[34] In front of over 16,000 fans, he performed a ceremonial kickoff at a Canadian Football League game between the Ottawa Rough Riders and Saskatchewan Roughriders,[35] and was given a standing ovation. Fox's journal reflected his growing excitement at the reception he had received.[36]

On July 11, Fox arrived in Toronto where a crowd of 10,000 people met him and he was honoured in Nathan Phillips Square.[37] As he ran to the square, he was joined on the road by many people, including National Hockey League star Darryl Sittler, who presented Fox with his 1980 All-Star Game jersey. The Cancer Society estimated it collected $100,000 in donations that day alone.[4] That evening he threw the ceremonial first pitch at Exhibition Stadium preceding a baseball game between the Toronto Blue Jays and the Cleveland Indians. As he continued through southern Ontario, he was met by Hockey Hall of Fame Hockey player Bobby Orr who presented him with a cheque for $25,000. Fox considered meeting Orr the highlight of his journey.[4]

refer to caption
Fox's path across eastern Canada. He began at St. John's on the east coast and ran west.
As Fox's fame grew, the Cancer Society scheduled him to attend more functions and give more speeches.[38] Fox attempted to accommodate any request that he believed would raise money, no matter how far out of his way it took him.[39] He bristled, however, at what he felt were media intrusions into his personal life, for example when the Toronto Star reported that he had gone on a date.[40] Fox was left unsure whom he could trust in the media after negative articles began to emerge, including one by The Globe and Mail that highlighted tensions with his brother Darrell and claimed he was running because he held a grudge against a doctor who had misdiagnosed his condition, allegations he referred to as "trash".[41][42]

The physical demands of running a marathon every day took their toll on Fox's body. Apart from the rest days in Montreal taken at the request of the Cancer Society, he refused to take a day off, even on his 22nd birthday.[43] He frequently had shin splints and an inflamed knee. He developed cysts on his stump and experienced dizzy spells.[44] At one point, he had a soreness in his ankle that would not go away. Although he feared he had developed a stress fracture, he ran for three more days before seeking medical attention, and was then relieved to learn it was tendonitis and could be treated with painkillers.[45] Fox rejected calls for him to seek regular medical checkups,[46] and dismissed suggestions he was risking his future health.[41] By late August, Fox described that he was exhausted before he began the day's run.[47] On September 1, outside Thunder Bay, he was forced to stop briefly after he had an intense coughing fit and experienced pains in his chest. He resumed running as the crowds along the highway shouted out their encouragement.[48] A few miles later, short of breath and with continued chest pain, he asked Alward to drive him to a hospital.[49] The next day, Fox held a tearful press conference during which he announced that his cancer had returned and spread to his lungs. He was forced to end his run after 143 days and 5,373 kilometres (3,339 mi).[50] Fox refused offers to complete the run in his stead, stating that he wanted to complete his marathon himself.[4]

National response
External videos
video icon " The Terry Fox Story" – Terry Fox Foundation (4:03 min)
Fox had raised $1.7 million (equivalent to $6 million in 2023) when he was forced to abandon the Marathon.[51] A week after his run ended, the CTV Television Network organized a nationwide telethon in support of Fox and the Canadian Cancer Society.[52] Supported by Canadian and international celebrities, the five-hour event raised $10.5 million (equivalent to $37 million in 2023).[4] Among the donations were $1 million each by the governments of British Columbia and Ontario, the former to create a new research institute to be founded in Fox's name and the latter an endowment given to the Ontario Cancer Treatment and Research Foundation.[53] Donations continued throughout the winter, and by April over $23 million had been raised (equivalent to $73 million in 2023).[54]

Supporters and well-wishers from around the world inundated Fox with letters and tokens of support. At one point, he was receiving more mail than the rest of Port Coquitlam combined.[55] Such was his fame that one letter sent from the United States addressed simply to "Terry Fox, Canada" was successfully delivered.[56]

In September 1980, Fox was invested in a special ceremony as a Companion of the Order of Canada; he was, and remains, the youngest person to be so honoured.[57][58] The Lieutenant Governor of British Columbia named him to the Order of the Dogwood, the province's highest award.[59] Canada's Sports Hall of Fame commissioned a permanent exhibit,[60] and Fox was named the winner of the Lou Marsh Award for 1980 as the nation's top athlete.[61] He was named Canada's 1980 Newsmaker of the Year. The Ottawa Citizen described the national response to his marathon as "one of the most powerful outpourings of emotion and generosity in Canada's history".
"""

In [277]:
bpe_tokenizer = BPE()

In [278]:
bpe_tokenizer.init_vocab(train_text=text)

In [279]:
num_merges = 100

old_size = bpe_tokenizer.get_encoded_size()

for i in range(num_merges):
    stats = bpe_tokenizer.get_stats()
    top_pair = sorted(stats.items(), key=lambda i:i[-1], reverse=True)[0][0]
    bpe_tokenizer.merge(top_pair)
    new_size = bpe_tokenizer.get_encoded_size()
    compression = old_size / new_size
    print(f"Iteration {i + 1}, Top Pair: {top_pair}, Compression: {compression:.4f}X")

Iteration 1, Top Pair: ('e', '</w>'), Compression: 1.0274X
Iteration 2, Top Pair: ('d', '</w>'), Compression: 1.0496X
Iteration 3, Top Pair: ('s', '</w>'), Compression: 1.0693X
Iteration 4, Top Pair: ('t', 'h'), Compression: 1.0892X
Iteration 5, Top Pair: ('n', '</w>'), Compression: 1.1072X
Iteration 6, Top Pair: ('e', 'r'), Compression: 1.1237X
Iteration 7, Top Pair: ('a', 'n'), Compression: 1.1393X
Iteration 8, Top Pair: ('t', '</w>'), Compression: 1.1532X
Iteration 9, Top Pair: ('e', 'd</w>'), Compression: 1.1670X
Iteration 10, Top Pair: (',', '</w>'), Compression: 1.1793X
Iteration 11, Top Pair: ('i', 'n'), Compression: 1.1915X
Iteration 12, Top Pair: ('o', 'n'), Compression: 1.2035X
Iteration 13, Top Pair: ('th', 'e</w>'), Compression: 1.2155X
Iteration 14, Top Pair: ('o', '</w>'), Compression: 1.2266X
Iteration 15, Top Pair: ('y', '</w>'), Compression: 1.2377X
Iteration 16, Top Pair: ('a', 'r'), Compression: 1.2491X
Iteration 17, Top Pair: ('er', '</w>'), Compression: 1.2593X
Ite

In [289]:
test_text = """Imagine the downfall man united having that this man is making a part 2 of any video."""

In [290]:
word_encoding = encode(test_text)

Compression: 1.7551X


In [291]:
decode(word_encoding)

'Imagine the downfall man united having that this man is making a part 2 of any video.'