# Fetch Rewards Coding Exercise

## The samples:

In [1]:
sample_1 = "The easiest way to earn points with Fetch Rewards is to just shop for the products you already love. If you have any participating brands on your receipt, you'll get points based on the cost of the products. You don't need to clip any coupons or scan individual barcodes. Just scan each grocery receipt after you shop and we'll find the savings for you."
sample_2 = "The easiest way to earn points with Fetch Rewards is to just shop for the items you already buy. If you have any eligible brands on your receipt, you will get points based on the total cost of the products. You do not need to cut out any coupons or scan individual UPCs. Just scan your receipt after you check out and we will find the savings for you."
sample_3 = "We are always looking for opportunities for you to earn more points, which is why we also give you a selection of Special Offers. These Special Offers are opportunities to earn bonus points on top of the regular points you earn every time you purchase a participating brand. No need to pre-select these offers, we'll give you the points whether or not you knew about the offer. We just think it is easier that way."

## The solution:

This function assumes only strings with no numeric characters will be entered. Contractions are converted to their full root words and punctuation is removed by the function. The similarity metric measures how similar the texts are not only in word order, but also in vocabulary.

In [2]:
#We will need this to identify the vocabulary of each sample
def unique(list1):
    unique_list = []
    for x in list1:
        if x not in unique_list:
            unique_list.append(x)
    return(unique_list)

def similarity(sample_1, sample_2):
    #replace contractions with full words and remove punctuation, would normally use regex for this
    sample_1 = sample_1.replace("n't", " not").replace("'ll", " will").replace(".", "").replace(",", "").lower()
    sample_2 = sample_2.replace("n't", " not").replace("'ll", " will").replace(".", "").replace(",", "").lower()
    
    #tokenize (separate into a list of individual words)
    sample_1_tokens = sample_1.split(" ")
    sample_2_tokens = sample_2.split(" ")    
    
    #If the two samples aren't the same length, add blank strings at the end of the shorter sample until both are equal length
    if len(sample_1_tokens) != len(sample_2_tokens):
        sample_1_padded = sample_1_tokens + [''] * (max([len(sample_1_tokens), len(sample_2_tokens)]) - len(sample_1_tokens))
        sample_2_padded = sample_2_tokens + [''] * (max([len(sample_1_tokens), len(sample_2_tokens)]) - len(sample_2_tokens))
    
    #This portion compares if the sentences have the same word in the same position#
    #This is why both samples need to be the same length
    sameness = []
    for token in range(len(sample_2_padded)):
        if sample_1_padded[token] == sample_2_padded[token]:
            sameness.append(1)
        else:
            sameness.append(0)
            
    #This portion check to see if the two sample use many of the same words, they just might be in a different order
    sample_1_unique = unique(sample_1_tokens)
    for token in range(len(sample_1_unique)):
        if sample_1_unique[token] in unique(sample_2_tokens):
            sameness.append(1)
        else:
            sameness.append(0)
            
    return round((sum(sameness) / len(sameness)), 2)

## Test:

In [3]:
similarity(sample_1, sample_2)

0.63

In [4]:
similarity(sample_1, sample_3)

0.16

In [5]:
similarity(sample_2, sample_3)

0.14

Since Samples 1 and 2 are so similar, it makes sense that they have similar scores when compared to Sample 3. Using vectorizers like td-idf or BERT, an even better comparison could be achieved.