## CS5803 NLP
### Assignment 1
#### Tanmay Garg, Tanmay Goyal, Tanay Yadav
#### Roll no: CS20BTECH11063, AI20BTECH11021, AI20BTECH11026

### **BLEU Score**

Let $x$ and $y$ be two sentences we wish to compare. Then, we define the modified N-gram as:

$$p_n = \frac{\sum\limits_{n-gram \in x\cap y}min(count_x(n-gram) , count_y(n-gram))}{\sum\limits_{n-gram \in x}count_x(n-gram)}$$

Here, $y$ is the ground truth and $x$ is the machine translated text. 

Define the BLEU Score as:

$$BLEU = BP \times exp\left(\sum\limits_{n=1}^N w_n \log(p_n)\right)$$
where $N = 4$ , $w_n = \frac{1}{N}$ and BP is the Brevity Penalty, which we set to 1.

1. Implement the BLEU score metric and pre-process the text by lower-casing the text and removing all punctuations.

2. Use this implementation to find the BLEU score when $x$ = "The boys were playing happily on the ground" and $y$ = "The boys were playing football on the field."

3. Explain why we take a minimum in the numerator.

4. Use the implmentation to find the BLEU score for 5 pairs of sentences and explain the disadvantage of the BLEU score.

In [5]:
import re
from collections import defaultdict
from math import log, exp  
import numpy as np

In [6]:
def preprocess_text(text):
    '''
    Function to preprocess the text by removing punctuations and converting to lower case
    '''
    # [^\w\s] -> ^ means except , \w refers to any alphanumeric character and \s refers to whitespace
    text = re.sub(r'[^\w\s]', '', text).lower()
    return text

def n_gram_dict(text , n):
    '''
    Function to create a dictionary consisting of the n-grams and their counts
    '''
    text_list = text.split(' ')
    dict = {}    
    # we also check for duplicates
    for i in range(n-1 , len(text_list)):
        key = tuple(text_list[i-n+1 : i+1])
        dict[key] = 1 if key not in dict.keys() else dict[key] + 1        
    return dict

def modified_ngram_precision(n_gram_dict_x, n_gram_dict_y):
    '''
    Calculates the modified n-gram precision given the n-gram dictionaries for x and y
    '''
    numerator = 0
    denominator = 0
    for n_gram in n_gram_dict_x.keys():
        denominator += n_gram_dict_x[n_gram]
        if n_gram in n_gram_dict_y.keys():
            numerator += min(n_gram_dict_x[n_gram], n_gram_dict_y[n_gram])
            
    return numerator/denominator
    
def bleu_score(x, y, N = 4):
    '''
    Function to calculate the BLEU score
    '''
    # preprocessing the text
    x = preprocess_text(x)
    y = preprocess_text(y)
    
    BP = 1
    weights = [1/N for i in range(N)] 

    modified_n_gram_list = []

    for i in range(1, N+1):
        # creating the n-gram dictionaries
        n_gram_dict_x = n_gram_dict(x, i)
        n_gram_dict_y = n_gram_dict(y, i)
        modified_n_gram_list.append(modified_ngram_precision(n_gram_dict_x , n_gram_dict_y))
        
    score = 0
    for (w , p) in zip(weights , modified_n_gram_list):
        try:
            score += w * log(p)
        except:
            # p = 0
            return 0
    return BP * exp(score)

x = "The boys were playing happily on the ground."  
y = "The boys were playing football on the field."

print("The bleu score for above pair of sentences is: ", bleu_score(x, y , 4))


The bleu score for above pair of sentences is:  0.4111336169005197


### **Explain why we take a minimum in the numerator**

The modified N-gram precision is calculated by taking the minimum of the count of the n-gram in the machine translated text and the count of the n-gram in the ground truth. This prevents repeating N-grams in either of the sentences from inflating the precision. By taking the <i>min ($count_x$, $count_y$)</i> for each N-gram, the metric focuses on measuring overlap rather than the number of times a particular N-gram is repeated.


In [7]:
# Experimenting with the BLEU score for different sentences

sentences = [
    ("The boy went to the store", "The boy went to the mall"),
    ("The cat is now sleeping", "The cat is now napping"),
    ("The child is going to school to study", "The kid is going to school to learn"),
    ("The officers have gathered for a meeting", "The officers have gathered for a conference"),
    ("He scored a goal in the football match", "He scored a goal in the soccer game"), 
]

# Calculating the BLEU score for the above sentences
for x, y in sentences:
    # pretty print the output for better readability
    # print("The bleu score for the pair of sentences: ", x, " and ", y, " is: ", bleu_score(x, y , 4))
    print(f"""The bleu score for the pair of sentences:\n\n{x}\n{y}\nScore: {bleu_score(x, y , 4)}
          """)
    

The bleu score for the pair of sentences:

The boy went to the store
The boy went to the mall
Score: 0.7598356856515925
          
The bleu score for the pair of sentences:

The cat is now sleeping
The cat is now napping
Score: 0.668740304976422
          
The bleu score for the pair of sentences:

The child is going to school to study
The kid is going to school to learn
Score: 0.5410822690539396
          
The bleu score for the pair of sentences:

The officers have gathered for a meeting
The officers have gathered for a conference
Score: 0.8091067115702212
          
The bleu score for the pair of sentences:

He scored a goal in the football match
He scored a goal in the soccer game
Score: 0.6803749333171202
          


### **Disadvantages of BLEU Score**

1. **Lack of Semantic Understanding**: BLEU score only considers the n-gram precision and does not take into account the semantic meaning of the sentence. A pair of sentences may have different words but convey the same meaning. Also, there is no contextual understanding and no consideration for synonyms.

2. **Cannot Evaluate Single Sentences**: BLEU score is designed to work on a large corpus of words and is not suitable for evaluating single sentences.

3. **Does Not Consider Order of Words**: BLEU score does not consider the order of words and only focuses on counting the n-grams, so sentences with incorrect word order may still get a high BLEU score.

4. **Sensitive to Length**: BLEU score is sensitive to the length of the sentences. Longer sentences are penalized as they have more n-grams and shorter sentences are rewarded.