## Problem 1: Calculate the Value of a Complex Formula [2 points]

Given the following formula, which computes a weighted sum of the first $n$ natural numbers, adjusted by a quadratic term:

$$ S(n) = \sum_{i=1}^{n} \left( i^2 + 2i + 1 \right) $$

This formula sums the terms $i^2 + 2i + 1$ for each integer $i$ from 1 to $n$. For example:

$$ S(3) = (1^2 + 2 \cdot 1 + 1) + (2^2 + 2 \cdot 2 + 1) + (3^2 + 2 \cdot 3 + 1) = 29 $$

In this problem, we will write a function ```calculate_complex_sum(n)``` that takes an integer $n$ as input and calculates the value of $S(n)$ using this formula. We will not use any Python libraries in our implementation.

In [42]:
def calculate_complex_sum(n):
    
    """
    Calculates the sum of a sequence where each term is the sum of
    the square of the index, twice the index, and 1.

    Args:
    n (int): The number of terms to include in the sum.

    Returns:
    int: The calculated sum.
    """

    # We create a variable called sum, which has a value of zero, to store the final sum. 
    # We will manipulate this variable with arithmetic operations, then return it as the output of the function.
    sum = 0 
    
    # We start with iterating over each number from 1 to n
    for i in range (1, n+1):
        a = i * i # We create a variable called a, which holds our calculation for the square of the number i. 
        b = i * 2 # We then create a variable called b, which holds our calculation for twice of the number i. 
        c = 1 # We also create a variable called c, which has the value of 1. This is the constant term from the formula.
        sum += a + b + c # Finally, we add all the values of these variables and store them in "sum" for each i.
    
    return sum

In [44]:
import math

assert(math.isclose(calculate_complex_sum(3), 29))

assert(math.isclose(calculate_complex_sum(4), 54))

## Problem 2: Automatic Detection and Correction of Fused Mathematical Expressions [3 Points]

### Fused Mathematical Expressions

We have been hired by MathCorrector, a company specializing in developing tools to automatically detect and correct specific errors in mathematical expressions. This task focuses on errors where multiplication operators between numbers and variables are missing. 

For instance
- `2x + 5y` should be converted to `2 * x + 5 * y`.

For simplicity, the input will only contaion variables(**consisting of a single letter**), integer of any digit, and basic arithmetic operators: addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), and equal (`=`).

### Formatting Expression

To make the expression easier to read, you are also required to format the expression to ensure consistent spacing between operators and operands. 

For instance:
- `2*x+ 5*y` should be `2 * x + 5 * y`

We will assume that the input expression is correct except for the problems mentioned above.

### Functional Requirements

Three functions are desiged to handle problmes metioned above. In this problem, we will implement theses three functions.

1. **detect_fused_tokens(expr)**:
   - **Purpose**: Detect where multiplication is missing between numbers and variables.
   - **Input**: A string `expr` representing the mathematical expression.
   - **Process**: Identify substrings where numbers are directly followed by variables without an operator (such as `2x`).
   - **Output**: Return a list of substrings where multiplication is missing.

2. **insert_multiplication(expr, fused_tokens)**:
   - **Purpose**: Insert multiplication operators in the correct positions identified by the `detect_fused_tokens` function.
   - **Input**: A string `expr` representing the mathematical expression, and a list `fused_tokens` from the `detect_fused_tokens` function.
   - **Process**: Add multiplication operators between numbers and variables in the substrings identified by the `fused_tokens`.
   - **Output**: Return the modified expression with the appropriate multiplication operators inserted.

3. **format_expression(expr)**:
   - **Purpose**: Format the expression to ensure consistent spacing between operators and operands, making the expression easier to read.
   - **Input**: The expression processed by the `insert_multiplication` function.
   - **Process**: Ensure appropriate spacing between all operators and operands.
   - **Output**: Return the formatted mathematical expression.

#### Example

For the input `2x+ 5y - 3zw`:
- The `detect_fused_tokens` function will detect the missing multiplication operators and return relevant information (e.g., `['2x', '5y', '3zw']`).
- The `insert_multiplication` function will process the expression and convert it to `2*x+ 5*y - 3*z*w`.
- The `format_expression` function will ensure consistent spacing and convert it to `2 * x + 5 * y - 3 * z * w`.



In [46]:
def detect_fused_tokens(expr):
    
    """
    Detects and returns a list of tokens in the expression that are not 
    separated by spaces and are not operators.

    Args:
    expr (str): The input expression as a string.

    Returns:
    list: A list of non-operator tokens without spaces.
    """
    
    # We start with defining the operators users can use (according to the problem explanation) 
    operators = set(['+', '-', '/', '*', '=', ','])
    
    # Then, we create an empty result string, which will hold the fused tokens.
    result = ''
    
    # First, we check if every character is followed by a space.
    # If it is, then there is no fused tokens in the input, therefore we return an empty list. 
    # If it is not, then the code breaks, because the only purpose of this code is doing the check. 
    for i in range(len(expr) - 1):
        if expr[i] != ' ' and expr[i + 1] != ' ':
            break 
    else:
        return []
    
    # Then, we iterate through each character in the expression. If a character is not an operator, we add it to the result string. 
    # We do this to get rid of the operators. 
    for c in expr:
        if c not in operators:
            result += c

    # Then, we split the result string to get the individual fused tokens. 
    # We also filter out any empty strings that might have been created by the split method.
    fused_tokens = [token for token in result.split() if token]

    # Finally we return the fused tokens, defined previously. 
    return fused_tokens

def insert_multiplication(expr, fused_tokens):
    
    # Again, we start with defining the operators users can use.
    operators = set(['+', '-', '/', '*', '=', ',', ' '])

    # Again, we check if every character is followed by a space.
    for i in range(len(expr) - 1):
        if expr[i] != ' ' and expr[i + 1] != ' ':
            break
    else:
        # If every character is followed by a space, we return the expression itself. 
        # Because, there is nothing to fix. 
        return expr
    
    # Then, we iterate through each character in fused tokens (second argument).
    for token in fused_tokens:
        new_token = '' # We create an empty string to hold the new, fixed tokens. 
        # Then, we iterate through the characters of the current individual token.
        # We use range(len(token) - 1) because we're going to look at character pairs. 
        # So we stop one character before the end.
        for i in range(len(token) - 1): 
            if token[i] not in operators and token[i+1] not in operators: # This means a pair without operators. 
                new_token += token[i] + '*' 
            else: # If at least one of the characters is an operator)
                new_token += token[i] # we just add the current character to new_token without adding a '*'.
                
        # We add the last character of the token to new_token. 
        # We do this separately because the inner loop doesn't process the last character.  
        new_token += token[-1]

        # Finally, we replace the original token in the expression with the new_token
        expr = expr.replace(token, new_token)

    return expr 

def format_expression(expr):
    """
    Formats a mathematical expression by adding spaces between each character
    within the terms and between the terms themselves.

    Args:
    expr (str): The input expression as a string.

    Returns:
    str: The formatted expression with spaces added.
    """
    
    # We first split the expression into a list of items (tokens and operators) using spaces
    list_expr = expr.split()
    
    # Then, we add spaces between each character in each item
    spaced_items = [' '.join(item) for item in list_expr]
    
    # Finally, we add all spaced items into a single string with spaces between them using join method. 
    new_str = ' '.join(spaced_items)
    
    return new_str

In [48]:
# Tests for detect_fused_tokens
assert(detect_fused_tokens("2x+ 5y = 3z + zw3") == ['2x', '5y', '3z', 'zw3'])
assert(detect_fused_tokens("a * b + 2 * c") == [])

In [50]:
# Tests for insert_multiplication
assert(insert_multiplication("2x+ 5y = 3z + z3w", ['2x', '5y', '3z', 'z3w']) == "2*x+ 5*y = 3*z + z*3*w")
assert(insert_multiplication("a * b + 2 * c", []) == "a * b + 2 * c")

In [52]:
# Tests for format_expression
assert(format_expression("2*x+ 5*y = z*3+a*2*b") == "2 * x + 5 * y = z * 3 + a * 2 * b")
assert(format_expression("a*b + 2*c") == "a * b + 2 * c")

## Problem 3: Post-Processor of the LLM's Responce [5 pints]

Large Language Models (LLMs) do not always follow user instructions precisely. For a given task, LLMs may provide responses in various formats. We have been tasked with creating a post-processor to extract target information from LLM responses related to summarizing video descriptions. These responses will include keywords from the video descriptions, an overall summary of the video, and some redundant responses (e.g., "Sure, here are the relevant keywords for each video description:", "And here is a new short description of the video in one sentence:").

Our task is to write a function `post_process` that returns a list of keywords and a sentence summarizing the video. The code should be able to handle all formats of LLM responses. We will not rely on the index of data samples, as there are unlimited data samples with different formats in real applications. Given an LLM response, we will classify the format of the response (5 formats in our dataset) and extract the information using different strategies.

```python
keywords, summary = post_process(response)
```

Should return:

```python
['Origami', 'Yellow apper', 'Folding', 'Paper', 'Triangle', 'Person']
"A person skillfully folds a yellow paper into a precise triangle, showcasing their origami skills."
```

We wil load the LLM responses from a json file. This json file is a list of dicts. The structure is like this: 
```python
dataset = [
    {
        "response": llm response, 
        "keywords": ground-truth keywords,
        "summary": ground-truth summary,
    },
    ...
]
```
Only the "response" item of every dict is allowed to be used in the main function. The "keywords" and "summary" items are only used for the program testing.   

In [54]:
import json
file_path = r"./dataset.json" # put the dataset.json and this code file in the same folder
with open(file_path, "r") as f_r:
    dataset = json.load(f_r)

In [56]:
#check the LLM response from the first data sample 
print(dataset[0]['response'])

Here is a bullet-point list of relevant keywords present in the video:

• Dragon
• Flying
• Dog
• Whistle
• Holiday
• Drop

And here is a new short description of the video based on the generated keywords in one sentence:

"A magical dragon soars through the air while a dog whistles in harmony, celebrating a festive holiday with a joyful drop."


In [58]:
# We start with investigating the dataset, mainly the response variable. 
# The reason of this step is to try to find a pattern in the variable we will use as the argument of our function. 
# If we find a pattern, we might use it to scrape the goal data. 

import pandas as pd

# Because the dataset is a list of dicts, use the normalize method to convert it into a dataframe. 
df = pd.json_normalize(dataset)

df.head()

Unnamed: 0,response,keywords,summary
0,Here is a bullet-point list of relevant keywor...,"[Dragon, Flying, Dog, Whistle, Holiday, Drop]",A magical dragon soars through the air while a...
1,Here are the relevant keywords present in the ...,"[Dragon, Flying, Dog whistle, Holiday, Drop]","A magical dragon soars through the sky, respon..."
2,Here is a bullet-point list of relevant keywor...,"[Dragon, Flying, Dog whistle, Holiday, Drop]",A majestic dragon soars through the skies whil...
3,Here is a bullet-point list of relevant keywor...,"[Dragon, Flying, Dog, Whistle, Holiday, Drop]","A magical dragon takes to the skies, accompani..."
4,Here are the relevant keywords present in the ...,"[Dragon, Flying, Dog, Whistle, Holiday, Drop]",A magical holiday treat features a majestic dr...


In [60]:
# Then, we check for the unique dataponts inside the response variable. 
df['response'].unique()

array(['Here is a bullet-point list of relevant keywords present in the video:\n\n• Dragon\n• Flying\n• Dog\n• Whistle\n• Holiday\n• Drop\n\nAnd here is a new short description of the video based on the generated keywords in one sentence:\n\n"A magical dragon soars through the air while a dog whistles in harmony, celebrating a festive holiday with a joyful drop."',
       'Here are the relevant keywords present in the video:\n\n• Dragon\n• Flying\n• Dog whistle\n• Holiday\n• Drop\n\nAnd here is a new short description of the video based on the generated keywords in one sentence:\n\n"A magical dragon soars through the sky, responding to a dog whistle\'s festive tune on a holiday drop."\n\nBy analyzing the keywords in the video description, you can get a better understanding of the content and purpose of the video, and create more targeted and relevant content for your audience.',
       'Here is a bullet-point list of relevant keywords present in the video:\n\n• Dragon\n• Flying\n• Dog 

In [62]:
# When we check these unique values, we see a pattern. 

# For keywords, we see that every keyword is in between "in the video:" and either "And here…" or "New video…"
# They are listed with either bullet points, numbers, or spaces.

# For summaries, we see that every summary comes after either "in one sentence:" or "New video description:"

# We will use re library for regular expressions, which is a built-in library in Python. 
# Specifically, it will help us with further string manipulation to create the pattern and check the matches. 

In [64]:
import re

def post_process(response: str):
    """
    args:
        response (str): a line of LLM's output
    return:
        keywords (list[str]): a list of keywords in the response
        summary (str): the overall summary 
    """
    # This is the first function inside our main function, which will extract the keywords. 
    def extract_keywords(text):
        # This pattern looks for text starting with "Here", followed by anything until "in the video:"
        # then captures everything until it finds "and here" or "new video".
        pattern = r'Here.*?in the video:(.*?)(?:and here|new video)'
        # We look for the pattern in the text, and we enable the match for any character, including a newline, and case insensitive.
        # We then store this match into a new variable called "match"
        match = re.search(pattern, text, re.DOTALL | re.IGNORECASE) 
        if match: # We check if a match is found. 
            extracted = match.group(1).strip() # We extract the content of the first captured group (group 1) and remove the whitespaces.
            # We then split the extracted text into lines, strip whitespace from each line, and keep only non-empty lines.
            lines = [line.strip() for line in extracted.split('\n') if line.strip()] 
            # We remove the numbering or bullet points at the start, then remove the whitespace again.
            keywords = [re.sub(r'^\d+\.?\s*|\•\s*', '', line).strip() for line in lines]
            # Then we create a new list called keywords. The loop makes sure that we keep only non-empty keywords.
            keywords = [k for k in keywords if k]
            return keywords # We return the new list we created. 
        # Finally, we add a return with an empty list to make sure that 
        # even when no keywords are found a list (although it is an empty one) is still returned.
        return [] 

    # This is the second function inside our main function, which will extract the summaries. 
    def extract_summary(text):
        # This pattern looks for either "in one sentence:" or "New video description:", followed by optional whitespace and quotes
        # then captures everything until the end of the line or string.
        pattern = r'(?:in one sentence:|New video description:)\s*"?(.+?)"?(?:\n|$)'
        # We look for the pattern in the text, and we enable the match case insensitive.
        # We then store this match into a new variable called "match"
        match = re.search(pattern, text, re.IGNORECASE) 
        if match: # We check if a match is found.
            # Then, we extract the content of the first captured group in the text.
            # We also remove any whitespace from this extracted text.
            # We return the cleaned text, which is our output, immediately. 
            return match.group(1).strip()
        return None # If there is no match, we return None. 

    # Finally, we define two variables for keywords and summaries, respectively called as keywords and summary. 
    # The keywords variable is the keywords extracted by our first function. 
    # The summary variable is the summary extracted by our second function. 
    # We define them as variables because we will return them as the output of the main function.
    keywords = extract_keywords(response)
    summary = extract_summary(response)

    return keywords, summary 

In [66]:
# Test your code with the following script
import random 
indexes = list(range(len(dataset)))
random.shuffle(indexes)
for id in indexes:
    line = dataset[id]
    keywords, summary = post_process(line["response"])
    assert keywords == line['keywords'], "Incorrect keywords list"
    assert summary == line['summary'], "Incorrect summary"
print("Done!")

Done!
