## Assignment Summary & Outcomes:

- **Objective**: The goal of this project is to build a model that corrects disfluent questions into their original form.
- **Approach**:  I used the mistral-openorca:latest model with the Ollama server and set a prompt to convert disfluent questions into original questions. No model training was performed. I measured the model's performance based on the contextual similarity between the predictions and the original questions.
- **Requirements**: For instructions on installing Ollama, please visit this link https://ollama.com/download/windows. After installation, you can pull the mistral-openorca:latest model by running the command: ollama pull mistral-openorca:latest. A requirement.txt file is attached. 

In [4]:
# Import all the nicessary libraries to analyse the data. 
import pandas as pd
import json
import os
import sys
import requests
from yachalk import chalk
import client as client
import time
#sys.path.append("/Users/soumendusekharbhattacharjee/Documents/DATA-SCIENCE/chata_ai/codes")
sys.path.append("..")

#### `Converting json file to a dataframe`

In [2]:
# Data set taken from Link: https://github.com/google-research-datasets/Disfl-QA
# Load the JSON data
with open('/Users/soumendusekharbhattacharjee/Documents/DATA-SCIENCE/chata_ai/Disfl-QA-main/dev.json', 'r') as file:
    data = json.load(file)

# Create lists to store the 'original' and 'disfluent' data
original_data = []
disfluent_data = []

# Iterate through the dictionary and store 'original' and 'disfluent' points
for key, value in data.items():
    original_data.append(value['original'])
    disfluent_data.append(value['disfluent'])

# Create a DataFrame with 'original' and 'disfluent' as columns
df = pd.DataFrame({
    'original': original_data,
    'disfluent': disfluent_data
})

# Display the first few rows of the DataFrame
df.head()


Unnamed: 0,original,disfluent
0,What did the government want Thoreau to do?,Who did no What did the government want Thorea...
1,What makes the Wells Fargo Center stand out?,What makes the Bank of America Tower or wait t...
2,What was the Colonia Agrippina's original name?,What was the Colonia Agrippina's original empi...
3,Extended networking benefits helped those that...,"Extended authorization limitations, no sorry n..."
4,Who is the emphasis on when there is a private...,What is the no make that who is the emphasis o...


In [3]:
df.shape

(1000, 2)

In [4]:
#df = df.drop(columns=['original'])


In [5]:
df.head()

Unnamed: 0,original,disfluent
0,What did the government want Thoreau to do?,Who did no What did the government want Thorea...
1,What makes the Wells Fargo Center stand out?,What makes the Bank of America Tower or wait t...
2,What was the Colonia Agrippina's original name?,What was the Colonia Agrippina's original empi...
3,Extended networking benefits helped those that...,"Extended authorization limitations, no sorry n..."
4,Who is the emphasis on when there is a private...,What is the no make that who is the emphasis o...


In [6]:
# The `correctQuestions` function takes a noisy or disfluent question as input (provided in the `prompt` parameter),
# and uses a specified language model (default is "mistral-openorca:latest") to generate a corrected version of the question.
# It does this by sending a system prompt (`SYS_PROMPT`) that instructs the model to fix issues such as typing errors,
# discontinuity in thinking, and irrelevant information, producing a clear, coherent, and grammatically correct question.
# The function returns the corrected question as a plain text string with any leading or trailing whitespace removed.

def correctQuestions(prompt: str, model="mistral-openorca:latest"):
    SYS_PROMPT = (
        "Your task is to correct the given noisy or disfluent question. The question might have typing errors, "
        "discontinuity in thinking, irrelevant information, or other issues that make it unclear. Please generate a "
        "clear, coherent, and grammatically correct version of the question. Ensure the corrected question is "
        "contextually accurate and directly addresses the intended query.\n"
        "Format your output as a plain text question.\n"
    )

    response, _ = client.generate(model_name=model, system=SYS_PROMPT, prompt=prompt)
    return response.strip()

# The `df2CorrectQuestions` function processes a DataFrame that must include a 'disfluent' column containing noisy or disfluent questions.
# It applies the `correctQuestions` function to each entry in this column, utilizing a specified language model (mistral-openorca:latest) 
# to generate corrected versions of the questions.
# The function creates a new column named 'correct_question' in the DataFrame, where these corrected questions are stored.
# Finally, the function returns the updated DataFrame, now containing both the original disfluent questions and their corrected counterparts.


def df2CorrectQuestions(dataframe: pd.DataFrame, model=None) -> pd.DataFrame:
    
    if 'disfluent' not in dataframe.columns:
        raise ValueError("DataFrame must contain a 'disfluent' column.")

    # Generate corrected questions
    results = dataframe.apply(
        lambda row: correctQuestions(row.disfluent, model=model), axis=1
    )
    dataframe['correct_question'] = results

    return dataframe

# The `evaluateQuestions` function assesses the contextual similarity between the 'original' questions and their corresponding 'correct_question' versions in a DataFrame.
# The function expects the DataFrame to contain two specific columns: 'original' and 'correct_question'.
# It uses a system prompt (`SYS_PROMPT`) to instruct a specified language model (default is "mistral-openorca:latest") to evaluate whether each pair of questions is contextually similar.
# The comparison is done by generating a binary output: 1 if the questions are similar, or 0 if they are not.
# The results of this evaluation are stored in a new column named 'evaluation' within the DataFrame, which is then returned.


def evaluateQuestions(dataframe: pd.DataFrame, model="mistral-openorca:latest") -> pd.DataFrame:
    
    if 'original' not in dataframe.columns or 'correct_question' not in dataframe.columns:
        raise ValueError("DataFrame must contain 'original' and 'correct_question' columns.")
    
    # Define the prompt for evaluating context similarity
    SYS_PROMPT = (
        "You are given two questions: the 'original' question and the 'correct_question'. "
        "Your task is to determine whether these two questions are contextually similar or not. "
        "If they are contextually similar, return 1; otherwise, return 0.\n\n"
        "Original Question: {original}\n"
        "Correct Question: {correct}\n"
        "Output: "
    )
    
    def compare_context(row):
        prompt = SYS_PROMPT.format(original=row['original'], correct=row['correct_question'])
        response, _ = client.generate(model_name=model, system=SYS_PROMPT, prompt=prompt)
        try:
            result = int(response.strip())
        except ValueError:
            print(f"\n\nERROR ### Invalid response for context evaluation: {response}\n\n")
            result = 0
        return result
    
    # Apply comparison function
    dataframe['evaluation'] = dataframe.apply(compare_context, axis=1)
    return dataframe


# Process dataframe to generate corrected questions
df_corrected = df2CorrectQuestions(df, model='mistral-openorca:latest')

# Evaluate context similarity
df_corrected = evaluateQuestions(df_corrected)

# Save to CSV in the local directory
output_file_path = "corrected_questions_with_evaluation.csv"
df_corrected.to_csv(output_file_path, sep="|", index=False)

print(f"Corrected questions with evaluation saved to '{output_file_path}'.")


 Who did what, and what did the government want Thoreau to do? What features make the Bank of America Tower or, in fact, the Wells Fargo Center stand out? What was the original name of Colonia Agrippinensium? What networking benefits enabled those who were unable to connect to a specific platform? What is the emphasis on when referring to "no make" in the context of a Private Finance Initiative? What Chinese-inspired dynasties influenced Kublai Khan's government? What is the average density of prime numbers compatible with modulo 8, and how does it compare to modulo 9? What resources did European empires rely on for their supply during their expansion period? What did Wahl and Ammann, along with Karlen and Singer, present to the U.S. Senate? What is the current status of the Haensch study, not to mention Schuenemann's research? Who was responsible for driving new building projects in Jacksonville? Who typically oversees the largest construction projects, considering potential inconveni

In [13]:
df_corrected.head()

Unnamed: 0,original,disfluent,correct_question,evaluation
0,What did the government want Thoreau to do?,Who did no What did the government want Thorea...,"Who did what, and what did the government want...",1
1,What makes the Wells Fargo Center stand out?,What makes the Bank of America Tower or wait t...,What features make the Bank of America Tower o...,1
2,What was the Colonia Agrippina's original name?,What was the Colonia Agrippina's original empi...,What was the original name of Colonia Agrippin...,1
3,Extended networking benefits helped those that...,"Extended authorization limitations, no sorry n...",What networking benefits enabled those who wer...,1
4,Who is the emphasis on when there is a private...,What is the no make that who is the emphasis o...,"What is the emphasis on when referring to ""no ...",0


In [14]:
# Calculate the number of correct evaluations (where evaluation == 1)
Correct = (df_corrected['evaluation'] == 1).sum()

# Calculate accuracy as the percentage of correct evaluations
Accuracy = (Correct / len(df_corrected)) * 100

# Display the result
print("Accuracy is: {:.1f}%".format(Accuracy))


Accuracy is: 92.8%
