#Introduction

The goal of this exercise is to identify the existence of a few dimensions in letters written by CEOs to shareholders using OpenAI’s language model. The language model is accessed using OpenAI’s API. The model is first fine-tuned using the three training datasets provided and the test dataset is used as a test set. We check the predicted dimensions for the test dataset and compare it with the true values. The accuracy of the prediction is calculated.

In [32]:
import pandas as pd
import requests
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from google.colab import drive
from openai import OpenAI
import json
import time

In [12]:
# Mount Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Methodology

## Load and Preprocess the Data:
We are using the provided three training datasets as the training set and the test dataset as the test set. In this step, we concatenated the three datasets to create a single training set.

In [13]:
def load_and_preprocess_train_data(train_files):
  # Load data from Excel files
  df_train = pd.concat([pd.read_excel(file) for file in train_files])

  # Handling Missing Values
  # Replace missing values with a placeholder or drop rows with missing values
  df_train.dropna(subset=['paragraph'], inplace=True)

  """# Download stopwords corpus (if not already downloaded)
  nltk.download('stopwords')
  nltk.download('punkt')"""

  # Preprocess the text data
  # Lower-Case words
  df_train['processed_paragraph'] = df_train['paragraph'].apply(lambda text: text.lower())

  return df_train

In [14]:
folder_path = r'/content/drive/MyDrive/GRA_KU_Assessment/TEST/train_files'
# Get a list of all files in the folder
files_in_folder = os.listdir(folder_path)
# Filter Excel files with specific names (train1, train2, train3, etc.)
train_files = [os.path.join(folder_path, file) for file in files_in_folder if file.startswith('train') and file.endswith('.xlsx')]
df_train = load_and_preprocess_train_data(train_files)

## Prepare Data for Fine-tuning
In this step, we convert our data into the format required for the OpenAI API.

We convert the data from the table format (.xlsx format) provided to a Chat Completions API format that is accepted by OpenAI’s API. This format is a list of messages where each message has a role and content.

In our training set, for each message in the requirem format will have three components for each datapoint (sentence) with each playing a different role. The first part of the message plays the role of *system*. This is where we give precise instructions to the model as to what we expect it to do. The next part of the message plays the role of *user* which contains portions of the letters from CEOs. This is supposed to be the input for fine-tuning process. The last part of the message plays the role of *assistant*, which is the result from which we want it to fine-tune. This is the training part.

In [15]:
def prepare_data_for_fine_tuning(df_train):
  # Convert the 'paragraph' column to a list
  paragraphs = df_train['processed_paragraph'].tolist()

  # Create a list to store the formatted data
  train_data = []

  # Iterate through each row in the DataFrame and create prompt-completion pairs
  for index, row in df_train.iterrows():
      prompt = row['processed_paragraph']
      # Convert 'Yes' and 'No' to 1 and 0, respectively
      completion = ','.join(['1' if row[col] == 'Yes' else '0' for col in df_train.columns[1:]])
      train_data.append({"prompt": prompt, "completion": completion})

  # Display a few examples to verify the format
  for example in train_data[:1]:
      print(example)

  return train_data

In [16]:
def convert_to_chat_completion(prompt_completion_data):
    chat_completion_data = []

    for entry in prompt_completion_data:
        prompt = entry['prompt']
        completion = entry['completion']

        # Extracting the completion details and converting them into the desired format
        completion_details = [f"{key}: {'Yes' if value == '1' else 'No'}" for key, value in zip(['Goal', 'Activity', 'Strategy', 'Plan', 'Structure', 'Innovation', 'Tactics', 'Relevance'], completion.split(','))]

        # Joining the completion details into a single string
        completion_text = ', '.join(completion_details)

        # Creating the chat-completion format
        conversation = {
            "messages": [
                {"role": "system", "content": "Use the folowing step-by-step instructon to respond to the user inputs. Step 1 - In the user content which is taken from letters written by CEO to shareholders, you have to identify the existence of dimensions/qualities that are provided in this list given in brackets and that are seperated by commas ['Goal', 'Activity', 'Strategy', 'Plan', 'Structure', 'Innovation', 'Tactics', 'Relevance']. Step 2 - For each of these dimensions, if the dimension exists in the user prompt based on the assistant content I provide to you in the fine-tuning data, answer Yes, otherwise answer No. After step2, this is an example output whose template you must use to provide your answer - ['Goal: No, Activity: Yes, Strategy: Yes, Plan: Yes, Structure: Yes, Innovation: Yes, Tactics: No, Relevance: No']"},
                {"role": "user", "content": prompt},
                {"role": "assistant", "content": completion_text}
            ]
        }

        chat_completion_data.append(conversation)

    return chat_completion_data


In [17]:
train_data = prepare_data_for_fine_tuning(df_train)
train_data = convert_to_chat_completion(train_data)

{'prompt': 'february 24, 2011\nto our shareholders:\n2010 was another challenging year for sears holdings.  our financial results remain at unacceptable levels, and we are working to drive better performance in both the short and long term.  the company generates significant amounts of cash, and we have the ability and flexibility to invest that cash strategically.  we will continue to make long-term investments in key areas that may adversely impact short-term results when we believe they will generate attractive long-term returns.  in particular, we have significantly grown our shop your way rewards program, improved our online and mobile platforms, and re-examined our overall technology infrastructure.  we believe these investments are an important part of transforming sears holdings into a truly integrated retail company, focusing on customers first', 'completion': '0,1,1,1,1,1,0,0,1,0'}


In [18]:
print(train_data[0])

{'messages': [{'role': 'system', 'content': "Use the folowing step-by-step instructon to respond to the user inputs. Step 1 - In the user content which is taken from letters written by CEO to\xa0shareholders, you have to identify the existence of dimensions/qualities that are provided in this list given in brackets and that are seperated by commas ['Goal', 'Activity', 'Strategy', 'Plan', 'Structure', 'Innovation', 'Tactics', 'Relevance']. Step 2 - For each of these dimensions, if the dimension exists in the user prompt based on the assistant content I provide to you in the fine-tuning data, answer Yes, otherwise answer No. After step2, this is an example output whose template you must use to provide your answer - ['Goal: No, Activity: Yes, Strategy: Yes, Plan: Yes, Structure: Yes, Innovation: Yes, Tactics: No, Relevance: No']"}, {'role': 'user', 'content': 'february 24, 2011\nto our shareholders:\n2010 was another challenging year for sears holdings.  our financial results remain at un

## Fine-tune the model
We invoke an OpenAI model, feed the training data and finetune it. The base model used for fine-tuning is gpt-3.5-turbo.

In [19]:
def fine_tune_model(train_data, api_key):
  # Assuming train_data contains your prompt-completion pairs
  # Save the train_data in JSON Lines format
  with open("/content/drive/MyDrive/GRA_KU_Assessment/TEST/mydata.jsonl", "w") as file:
      for example in train_data:
          file.write(json.dumps(example) + "\n")

  # Initialize the OpenAI client
  client = OpenAI(api_key= api_key)

  # Upload the JSON Lines file for fine-tuning
  try:
    resp1 = client.files.create(
        file=open("/content/drive/MyDrive/GRA_KU_Assessment/TEST/mydata.jsonl", "rb"),
        purpose="fine-tune"
    )
    print("File uploaded successfully.")
  except Exception as e:
    print("File upload failed:", e)
    return None, None

  # Create the fine-tuning job
  try:
    resp2 = client.fine_tuning.jobs.create(
    training_file=resp1.id,
    model="gpt-3.5-turbo"
    )
    print("Fine-tuning job created successfully.")
  except Exception as e:
    print("Fine-tuning job creation failed:", e)
    return None, None

  # Check the status of the fine-tuning job
  while True:
    resp3 = client.fine_tuning.jobs.retrieve(resp2.id)
    status = resp3.status
    print("Fine-tuning job status:", status)
    if status == "succeeded":
      print("Fine-tuning job completed successfully.")
      break
    elif status == "failed":
      print("Fine-tuning job failed:", resp3.error)
      break
    elif status == "cancelled":
      print("Fine-tuning job cancelled by user.")
      break
    else:
      print("Fine-tuning job in progress. Please wait...")
      time.sleep(60)

  return resp2, client


In [20]:
api_key = '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'
response, client = fine_tune_model(train_data, api_key)

File uploaded successfully.
Fine-tuning job created successfully.
Fine-tuning job status: validating_files
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning job status: running
Fine-tuning job in progress. Please wait...
Fine-tuning j

In [21]:
print(response)

FineTuningJob(id='ftjob-catQFwicDtVbb6bZYZbg2GLC', created_at=1700433191, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-BNdO85mmDzyE4YWAlVvz9Z2A', result_files=[], status='validating_files', trained_tokens=None, training_file='file-4cqRSlKO9x0qpMQnshCjjOtU', validation_file=None)


## Load and preprocess the test data
We load the test data and do the necessary preprocessing. This is the test.xslx data.

In [22]:
def load_and_preprocess_test_data(test_file):
  # Load data from the test file
  df_test = pd.read_excel(test_file)

  # Preprocess the text data similar to the training data
  df_test['processed_paragraph'] = df_test['paragraph'].apply(lambda text: text.lower())

  return df_test

In [23]:
test_file = r"/content/drive/MyDrive/GRA_KU_Assessment/TEST/test_file/test.xlsx"
df_test = load_and_preprocess_test_data(test_file)

## Make Predictions
We now use the fine-tuned model and feed in the test set provided. We will use the results from the test set to assess the model’s performance.



In [24]:
def make_predictions(df_test, fine_tuned_model_id, client):
  # Get the sentences to test from the dataframe
  sentences_to_test = df_test['processed_paragraph'].tolist()
  # Initialize an empty list to store the responses
  responses = []
  # Set the batch size for chat completions
  batch_size = 10
  # Loop through the sentences in batches
  for i in range(0, len(sentences_to_test), batch_size):
    # Get the current batch of sentences
    batch = sentences_to_test[i:i+batch_size]
    # Loop through the sentences in the batch
    for ind, sentence in enumerate(batch):
      # Create the system and user messages for each sentence
      messages = [
        {"role": "system", "content": "Use the folowing step-by-step instructon to respond to the user inputs. Step 1 - In the user content which is taken from letters written by CEO to shareholders, you have to identify the existence of dimensions/qualities that are provided in this list given in brackets and that are seperated by commas ['Goal', 'Activity', 'Strategy', 'Plan', 'Structure', 'Innovation', 'Tactics', 'Relevance']. Step 2 - For each of these dimensions, if the dimension exists in the user prompt based on the assistant content I provide to you in the fine-tuning data, answer Yes, otherwise answer No. After step2, this is an example output whose template you must use to provide your answer - ['Goal: No, Activity: Yes, Strategy: Yes, Plan: Yes, Structure: Yes, Innovation: Yes, Tactics: No, Relevance: No']"},
        {"role": "user", "content": sentence}
      ]
      # Try to make a chat completion for the sentence
      try:
        response = client.chat.completions.create(
        model=fine_tuned_model_id,
        messages=messages
        )
        print("Chat completion succeeded for sentence", ind, " and batch ", i)
      # Handle any errors or exceptions
      except Exception as e:
        print("Chat completion failed for sentence", ind, " and batch ", i, ":", e)
        return None
      # Append the assistant message content to the responses list
      responses.append(response.choices[0].message.content)

  # Return the responses list
  return responses


In [25]:
fine_tuned_model_id = "gpt-3.5-turbo"
predictions = make_predictions(df_test, fine_tuned_model_id, client)

Chat completion succeeded for sentence 0  and batch  0
Chat completion succeeded for sentence 1  and batch  0
Chat completion succeeded for sentence 2  and batch  0
Chat completion succeeded for sentence 3  and batch  0
Chat completion succeeded for sentence 4  and batch  0
Chat completion succeeded for sentence 5  and batch  0
Chat completion succeeded for sentence 6  and batch  0
Chat completion succeeded for sentence 7  and batch  0
Chat completion succeeded for sentence 8  and batch  0
Chat completion succeeded for sentence 9  and batch  0
Chat completion succeeded for sentence 0  and batch  10
Chat completion succeeded for sentence 1  and batch  10
Chat completion succeeded for sentence 2  and batch  10
Chat completion succeeded for sentence 3  and batch  10
Chat completion succeeded for sentence 4  and batch  10
Chat completion succeeded for sentence 5  and batch  10
Chat completion succeeded for sentence 6  and batch  10
Chat completion succeeded for sentence 7  and batch  10
Ch

In [26]:
print(predictions)

["['Goal: No, Activity: Yes, Strategy: Yes, Plan: Yes, Structure: Yes, Innovation: Yes, Tactics: No, Relevance: No']", "['Goal: No, Activity: Yes, Strategy: Yes, Plan: Yes, Structure: No, Innovation: Yes, Tactics: No, Relevance: No']", "['Goal: No, Activity: No, Strategy: No, Plan: No, Structure: No, Innovation: No, Tactics: Yes, Relevance: No']", "['Goal: No, Activity: No, Strategy: No, Plan: No, Structure: No, Innovation: No, Tactics: No, Relevance: No']", "['Goal: No, Activity: Yes, Strategy: Yes, Plan: No, Structure: No, Innovation: Yes, Tactics: No, Relevance: No']", "['Goal: No, Activity: Yes, Strategy: Yes, Plan: Yes, Structure: Yes, Innovation: Yes, Tactics: No, Relevance: No']", "['Goal: No, Activity: Yes, Strategy: Yes, Plan: No, Structure: No, Innovation: Yes, Tactics: No, Relevance: No']", "['Goal: No, Activity: No, Strategy: Yes, Plan: Yes, Structure: No, Innovation: No, Tactics: No, Relevance: No']", "['Goal: No, Activity: No, Strategy: No, Plan: No, Structure: No, Innova

## Generate predicted test dataset
The predicted dimensions for each of the sentence in the test dataset is concatenated to the test dataset and saved in the folder.

In [37]:
def concatenate_predictions(df_test, predictions):
    # Clean up the predictions by removing extra characters and splitting by commas
    cleaned_predictions = [item[2:-2].split(', ') for item in predictions]

    # Create a DataFrame with columns based on the cleaned predictions
    df_test_prediction = pd.DataFrame(cleaned_predictions, columns=['Goal', 'Activity', 'Strategy', 'Plan', 'Structure', 'Innovation', 'Tactics', 'Relevance'])

    # Adjusting the values in each column to retain only the value after ':'
    for col in df_test_prediction.columns:
        df_test_prediction[col] = df_test_prediction[col].apply(lambda x: x.split(': ')[1])

    # Concatenate the preprocessed dataset with the predictions
    final_test_df = pd.concat([df_test, df_test_prediction], axis=1)

    return final_test_df


In [38]:
final_test_df = concatenate_predictions(df_test, predictions)

In [39]:
print(final_test_df.head())

   Unnamed: 0                                          paragraph  \
0           0  Chairman's Letter\nFebruary 26, 2015\nTo our S...   
1           1  For Sears and Kmart, after years of work at be...   
2           2  This isn't new for Sears. An article in the Oc...   
3           3  Time and again, people have proclaimed our com...   
4           4  These old stories got it partially right. Had ...   

                                 processed_paragraph Goal Activity Strategy  \
0  chairman's letter\nfebruary 26, 2015\nto our s...   No      Yes      Yes   
1  for sears and kmart, after years of work at be...   No      Yes      Yes   
2  this isn't new for sears. an article in the oc...   No       No       No   
3  time and again, people have proclaimed our com...   No       No       No   
4  these old stories got it partially right. had ...   No      Yes      Yes   

  Plan Structure Innovation Tactics Relevance  
0  Yes       Yes        Yes      No        No  
1  Yes        No    

In [31]:
loc = "/content/drive/MyDrive/GRA_KU_Assessment/TEST/test_data_predictions.xlsx"
final_test_df.to_excel(loc)