### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [17]:
#!pip install -U oci
#!pip install word2number
#!pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-17.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.3 kB)
Downloading pyarrow-17.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-17.0.0


## Import necessary libraries

In [18]:
import pandas as pd
import oci
import re
from word2number import w2n

## Configurations to access OCI resources such as object storage, Gen AI APIs, etc.

In [7]:
CONFIG_PROFILE = "DEFAULT"
config = oci.config.from_file('~/.oci/config', CONFIG_PROFILE)

## Initialize parameters used for Gen AI inferencing

In [23]:
# compartment OCID
# TODO: replace this with your own compartment Id
compartment_id = "ocid1.compartment.xxxx"

# endpoint for calling Gen AI services
endpoint = "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"

# client for calling Gen AI inferencing service
generative_ai_inference_client = oci.generative_ai_inference.GenerativeAiInferenceClient(config=config, service_endpoint=endpoint, timeout=(10,240))

#OOTB model Id:
model_id="cohere.command-r-16k"

#Endpoint created on the fine-tuned model trained on the 90% dataset:
# TODO: replace this with your own endpoint Id
endpoint_id = "ocid1.generativeaiendpoint.xxxx"

In [24]:
text_gen_params = \
{
    'max_new_tokens': 600,
    'temperature'   : 0.0,
    'top_p'         : 0.75,
    'top_k': 0,
    'frequency_penalty': 0.05,
    'presence_penalty': 0,
    'do_sample'     : False
}

## Prompt templates

### 1. Prompt for generating answers to math questions (this needs to be consistent with the one used for fine-tuning):

In [25]:
prompt_template = """input: "{transcript}"
Solve the above math question. Describe the steps to solve together with the result.
State the final result in the last line in the output. Show the final result as one number (preferred if possible) or one word.
Don't add any extra line after the final result.
"""

### 2. Prompt for extracting keys from answers to math questions (for performance evaluation):

In [26]:
prompt_template_key = """input: "{transcript}"
Read the above math answer. Extract only the final result as a single number or word in the answer. Don't add any symbol, comma, or period unless they are part of the number. 
Note that sometimes the final results are enclosed in a pair of two asterisks, sometimes they are not.
Example #1: 
input: "If Jungkook is in 5th place, then **4** people crossed the finish line faster than him."
output: 4

Example #2: 
input: "Let's call the certain number "x". According to the information given:

A number divided by 10 is 6:
x / 10 = 6

Yoongi got the result by subtracting 15 from x:
Result = x - 15

First, we need to find the value of x. We can do this by solving the first equation:

x / 10 = 6
x = 6 * 10
x = 60

Now that we know x is 60, we can find the result Yoongi got by subtracting 15 from x:

Result = x - 15
Result = 60 - 15
Result = 45

So, the result Yoongi got is **45**."
output: 45

Example #3:
input: "To solve this math problem, we need to determine the exchange rate from US Dollars to Korean Won. Soojeong exchanged 140 USD and received 158,760 KRW.

We can use the following formula to calculate the exchange rate: 
1 USD = amount of KRW

Setting up the equation based on the given information: 
1 USD = 158,760 KRW 

Solving for the exchange rate, we divide both sides by 158,760: 
1 USD / 158,760 KRW = 1 

The exchange rate of the Korean Won to the US Dollar today is approximately **1:158,760** or KRW 158,760 for 1 USD."
output: 1:158,760

Example #4:
input: "The difference between the largest and smallest four-digit numbers that can be made from the digits 2, 0, 3, 5, and 8 is 6497."
output: 6497
"""

## Inferencing Function (support calling either OOTB model or fine-tuned model)

In [27]:
def runCohereChat(transcript, 
                  text_gen_params = text_gen_params, 
                  prompt_template = prompt_template, 
                  use_prompt_template = True, 
                  use_FT_model = False,
                  verbose=False, 
                  model_id=model_id,
                  endpoint_id=endpoint_id):
   
    chat_detail = oci.generative_ai_inference.models.ChatDetails()
    cohere_chat_request = oci.generative_ai_inference.models.CohereChatRequest()
    if not use_FT_model:
        chat_detail.serving_mode = oci.generative_ai_inference.models.OnDemandServingMode(model_id=model_id)
    else:
        chat_detail.serving_mode = oci.generative_ai_inference.models.DedicatedServingMode(endpoint_id=endpoint_id)
        
    chat_detail.compartment_id = compartment_id
    
    cohere_chat_request.max_tokens = text_gen_params['max_new_tokens']
    cohere_chat_request.temperature = text_gen_params['temperature']
    cohere_chat_request.frequency_penalty = text_gen_params['frequency_penalty']
    #cohere_chat_request.presence_penalty = text_gen_params['presence_penalty']
    cohere_chat_request.top_p = text_gen_params['top_p']
    #cohere_chat_request.top_k = text_gen_params['top_k']

    if use_prompt_template:
        prompt = prompt_template.format(transcript=transcript)
    else:
        prompt = transcript
    
    cohere_chat_request.message = prompt

    chat_detail.chat_request = cohere_chat_request
    
    try:
        generate_text_response = generative_ai_inference_client.chat(chat_detail)
        #return response
    
        output_str = vars(generate_text_response)['data'].chat_response.text
                               
        if verbose:
            print(f'Prompt: {prompt}\n Output: {output_str}')
        else:
            print(f'Output: {output_str}')
        
        return output_str
    
    except Exception as error:
        print("An exception occurred:", error)
        return error

## Performance evaluation Function (comparing key in correct answer with key in LLM answer)

In [28]:
def checkMathAccuracy(data, col1, col2):
    correctCt=0
    DF=data[[col1, col2]].copy(deep=True)
    DF[col1] = DF[col1].str.rstrip('.')
    DF[col2] = DF[col2].str.rstrip('.')

    for idx in range(DF.shape[0]):
        if DF.loc[idx, col1]==DF.loc[idx, col2]:
            correctCt +=1
            continue

        try:
            if w2n.word_to_num(DF.loc[idx, col1]) == w2n.word_to_num(DF.loc[idx, col2]):
                correctCt +=1
                continue
        except:
            #Do nothing   
            temp = 0

        try:
            if str(w2n.word_to_num(DF.loc[idx, col1])) == DF.loc[idx, col2]:
                correctCt +=1
                continue
        except:
            #Do nothing   
            temp = 0

        try:
            if DF.loc[idx, col1] == str(w2n.word_to_num(DF.loc[idx, col2])):
                correctCt +=1
                continue
        except:
            #Do nothing   
            temp = 0

        #print(idx)
        #print("Correct Answer: " + DF.loc[idx, col1])
        #print("CmdR Answer: " + DF.loc[idx, col2])
        
    return correctCt

## Evaluate the accuracy of the OOTB model vs. the Fine-tuned model over the out-of-bag samples (1K)

In [29]:
# Load the raw dataset and extract 1000 out-of-bag samples from the tail of the dataset
data_path = "../../Data/MathTrainingSet.parquet"
mathQnA = pd.read_parquet(data_path, engine='pyarrow')

last_1k=mathQnA.tail(1000).copy(deep=True)
last_1k=last_1k.reset_index(drop=True)

### First, obtain answers by inferencing from the OOTB model for these out-of-bag samples (1K)

In [94]:
## Obtain answers for the last 1000 samples using the OOTB model:
for idx in range(1000):
    print("\nSample #: " + str(idx))
    try:
        transcript=last_1k.iloc[idx]['question']
        response = runCohereChat(transcript)
    
    except Exception as error:
        print("An exception occurred")
        print(error)
        response = error
        #latency = 0
            
    last_1k.loc[idx, 'CohereCmdR_answer']=response


Sample #: 0
Output: We need to find the difference between John's walking distance and Nina's walking distance. 

John walks 1.74 miles, and Nina walks 1.235 miles, so we can find the difference like this: 
1.74 miles - 1.235 miles = 0.505 miles 

John walks 0.51 miles farther than Nina.

The final result is 0.51.

Sample #: 1
Output: Isha initially had a 31.25-inch pencil, which shrank to 14.75 inches after sharpening. We need to find the difference between these two lengths to determine the amount sharpened off. 

Here are the steps to solve:
1. Subtract the length after sharpening from the initial length: 31.25 inches - 14.75 inches = 16.5 inches.

16 1/2 inches were sharpened off the pencil.
 
The final result: 16.5 inches.

Sample #: 2
Output: Mrs. Santiago has 58 - 10 + 5 = 53 red roses.
Mrs. Garrett has 24 - 10 + 5 = 29 red roses.
Mrs. Santiago has 53 - 29 = 24 more red roses than Mrs. Garrett.

Twenty-four.

Sample #: 3
Output: Diana has enough erasers to share with 14 friends

### Next, obtain answers by inferencing from the Fine-tuned model for these out-of-bag samples (1K)

In [93]:
## Obtain answers for the last 1000 samples using MathQnA_First180K_3ep model fine-tuned using the first 180K samples only:

for idx in range(1000):
    print("\nSample #: " + str(idx))
    try:
        transcript=last_1k.iloc[idx]['question']
        #print("\nQuestion: " + transcript + "\n")
        response = runCohereChat(transcript, prompt_template = prompt_template, use_FT_model = True, endpoint_id = endpoint_id)
        #print("\nCorrect Answer: " + last_1k.iloc[idx]['answer'] + "\n")
    
    except Exception as error:
        print("An exception occurred")
        print(error)
        response = error
            
    last_1k.loc[idx, 'CohereCmdR_FT_answer']=response


Sample #: 0
Output: To find out how much farther John walks than Nina, we need to subtract the distance Nina walks from the distance John walks.

John walks 1.74 miles to school.
Nina walks 1.235 miles to school.

So, John walks 1.74 - 1.235 = 0.505 miles farther than Nina.

Sample #: 1
Output: To find out how much Isha sharpened off her pencil, we need to subtract the length of the pencil after sharpening from the original length of the pencil before sharpening.

So, we calculate:

31.25 inches (original length) - 14.75 inches (length after sharpening) = 16.5 inches

Isha sharpened off 16.5 inches of her pencil.

Sample #: 2
Output: Mrs. Santiago has 58 red roses.
Mrs. Garrett has 24 red roses.

To find out how many more red roses Mrs. Santiago has than Mrs. Garrett, we subtract the number of red roses Mrs. Garrett has from the number of red roses Mrs. Santiago has:

58 (Mrs. Santiago's red roses) - 24 (Mrs. Garrett's red roses) = 34

So, Mrs. Santiago has 34 more red roses than Mrs.

### Next, extract keys from correct answers, OOTB answers, and the Fine-tuned model answers for these out-of-bag samples (1K)

In [96]:
# Extract the keys from answers:

for idx in range(1000):
    print("\nSample #: " + str(idx))
    try:
        transcript=last_1k.iloc[idx]['CohereCmdR_answer']
        print("\nQuestion: " + transcript + "\n")
        print("\nOOTB Model Answer: ")
        response = runCohereChat(transcript, prompt_template = prompt_template_key)
        last_1k.loc[idx, 'CohereCmdR_key']=response

        transcript=last_1k.iloc[idx]['CohereCmdR_FT_answer']
        print("\nFine-tuned Model Answer: ")
        response = runCohereChat(transcript, prompt_template = prompt_template_key)
        last_1k.loc[idx, 'CohereCmdR_FT_key']=response
        
        transcript=last_1k.iloc[idx]['answer']
        print("\nCorrect Answer: ")
        response = runCohereChat(transcript, prompt_template = prompt_template_key)
        last_1k.loc[idx, 'Correct_key']=response
    except Exception as error:
        print("An exception occurred")
        print(error)
        response = error
        latency = 0


Sample #: 0

Question: We need to find the difference between John's walking distance and Nina's walking distance. 

John walks 1.74 miles, and Nina walks 1.235 miles, so we can find the difference like this: 
1.74 miles - 1.235 miles = 0.505 miles 

John walks 0.51 miles farther than Nina.

The final result is 0.51.


OOTB Model Answer: 
Output: 0.51

Fine-tuned Model Answer: 
Output: 0.505

Correct Answer: 
Output: 0.505

Sample #: 1

Question: Isha initially had a 31.25-inch pencil, which shrank to 14.75 inches after sharpening. We need to find the difference between these two lengths to determine the amount sharpened off. 

Here are the steps to solve:
1. Subtract the length after sharpening from the initial length: 31.25 inches - 14.75 inches = 16.5 inches.

16 1/2 inches were sharpened off the pencil.
 
The final result: 16.5 inches.


OOTB Model Answer: 
Output: 16 1/2

Fine-tuned Model Answer: 
Output: 16.5

Correct Answer: 
Output: 16.5

Sample #: 2

Question: Mrs. Santiago h

### Finally, compute the accuracy for both OOTB and fine-tuned models:

In [97]:
# Evaluate the accuracy for OOTB and fine-tuned models:
ootb_accuracy = checkMathAccuracy(last_1k, 'Correct_key', 'CohereCmdR_key')/1000
print(ootb_accuracy)
ft_accuracy = checkMathAccuracy(last_1k, 'Correct_key', 'CohereCmdR_FT_key')/1000
print(ft_accuracy)

# We achieved 10.7% absolute increase or 43% relative increase in accuracy over the OOTB model

0.248
0.355


## We achieved 10.7% absolute increase or **43%** relative increase in accuracy over the OOTB model through fine-tuning!

In [98]:
#last_1k.to_csv("../Output/MathQnA_Last1000_WithAnswers.csv")