### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [2]:
import pandas as pd

In [3]:
data_path = "../../Data/MathTrainingSet.parquet"
mathQnA = pd.read_parquet(data_path, engine='pyarrow')

In [4]:
print(mathQnA.columns)
mathQnA.head(5)

Index(['question', 'answer'], dtype='object')


Unnamed: 0,question,answer
0,Jungkook is the 5th place. Find the number of ...,"If Jungkook is in 5th place, then 4 people cro..."
1,A number divided by 10 is 6. Yoongi got the re...,"Let's call the certain number ""x"". According t..."
2,Dongju selects a piece of paper with a number ...,To find the second smallest and third smallest...
3,"You wanted to subtract 46 from a number, but y...",If you accidentally subtracted 59 instead of 4...
4,The length of one span of Jinseo is about 12 c...,If one span of Jinseo is about 12 centimeters ...


In [5]:
prompt_template = """input: "{transcript}"
Solve the above math question. Describe the steps to solve together with the result.
State the final result in the last line in the output.
Don't add any extra line after the final result.
"""

In [6]:
def prompt(transcript):
    prompt = prompt_template.format(transcript=transcript)
    return prompt

In [8]:
ft_data = pd.DataFrame(columns = ["prompt", "completion"])
ft_data['prompt'] = mathQnA["question"].apply(prompt)
ft_data['completion'] = mathQnA["answer"]

In [None]:
ft_data.head(5)

In [9]:
print(ft_data.iloc[0].prompt)

input: "Jungkook is the 5th place. Find the number of people who crossed the finish line faster than Jungkook."
Solve the above math question. Describe the steps to solve together with the result.
State the final result in the last line in the output.
Don't add any extra line after the final result.



In [10]:
ft_data.shape

(200035, 2)

## Select the first 90% of data for fine-tuning, leave the rest for evaluation

In [11]:
output_path = "../../Data/math_QnA_set_first_180K.jsonl"

with open(output_path, "w", encoding="utf-8") as f:
    f.write(ft_data[['prompt', 'completion']].head(180000).to_json(orient='records', lines=True, force_ascii=False))#.head(80) to reserve some samples for testing

## Upload training samples to the bucket for storing fine-tuning dataset:

In [13]:
import oci
from oci.object_storage import UploadManager

CONFIG_PROFILE = "DEFAULT"
config = oci.config.from_file('~/.oci/config', CONFIG_PROFILE)

In [15]:
# Initialize service client with default config file
config['region'] = "us-chicago-1"
object_storage_client = oci.object_storage.ObjectStorageClient(config)

# TODO: replace this with your own namespace
Namespace = 'abcdefg'

# TODO: replace this with your own bucket Name
bucketName="FinetuneData"
finetuneDataBucket = object_storage_client.get_bucket(namespace_name=Namespace, bucket_name=bucketName)
print(finetuneDataBucket.data.name)

FinetuneData


In [16]:
upload_manager = UploadManager(object_storage_client, allow_parallel_uploads=True, parallel_process_count=10)
fileName="math_QnA_set_first_180K.jsonl"
upload_manager.upload_file(Namespace,bucketName,fileName, '../Data/'+fileName)   

<oci.response.Response at 0x7f14598347f0>