# Scenario 6: Analyzing Correlations Between Food Incidents and External Factors
This notebook is complementary material for the walkthrough scenario **Analyzing Correlation Between Food Incidents and External Factors** used the STELAR KLMS
It is not intended to be run as a standalone notebook. It **requires access to a deployment of STELAR KLMS** and an **account** on the respective instance. 

Some of the instances used during the evaluation period of the STELAR Project are:

Internal Pilot Instance: https://klms.stelar.gr

Public Sandbox Instance: https://sandbox.stelar.gr


*If you don't have an account on the STELAR KLMS, you can create one on the respective instance. 
Kindly note that the internal pilot instance is only accessible to STELAR project members, while the public sandbox instance is open to everyone by registration.*

---
# Overview

This notebook is intended to run **Correlation Detective** tool to discover correlation in a food incidents dataset provide by Agroknow.

### Prerequisites

- Fill in your accounts credentials in the block below. 
- Select datasets according to the walkthrough directions.
- Ensure you have a modern python version installed (3.9 or later).
- Install the STELAR Python SDK and any other required libraries (`pip install stelar_client --upgrade`).

### Instatiate a STELAR Client object
**Modify credentials and base URL as needed.**

In [None]:
from stelar.client import Client, Dataset, TaskSpec, Process
from datetime import datetime

# Base URL
# Sandbox: https://sandbox.stelar.gr
# Internal Pilots: https://klms.stelar.gr

BASE_URL = "https://sandbox.stelar.gr"
USERNAME = "your_username"  # Replace with your username
PASSWORD = "your_password"  # Replace with your password

c = Client(base_url=BASE_URL, username=USERNAME, password=PASSWORD)
print(f"Connected to STELAR KLMS @ {c._base_url} as {c._username}")

### Select a food safety incidents dataset 

In [None]:
food_safety_dataset = c.datasets["food-safety-incidents-products-and-hazards"]
print(f"Selected Dataset: {food_safety_dataset.id} | {food_safety_dataset.title}")
print(f"Browse the dataset at: {c._base_url}/console/v1/catalog/{food_safety_dataset.id}")

### Create/Select a Workflow Process to run the segmentation task

In [None]:
ORGANIZATION = "stelar-klms"

try:
    proc = c.processes.create(**{
        "title": "Evalution Workflow for " + c._username,
        "name": "evaluation-workflow-" + c._username,
        "organization": c.organizations[ORGANIZATION]
    })
    print(f"Created new process for evaluation: {proc.id} | {proc.title}")
except Exception as e:
    proc = c.processes["evaluation-workflow-" + c._username]
    print(f"Using existing process for evaluation: {proc.id} | {proc.title}")

### Create a dataset to store the results of the correlation detection task

In [None]:
ORGANIZATION = "stelar-klms"

try:
    res_dset = c.datasets.create(**{
        "title": "Food Incident Correlations for " + c._username,
        "name": "food-incident-correlations-" + c._username,
        "organization": c.organizations[ORGANIZATION],
        "notes": "Food incidents correlation curated by " + c._username,
    })
    print(f"Created new dataset for correlations: {res_dset.id} | {res_dset.title}")
except Exception as e:
    res_dset = c.datasets["lai-timeseries-" + c._username]
    print(f"Using existing dataset for correlations: {res_dset.id} | {res_dset.title}")

### Prepare & Run the Correlation Detective task

In [None]:
# Start by building on a new task spec for running correlation detective
t = TaskSpec(tool='correlation-detective', name='Food incidents correlation detection')

# Define the input, in this case the food safety dataset

# Choose
#   - food_safety_dataset.resources[0] if you want to use the resource for weekly incidents 
#   - food_safety_dataset.resources[1] if you want to use the resource for monthly incidents
#
t.i(data_file=food_safety_dataset.resources[1].id)

# Set a local alias for the dataset to which the results are going to be stored. 
t.d(alias="d0", dset=res_dset.id)


# Set the parameters of the Correlation Algorithm to use and configure.
t.p(simMetricName="multipole", 
    maxPLeft= 3,
    maxPRight= 0,
    queryType= "TOPK",
    nVectors= 100)

# Create a timestamp to be able to identify the files and not overwrite them in subsequent executions
timestamp= datetime.now().strftime("%Y%m%d%H%M%S")

# Set the destination path for the result files. 
t.o(correlations_file={
    "url": "s3://klms-bucket/evaluation/experiments/proc-" + str(proc.id) + f"/correlations_{timestamp}_{c._username}.json",
    "resource":{
        "name": "Correlations for monthly food incidents",
        "relation": "correlations",
        "format": "CSV"
    },
    "dataset": "d0"
})


# Run the task in the selected workflow process
cd_task = proc.run(t)
print(f"Task {cd_task.id} is running. Check the status at: {c._base_url}/console/v1/task/{str(proc.id)}/{str(cd_task.id)}")