# lilac

Lilac helps you curate data for LLMs, from RAGs to fine-tuning datasets.

https://github.com/lilacai/lilac

https://www.lilacml.com/

## Installation

**Documentation: lilac optional components**

https://github.com/lilacai/lilac/blob/main/pyproject.toml

llms = ["openai"]

langsmith = ["langsmith"]

github = ["llama-index", "llama-hub"]

signals = ["textacy", "detect-secrets", "langdetect", "hdbscan"]

*Individual signals*

lang_detection = ["langdetect"]               # Language detection.

pii = ["detect-secrets", "presidio_analyzer"] # PII.

text_stats = ["textacy"]                      # Text statistics.

*Individual embeddings*

gte = ["sentence-transformers"]

sbert = ["sentence-transformers"]

cohere = ["cohere"]

palm = ["google-generativeai", "google-cloud-aiplatform"]

openai = ["openai"]

### 1. Lilac Server

#### 1.1 Configure the Kubernetes cluster on the host virtual machine

Open a terminal inside the host virtual machine.

*Update the Kubernetes config: add a lilac service mapped to the root url*

- vi lilac-add.yaml

```yaml
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: wordslab-notebooks-ingressroute
  labels:
    wordslab.org/app: wordslab-notebooks
    wordslab.org/component: jupyterlab
  annotations:
    ........
spec:
  entryPoints:
    - web
  routes:
  - match: PathPrefix(`/notebooks`)
    kind: Rule
    services:
    - name: wordslab-notebooks-service
      port: 8888
  ........
  - match: PathPrefix(`/`)
    kind: Rule
    services:
    - name: wordslab-lilac-service
      port: 5432
---
apiVersion: v1
kind: Service
metadata:
  name: wordslab-lilac-service
  labels:
    wordslab.org/app: wordslab-notebooks
    wordslab.org/component: jupyterlab
spec:
  type: ClusterIP
  ports:
    - port: 5432
      targetPort: 5432
      protocol: TCP
  selector:
    wordslab.org/app: wordslab-notebooks
    wordslab.org/component: jupyterlab
```

#### 1.2 Launch the lilac server inside the Jupyterlab conatainer

Open a Jupyterlab Terminal 

*Install lilac server*

- create-workspace-project lilac
- cd /workspace/lilac
- source .venv/bin/activate
- pip install lilac[signals,gte]
- vi lilac-start.py

```python
import lilac as ll

project_dir = '/workspace/lilac/bank-project'
server = ll.start_server(host='0.0.0.0', port=5432, project_dir=project_dir)
```

*Start lilac server*

- cd /workspace/lilac
- source .venv/bin/activate
- python lilac-start.py

*Open lilac UI in the browser*

Navigate to the root URL of the host machine:

http://192.168.1.24/ 

### 2. Lilac notebook client

#### 2.1 Install and initialize dataset

In [None]:
pip install --update lilac[signals,gte]

In [1]:
import lilac as ll

project_dir = '/workspace/lilac/bank-project'
ll.set_project_dir(project_dir)



In [2]:
with open("/workspace/hftoken", 'r') as file:
    myhftoken = file.read().strip()

In [3]:
source = ll.HuggingFaceSource(dataset_name='frenchtext/banque-fr-2311', split="valid", token=myhftoken)
config = ll.DatasetConfig(namespace='local', name='banque-fr-2311', source=source)
dataset = ll.create_dataset(config)

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Dataset "banque-fr-2311" written to /workspace/lilac/bank-project/datasets/local/banque-fr-2311


Refresh the lilac home page in your browser => the new local dataset is available !

In [7]:
source = ll.HuggingFaceSource(dataset_name='frenchtext/bank-en-2401', split="valid", token=myhftoken)
config = ll.DatasetConfig(namespace='local', name='bank-en-2401', source=source)
dataset = ll.create_dataset(config, overwrite=True)

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Dataset "bank-en-2401" written to /workspace/lilac/bank-project/datasets/local/bank-en-2401
