Skip to content

skit-ai/skit-calls

Repository files navigation

skit-calls

Skit.ai's calls library.

We provide means to sample calls, and conversations (aka turns) from a specified environment. This data is required for analysis and training machine learning models. Hence the current offering of this library is an aggregation of conversation over calls.

We use this project as a component within skit-pipelines

Installation

The installation is a little quirky because it is meant for usage within a separate project here. You would need credentials from skit.ai to get past dvc pull and beyond.

Pre-requisites

  1. Miniconda or any other python version management tool.
  2. awscli
  3. poetry
  4. S3 access credentials from skit.ai
  5. tunnel secrets by mailing to skit.ai
  6. libpq-dev, postgresql-libs or brew install postgresql on mac/mac M1.

Project setup

git clone git@github.com:skit-ai/skit-calls.git
cd skit-calls

Installing dependencies

poetry install
dvc pull

The dvc pull command will create a secrets/ dir. This is where we store our queries and environment variables.

Local environment

You need to change the first line to export DB_HOST="localhost".

source secrets/env.sh

Now you are ready to use the project.

Usage

Post installation, we can see what the tooling provides by running:

skit-calls -h

usage: skit-calls [-h] [-v] [--on-disk] {sample,select} ...

Skit.ai's calls library {'0.2.8'}. We provide means to sample calls and conversations
from a specified environment. Learn about this library at: https://github.com/skit-
ai/skit-calls

positional arguments:
  {sample,select}  Supported means to obtain calls datasets aggregated with their turns.
    sample         Random sample calls with a variety of call/turn filters.
    select         Select calls from known call-ids.

options:
  -h, --help       show this help message and exit
  -v, --verbose    Increase verbosity
  --on-disk        Each record is written directly to disk. Highly recommended for large
                   queries.

To get random samples:

❯ poetry run skit-calls sample -h
usage: skit-calls sample [-h] --lang LANG [--org-ids [ORG_IDS ...]] --start-date START_DATE
                         [--end-date END_DATE] [--timezone TIMEZONE]
                         [--call-quantity CALL_QUANTITY]
                         [--call-type {INBOUND,OUTBOUND,CALL_TEST}]
                         [--ignore-callers [IGNORE_CALLERS ...]] [--reported]
                         [--use-case USE_CASE] [--flow-name FLOW_NAME]
                         [--min-audio-duration MIN_AUDIO_DURATION]
                         [--asr-provider ASR_PROVIDER]

options:
  -h, --help            show this help message and exit
  --lang LANG           Search calls made in the given language.
  --org-ids ORG_IDS     The orgs for which you need the data.
  --start-date START_DATE
                        Search calls made after the given date (YYYY-MM-DD).
  --end-date END_DATE   Search calls made before the given date.
  --timezone TIMEZONE   The timezone to use for the start and end dates.
  --call-quantity CALL_QUANTITY
                        The number of calls to filter.
  --call-type {INBOUND,OUTBOUND,CALL_TEST}
                        The type of call to filter.
  --ignore-callers [IGNORE_CALLERS ...]
                        A comma separated list of callers to ignore.
  --reported            Search only reported calls.
  --use-case USE_CASE   Filter calls by use-case.
  --flow-name FLOW_NAME
                        Filter calls by flow-name.
  --min-audio-duration MIN_AUDIO_DURATION
                        Filter calls longer than given duration.
  --asr-provider ASR_PROVIDER
                        Filter calls served via a specific ASR provider.

But if you already have a selected call-ids in mind:

❯ poetry run skit-calls select -h
usage: skit-calls select [-h] (--call-ids CALL_IDS [CALL_IDS ...] | --csv CSV) [--org-ids [ORG_IDs ...]] [--use-fsm-url]
                         [--domain-url DOMAIN_URL] [--uuid-column UUID_COLUMN] [--history]

optional arguments:
  -h, --help            show this help message and exit
  --call-ids CALL_IDS [CALL_IDS ...]
                        The call-ids to select.
  --csv CSV             CSV file that contains the call-ids to select.
  --org-ids ORG_IDS     The orgs for which you need the data.
  --use-fsm-url         Whether to use turn audio url from fsm or s3 path.
  --domain-url DOMAIN_URL
                        The domain to use while forming public audio_urls
  --uuid-column UUID_COLUMN
                        The column name of the UUID column in the CSV file. Required if --csv is set.
  --history             Collect call history for each turn