In [1]:
)clear
⎕PP←4

In [2]:
]link.import # .

# Data Science in APL

**DISCLAIMER** This is a proof-of-concept. Use at your own risk. Send comments to jgl@dyalog.com

## `data` namespace

### Classes

#### `data.Series` class

An instance of the `data.Series` class contains a labelled 1D array.

- **`label`** label of the series.

- **`values`** values of the series. It must be a 1D array.

Values of the array can be accessed by bracket indexing (eg: `s[2]`).

The **`loc`** property allows to *locate* values.
It returns the indices (the *location*) of the values given as indices in brackets (eg: `s.loc['A']`).
It also allows to assign to the located values.

The **`frames`** property gives access to the instances of `data.Frame` which contain this series.

The monadic **`series`** method returns a new series with the same label and the right argument as values.

#### `data.Frame` class

An instance of the `data.Frame` class contains a list of series, all of them containing arrays of the same length.

- **`series`** list of series in the frame.

The **`labels`** property gives access to the array of labels of the series, while **`values`** gives access to the values as a list of nested arrays.

The series can be accessed by bracket indexing of rank-1 (eg: `f[⊂'label']`). Bracket indexing of rank 2 gives access to the values in the frame as a 2D array (eg: `f[2 3;'col1' 'col2']←2 2⍴⍳4`).

The **`loc`** property allows to *locate* values.
Bracket indexing of rank-1 will return the index of the corresponding columns (eg: `f.loc['col1' 'col2']`).
Rank-2 indexing allows to locate values. The corresponding indices are returned as a 2D array.
It also allows to assign to the located values.

The monadic **`frame`** method returns a new series with the same labels and the right argument as values.

### Functions and operators

#### `data.series` function

Create an instance or a list of instances of the `data.Series` class.

- `⍺ data.series ⍵` create instance of `data.Series` with label `⍺` and values `⍵`.
- `data.series ⍵` create an instance of `data.Series` for each of the series in `⍵` and each of the series contained in each frame in `⍵`. If `⍵` is a 2D array, it must contains series with the same label in each column, and their values will be concatenated. If `⍵` is a string, try to read it as CSV.

#### `data.frame` function

Create an instance of the `data.Frame` class.

- `⍺ data.frame ⍵` create instance of `data.Frame` with labels `⍺` and values `⍵`.
- `data.frame ⍵` create an instance of `data.Frame` with each of the series returned by `data.series ⍵`.

#### `data.sort` operator

Sort data according to left function.

- `⍺ (⍺⍺ data.sort) ⍵` returns the data in `⍵` (a frame or list of series) sorted according to the result of `⍺⍺ ⍺` (typically one of `⍒⍋`).
- `(⍺⍺ data.sort) ⍵` equivalent to `(⍺⍺ data.sort)⍨⍵`.

#### `data.by` operator

Group data by values in right operand and apply left function.

- `⍺ (⍺⍺ data.by ⍵⍵) ⍵` returns the data in `⍵` (a frame or list of series) grouped according to `⍵⍵` (also a frame or list of series) and apply `⍺⍺` to each group. A new frame is returned with the labels given in `⍺` (or `⍺.labels`). If `≢⍺` is lower than the number of series, it must contain a label for each of the additional series or a label for each of the series not in `⍵⍵`.
- `(⍺⍺ data.by ⍵⍵) ⍵` equivalent to `(⍺⍺ data.by ⍵⍵)⍨⍵`.

#### `data.where` operator

Apply left function to data that fulfills condition given as right operand.

- `⍺ (⍺⍺ data.where ⍵⍵) ⍵` returns the data in `⍵` (a frame or list of series) after applying the function `⍺⍺` to the values which fulfill the condition `⍵⍵ ⍺`.
- `(⍺⍺ data.where ⍵⍵) ⍵` equivalent to `(⍺⍺ data.where ⍵⍵)⍨⍵`.

#### `data.join` operator

Merge two frames (or lists of series).

- `⍺ (⍺⍺ data.join ⍵⍵) ⍵` returns frame with series labelled `⍺.labels ⍵⍵ ⍵.labels`. If two series at left and right have the same label, its values are combined as `⍺.values ⍺⍺ ⍵.values`.

## Example

In [3]:
f←data.frame'berkeley.csv'
a←'Applicants' 'Accepted'{(≢⍵),('A'+.=⊃¨)⍵}data.by'Major' 'Gender'⊢f[]~f[⊂'Year']
g←⍋data.sort a,data.join⊣(⊂'Gender')(+⌿,'T'⍨)data.by(⊂'Major')⊢a[]~a[⊂'Gender']
m←⍋data.sort g,data.join⊣(⊂'Major')(+⌿,(⊂'Total')⍨)data.by(⊂'Gender')⊢g[]~g[⊂'Major']
r←data.frame m,'%Accepted'data.series 100×÷/m[;'Accepted' 'Applicants']
data.frame r,'%Applicants'data.series (100×⊢÷≢⍴('T'=⊃¨r[;⊂'Major'])⌿⊢)⊢r[;⊂'Applicants']