# `data` namespace

<b style="color:red;">THIS IS A PROOF-OF-CONCEPT</b>

The `data` namespace contains classes, functions and operators to facilitate the manipulation and analysis of columnar data.

## Classes

<div style="background-color:#ffffee;border:1px solid #ddddaa;margin:2em;padding:5px;">

### `data.Series` class

An instance of the `data.Series` class contains a labelled array.

Bracket indexing (property `iloc`) of the series gives access to the values of the array.
The **`values`** property is equivalent to `[]`.
The label, which can take any value,
can be accessed through the **`label`** property.
The keyed property **`loc`** returns the indices of the given values and allows to assign to those positions.

</div>
<div style="background-color:#ffffee;border:1px solid #ddddaa;margin:2em;padding:5px;">

### `data.Frame` class

An instance of the `data.Frame` class contains a list of `data.Series` instances.
All the series must contain values arrays of the same length.

The series list can be accessed by bracket indexing (property **`loc`**) of rank 1 using the labels of the series as indices.
Bracket indexing of rank 2 gives access to the values in the series.
The properties **`series`**, **`labels`** and **`values`** are equivalent to `[]`, `[].label` and `[;]`.
The keyed property **`iloc`** is analogous to `loc` but takes positions instead of values as indices.

The property **`index`** is a list of labels used as index. The values in those series will then be used to index the frame when using `loc` (or `Sel`, or `Loc`). The index values can be read or changed through the property **`indices`**. The property **`columns`** gives access to the labels which are not in the index.

Frames are displayed with shades at row intervals of the size specified by the **SHADE** property
and up to a maximum number of lines specified by the **MAXLINES** property.

#### Methods

- `⍺ df.Col ⍵` returns frame with columns `df[⍵]` (or `⍵` if it is an array of frames and series) and index `df[⍺]` (default none)
- `⍺ df.Sel ⍵` returns frame with rows `⍵` using as index `⍺` (default `df.index`); if `⍵` is an array of frames and series, return selection `⍺` using as index `df.index` and index frame with it.
- `⍺ df.Loc ⍵` equivalent to `df.(Sel∘Col)`
    
</div>

## Functions and operators

<div style="background-color:#eeffee;border:1px solid #aaddaa;margin:2em;padding:5px;">

### `data.series` function

This function returns an instance or a list of instances of the `data.Series` class.

- `⍺ data.series ⍵` creates an instance of `data.Series` with label `⍺` and values `⍵`. If `⍺` is a series, the label is taken from it.
- `data.series ⍵` creates an instance of `data.Series` for each of the series in `⍵` and each of the series contained in each frame in `⍵`. If `⍵` is a rank 2 array, it must contain series with the same label in each column, and their values will be concatenated.

</div>
<div style="background-color:#eeffee;border:1px solid #aaddaa;margin:2em;padding:5px;">

### `data.frame` function

This function returns an instance of the `data.Frame` class.

- `⍺ data.frame ⍵` creates an instance of `data.Frame` with labels `⍺` (or the labels of the series list or frame `⍺`) and values `⍵`. If `⍵` is a string, it writes the frame `⍺` to the CSV file `⍵` or reads the CSV file `⍵` without header and returns frame with labels `⍺`.
- `data.frame ⍵` creates an instance of `data.Frame` with each of the series returned by `data.series ⍵`. If `⍵` is a string, it reads the file `⍵` as CSV with header and returns frame.

</div>
<div style="background-color:#eeffee;border:1px solid #aaddaa;margin:2em;padding:5px;">

### `data.index` function

This function returns an instance of the `data.Frame` class with the given index.

- `⍺ data.index ⍵` creates an instance of `data.Frame` with index `⍺` (or the labels of the series list or frame `⍺`) and series `⍵`, which must be a list of series and or frames.
- `data.index ⍵` creates an instance of `data.Frame` with no index.

</div>
<div style="background-color:#eeffee;border:1px solid #aaddaa;margin:2em;padding:5px;">

### `data.nan` function

- `⍺ data.nan ⍵` substitutes non-numeric values in `⍵` with `⍺`.
- `data.nan ⍵` returns array with shape of `⍵.values` with ones in non-numeric elements and zeros in numeric ones.

</div>

<div style="background-color:#ffeeff;border:1px solid #ddaadd;margin:2em;padding:5px;">

### `data.at` operator

This operator applys the left operand to the values specified by the left operand.

- `(⍺⍺ data.at) ⍵` returns `⍵` (a frame or list of series) as a frame after applying `⍺⍺` to the series with labels `⍵⍵⊣⍵.labels`.
- `⍺ (⍺⍺ data.at) ⍵` is equivalent to `(⍺∘⍺⍺ data.at) ⍵`.

</div>
<div style="background-color:#ffeeff;border:1px solid #ddaadd;margin:2em;padding:5px;">

### `data.sort` operator

This operator sorts data according to the left function.

- `⍺ (⍺⍺ data.sort) ⍵` returns `⍵` (a frame, list of series, or array) sorted according to the result of `⍺⍺ ⍺` (where `⍺⍺` typically is one of `⍒⍋`).
- `(⍺⍺ data.sort) ⍵` is equivalent to `(⍺⍺ data.sort)⍨⍵`.

</div>
<div style="background-color:#ffeeff;border:1px solid #ddaadd;margin:2em;padding:5px;">

### `data.by` operator

This operator groups data by the right operand and applies the left function.

- `⍺ (⍺⍺ data.by ⍵⍵) ⍵` returns the data in `⍵` (a frame or list of series) grouped according to `⍵⍵` (also a frame or list of series) and apply `⍺⍺` to each group. Labels (either all of them, the ones not in `⍵`, or the ones not in `⍵⍵`) are given in `⍺`, which can be a list of values, a list of series, or a frame. If `⍺⍺` and `⍵⍵` are both arrays, the `⍺` series are stacked with labels in series `⍺⍺` and values in series `⍵⍵`.
- `(⍺⍺ data.by ⍵⍵) ⍵` is equivalent to `⍬ (⍺⍺ data.by ⍵⍵) ⍵`. If `⍺⍺` and `⍵⍵` are both series or labels, the values in `⍺⍺` are grouped for each value in `⍵⍵` and distributed in series.

</div>
<div style="background-color:#ffeeff;border:1px solid #ddaadd;margin:2em;padding:5px;">

### `data.join` operator

This operator merges two frames (or lists of series).

- `⍺ (⍺⍺ data.join ⍵⍵) ⍵` returns frame with series labelled `⍺.labels ⍵⍵ ⍵.labels`. If two series at left and right have the same label, its values are combined as `⍺.values ⍺⍺ ⍵.values`.
- `(⍺⍺ data.join ⍵⍵) ⍵` returns a series with label `⍵⍵ ⍵.labels` and values `⍺⍺ ⍵.values`.

</div>

## Examples

In [1]:
)clear
⎕PP←3
] _←link.import # . ⍝ import data namespace from current directory
test.all

### Berkeley

In [2]:
f ←   data.frame'berkeley.csv'                                                              ⍝ load data file
a ←   data.('Applicants' 'Accepted'{(≢⍵),('A'+.=⊃¨)⍵}by'Major' 'Gender'⊢)f[]~f[⊂'Year']     ⍝ group
g ← a data.(⍋sort⊣,join⊣(⊂'Gender')(+⌿,'T'⍨)by(⊂'Major')⊢)a[]~a[⊂'Gender']                  ⍝ totals by gender
m ← g data.(⍋sort⊣,join⊣(⊂'Major')(+⌿,(⊂'Total')⍨)by(⊂'Gender')⊢)g[]~g[⊂'Major']            ⍝ totals by major
r ← m data.(frame⊣,'%Accepted'series⊢)100×÷/m[;'Accepted' 'Applicants']                     ⍝ accepted ratio
b ← r data.(frame⊣,'%Applicants'series⊢)(100×⊢÷≢⍴('T'=⊃¨r[;⊂'Major'])⌿⊢)⊢r[;⊂'Applicants']  ⍝ applicants ratio
b

### Iris

In [3]:
AVG←+⌿÷≢ ⋄ STD←(2*∘÷⍨+⌿÷¯1+≢)2*⍨⊢-⍤1+⌿÷≢     ⍝ average and standard deviation
PCT←{((2÷⍨+/)⊢⌷⍨∘⊂⍋⌷⍨∘⊂∘⌈100÷⍨⍺×0 1+≢)⍵}     ⍝ percentile-⍺
PCC←+.×⍥((⊢÷2*∘÷⍨+.×⍨)⊢-+⌿÷≢)                ⍝ Pearson correlation coefficient
f←'sl' 'sw' 'pl' 'pw' 'class'data.frame'iris.csv'                              ⍝ load data file
aggs←{l f←⍺⍵ ⋄ ⍪,¨⍺⍺{(⍺⍺ ⍵)⍵⍵data.by(f[l])⊢f[⊂⍵]}⍵⍵¨f.labels~l}                ⍝ aggregation operator
s←(⊂'class'){'⌊AS⌈',¨⊂⍵}aggs(⌊⌿,AVG,STD,⌈⌿)f                                   ⍝ statistical summary
p←(⊂'class')('25' '50' '75',⍨¨⊂)aggs(,25 50 75∘.PCT↓∘⍉)f                       ⍝ percentiles
s←s{data.frame⍺.series,1↓⍵.series}¨¨p ⋄ s.SHADE←0                              ⍝ summary for each variable
pcc←{∘.PCC⍨↓⍉⍵}data.by(⊂'class')⊢f[]                                           ⍝ Pearson's correlation coeff
pcc.values⍪←(⊂'Class'),({(∪⍵)⍳⍵}f[;⊂'class'])∘PCC¨f[f.labels~⊂'class'].values  ⍝ PCC for the class
⍕(⍪⊃(⊣,(⊂''),⊢)⌿s)pcc

### Google

In [4]:
g←'date' 'n'data.frame⎕CSV⍠'Invert'2⊢(3↓⊃⎕NGET'google-scotch.csv'1)'N'4  ⍝ read data file
d←{⍲/(∧/∊∘(⎕D,'-'))¨⍵:⎕SIGNAL 11 ⋄ ↑'-'(⍎¨≠⊆⊢)¨⍵}g[;⊂'date']             ⍝ convert date strings to y m d
tt←'n'data.by'month'⊢t←'year' 'month' 'n'+⌿data.by(d[;1 2])⊢g[⊂'n']      ⍝ group by month
tv←(tt[;,⊂'year']⍪⊂'total'),(⊢⍪(+⌿+/¨))(⊢,(+/+/¨))tt[;1↓tt.labels]       ⍝ add total values
tt←(tt.labels,⊂'total')data.frame tv                                     ⍝ data frame with totals
c←'n'data.by'month'⊢t data.frame t[;t.labels~'n'],(⊂⍬),¯2-/t[;'n']       ⍝ calculate change
tt←tt data.frame tt[;]⍪(⊂'')⍪c[;],(⊂⍬)⍪¯2-⌿¯1↓tt[;,⊂'total']             ⍝ data frame with totals and change
tt ⋄ ⍬ ⋄ {'min' 'avg' 'max' 'total'(⌊⌿,(+⌿÷≢),⌈⌿,+⌿)data.by(t[⊂⍵])⊢t[⊂'n']}¨'year' 'month'

    jgl@dyalog.com