In [1]:
)clear

In [2]:
T←{⎕PP←3 ⋄ ⍕⍺⍪⍵}  ⍝ print header and data as table

# Berkeley

In 1973, the admission data in UC Berkeley graduate schools showed a clear gender bias: a larger percentage of male applicants was being admitted. However, when the admission rates in each major are analysed, this gender bias is not observed. This is a typical example of [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox). To understand it, we must calculate the admission rates in total and by gender for each major and in total, as well as the percentage in which the students of each gender apply to the different majors.

In [3]:
d h←⎕CSV'berkeley.csv' ''1 1                      ⍝ read data
a←{⍵[⍋⍵;]}d[;2 3]{⍺,(+/('A'=⊃)¨⍵),≢⍵}⌸d[;4]       ⍝ group admitted and applicants
g←a⍪(⊂'Total'),a[;2]{⍺,+⌿⍵}⌸a[;3 4]               ⍝ totals by gender
m←{⍵[⍋⍵;]}g⍪g[;1]{⍺,'T',+⌿⍵}⌸g[;3 4]              ⍝ totals by major
ar←m,100×m[;3]÷m[;4]                              ⍝ admission ratios
mr←ar,100×ar[;4]÷(≢ar)⍴¯3↑ar[;4]                  ⍝ applicants ratios

In [4]:
h⍪(⊂'...')⍪⍨25↑d

In [5]:
(ha←h[2 3],'Admitted' 'Applicants')T a

In [6]:
ha T g

In [7]:
ha T m

In [8]:
(har←ha,⊂'%Admitted')T ar

In [9]:
(hmr←har,⊂'%Applicants')T mr

**TODO** bar charts. eg https://rpubs.com/dawnwp/1081716

# Iris

The [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) contains measures of different features for three different classes of Iris flowers. This dataset is frequently used as an example for different data classification techniques.

In [10]:
⍝ statistical functions (from aplcart)
AVG←+⌿÷≢ ⋄ STD←(2*∘÷⍨+⌿÷¯1+≢)2*⍨⊢-⍤1+⌿÷≢     ⍝ average and standard deviation
PCT←{((2÷⍨+/)⊢⌷⍨∘⊂⍋⌷⍨∘⊂∘⌈100÷⍨⍺×0 1+≢)⍵}     ⍝ percentile-⍺
PCC←+.×⍥((⊢÷2*∘÷⍨+.×⍨)⊢-+⌿÷≢)                ⍝ Pearson correlation coefficient

In [11]:
_A←{(⍵[;1],∘⍺⍺⌸1↓[2]⍵)⍪(⊂'Total'),⍺⍺1↓[2]⍵}  ⍝ aggregate with total

In [12]:
d←¯1⌽⎕CSV'iris.csv' ''4                      ⍝ read data
s←(⌊⌿,AVG,STD,⌈⌿)_A d                        ⍝ statistical summary
p←{,25 50 75∘.PCT↓⍉⍵}_A d                    ⍝ percentiles
c←⊂⍤2{∘.PCC⍨↓⍉⍵}_A d                         ⍝ correlation matrices
c,←⊂(⊂'Class'),(⊂⍳⍨d[;1])PCC¨↓⍉1↓[2]d        ⍝ class correlation

In [13]:
25↑d

In [14]:
((⊂'class'),,'⌊AS⌈'∘.,'sl' 'sw' 'pl' 'pw')T s
((⊂'class'),,'25' '50' '75'∘.,⍨'sl' 'sw' 'pl' 'pw')T p
('class' 'sl' 'sw' 'pl' 'pw')T⊃⍪/c

**TODO** box plots, bar charts, scatter-plot. eg http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html

### Alternative to tacit statistical functions

In [15]:
:Namespace stats                      ⍝ statistical functions namespace
    AVG←{(+⌿⍵)÷≢⍵}                    ⍝   average
    STD←{(÷2)*⍨(+.×⍨⍵-AVG⍵)÷(≢⍵)-1}   ⍝   standard deviation (of the sample)
     SS←{⍵÷(÷2)*⍨(+.×⍨⍵-AVG⍵)}        ⍝   standard score
    PCC←{+.×⍥SS⍵}                     ⍝   Pearson's correlation coefficient
    PCT←{                             ⍝   percentile
        i←⌈(⍺÷100)×0 1+≢⍵             ⍝     indices of two nearest values
        v←(⊂(⊂i)⌷⍋⍵)⌷⍵                ⍝     two nearest values
        (+/v)÷2                       ⍝     return average
    }
:EndNamespace

# Google

A [time series](https://en.wikipedia.org/wiki/Time_series) is a series of data points indexed by time. As an example, we can analyse the number of searches for "scotch" in the last 5 years taken from google trends: https://trends.google.com/trends/explore?date=today%205-y&q=scotch

In [16]:
(ds n)←⎕CSV⍠'Invert'2⊢(3↓⊃⎕NGET'google-scotch.csv'1)'N'4   ⍝ read data
d←{⍲/(∧/∊∘(⎕D,'-'))¨⍵:⎕SIGNAL 11 ⋄ ↑'-'(⍎¨≠⊆⊢)¨⍵}ds        ⍝ dates
t←d[;1 2],∘(+/)⌸n                                          ⍝ group
s←{⍵[⍋⍵;]}¨{t[;⍵],∘(⌊⌿,(+⌿÷≢),⌈⌿,+⌿)⌸t[;3]}¨⍳2             ⍝ summary
c←{¯2-/t[;⍵]+/⍤⊢⌸t[;3]}¨⍳¨⍳2                               ⍝ change

In [17]:
(⍕∘⍪30∘↑)¨ds n

In [18]:
30↑d

In [19]:
'year' 'month' 'n'T t

In [20]:
F←{(⍳12),⍉6 12⍴(⍺⍴'·'),⍵,(12-⍺)⍴'·'} ⋄ ⍕(((⊂'Total'),2018+⍳6)T 7 F t[;3])(((⊂'Change'),2018+⍳6)T 8 F⊃⌽c)
l←'min' 'avg' 'max' 'total' ⋄ ⍕(((⊂'month'),l)T⊃⌽s)(((⊂'year'),l,⊂'change')T(⊃s),'·',⊃c)

**TODO** run chart with moving average. eg https://www.geeksforgeeks.org/how-to-make-a-time-series-plot-with-rolling-average-in-python/