Clustering *lot area* and *sale prize* in the [ames](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) dataset 

In [1]:
]get ../ames.csv

In [2]:
I ← {(⊂⍺)⌷⍵}
∆ ← {(⊂⍋⍵)⌷⍵}

In [3]:
]box on
⊢/⍴ames
(1⌷ames)⍳⊆'LotArea' 'SalePrice'

In [4]:
⍴feat←ames[;(1⌷ames)⍳⊆'LotArea' 'SalePrice']
3↑feat

In [5]:
(⌈⌿,[0.5]⌊⌿)1↓feat

In [6]:
c←6(?∘≢I⊢)1↓feat
c

In [7]:
⍴d←0.5*⍨+/2*⍨(1↓feat)(-⍤1⍤1 2)c
5↑d

In [8]:
⍴g←d⍳⍤1 0⌊/d
10↑g

In [9]:
k←⌊/d
{(+⌿⍵×[1]⍺=k)÷+/⍺=k}¨⍳⍺ 

SYNTAX ERROR
      {(+⌿⍵×[1]⍺=k)÷+/⍺=k}¨⍳⍺
                            ∧


In [16]:
∇ KMeans←{
⍝ ⍺: (n max)←number of clusters, maximum iterations :: scalar integer
⍝ ⍵: data set          :: numeric matrix 1 column per field
  (n max)←⍺
  End←{
    ⍺≡⍵:1  ⍝ converged
    i=max⊣i+←1   ⍝ maximum iterations reached
  }
  ComputeCentroids←{
    ⍝c←n(?∘≢I⊢)⍺               ⍝ choose random centroids    
    d←0.5*⍨+/2*⍨⍺(-⍤1⍤1 2)⍵   ⍝ distance from points to centroid
    g←d⍳⍤1 0⌊/d                 ⍝ cluster (group) for each data point
    g{(+⌿÷1⌈≢)⍵}⌸⍺           ⍝ new clusters are means of points in each group
  }
  i←0
  c←n(?∘≢I⊢)⍵
  ⍵ ComputeCentroids⍣End⊢c
}
∇

In [17]:
⍴r←(6 100) KMeans 1↓feat

In [18]:
r

In [22]:
d←1↓feat
d←0.5*⍨+/2*⍨d(-⍤1⍤1 2)r
g←d⍳⍤1 0⌊/d

In [30]:
(g{+⌿,⍵}⌸d)÷(g{≢⍵}⌸d)

In [31]:
]runtime -c "g{(+⌿÷1⌈≢)⍵}⌸d" "g{(+⌿⍵)÷1⌈≢⍵}⌸d" "(g{+⌿,⍵}⌸d)÷(g{≢⍵}⌸d)"

Yes! There are several additional parameters users often want to control in k-means clustering:

1. Convergence Threshold (ε):
   - Instead of checking for exact equality of centroids (`⍺≡⍵`), users often want to specify a small threshold
   - Centroids would be considered converged if their movement is less than this threshold
   - e.g., if `∥new_centroids - old_centroids∥ < ε`

2. Distance Metric:
   - Currently we're using Euclidean distance
   - Users might want Manhattan distance (L1 norm)
   - Or other metrics like Cosine similarity
   - This could be passed as a function operand

3. Initialization Method:
   - Currently using random selection from data points
   - Users might want k-means++ initialization 
   - Or their own specific starting centroids

4. Data Scaling/Normalization Method:
   - Since we're dealing with different scales (Lot Area vs Price)
   - Options like:
     - Min-max scaling
     - Z-score standardization
     - No scaling

Would you like me to elaborate on how any of these might be incorporated into the APL function's interface?

In [None]:
∇ KMeans←{
⍝ ⍺: (n max)←number of clusters, maximum iterations :: scalar integer
⍝ ⍵: data set          :: numeric matrix 1 column per field
  (n max)←⍺
  
  End←{
    ⍺≡⍵:1  ⍝ converged
    i=max⊣i+←1   ⍝ maximum iterations reached
  }
  ComputeCentroids←{
    ⍝c←n(?∘≢I⊢)⍺               ⍝ choose random centroids    
    d←0.5*⍨+/2*⍨⍺(-⍤1⍤1 2)⍵   ⍝ distance from points to centroid
    g←d⍳⍤1 0⌊/d                 ⍝ cluster (group) for each data point
    g{(+⌿÷1⌈≢)⍵}⌸⍺           ⍝ new clusters are means of points in each group
    ⍝ faster to (+⌿÷1⌈≢)¨g{⊂⍵}⌸⍺
    ⍝           (g{+⌿⍵}⌸⍺)÷(g{≢⍵}⌸⍺)
  }
  i←0
  c←n(?∘≢I⊢)⍵
  ⍵ ComputeCentroids⍣End⊢c
}
∇

The Big Questions are:
- what is "the APL way" of doing these things?
- can users take guides or existing code and boil down the necessary pieces for their use case, rather than relying on abstraction?
- are there guidelines for creating TamStat-like APIs that feel nice to use and mesh well together for data science?
- if we are providing Scikit/SciPy/Pandas-like abstractions, is there a benefit to doing so in APL?
- In known data analysis and scientific domains, can we discover and demonstrate value in the APL way?


I think users should do normalisation themselves - or they could be provided as helper functions. 

```apl
(n max) KMeans data   ⍝ default
(n max) (KMeans Init) data ⍝ custom initialisation
(n max) (Metric KMeans Init) data
(n max) (KMeans WithDistance 'manhattan') 'zscore' Scale data
(n max)(KMeans _WithMetric_ Metric _WithEndCondition_ End _Init_ Init) ScaleByZScore data
```

Can the modifications be expressed in either/or a combination of functional processing pipeline and operators?

```apl
(n max) KMeans data
(n max) KMeans ZScale data
(n max) (Metric _KMeans_ InitFn) data
```

"It is better to have 100 functions operate on 1 data type than 10 functions operator on 10 data types."

This use of operators as modifiers is something I'd tried to do with APLPhys and perhaps should revisit it. It seems like the conventions are either some kind of big ol' set of parameters (named parameters in Python, namespaces in APL), some kind of scripting language (LAMMPS) or some kind of object-oriented model?

With namespaces in Dyalog, we could provide both parameters and functions:

```apl
opts ← (
    n_clusters: 6
    max_iter: 100
    End: {(⍺≡⍵)∨i≥max_iter⊣i+←1}
    Distance←{0.5*⍨+/2*⍨⍺(-⍤1⍤1 2)⍵})
)
opts KMeans data
```

How can we attempt to answer the Big Questions?

We can attempt to recreate popular analyses in Pandas/SciPy/Scikit-learn/Julia examples (and our own personal interests). We can attempt to push the codes developed for these analyses further by looking into advanced examples from academic papers. Finally we can reach out to the domain experts at conferences, present what we have created and get feedback (and hopefully interest).

There are various approaches we could attempt with this:
1. recreate similar abstractions (e.g. options namespace)
2. create our own unique abstractions
3. attempt to solve the same variety of problems with direct expression

(1) and (2) are attempts at creating a DSL for data science in APL - if it is well created, it could offer a path for others to build on it. (3) is more like creating a framework and set of conventions, and IMHO is a more interesting investigation. My intuition is that (3) is a way to explore the unique aspects of APL in relation to these problems, whereas (1) and (2) are attempts to meet existing expectations, and generally raise the question "so why do this in APL?" to me.


Future work might be worth exploring [Kaggle advanced regression techniques task](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/).