# Dleto on Toy data

CC-BY Peter Brooksbank, Martin Kassabov, and James B. Wilson

This notebook explores a data set about toys, specifically video game data.  This data is available from VG Chartz, all rights reserved.

First we need to load Julia and perhaps some necessary packages.

In [17]:
## Uncomment if you do not have iJulia installed
## it will take only one round to install, you can re-comment after that
# using Pkg
# Pkg.add("IJulia")
# This installs Julia's Jupyter kernel without Python dependencies
# println("IJulia installed! Restart VS Code and select Julia kernel.")

# Ensure Julia kernel is properly recognized  
# This notebook requires Julia kernel for execution and export
using IJulia
println("Julia kernel is active!")
println("Julia version: ", VERSION) # Fix Jupyter/Julia setup - Install IJulia for Julia notebooks

include("../../Dleto.jl") 

Julia kernel is active!
Julia version: 1.11.6
Dleto.jl loaded successfully.


## Load some toy data.

We are using a Comma Separated Value (CSV) file of toy data.  We load this with Julia's DataFrames (Julia's version of R's data frame, similar to Python Pandas) and print a few values.  This may require you to install a couple packages, uncomment the necessary commands if that happens.  Once those are install you can remove those steps or comment them out again.

In [18]:

## Uncomment if you do not have these packages installed
## it will take only one round to install, you can re-comment after that
import Pkg;
Pkg.add("CSV")
Pkg.add("DataFrames")

using CSV, DataFrames

# Load the CSV file
df = CSV.read("Video_Games.csv", DataFrame)
println("Loaded Video_Games.csv with ", nrow(df), " rows")

# Inspect the structure
println("Columns: ", names(df))
println(first(df, 5))


[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`


Loaded Video_Games.csv with 16719 rows
Columns: ["Name", "Platform", "Year_of_Release", "Genre", "Publisher", "NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales", "Global_Sales", "Critic_Score", "Critic_Count", "User_Score", "User_Count", "Developer", "Rating"]
[1m5×16 DataFrame[0m
[1m Row [0m│[1m Name                     [0m[1m Platform [0m[1m Year_of_Release [0m[1m Genre        [0m[1m Publisher [0m[1m NA_Sales [0m[1m EU_Sales [0m[1m JP_Sales [0m[1m Other_Sales [0m[1m Global_Sales [0m[1m Critic_Score [0m[1m Critic_Count [0m[1m User_Score [0m[1m User_Count [0m[1m Developer [0m[1m Rating   [0m
[1m     [0m│[90m String?                  [0m[90m String7  [0m[90m String7         [0m[90m String15?    [0m[90m String    [0m[90m Float64  [0m[90m Float64  [0m[90m Float64  [0m[90m Float64     [0m[90m Float64      [0m[90m Int64?       [0m[90m Int64?       [0m[90m String3?   [0m[90m Int64?     [0m[90m String?   [0m[90m String7? [0m

## Creating a data tensor

This is a good place to demonstrate creating a tensor from a data set.  
 * We treat every row of the CSV/DataFrame as contributing to an entry in the tensor.  
 * That axes of the tensor are individual columns.  We will demonstrate using "Platform", "Genre", "Critic Score"
 * For columns that are categories, the corresponding axis will have one basis vector for each category.  For example, in Platform we have Wii and NES. These could be mapped to $e_1$ and $e_2$.   Users familiar with one-hot encoding will recognize this encoding strategy only now we apply this just to individual axes.
 * For numeric columns we take actual values, or push them into a range of scores as the units.  For example, if scores are out of 100 we might take units of 10 similar to A,B,C,D, F grading. 

In [19]:
# Get unique values for each dimension
platforms = unique(skipmissing(df.Platform))
genres = unique(skipmissing(df.Genre))
scores = unique(skipmissing(df.Critic_Score))


println("Unique Platforms: ", length(platforms))
println("Unique Genres: ", length(genres))
println("Unique User Scores: ", length(scores))


Unique Platforms: 31
Unique Genres: 12
Unique User Scores: 82


Now we build up the tensor with entires being total sales.  Since each row of the data frame contributes to the data it is possible that several rows have the same platform, genre, and score.  This will then be used to add the total sales.  Our data is pretty course so we wont need full 64 bit floating points.

In [20]:
# Initialize tensor
t = zeros(Float16, length(platforms), length(genres), length(scores))
# Use Critic_Score instead of Year for the third dimension
critic_scores = unique(skipmissing(df.Critic_Score))
# Filter out "tbd" or non-numeric scores
critic_scores = filter(x -> tryparse(Float64, string(x)) !== nothing, critic_scores)
critic_scores = sort([parse(Float64, string(x)) for x in critic_scores])

# Reinitialize tensor with new dimensions
t = zeros(Float64, length(platforms), length(genres), length(critic_scores))
# Fill tensor with aggregated sales data
for row in eachrow(df)
    if !ismissing(row.Platform) && !ismissing(row.Genre) && !ismissing(row.Critic_Score)
        # Parse critic score
        score_str = string(row.Critic_Score)
        parsed_score = tryparse(Float64, score_str)
        
        if parsed_score !== nothing
            p_idx = findfirst(==(row.Platform), platforms)
            g_idx = findfirst(==(row.Genre), genres)
            s_idx = findfirst(==(parsed_score), critic_scores)
            
            if !isnothing(p_idx) && !isnothing(g_idx) && !isnothing(s_idx)
                # Aggregate global sales
                sales = get(row, :Global_Sales, 0.0)
                t[p_idx, g_idx, s_idx] += ismissing(sales) ? 0.0 : sales
            end
        end
    end
end

println("Created tensor with dimensions: ", size(t))
println("(", length(platforms), " platforms × ", length(genres), " genres × ", length(critic_scores), " critic scores)")
# t = loadTensorFromFile("../lstm_hidden_states_tensor.txt")
# t = loadTensorFromFile("../gnn_adjacency_tensor.txt")
println("Tensor loaded with size: ", size(t))

# plotTensor(t)

Created tensor with dimensions: (31, 12, 82)
(31 platforms × 12 genres × 82 critic scores)
Tensor loaded with size: (31, 12, 82)


### Visualization tools

This is a good place to look at the tensor we created.  You will need to have PlotlyJS active in your notebook, uncomment the command to install if you do not.

In [21]:
# import Pkg; Pkg.add("PlotlyJS") # Uncomment if PlotlyJS is not installed
plotTensor(t; xlabel="Platforms", ylabel="Genres", zlabel="Critic Scores", 
    title="Video Game Sales Tensor (Platforms × Genres × Critic Scores)")

Plotting 4089 points...


We see here that not all the platforms are actually being used.  There is a lot of missing data.  There are tensor tools to detect such structure but it would be far more efficient to first go through some preliminary data processing to remove obvious issues.  In this case having built the tensor we will simply drop the platforms beyond 23.

In [22]:

# Safely slice the tensor based on actual dimensions
dim1, dim2, dim3 = size(t)
t_trimmed = t[1:min(23, dim1), :, :]
plotTensor(t_trimmed; xlabel="Platforms", ylabel="Genres", zlabel="Critic Scores", 
    title="Sliced Video Game Sales Tensor (Platforms × Genres × Critic Scores)")

Plotting 4089 points...


When tensors get larger we will want some general statistics to guide our analysis.  Often the data is wide ranging so we might drop small values or renormalize etc.  To explore one option we will simply drop small sales volumes.

In [24]:
dropSmall = x -> abs(x) < 0.05 ? 0 : x 
t_trimmed_big = t_trimmed .|> dropSmall

# Check current tensor dimensions
println("Current tensor size: ", size(t_trimmed_big) )

nonzeros_count = count(!iszero, t_trimmed_big); nonzeros_count_orig = count(!iszero, t)
total_length = length(t_trimmed_big); total_length_orig = length(t)
result = nonzeros_count / total_length; result_orig = nonzeros_count_orig / total_length_orig
println("Number of nonzeros: ", nonzeros_count, " down from ", nonzeros_count_orig)
println("Total length: ", total_length, " down from ", total_length_orig)
println("Ratio (nonzeros/length): ", result, " down from ", result_orig)

Current tensor size: (23, 12, 82)
Number of nonzeros: 3761 down from 4089
Total length: 22632 down from 30504
Ratio (nonzeros/length): 0.16618062919759632 down from 0.1340479937057435


For this size of data we can plot the resulting trimmed large value tensors.  You might notice for top sellers the scores do not go below 30.  We could trim this off as well.


In [25]:
plotTensor(t_trimmed_big, 1.0; xlabel="Platform", ylabel="Genre", zlabel="User Score")

Plotting 1268 points...


## Stratification

It is time to see what Dleto can chisel form this data.  Depending on the size of the data and the strategy selected this could take some time.  It is our present research question to improve on this timing.  So for now treat this as a demonstration of what you will get for this investment in time.

We will start with the full tensor, which takes about 120 seconds on an Apple M2 and uses 2.3 GB.

In [26]:
@time u = stratify(t)


	Building linear system...
	Sizes: (7829, 30504)
  0.199029 seconds (388.35 k allocations: 1.825 GiB, 9.38% gc time, 29.96% compilation time)

	Computing singular vectors for (7829, 30504)...
	
 58.692308 seconds (36.59 k allocations: 2.291 GiB, 0.17% gc time)

	Extracting matrices...
  0.018701 seconds (11.50 k allocations: 652.812 KiB, 99.86% compilation time)
  0.000006 seconds (4 allocations: 2.531 KiB)
  0.000030 seconds (6 allocations: 128.156 KiB)
 59.026459 seconds (1.11 M allocations: 4.152 GiB, 0.20% gc time, 0.33% compilation time)


(tensor = [7.889335476476046e-33 -4.1305439106547555e-33 … 0.0 3.2915724879252086e-33; 3.020843100155467e-20 7.534112207070732e-19 … 0.0 4.941426166354507e-19; … ; -6.737423920658879e-19 3.873887144508017e-18 … 0.0 1.923150981158532e-18; -9.702174664633503e-21 2.104754861185557e-18 … 0.0 1.3103566999534987e-18;;; 1.7136716340225575e-19 -8.972106654457276e-20 … 0.0 7.149745907884779e-20; 1.439972914413811e-19 -7.53912844861872e-20 … 0.0 6.007825681357777e-20; … ; 3.706276773064601e-19 -1.9404598779999194e-19 … 0.0 1.54632525074277e-19; -2.725079305800116e-20 1.4267437056247148e-20 … 0.0 -1.1369520407810759e-20;;; 6.383873353911055e-33 -3.3423435075125905e-33 … 0.0 2.6634666456747473e-33; 5.364274307967549e-33 -2.808521787915427e-33 … 0.0 2.2380716072270346e-33; … ; 1.3806846693405154e-32 -7.228718655054846e-33 … 0.0 5.760464472137407e-33; -1.0151630465374678e-33 5.314992058203166e-34 … 0.0 -4.235442598054678e-34;;; … ;;; -1.3100432362854097e-32 6.858868061154095e-33 … 0.0 2.351779356261

In [27]:
plotTensor(u.tensor,0.01; 
          xlabel="Stratified Platforms", 
          ylabel="Stratified Genre", 
          zlabel="Stratified Score", 
          title="Stratified Video Game Sales Tensor")

Plotting 1821 points...


You will see in this that essentially concentrated one Genre combination but spread along many platforms and scores.  Some of these effects are the problem of using data with so many degeneracies.  Lets recompute on the trimmed tensor to compare.  This is smaller and the performance improves to about 90 seconds and 2 GB on an Apple M2.


In [28]:
@time v = stratify(t_trimmed)


	Building linear system...
	Sizes: (7397, 22632)
  0.083028 seconds (135.89 k allocations: 1.272 GiB, 1.50% gc time)

	Computing singular vectors for (7397, 22632)...
	
 50.934274 seconds (36.53 k allocations: 2.192 GiB, 0.21% gc time)

	Extracting matrices...
  0.000006 seconds (6 allocations: 10.156 KiB)
  0.000001 seconds (4 allocations: 2.531 KiB)
  0.000020 seconds (6 allocations: 128.156 KiB)
 51.020120 seconds (173.15 k allocations: 3.465 GiB, 0.21% gc time)


(tensor = [1.8821723130153493e-25 8.978470256009489e-26 … 0.0 -4.938088840532793e-20; 0.0 0.0 … 0.0 0.0; … ; 4.954383952241529e-9 -4.290160524896584e-9 … 0.0 7.407081359171683e-5; 2.437188114624881e-11 -1.921887949894925e-11 … 0.0 -2.497280245557182e-8;;; 2.4188659016020064e-21 2.464262497768022e-21 … -3.7153282963833327e-20 4.938346061153376e-20; 0.0 0.0 … 0.0 0.0; … ; -5.5184200688763654e-5 3.9536574112492026e-5 … 0.000119308968700645 -7.97136622497025e-5; -3.708491831031371e-6 3.1121067535295833e-6 … 4.556813875773481e-6 -3.7247315047476045e-8;;; 1.876256341199293e-20 -1.5734152324665144e-20 … -2.3144540124228653e-20 5.283312073596825e-20; 0.0 0.0 … 0.0 0.0; … ; -7.813883465560615e-5 6.552507392270168e-5 … 9.637062950697786e-5 -7.976875675433285e-5; -3.587873092833662e-6 3.0086869127441994e-6 … 4.425335641386805e-6 -1.067747336796062e-8;;; … ;;; -5.4756683583014335e-21 4.6413977729461455e-21 … 6.25437205154192e-21 -8.141294187535804e-21; 0.0 0.0 … 0.0 0.0; … ; 2.086243432633001e-6 -

We can plot the result and we see now there are 2 possibly 3 relevant score groups but not much separation in platforms.

In [29]:
plotTensor(v.tensor; 
          xlabel="Stratified Platforms", 
          ylabel="Stratified Genre", 
          zlabel="Stratified Score", 
          title="Trimmed Stratified Video Game Sales Tensor")

Plotting 16 points...



We might now ask a number of questions:
 1. Is this reliable features or if I run this again will it find other structure?
 2. How do I see the actual combinations that lead to these clusters?
 3. Are their other clustering targets?

We can start by looking a the data labels the new combinations.  These are included in the output under cryptic labels (a future feature is to fix this).  Use `u.Xchange` (Platforms groups), `u.Ychange` (Genre groups), and `u.Zchange` (Score groups) and replace `u` with `v` to look a the second stratified tensor. 

If you print `v.Xchange` directly you get a matrix.  Each column corresponds to the vector of combinations that corresponds to the new coordinates in the resulting tensor.  So it may be more instructive to read this column-by-column.  


In [31]:
v.Xchange[:,1] .|> dropSmall


23-element Vector{Real}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1.0
 0
 0

This column is \(e_2\) which means this column did not change and it represents therefore 100% of the value of that original category.  Since this is the platform category this is the 2nd platform.  We can look this up.


In [None]:
println("Platform at index 2: ", platforms[2])

Perhaps unsurprisingly for Video Game experts, Nintendo NES stands apart from other platforms.  But we should not jump to conclusions.  Perhaps what we are seeing is that no platforms changed.  Here inspecting the whole matrix can help us.

In [None]:
v.Xchange

We see now that while the first several columns are unchanged, the final 3 columns are in combination.  Lets extract that combination.

In [None]:
new_platform = [(platforms[i], v.Xchange[i,21]) for i in 1:size(v.Xchange,1)]

We see that platform group 23 is now a combination of "Wii" (down by 65%), "DS" (up 28%), "X360" (down 64%), "PS3" (up 28%).  How to interpret this data will take a subject matter expert with experience in data science.  However, we might conjecture that the algorithm is identifying a habit in sales which would be perceived as stable across this combination of groups.  For example, that the sales Wii and XBox 360 sales negatively offset with "DS" and "PS2".  We might look at the entire tensor slice here to learn more.

In [None]:
v.tensor[21, :, :]

Since scores are the most pronounced clusters we should look at the Z-change (dropping small values to see better.)

In [None]:
v.Zchange .|> dropSmall

We should see what row 2 is about.

In [None]:
critic_scores[2]

## Other chisels

We close with other chisels.

In [None]:
@time s = toSurfaceTensor(t_trimmed)

In [None]:

plotTensor(s.tensor, 1.0)

While surface chisels are great at recovering data like embedded surfaces that arise in PDEs, because they only use symmetric matrices and thus rotations, they are less capable of clustering in this setting.

In [None]:
@time f = toFaceCurveTensor(t)

In [None]:
plotTensor(f.tensor)

In [None]:
@time c = toCurveTensor(t)
plotTensor(c.tensor)