Binary Attribute Representation

This is the repository for "A transparent approach to data representation" (http://arxiv.org/abs/2304.14209), a manuscript detailing the binary attribute representation (BAR). BAR is a system for factorizing a large MxN matrix as a product of a binary MxL matrix and a real-valued LxN matrix (with L much smaller than M and N). The manuscript describes how to use this model to find a compact representation of movies and viewers from an incomplete matrix of ratings that the viewers have given to the movies. The ratings data are from the Netflix prize (https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data), comprising 17770 movies and 480189 viewers. To use the code in this repo, you'll need to take the raw ratings file and use matrix.c to convert it to matrix form, then use sort.c to put the most important movies first. You can compile these scripts with gcc -O3 matrix.c -lm -o matrix and gcc -O3 sort.c -lm -o sort.

bits.c is the code for finding the binary attributes for the viewers. weights.c is the code for finding the best real-valued weights for all the movies, given a set of binary attributes for each viewer. Both of these use OpenMP to take advantage of the algorithm's parallelism; compile them with gcc -fopenmp bits.c -lm -o bits and gcc -fopenmp weights.c -lm -o weights.

bits expects nine arguments:

name of the data file
number of columns (movies) to use; a few hundred is enough
number of rows (viewers) to use; you can use all 480189, or fewer to get quicker diagnostic runs on a sample of the viewers
number of attributes to use; 8 or 16 can give decent results (more attributes will take longer)
number of iterations; 1000 is usually enough
how often (measured in iterations) to compute RMSE; 10 is a good choice (computing RMSE more often makes the algorithm slower)
how many trials to run; just one is probably enough if training on all viewers
desired name for results files
number of threads to use

weights expects four arguments:

name of the data file
name of the attribute bits file (one of the outputs from running bits
desired name for results files
number of threads to use

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary Attribute Representation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md
bits.c		bits.c
matrix.c		matrix.c
sort.c		sort.c
weights.c		weights.c

Folders and files

Latest commit

History

Repository files navigation

Binary Attribute Representation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages