Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tensorly.contrib.sparse module #77

Merged
merged 14 commits into from Feb 5, 2019

Conversation

jcrist
Copy link
Contributor

@jcrist jcrist commented Oct 2, 2018

This adds a tensorly.contrib.sparse module, mirroring the tensorly namespace, but with sparse functionality instead of dense. This builds on #76, only the last commit is for this PR.

I'm not 100% happy with the mechanism here, but it's easy with #76 (which I think is a good change regardless if it's used for this). The gist is that the sparse functions are simple wrappers around the normal tensorly functions, using the relevant sparse backend instead of the dense backend. This means that you need to use tensorly.contrib.sparse.unfold instead of tensorly.unfold, but the sparse versions are just wrapped versions of the dense ones.

So far only the numpy backend implements sparse functionality, using the pydata/sparse library (note that you need master to try this). All backend methods are supported, as well as all top-level tensorly methods (e.g. everything wrapped in the tensorly.contrib.sparse namespace).

I also included tucker decomposition in the tensorly.contrib.sparse.decomposition namespace. For now this is a wrapper around the dense version (as above), but could be replaced with a sparse-specific implementation later.

Supersedes #64.


Diff of only the changes in this PR: aa076e5

- Add docstrings for all public methods
- A few style cleanups
- Explicitly import things into top-level namespace
- Remove a few unnecessary backend methods.
Having these as backend specific was unnecessary.

- Create `tensorly.testing`
- Move all test imports to `tensorly.testing`.
- Use absolute imports for test imports. For tests this makes more sense
than relative imports, and is standard practice in the numerical python
ecosystem.
- Use classes, to hopefully make the backend implementations clearer for
others.
- Add ability to set backend for all threads. Default is still
thread/context local, but we may want to change that later.
@coveralls
Copy link

coveralls commented Oct 2, 2018

Coverage Status

Coverage decreased (-4.05%) to 92.418% when pulling aa076e5 on jcrist:sparse-take-2 into 5ef625f on tensorly:master.

@jcrist
Copy link
Contributor Author

jcrist commented Oct 2, 2018

Example

In [1]: %load frostt.py
   ...: import requests
   ...: import os
   ...: import gzip
   ...: import numpy as np
   ...: import sparse
   ...:
   ...:
   ...: DATA_DIR = './_tensor-data/'
   ...:
   ...:
   ...: def download_file(url, local_path=DATA_DIR):
   ...:     local_filename = url.split('/')[-1]
   ...:     path = local_path + local_filename
   ...:     r = requests.get(url, stream=True)
   ...:     total_size = int(r.headers.get('content-length', 0))
   ...:     with open(path, 'wb') as f:
   ...:         chunk_size = 32 * 1024
   ...:         for chunk in r.iter_content(chunk_size):
   ...:             if chunk:
   ...:                 f.write(chunk)
   ...:     return path
   ...:
   ...:
   ...: def frostt(descriptor, data_dir=DATA_DIR):
   ...:     if data_dir == DATA_DIR:
   ...:         try:
   ...:             os.makedirs(DATA_DIR)
   ...:         except FileExistsError:
   ...:             pass
   ...:     files = os.listdir(data_dir)
   ...:     if descriptor + '.tns.gz' in files:
   ...:         return read_dataset(data_dir + descriptor + '.tns.gz',
   ...:                             format=lambda coords, values: (coords - 1, values))
   ...:     prefix = 'https://s3.us-east-2.amazonaws.com/frostt/frostt_data/'
   ...:     url = prefix + descriptor + '/' + descriptor + '.tns.gz'
   ...:     download_file(url, local_path=data_dir)
   ...:     return frostt(descriptor, data_dir=data_dir)
   ...:
   ...:
   ...: def read_dataset(filename, format=None):
   ...:     with gzip.open(filename, 'rb') as f:
   ...:         raw = f.readlines()
   ...:     first_row = [float(x) for x in raw[0].split(b' ')]
   ...:     num_coords = len(first_row) - 1
   ...:     medium_rare = list(map(lambda line: line.strip(b'\n').split(b' '), raw))
   ...:     coords = (int(x) for line in medium_rare for x in line[:-1])
   ...:     values = (float(line[-1]) for line in medium_rare)
   ...:     coords = np.fromiter(coords, dtype=int).reshape(-1, num_coords)
   ...:     values = np.fromiter(values, dtype=float)
   ...:     if format:
   ...:         coords, values = format(coords, values)
   ...:     return sparse.COO(coords.T, data=values)

In [2]: data = frostt('nips')

In [3]: data.nbytes / 1e9  # Sparse memory used in GB
Out[3]: 0.12406436

In [4]: data.size * 8 / 1e9  # Memory used if a dense array
Out[4]: 13559.812193664

In [5]: from tensorly.contrib.sparse.decomposition import partial_tucker

In [6]: %%time
   ...: core, factors = partial_tucker(data, [1, 2],
   ...:                                rank=[5, 5, 100, 17],
   ...:                                verbose=True, init='random',
   ...:                                tol=1e-3)
   ...:
reconsturction error=0.9934568578063805, variation=0.00014067785852067693.
converged in 2 iterations.
CPU times: user 3min 4s, sys: 53.6 s, total: 3min 58s
Wall time: 3min 25s

In [7]: core
Out[7]: <COO: shape=(2482, 5, 5, 17), dtype=float64, nnz=62050, fill_value=0.0>

In [8]: factors
Out[8]:
[array([[-5.91222626e-18,  2.89827035e-18, -2.75691272e-19,
         -8.78649988e-18,  3.94111419e-18],
        [-5.02446340e-18, -1.55209758e-09,  2.90628869e-08,
          1.34380988e-06,  1.68727465e-05],
        [ 7.39479895e-18, -1.79811939e-18,  9.82153292e-19,
          1.40760063e-17, -2.35939229e-18],
        ...,
        [-4.79033419e-18, -4.76625667e-08,  6.99296165e-09,
          5.31161114e-06, -2.72788286e-06],
        [-4.90639724e-18, -9.82272650e-18,  2.91640399e-19,
         -2.72238732e-18, -1.30575311e-17],
        [-7.96563209e-18, -6.38409970e-16,  2.77030659e-16,
          2.90861929e-13, -1.38441587e-13]]),
 array([[-2.17508562e-04, -2.20062345e-04,  1.71913582e-04,
         -2.69793852e-04,  4.48704964e-04],
        [-5.76951439e-04, -7.41918691e-04,  5.09510767e-04,
         -6.76461602e-04, -8.46511610e-04],
        [-5.74204776e-05, -6.89232270e-05,  5.43752349e-05,
         -1.39790241e-05, -3.00200107e-05],
        ...,
        [-1.17397904e-09, -1.19689398e-09,  9.17598542e-10,
         -1.32693291e-09, -5.93684923e-10],
        [-1.59895405e-04, -1.37067503e-04,  1.10396298e-05,
         -3.79328786e-05,  1.03508722e-04],
        [-9.26775928e-07, -1.01001915e-06,  5.39128131e-07,
         -9.26510976e-07,  1.09824611e-06]])]

This was referenced Oct 2, 2018
@jcrist
Copy link
Contributor Author

jcrist commented Oct 12, 2018

I gave a demo of this functionality today, the notebook used can be found here if you're interested: https://gist.github.com/jcrist/f7f0682ed01f12e96f9a40d8862b2477.

@JeanKossaifi
Copy link
Member

Thanks for sharing - Awesome notebook!
Looking forward to have this merged in TensorLy :)

@asmeurer asmeurer mentioned this pull request Nov 19, 2018
@JeanKossaifi JeanKossaifi merged commit ff75e67 into tensorly:master Feb 5, 2019
@JeanKossaifi JeanKossaifi mentioned this pull request Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants