# Getting started with `svd2vec`

## I - Installation

`svd2vec` can be installed using *pip*:

```shell
pip install svd2vec
```

##  II - Usage

`svd2vec` can be used like the `word2vec` implementation of [Gensim](https://pypi.org/project/gensim/).
The full documentation can be seen [here](#).

### A/ Corpus creation

The corpus (`documents`) parameter of `svd2vec` should be a list of documents. Each document should be a list of words representing that document.

In [11]:
import os
if not os.path.isfile("text8"):
    !echo "Downloading and extracting the corpus file"
    !curl -O http://mattmahoney.net/dc/text8.zip
    !unzip text8.zip
    !echo "Done"

In [12]:
# loading the word2vec demo corpus
from svd2vec import FilesIO
documents = FilesIO.load_corpus("text8")

### B/ Creation of the vectors

In [13]:
from svd2vec import svd2vec

In [16]:
# showing first fifteen words of the first two documents
[d[:15] + ['...'] for d in documents[:2]]

[['',
  'anarchism',
  'originated',
  'as',
  'a',
  'term',
  'of',
  'abuse',
  'first',
  'used',
  'against',
  'early',
  'working',
  'class',
  'radicals',
  '...'],
 ['emotional',
  'reciprocity',
  'qualitative',
  'impairments',
  'in',
  'communication',
  'as',
  'manifested',
  'by',
  'at',
  'least',
  'one',
  'of',
  'the',
  'following',
  '...']]

In [17]:
# creating the words representation (can take a while)
svd = svd2vec(documents, window=5, min_count=100, verbose=False)

### C/ Similarity and distance

In [18]:
svd.similarity("bad", "good")

0.5542564783462338

In [19]:
svd.similarity("monday", "friday")

0.8130096497866965

In [20]:
svd.distance("apollo", "moon")

0.44440147394775065

In [21]:
svd.most_similar(positive=["january"], topn=2)

[['march', 0.8295233045546868], ['november', 0.8216695339361217]]

### D/ Analogy

In [22]:
svd.analogy("paris", "france", "berlin")

[['germany', 0.7408263113851594],
 ['bavaria', 0.6431198555282065],
 ['saxony', 0.5959211171696297],
 ['austria', 0.590324328590732],
 ['brandenburg', 0.5900766294741353],
 ['prussia', 0.5843773122184047],
 ['bohemia', 0.5824459318790264],
 ['hanover', 0.5682950512805615],
 ['cologne', 0.5442204032780137],
 ['reich', 0.5419180170716675]]

In [23]:
svd.analogy("road", "cars", "rail", topn=5)

[['locomotives', 0.7007217961709339],
 ['locomotive', 0.6949958902552571],
 ['trucks', 0.6416710731236377],
 ['passenger', 0.6340002227591348],
 ['diesel', 0.6173040175406118]]

In [24]:
svd.analogy("cow", "cows", "pig")

[['pigs', 0.585718326879587],
 ['rabbit', 0.5419044648732168],
 ['dogs', 0.535370585991924],
 ['cats', 0.5153222256053517],
 ['sheep', 0.5149061151805266],
 ['goat', 0.50844035445475],
 ['deer', 0.5025313268690585],
 ['cat', 0.4944135783694952],
 ['goats', 0.4921548791044705],
 ['cattle', 0.48890884003919893]]

In [25]:
svd.analogy("man", "men", "woman")

[['women', 0.7420544447534583],
 ['couples', 0.5894623989986301],
 ['sex', 0.5854488454516094],
 ['male', 0.5767848480015852],
 ['female', 0.5652073684195842],
 ['sexual', 0.5066287903659428],
 ['sexually', 0.49818155809108344],
 ['intercourse', 0.4814815146911616],
 ['heterosexual', 0.47986486800415895],
 ['lesbian', 0.4657069024650438]]

### E/ Saving and loading vectors

In [26]:
# saving to a binary format
svd.save("svd.svd2vec")

In [27]:
# loading from binary file
loaded = svd2vec.load("svd.svd2vec")
loaded.similarity("bad", "good")

0.5542564783462338

In [28]:
# saving to a word2vec like representation
svd.save_word2vec_format("svd.word2vec")