# Embeddings

https://www.tensorflow.org/programmers_guide/embedding

Embeddings are important for input to machine learning. Classifiers, and neural networks more generally, work on vectors of real numbers. They train best on dense vectors, where all values contribute to define an object. However, many important inputs to machine learning, such as words of text, do not have a natural vector representation. Embedding functions are the standard and effective way to transform such discrete input objects into useful continuous vectors.

#### Example:

```python
blue:  (red, 47.6°), (yellow, 51.9°), (purple, 52.4°)
blues:  (jazz, 53.3°), (folk, 59.1°), (bluegrass, 60.6°)
orange:  (yellow, 53.5°), (colored, 58.0°), (bright, 59.9°)
oranges:  (apples, 45.3°), (lemons, 48.3°), (mangoes, 50.4°)
```

### FAQ:

#### Is "embedding" an action or a thing? Both. 
People talk about embedding words in a vector space (action) and about producing word embeddings (things). Common to both is the notion of embedding as a mapping from discrete objects to vectors. Creating or applying that mapping is an action, but the mapping itself is a thing.

#### Are embeddings high-dimensional or low-dimensional? It depends. 
A 300-dimensional vector space of words and phrases, for instance, is often called low-dimensional (and dense) when compared to the millions of words and phrases it can contain. But mathematically it is high-dimensional, displaying many properties that are dramatically different from what our human intuition has learned about 2- and 3-dimensional spaces.

#### Is an embedding the same as an embedding layer? No. 
An embedding layer is a part of neural network, but an embedding is a more general concept.

# MovieLens Dataset

Data available from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

FAST AI reference: https://github.com/fastai/fastai/blob/master/fastai/column_data.py

FAST AI lesson: https://github.com/fastai/fastai/blob/master/courses/dl1/lesson5-movielens.ipynb

In [2]:
import pandas as pd
import numpy as np
path = '/Users/timlee/data/movielens/'

In [6]:
def get_data():
    ratings = pd.read_csv(path+'ratings.csv')
    movies = pd.read_csv(path+'movies.csv')
    return ratings, movies

ratings, movies = get_data()

### Making a custom dataset and dataloader in `pytorch` using the `dataset`, and `dataloader` objects

In [None]:
from torch.utils.data import Dataset, DataLoader
class movielens_ds(Dataset):
    def __init__(self, cat_vars, cont_vars, target):
        n = len(target)
        self.cat_vars = np.stack(cat_vars, 1).astype(np.int64) if cat_vars else np.zeros((n,1))
        self.cont_vars = np.stack(cont_vars,1).astype(np.int64) if cont_vars else np.zero((n,1))
        self.y = np.zeros((n,1)) if y is None else y[:,None]
    
    def __len__(self): return len(self.y)
    
    def __getitem__(self, idx):
        return [self.cats[idx], self.conts[idx], self.y[idx]]
    
    def import_df(cls)