# word2vec Schritt für Schritt

**Schriftliche Arbeit zum CAS Big Data und Machine Learning der Universität Zürich**

Thomas Briner, thomas.briner@gmail.com

Notebook mit Beispielberechnungen

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Beispiel-Setup" data-toc-modified-id="Beispiel-Setup-1">Beispiel Setup</a></span></li><li><span><a href="#Single-Word-Kontext-Architektur" data-toc-modified-id="Single-Word-Kontext-Architektur-2">Single Word Kontext Architektur</a></span><ul class="toc-item"><li><span><a href="#Encoding-Kontext-und-Target-Wort" data-toc-modified-id="Encoding-Kontext-und-Target-Wort-2.1">Encoding Kontext und Target Wort</a></span></li><li><span><a href="#Sliding-Window" data-toc-modified-id="Sliding-Window-2.2">Sliding Window</a></span></li><li><span><a href="#One-Hot-Encoding-für-Kontext-Wort:-katze" data-toc-modified-id="One-Hot-Encoding-für-Kontext-Wort:-katze-2.3">One-Hot-Encoding für Kontext Wort: katze</a></span></li><li><span><a href="#One-Hot-Encoding-für-Target-Wort:-klettert" data-toc-modified-id="One-Hot-Encoding-für-Target-Wort:-klettert-2.4">One-Hot-Encoding für Target Wort: klettert</a></span></li><li><span><a href="#Ähnlichkeit-auf-One-Hot-Encoding:-Nutzlos" data-toc-modified-id="Ähnlichkeit-auf-One-Hot-Encoding:-Nutzlos-2.5">Ähnlichkeit auf One-Hot-Encoding: Nutzlos</a></span></li></ul></li><li><span><a href="#Forward-Pass" data-toc-modified-id="Forward-Pass-3">Forward Pass</a></span><ul class="toc-item"><li><span><a href="#Gewichtsmatrix-$W$-für-Input-=&gt;-Hidden-Layer" data-toc-modified-id="Gewichtsmatrix-$W$-für-Input-=>-Hidden-Layer-3.1">Gewichtsmatrix $W$ für Input =&gt; Hidden Layer</a></span></li><li><span><a href="#Berechnung-Hidden-Layer" data-toc-modified-id="Berechnung-Hidden-Layer-3.2">Berechnung Hidden Layer</a></span></li><li><span><a href="#Gewichtsmatrix-$W^{\prime}$-für-Hidden-=&gt;-Output-Layer" data-toc-modified-id="Gewichtsmatrix-$W^{\prime}$-für-Hidden-=>-Output-Layer-3.3">Gewichtsmatrix $W^{\prime}$ für Hidden =&gt; Output Layer</a></span></li><li><span><a href="#Berechnung-$u$" data-toc-modified-id="Berechnung-$u$-3.4">Berechnung $u$</a></span></li><li><span><a href="#Prediction-der-Wahrscheinlichkeit-$y_{pred}$-für-die-potentiellen-Target-Wörter" data-toc-modified-id="Prediction-der-Wahrscheinlichkeit-$y_{pred}$-für-die-potentiellen-Target-Wörter-3.5">Prediction der Wahrscheinlichkeit $y_{pred}$ für die potentiellen Target Wörter</a></span></li><li><span><a href="#Fehlervektor-aufgrund-Vergleich-mit-effektivem-Target-Wort-in-One-Hot-Encoding" data-toc-modified-id="Fehlervektor-aufgrund-Vergleich-mit-effektivem-Target-Wort-in-One-Hot-Encoding-3.6">Fehlervektor aufgrund Vergleich mit effektivem Target Wort in One-Hot-Encoding</a></span></li></ul></li><li><span><a href="#Backward-Pass" data-toc-modified-id="Backward-Pass-4">Backward Pass</a></span><ul class="toc-item"><li><span><a href="#Loss-Function" data-toc-modified-id="Loss-Function-4.1">Loss Function</a></span></li><li><span><a href="#Backprop-Output-Layer-=&gt;-Hidden-Layer" data-toc-modified-id="Backprop-Output-Layer-=>-Hidden-Layer-4.2">Backprop Output Layer =&gt; Hidden Layer</a></span></li><li><span><a href="#Backprop-Hidden-Layer-=&gt;-Input-Layer" data-toc-modified-id="Backprop-Hidden-Layer-=>-Input-Layer-4.3">Backprop Hidden Layer =&gt; Input Layer</a></span></li><li><span><a href="#Berechnung-mittels-Autograd-Funktion" data-toc-modified-id="Berechnung-mittels-Autograd-Funktion-4.4">Berechnung mittels Autograd Funktion</a></span></li></ul></li><li><span><a href="#CBOW" data-toc-modified-id="CBOW-5">CBOW</a></span><ul class="toc-item"><li><span><a href="#Forward-Pass" data-toc-modified-id="Forward-Pass-5.1">Forward Pass</a></span></li><li><span><a href="#Backward-Pass" data-toc-modified-id="Backward-Pass-5.2">Backward Pass</a></span></li></ul></li><li><span><a href="#Skip-Gram" data-toc-modified-id="Skip-Gram-6">Skip-Gram</a></span><ul class="toc-item"><li><span><a href="#Forward-Pass" data-toc-modified-id="Forward-Pass-6.1">Forward Pass</a></span></li><li><span><a href="#Backward-Pass" data-toc-modified-id="Backward-Pass-6.2">Backward Pass</a></span></li></ul></li></ul></div>

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
import math
from scipy import stats
import sys
import warnings
import math
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import torch.nn


In [2]:
# Configuring Jupyter environment
%matplotlib inline
warnings.filterwarnings('ignore')

pd.options.display.max_columns = 999
pd.options.display.float_format = '{:,.2f}'.format
np.set_printoptions(precision=2)

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Beispiel Setup


In [99]:
# Festlegen auf 4 Dimensionen für das Embedding
dims=['dim1','dim2','dim3','dim4']
D=len(dims) 

In [100]:
# Corpus definieren
satz1='unsere katze heisst neo'
satz2='die katze klettert auf den baum'

corpus=[satz1, satz2]

vokabular=['unsere', 'katze', 'heisst', 'mimo', 'die', 'klettert', 'auf', 'den', 'baum']
V=len(vokabular)

print("Vokabular: {0}".format(vokabular))

Vokabular: ['unsere', 'katze', 'heisst', 'mimo', 'die', 'klettert', 'auf', 'den', 'baum']


In [101]:
# Hilfsmethoden für die Ausgabe als pandas DataFrame

def df(tensor):
    if tensor.size()[0]==V:
        index = vokabular
    else:
        index = dims
    
    data = tensor.data.numpy()
    if tensor.size()[1]==1:
        cols=['val']
    elif tensor.size()[1]==V:
        cols = vokabular
    else:
        cols = dims

    if is_one_hot(tensor):
        data = data.astype(int)
    return pd.DataFrame(data, columns=cols, index=index)

def is_one_hot(vector):
    all_values = set()
    for x in np.nditer(vector.data.numpy()):
        all_values.add(x.tolist())
    if 1 in all_values:
        all_values.remove(1)
    if 0 in all_values:
        all_values.remove(0)
    return len(all_values) == 0


## Single Word Kontext Architektur

### Encoding Kontext und Target Wort

### Sliding Window

Satz 2, Position 2

|     | kontext |  target  |     |     |      |
|:---:|:-------:|:--------:|:---:|:---:|:----:|
| die |  katze  | klettert | auf | den | baum |


### One-Hot-Encoding für Kontext Wort: katze

In [102]:
katze = torch.tensor(np.asarray([0,1,0,0,0,0,0,0,0]).reshape(len(vokabular),1), dtype=torch.float)
df(katze)

Unnamed: 0,val
unsere,0
katze,1
heisst,0
mimo,0
die,0
klettert,0
auf,0
den,0
baum,0


### One-Hot-Encoding für Target Wort: klettert

In [103]:
klettert = torch.tensor(np.asarray([0,0,0,0,0,1,0,0,0]).reshape(len(vokabular),1), dtype=torch.float)
df(klettert)

Unnamed: 0,val
unsere,0
katze,0
heisst,0
mimo,0
die,0
klettert,1
auf,0
den,0
baum,0


### Ähnlichkeit auf One-Hot-Encoding: Nutzlos

Inneres Produkt der One-Hot-Repräsentationen gibt immer 0,
weil die Embeddings immer orthogonal zueinander sind.

In [104]:
torch.matmul(katze.t(),klettert)

tensor([[ 0.]])

## Forward Pass

### Gewichtsmatrix $W$ für Input => Hidden Layer

In [105]:
r = torch.manual_seed(42)
W = Variable(torch.rand(len(vokabular), len(dims)), requires_grad=True)

df(W)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.88,0.92,0.38,0.96
katze,0.39,0.6,0.26,0.79
heisst,0.94,0.13,0.93,0.59
mimo,0.87,0.57,0.74,0.43
die,0.89,0.57,0.27,0.63
klettert,0.27,0.44,0.3,0.83
auf,0.11,0.27,0.36,0.2
den,0.55,0.01,0.95,0.08
baum,0.89,0.58,0.34,0.81


### Berechnung Hidden Layer

$h = W^T \cdot w_c$

In [106]:
h = torch.matmul(W.t(),katze)
df(h)

Unnamed: 0,val
dim1,0.39
dim2,0.6
dim3,0.26
dim4,0.79


### Gewichtsmatrix $W^{\prime}$ für Hidden => Output Layer

In [107]:
r = torch.manual_seed(43)
W_prime = Variable(torch.rand(len(dims), len(vokabular)), requires_grad=True)
df(W_prime)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.45,0.2,0.92,0.35,0.15,0.09,0.59,0.07,0.75
dim2,0.63,0.94,0.13,0.52,0.53,0.54,0.71,0.43,0.28
dim3,0.84,0.16,0.11,0.73,0.32,0.89,0.13,0.84,0.56
dim4,0.41,0.86,0.28,0.58,0.33,0.68,0.7,0.92,0.23


### Berechnung $u$

Der Vektor $u$ ist das Produkt der Gewichtsmatrix $W^{\prime}$ mit dem Hidden Layer: $\textbf{u} = \textbf{W'}^T \times \textbf{h} = \textbf{W'}^T \cdot (\textbf{W}^T \cdot \textbf{x})$

Jede Position dieses Vektors repräsentiert also die Ähnlichkeit des Word Embeddings des Kontext Wortes $w_c$ mit dem jeweiligen Embedding des potentiellen Target Wortes.

Diese Werte sind nicht normiert und können deshalb nicht als Wahrscheinlichkeitsverteilung benutzt werden.

In [108]:
u = torch.matmul(W_prime.t(),h)
df(u)

Unnamed: 0,val
unsere,1.1
katze,1.37
heisst,0.69
mimo,1.1
die,0.72
klettert,1.12
auf,1.25
den,1.23
baum,0.79


### Prediction der Wahrscheinlichkeit $y_{pred}$ für die potentiellen Target Wörter

Um die Wahrscheinlichkeiten für die möglichen Target Wörter zu erhalten, wird die Softmax Funktion auf den Vektor $u$ angewendet. Dadurch wird für jedes Wort $i$ seine Wahrscheinlichekeit als Target Wort berechnet.

$p(w_i | w_c) = y_{pred_i} = \frac{e^{u_i}}{\sum_{k=1}^{V}{e^{u_j}}}$

In [109]:
y_pred = F.softmax(u, dim=0)
df(y_pred)

Unnamed: 0,val
unsere,0.11
katze,0.15
heisst,0.08
mimo,0.11
die,0.08
klettert,0.12
auf,0.13
den,0.13
baum,0.08


### Fehlervektor aufgrund Vergleich mit effektivem Target Wort in One-Hot-Encoding

Um das Modell zu trainieren, wird die Prediction $y_{pred}$ mit dem effektiven Target Wort verglichen. Die Differenz ergibt den Fehlervektor $e$.

$e = y_{pred} - y$


In [110]:
e = y_pred - klettert
df(e)

Unnamed: 0,val
unsere,0.11
katze,0.15
heisst,0.08
mimo,0.11
die,0.08
klettert,-0.88
auf,0.13
den,0.13
baum,0.08


## Backward Pass

### Loss Function

Die Loss Funktion ist definiert als $\mathcal{L} =-\log p(w_t | w_c)  = - \log y_{w_t}$.

Im Beispiel ist der aktuelle Loss also der negative Logarithmus von $y_{pred}$ an der Position des Target Worts $w_t$, d.h. 'klettert'

In [111]:
loss = -math.log(df(y_pred).loc['klettert', 'val'])
loss

2.1406858451589

Je näher die Wahrscheinlichkeit gegen $1$ geht, desto kleiner wird der Logarithmus.

### Backprop Output Layer => Hidden Layer

Um die Gewichte von $W^{\prime}$ anzupassen, wird die Ableitung von $\mathcal{L}$ nach $W^{\prime}$ berechnet.

Dies führt zur Formel
$\textbf{v}_{w_i}^{\prime (new)} = \textbf{v}_{w_i}^{\prime (old)} - \eta \cdot e_i \cdot \textbf{h}$

für alle i = 1, 2, .., V

Die Multiplikation von $h$ mit $e$ kann auch als Outer Product ausgedrückt werden:


In [112]:
df(torch.ger(h.squeeze(),e.squeeze()))

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.04,0.06,0.03,0.04,0.03,-0.34,0.05,0.05,0.03
dim2,0.07,0.09,0.05,0.07,0.05,-0.53,0.08,0.08,0.05
dim3,0.03,0.04,0.02,0.03,0.02,-0.23,0.03,0.03,0.02
dim4,0.09,0.12,0.06,0.09,0.06,-0.7,0.11,0.1,0.07


Um den Effekt der Anpassung im Beispiel besser sichtbar zu machen, wird mit einer unrealistischen Learning Rate von $\eta = 0.2$ gerechnet.

In [113]:
eta = 0.2

Nach diesem Schritt hat die neue Gewichtsmatrix $W^{\prime}$ folgende Werte:

In [114]:
W_prime_new = W_prime - torch.ger(h.squeeze(),e.squeeze())*eta
df(W_prime_new)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.45,0.18,0.92,0.34,0.14,0.15,0.58,0.06,0.74
dim2,0.61,0.92,0.12,0.51,0.52,0.64,0.69,0.41,0.27
dim3,0.83,0.15,0.1,0.72,0.32,0.93,0.12,0.83,0.56
dim4,0.4,0.84,0.27,0.57,0.32,0.82,0.68,0.9,0.22


### Backprop Hidden Layer => Input Layer

Für die Anpassung der Gewichte von $W$ wird zuerst der Vektor EH berechnet.

$\textbf{EH}_i = \sum_{j=1}^{V} e_j \cdot v_{w_{ij}}^{\prime}$

In [115]:
EH = torch.matmul(W_prime, e)
df(EH)

Unnamed: 0,val
dim1,0.28
dim2,0.02
dim3,-0.37
dim4,-0.07


Die Anpassung von $W$ erfolgt also gemäss 

$\textbf{v}_{w_c}^{(new)} = \textbf{v}_{w_c}^{(old)} - \eta \textbf{EH}^T$

Dabei werden nur Wert in der Zeile des Kontext Wortes $w_c$, also 'katze' angepasst, da nur diese in die Berechnung einflossen.

Dies ergibt für $W^{(new)}$:

In [116]:
W_new = W - torch.ger(katze.squeeze(), EH.squeeze())*eta
df(W_new)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.88,0.92,0.38,0.96
katze,0.33,0.6,0.33,0.81
heisst,0.94,0.13,0.93,0.59
mimo,0.87,0.57,0.74,0.43
die,0.89,0.57,0.27,0.63
klettert,0.27,0.44,0.3,0.83
auf,0.11,0.27,0.36,0.2
den,0.55,0.01,0.95,0.08
baum,0.89,0.58,0.34,0.81


### Berechnung mittels Autograd Funktion

Dank der Autograd-Funktionalität von Pytorch kann die Berechnung auch ohne explizite Berechnung der Ableitungen erfolgen:

In [117]:
target_position = torch.tensor([5])

softmax_function = torch.nn.LogSoftmax()
loss_function = torch.nn.NLLLoss()
y_pred_pytorch = softmax_function(u.view(1,-1))
loss_pytorch = loss_function(y_pred_pytorch, target_position)
loss_pytorch

tensor(2.1407)

Nun können durch die Anwendung der backward() Funktion automatisch die Ableitungen berechnet werden:

In [118]:
loss_pytorch.backward()

Dies führt zum identischen Resultat wie mit der manuellen Ableitung.

Für $W^{\prime}$:

In [119]:
df(W_prime.grad)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.04,0.06,0.03,0.04,0.03,-0.34,0.05,0.05,0.03
dim2,0.07,0.09,0.05,0.07,0.05,-0.53,0.08,0.08,0.05
dim3,0.03,0.04,0.02,0.03,0.02,-0.23,0.03,0.03,0.02
dim4,0.09,0.12,0.06,0.09,0.06,-0.7,0.11,0.1,0.07


und für $W$:

In [120]:
df(W.grad)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.0,0.0,-0.0,-0.0
katze,0.28,0.02,-0.37,-0.07
heisst,0.0,0.0,-0.0,-0.0
mimo,0.0,0.0,-0.0,-0.0
die,0.0,0.0,-0.0,-0.0
klettert,0.0,0.0,-0.0,-0.0
auf,0.0,0.0,-0.0,-0.0
den,0.0,0.0,-0.0,-0.0
baum,0.0,0.0,-0.0,-0.0


## CBOW

Die einzige Änderung gegenüber der Single-Word Kontext Architektur besteht darin, dass nun nicht nur 1 Wort als Kontext benützt wird, sondern die c Wörter vor und nach dem Target Wort.



|     | kontext | kontext   | target    |  kontext   |  kontext    |
|:---:|:-------:|:--------:|:---:|:---:|:----:|
| die |  katze  | klettert | auf | den | baum |


Dies sind die Input Wörter:

In [121]:
katze = torch.tensor(np.asarray([0,1,0,0,0,0,0,0,0]).reshape(len(vokabular),1), dtype=torch.float)
klettert = torch.tensor(np.asarray([0,0,0,0,0,1,0,0,0]).reshape(len(vokabular),1), dtype=torch.float)
den = torch.tensor(np.asarray([0,0,0,0,0,0,0,1,0]).reshape(len(vokabular),1), dtype=torch.float)
baum = torch.tensor(np.asarray([0,0,0,0,0,0,0,0,1]).reshape(len(vokabular),1), dtype=torch.float)

df(katze)
df(klettert)
df(den)
df(baum)

Unnamed: 0,val
unsere,0
katze,1
heisst,0
mimo,0
die,0
klettert,0
auf,0
den,0
baum,0


Unnamed: 0,val
unsere,0
katze,0
heisst,0
mimo,0
die,0
klettert,1
auf,0
den,0
baum,0


Unnamed: 0,val
unsere,0
katze,0
heisst,0
mimo,0
die,0
klettert,0
auf,0
den,1
baum,0


Unnamed: 0,val
unsere,0
katze,0
heisst,0
mimo,0
die,0
klettert,0
auf,0
den,0
baum,1


### Forward Pass

Für die weitere Berechnung werden diese One-Hot-Encodings gemittelt.

In [122]:
x = (katze+klettert+den+baum)/4
df(x)

Unnamed: 0,val
unsere,0.0
katze,0.25
heisst,0.0
mimo,0.0
die,0.0
klettert,0.25
auf,0.0
den,0.25
baum,0.25


Beim Target Wort ergibt sich keine Veränderung:

In [123]:
auf = torch.tensor(np.asarray([0,0,0,0,0,0,1,0,0]).reshape(len(vokabular),1), dtype=torch.float)
df(auf)

Unnamed: 0,val
unsere,0
katze,0
heisst,0
mimo,0
die,0
klettert,0
auf,1
den,0
baum,0


Die Berechnungen erfolgen nun genau analog:

In [124]:
r = torch.manual_seed(42)
W = Variable(torch.rand(len(vokabular), len(dims)), requires_grad=True)
df(W)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.88,0.92,0.38,0.96
katze,0.39,0.6,0.26,0.79
heisst,0.94,0.13,0.93,0.59
mimo,0.87,0.57,0.74,0.43
die,0.89,0.57,0.27,0.63
klettert,0.27,0.44,0.3,0.83
auf,0.11,0.27,0.36,0.2
den,0.55,0.01,0.95,0.08
baum,0.89,0.58,0.34,0.81


In [125]:
h = torch.matmul(W.t(),x)
df(h)

Unnamed: 0,val
dim1,0.52
dim2,0.41
dim3,0.46
dim4,0.63


In [126]:
r = torch.manual_seed(43)
W_prime = Variable(torch.rand(len(dims), len(vokabular)), requires_grad=True)
df(W_prime)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.45,0.2,0.92,0.35,0.15,0.09,0.59,0.07,0.75
dim2,0.63,0.94,0.13,0.52,0.53,0.54,0.71,0.43,0.28
dim3,0.84,0.16,0.11,0.73,0.32,0.89,0.13,0.84,0.56
dim4,0.41,0.86,0.28,0.58,0.33,0.68,0.7,0.92,0.23


In [127]:
u = torch.matmul(W_prime.t(),h)
df(u)

Unnamed: 0,val
unsere,1.14
katze,1.1
heisst,0.76
mimo,1.09
die,0.65
klettert,1.1
auf,1.1
den,1.17
baum,0.91


In [128]:
y_pred = F.softmax(u, dim=0)
df(y_pred)

Unnamed: 0,val
unsere,0.13
katze,0.12
heisst,0.09
mimo,0.12
die,0.08
klettert,0.12
auf,0.12
den,0.13
baum,0.1


Basierend auf dem Target Wort wird nun wieder der Fehler berechnet.

In [129]:
e = y_pred - auf
df(e)

Unnamed: 0,val
unsere,0.13
katze,0.12
heisst,0.09
mimo,0.12
die,0.08
klettert,0.12
auf,-0.88
den,0.13
baum,0.1


### Backward Pass

Auch der Backward Pass funktioniert analog:

In [130]:
target_position = torch.tensor([5])

softmax_function = torch.nn.LogSoftmax()
loss_function = torch.nn.NLLLoss()
y_pred_pytorch = softmax_function(u.view(1,-1))
loss_pytorch = loss_function(y_pred_pytorch, target_position)
loss_pytorch

tensor(2.1155)

In [131]:
loss_pytorch.backward()

In [132]:
df(W_prime.grad)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.07,0.06,0.05,0.06,0.04,-0.46,0.06,0.07,0.05
dim2,0.05,0.05,0.04,0.05,0.03,-0.36,0.05,0.05,0.04
dim3,0.06,0.06,0.04,0.06,0.04,-0.41,0.06,0.06,0.05
dim4,0.08,0.08,0.05,0.08,0.05,-0.55,0.08,0.08,0.06


In [133]:
df(W.grad)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.0,0.0,-0.0,-0.0
katze,0.07,0.0,-0.09,-0.02
heisst,0.0,0.0,-0.0,-0.0
mimo,0.0,0.0,-0.0,-0.0
die,0.0,0.0,-0.0,-0.0
klettert,0.07,0.0,-0.09,-0.02
auf,0.0,0.0,-0.0,-0.0
den,0.07,0.0,-0.09,-0.02
baum,0.07,0.0,-0.09,-0.02


Interessant ist dabei, dass nun für alle 4 Input Wörter Anpassungen am entsprechenden Word Embedding erfolgen.

## Skip-Gram

Bei der Skip-Gram Architektur wird nur ein einziges Wort als Input benützt und aufgrund dieses Wortes werden die benachbarten c Wörter vor und nach dem Input Wort vorausgesagt.


|     | output | output   | input    |  output   |  output    |
|:---:|:-------:|:--------:|:---:|:---:|:----:|
| die |  katze  | klettert | auf | den | baum |


### Forward Pass

In [134]:
auf = torch.tensor(np.asarray([0,0,0,0,0,0,1,0,0]).reshape(len(vokabular),1), dtype=torch.float)
df(auf)

Unnamed: 0,val
unsere,0
katze,0
heisst,0
mimo,0
die,0
klettert,0
auf,1
den,0
baum,0


In [135]:
r = torch.manual_seed(42)
W = Variable(torch.rand(len(vokabular), len(dims)), requires_grad=True)
df(W)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.88,0.92,0.38,0.96
katze,0.39,0.6,0.26,0.79
heisst,0.94,0.13,0.93,0.59
mimo,0.87,0.57,0.74,0.43
die,0.89,0.57,0.27,0.63
klettert,0.27,0.44,0.3,0.83
auf,0.11,0.27,0.36,0.2
den,0.55,0.01,0.95,0.08
baum,0.89,0.58,0.34,0.81


In [136]:
h = torch.matmul(W.t(), auf)
df(h)

Unnamed: 0,val
dim1,0.11
dim2,0.27
dim3,0.36
dim4,0.2


In [137]:
r = torch.manual_seed(43)
W_prime = Variable(torch.rand(len(dims), len(vokabular)), requires_grad=True)
df(W_prime)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.45,0.2,0.92,0.35,0.15,0.09,0.59,0.07,0.75
dim2,0.63,0.94,0.13,0.52,0.53,0.54,0.71,0.43,0.28
dim3,0.84,0.16,0.11,0.73,0.32,0.89,0.13,0.84,0.56
dim4,0.41,0.86,0.28,0.58,0.33,0.68,0.7,0.92,0.23


In [138]:
u = torch.matmul(W_prime.t(),h)
df(u)

Unnamed: 0,val
unsere,0.6
katze,0.5
heisst,0.23
mimo,0.55
die,0.34
klettert,0.61
auf,0.44
den,0.61
baum,0.4


In [139]:
y_pred = F.softmax(u, dim=0)
df(y_pred)

Unnamed: 0,val
unsere,0.12
katze,0.11
heisst,0.09
mimo,0.12
die,0.1
klettert,0.13
auf,0.11
den,0.13
baum,0.1


Basierend auf diesem Prediction Vektor wird nun für jedes der Output Wörter der entsprechende Fehlervektor berechnet.

In [140]:
katze = torch.tensor(np.asarray([0,1,0,0,0,0,0,0,0]).reshape(len(vokabular),1), dtype=torch.float)
klettert = torch.tensor(np.asarray([0,0,0,0,0,1,0,0,0]).reshape(len(vokabular),1), dtype=torch.float)

den = torch.tensor(np.asarray([0,0,0,0,0,0,0,1,0]).reshape(len(vokabular),1), dtype=torch.float)
baum = torch.tensor(np.asarray([0,0,0,0,0,0,0,0,1]).reshape(len(vokabular),1), dtype=torch.float)

In [154]:
y_pred.size()
y_pred.repeat((1,4))

torch.Size([9, 1])

tensor([[ 0.1249,  0.1249,  0.1249,  0.1249],
        [ 0.1132,  0.1132,  0.1132,  0.1132],
        [ 0.0860,  0.0860,  0.0860,  0.0860],
        [ 0.1192,  0.1192,  0.1192,  0.1192],
        [ 0.0964,  0.0964,  0.0964,  0.0964],
        [ 0.1259,  0.1259,  0.1259,  0.1259],
        [ 0.1064,  0.1064,  0.1064,  0.1064],
        [ 0.1256,  0.1256,  0.1256,  0.1256],
        [ 0.1024,  0.1024,  0.1024,  0.1024]])

### Backward Pass

In [None]:
target_position = torch.tensor([1,5,7,8])

softmax_function = torch.nn.LogSoftmax()
loss_function = torch.nn.NLLLoss()
y_pred_pytorch = softmax_function(u.view(1,-1))

Damit der Fehler nun für jedes der 4 Output Wörter berechnet wird, muss der Fehlervektor nun 4-fach kopiert werden.

In [149]:
y_pred_pytorch.repeat(4,1)


tensor([[-2.0804, -2.1787, -2.4533, -2.1273, -2.3393, -2.0725, -2.2403,
         -2.0743, -2.2789],
        [-2.0804, -2.1787, -2.4533, -2.1273, -2.3393, -2.0725, -2.2403,
         -2.0743, -2.2789],
        [-2.0804, -2.1787, -2.4533, -2.1273, -2.3393, -2.0725, -2.2403,
         -2.0743, -2.2789],
        [-2.0804, -2.1787, -2.4533, -2.1273, -2.3393, -2.0725, -2.2403,
         -2.0743, -2.2789]])

In [150]:
loss_pytorch = loss_function(y_pred_pytorch.repeat(4,1), target_position)
loss_pytorch

tensor(2.1511)

In [151]:
loss_pytorch.backward()

In [152]:
df(W_prime.grad)

Unnamed: 0,unsere,katze,heisst,mimo,die,klettert,auf,den,baum
dim1,0.01,-0.01,0.01,0.01,0.01,-0.01,0.01,-0.01,-0.02
dim2,0.03,-0.04,0.02,0.03,0.03,-0.03,0.03,-0.03,-0.04
dim3,0.04,-0.05,0.03,0.04,0.03,-0.04,0.04,-0.04,-0.05
dim4,0.02,-0.03,0.02,0.02,0.02,-0.02,0.02,-0.02,-0.03


In [153]:
df(W.grad)

Unnamed: 0,dim1,dim2,dim3,dim4
unsere,0.0,-0.0,-0.0,-0.0
katze,0.0,-0.0,-0.0,-0.0
heisst,0.0,-0.0,-0.0,-0.0
mimo,0.0,-0.0,-0.0,-0.0
die,0.0,-0.0,-0.0,-0.0
klettert,0.0,-0.0,-0.0,-0.0
auf,0.1,-0.01,-0.07,-0.1
den,0.0,-0.0,-0.0,-0.0
baum,0.0,-0.0,-0.0,-0.0


Bei $W$ wird wiederum nur das Word Embedding des einzigen Input Wortes angepasst, während die übrigen Zeilen unverändert bleiben.