# Ch 2. Word2Vec

The goal of Word2Vec is to produce good word embeddings. To do this, we have to maximimize the sum of every probability of seeing a context word $w_{i+j}$ given a target word $w_i$ in an entire text:

$$
\frac{1}{N} \sum_{i=1}^{N} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{i+j}|w_i)
$$

where $N$ is the number of words and $c$ is the size of the context vector.

In [1]:
import numpy as np
from gensim.models.word2vec import Word2Vec

## Data

Source: https://www.zerohedge.com/markets/china-canada-retaliate-against-trumps-tariff-war-global-stocks-slide

In [2]:
text = """First there were the Trump Tariffs... then the Retaliation. Then as Europe closed, chatter about a possible Ukraine minerals deal started to gain momentum lifting stocks to the highs.
With a few minutes left in the day, a large MoC sell imbalance sent stocks back down hard...
But, just as we predicted here... Commerce Secretary Howard Lutnick spoke on Fox Business, offering the dovish olive branch: 
"Both the Mexicans and the Canadians were on the phone with me all day today trying to show that they'll do better, and the president's listening, because you know he's very, very fair and very reasonable," Lutnick said in an interview with Fox Business. 
"So I think he's going to work something out with them — it's not going to be a pause, none of that pause stuff, but I think he's going to figure out: you do more and I'll meet you in the middle some way and we're going to probably announcing that tomorrow."
As Bloomberg reports, Lutnick did not explicitly say what President Donald Trump was considering after imposing an across-the-board tariff on all goods from Canada and Mexico that went into effect overnight. 
Lutnick said that the tariffs would likely land "somewhere in the middle" with Trump "moving with the Canadians and Mexicans, but not all the way."
Lutnick discounted the notion that the tariffs would be fully rolled back, pointing instead to the US-Mexico-Canada trade pact negotiated during Trump's first term.
"If you live under those rules, then the president is considering giving you relief," Lutnick said.
"If you haven't lived under those rules, well, then you have to pay the tariff."
and lifted US equity futures higher after the close... And the Peso and Loonie are also bid with both hands and feet... So, the Trump 1.0 tariff playbook analog continues. Asian markets closed lower, European stocks are in the red, and US equity futures are trending lower this morning as worsening global trade war concerns weigh on risk sentiment.
On Monday, President Trump reiterated that he would impose tariffs on imports from Canada and Mexico starting Tuesday, stating that there was "no room left" for negotiation. He also noted that an additional 10% levy would be applied to imports from China.
Fast-forward to Tuesday morning. Trump's 25% tariffs on goods from Mexico and Canada took effect, prompting Canada to retaliate with 25% tariffs on $100 billion worth of US imports. Mexico is expected to respond later.
Trump also introduced an additional 10% tariff on Chinese imports early Tuesday, bringing the total tax to 20% following a similar increase last month. China swiftly retaliated with tariffs on US food and agricultural products and an export ban on some defense firms. 
According to an announcement by the Chinese Ministry of Finance, Beijing imposed new duties of 10% to 15% on US food and agricultural products. 
Here's an excerpt from the announcement: 
15% tariff will be imposed on chicken, wheat, corn, and cotton.
10% tariff will be imposed on sorghum, soybeans, pork, beef, aquatic products, fruits, vegetables, and dairy products
For the imported goods listed in the appendix originating from the United States, corresponding tariffs will be levied on the basis of the current applicable tariff rates. The current bonded and tax reduction and exemption policies remain unchanged, and the additional tariffs will not be reduced or exempted
Goods that have been shipped from the place of departure prior to March 10, 2025, and imported from March 10, 2025 to April 12, 2025, shall not be subject to the additional tariffs prescribed in this announcement
Commenting on China's retaliatory tariffs, Lynn Song, chief economist for Greater China at ING Bank, told clients: "The measures are still relatively measured for now. I think this retaliation shows China remains patient and has refrained from 'flipping the table' so to speak despite the recent escalation."
"China's hit-back isn't exactly aggressive — a 15% tariff on US agricultural goods, but nothing broad-based on tech or autos, suggests to me they're leaving room for negotiation," said Billy Leung, an investment strategist at Global X ETF, adding, "That's probably why Chinese stocks are rebounding instead of selling off harder."
Dilin Wu, a research strategist at Pepperstone Group Ltd., said, "The immediate impact of these new tariffs on China remains manageable — the measures are currently concentrated in specific areas."
"Should Beijing roll out additional pro-growth measures — such as large-scale fiscal stimulus or targeted support for high-tech industries and domestic consumption — it could further bolster market confidence," Wu said. 
Sea of red for global equity futures across most regions. China's Ministry of Finance warned:  "The US's unilateral tariff increase damages the multilateral trading system, increases the burden on US companies and consumers, and undermines the foundation of economic and trade cooperation between China and the US." 
Should Trump respond with retaliatory tariffs, the risk of sentiment continuation could continue. In the US, the growth outlook has dimmed as the troubling narrative of "growth scare and tariffs" takes center stage. 
""".replace('.', '').split()

In [3]:
text[0:5]

['First', 'there', 'were', 'the', 'Trump']

In [4]:
vocab = set(text)
VOCAB_SIZE = len(vocab)
print(f'n_vocab={VOCAB_SIZE}')

n_vocab=471


## Extract skipgrams

NOTE: We don't directly use these in the Word2Vec training below. Word2Vec takes care of skipgrams itself.

In [5]:
CONTEXT_SIZE = 2

In [6]:
skipgrams = []
for i in range(CONTEXT_SIZE, len(text) - CONTEXT_SIZE):
    array = [text[j] for j in np.arange(i - CONTEXT_SIZE, i + CONTEXT_SIZE + 1) if j != i]
    skipgrams.append((text[i], array))

for sg in skipgrams[0:5]:
    print(sg)

('were', ['First', 'there', 'the', 'Trump'])
('the', ['there', 'were', 'Trump', 'Tariffs'])
('Trump', ['were', 'the', 'Tariffs', 'then'])
('Tariffs', ['the', 'Trump', 'then', 'the'])
('then', ['Trump', 'Tariffs', 'the', 'Retaliation'])


## Train Word2Vec model

This generates the embeddings from the input text.

Word2Vec params:

- `vector_size=10`: This sets the dimensionality of the word vectors (embeddings) to 10.
- `min_count=0`: This specifies that all words, regardless of their frequency, will be included in the training. Typically, a higher value is used to ignore infrequent words.
- `window=2`: This sets the context window size to 2, meaning the model will consider up to 2 words on either side of the target word.
- `workers=2`: This sets the number of worker threads to use for training the model. More workers can speed up the training process.
- `seed=0`: This sets the random seed for reproducibility, ensuring that the results are consistent across different runs.
- `hs=0` (default): Use hierarchical softmax?

In [9]:
model = Word2Vec(
    [text],
    vector_size=10,
    min_count=0,
    window=CONTEXT_SIZE,
    workers=2,
    seed=0,
)
print(f"Shape of W_embed: {model.wv.vectors.shape}")
model.train([text], total_examples=model.corpus_count, epochs=10)

Shape of W_embed: (471, 10)


(6321, 8340)

In [10]:
for i in range(5):
    print(model.wv[i])

[ 0.07530797  0.0355139   0.00726908 -0.06308061 -0.00896571 -0.07036778
 -0.10100172 -0.11606818 -0.06701231  0.09070873]
[ 0.03735185  0.09289736  0.00495964  0.00534488  0.12825835  0.07173234
  0.01003004 -0.0108858   0.00943153  0.11662143]
[-0.03754295  0.0706495   0.03944238 -0.11592363  0.00564743  0.09371154
 -0.00333768 -0.11029505  0.0513505   0.07139111]
[ 0.07683776 -0.05998567 -0.0788378   0.06031892 -0.07022838  0.02802407
 -0.09824231 -0.05601652 -0.00478628  0.00526517]
[-0.01383741 -0.08972961 -0.09708099 -0.08490784 -0.07819115  0.05189092
 -0.00505868  0.0177676  -0.05141972  0.04116134]
