### Matrix and Matrix multiplication

### Set-1

In [1]:
'''
SET‚Äì1 : Matrix √ó Matrix Multiplication (Foundations & Intuition)

Q1. What does matrix √ó matrix multiplication mean in simple words?
Ans. It means applying one matrix transformation to many vectors at once.
'''
# Example
# Matrix √ó vector = one transformation
# Matrix √ó matrix = many such transformations together


'''
Q2. What is the shape rule for matrix √ó matrix multiplication?
Ans. The number of columns in the first matrix must equal the number of rows in the second.
'''
# Example
# A shape = (m, n)
# B shape = (n, p)
# A √ó B   = (m, p)


'''
Q3. Why must the inner dimensions match?
Ans. Because each row of the first matrix must align with each column of the second matrix.
'''
# Example
# (2, 3) √ó (3, 2) ‚Üí valid
# (2, 3) √ó (4, 2) ‚Üí ‚ùå invalid


'''
Q4. How is each element of the result matrix computed?
Ans. Each element is the dot product of one row from A and one column from B.
'''
# Example
row_A    = [1, 2, 3]
column_B = [7, 9, 11]
# Dot product = 1*7 + 2*9 + 3*11 = 58


'''
Q5. What is the step-by-step meaning of matrix multiplication?
Ans. For each row of A and each column of B, compute a dot product.
'''
# Example
A = [
    [1, 2, 3],
    [4, 5, 6]
]
B = [
    [7,  8],
    [9, 10],
    [11,12]
]
# Result shape = (2, 2)


'''
Q6. How can matrix √ó matrix be seen as many matrix √ó vector operations?
Ans. Each column of B is treated as a vector and multiplied by A.
'''
# Example
# A √ó B = [A¬∑b1 | A¬∑b2]
# where b1 and b2 are columns of B


q='''
Q7. Why does matrix multiplication order matter?
Ans. Because row‚Äìcolumn alignment changes when order is reversed.
'''
# Example
# A √ó B ‚â† B √ó A
# Shapes and results are different


#### Set-2

In [2]:
'''
SET‚Äì2 : Matrix √ó Matrix Multiplication (AI & Practical Usage)

Q1. Why is matrix √ó matrix multiplication essential in machine learning?
Ans. Because it transforms entire datasets in one operation.
'''
# Example
# X (N, D) √ó W (D, H) ‚Üí Output (N, H)


'''
Q2. How does matrix multiplication represent a neural network layer?
Ans. Each column of the weight matrix represents one neuron applied to all samples.
'''
# Example
# Each column in W produces one output feature


'''
Q3. How does matrix √ó matrix multiplication relate to embeddings?
Ans. It projects embeddings into new spaces like queries, keys, and values.
'''
# Example
# E √ó Wq ‚Üí query embeddings
# E √ó Wk ‚Üí key embeddings
# E √ó Wv ‚Üí value embeddings


'''
Q4. Why is matrix multiplication not commutative?
Ans. Because rows and columns play different roles in the operation.
'''
# Example
# A shape = (2, 3)
# B shape = (3, 2)
# A √ó B exists, but B √ó A gives a different result


'''
Q5. What is a common beginner mistake in matrix √ó matrix multiplication?
Ans. Confusing it with element-wise multiplication.
'''
# Example
# A * B ‚Üí element-wise (different operation)
# A √ó B ‚Üí row‚Äìcolumn dot products


'''
Q6. How does matrix multiplication enable parallel computation on GPUs?
Ans. Each row‚Äìcolumn dot product is independent and can be computed in parallel.
'''
# Example
# Thousands of dot products computed simultaneously


q='''
Q7. What is the key mental rule to remember?
Ans. Row of first matrix talks to column of second matrix using dot product.
'''
# Example
# (A √ó B)[i][j] = row_i(A) ¬∑ column_j(B)


### Matrix x Vector VS Matrix x Matrix

In [3]:
import numpy as np

# Matrix √ó vector (single vector)
A = np.array([[1, 2],
              [3, 4]])
x = np.array([[5],
              [6]])

print(A @ x)

# Matrix √ó matrix (many vectors at once)
B = np.array([[5, 6],
              [7, 8]])

print(A @ B)

# Matrix √ó matrix = batch version of matrix √ó vector.

[[17]
 [39]]
[[19 22]
 [43 50]]


### Observe Shape

In [4]:
A = np.random.rand(2, 3)
B = np.random.rand(3, 4)

C = A @ B
print(C.shape)   # (2, 4)


(2, 4)


### Why inner dimensions must match?

In [5]:
# Valid
np.random.rand(2, 3) @ np.random.rand(3, 2)

# Invalid
# np.random.rand(2, 3) @ np.random.rand(4, 2)

# Dot products require equal-length vectors.


array([[0.90250314, 0.71764131],
       [0.69751942, 0.73000733]])

### At a small level : How each element is Computed ?

In [6]:
row_A    = np.array([1, 2, 3])
column_B = np.array([7, 9, 11])

print(row_A @ column_B)   # 58

# Operation of One row and One column YIELDS One value

58


### Individual Calculation is Independent (Important)

In [7]:
import numpy as np

# --------------------------------------------------
# STEP 1: Define matrices A and B
# --------------------------------------------------
# A is a transformation matrix
A = np.array([
    [1, 2],
    [3, 4]
])

# B is a matrix containing MULTIPLE vectors
# Each column of B is one vector
B = np.array([
    [5,  7],
    [6,  8]
])

# --------------------------------------------------
# STEP 2: Understand what matrix √ó matrix means
# --------------------------------------------------
# A @ B does NOT mix columns together.
# Instead:
# - Take ONE column of B
# - Apply A to it (matrix √ó vector)
# - Repeat independently for each column
#
# `Matrix √ó Matrix` = many `Matrix √ó Vector` operations

# --------------------------------------------------
# STEP 3: Extract columns of B (vectors)
# --------------------------------------------------
b1 = B[:, 0]   # First column of B
b2 = B[:, 1]   # Second column of B

# --------------------------------------------------
# STEP 4: Apply A to each vector independently
# --------------------------------------------------
out1 = A @ b1
out2 = A @ b2

print(out1)
print(out2)

# --------------------------------------------------
# STEP 5: Stack the results to form final matrix
# --------------------------------------------------
# The final result of A @ B is just:
# [ A@b1 , A@b2 ] as columns
result = np.column_stack([out1, out2])

print(result)

# --------------------------------------------------
# KEY CONCEPT (MOST IMPORTANT PART)
# --------------------------------------------------
# Each column is transformed independently
# No column affects another column
# Same transformation (A) is applied every time
#
# Think of it like:
# - B holds multiple input vectors
# - A is a machine
# - Each vector goes through the machine separately
#
# Final shape rule:
# (m, n) @ (n, k) ‚Üí (m, k)
#
# k independent transformations happen


[17 39]
[23 53]
[[17 23]
 [39 53]]


### Example: Embedding and Projection (Read)

In [8]:
import numpy as np

# --------------------------------------------------
# STEP 1: Embeddings matrix (E)
# --------------------------------------------------
# Think of E as a batch of token embeddings.
#
# Shape: (32, 768)
# - 32   ‚Üí number of tokens (or samples)
# - 768  ‚Üí embedding dimension (original semantic space)
#
# Each ROW is ONE token embedding.
# Each row is processed independently.
E = np.random.rand(32, 768)


# --------------------------------------------------
# STEP 2: Projection matrix (Wq)
# --------------------------------------------------
# Wq is a learned weight matrix.
#
# Shape: (768, 64)
# - Input size  = 768
# - Output size = 64
#
# This matrix defines HOW we want to "look at" embeddings.
# It decides what information to keep, compress, or ignore.
Wq = np.random.rand(768, 64)


# --------------------------------------------------
# STEP 3: Matrix multiplication (Projection)
# --------------------------------------------------
# We project embeddings into a NEW space.
#
# (32, 768) @ (768, 64) ‚Üí (32, 64)
#
# What actually happens:
# - Take ONE embedding vector (row of E)
# - Multiply it with Wq
# - Produce ONE new vector of size 64
# - Repeat this independently for all 32 rows
Q = E @ Wq
print(Q)

# --------------------------------------------------
# STEP 4: Result shape
# --------------------------------------------------
print(Q.shape)   # (32, 64)


# --------------------------------------------------
# KEY CONCEPTUAL TAKEAWAYS (MOST IMPORTANT)
# --------------------------------------------------
# Each embedding (row) is projected independently
# No token mixes with another token here
# Same projection matrix (Wq) is applied to every row
#
# You can think of Wq as:
# - A "lens"
# - A "view"
# - A rule for re-expressing meaning
#
# Original space: 768-dim (general meaning)
# New space:       64-dim (query-specific meaning)
#
# This is why we call it:
# üëâ "Projection into a new semantic space"


# --------------------------------------------------
# ATTENTION CONTEXT (VERY IMPORTANT)
# --------------------------------------------------
# In self-attention:
# - E @ Wq ‚Üí Queries
# - E @ Wk ‚Üí Keys
# - E @ Wv ‚Üí Values
#
# SAME embeddings
# DIFFERENT projection matrices
# DIFFERENT semantic roles
#
# Final mental model:
# Many vectors ‚Üí same transformation ‚Üí independent projections


[[187.88881781 198.18904425 195.83182058 ... 194.35469802 197.22938863
  195.32680526]
 [182.30784534 187.20167675 187.55886894 ... 187.0638015  187.12241008
  187.20223914]
 [190.42128291 193.2108744  195.17468294 ... 185.61181384 192.04614823
  190.07780693]
 ...
 [190.59113198 195.38054791 193.91091685 ... 195.87723076 195.27344619
  192.05958177]
 [182.30901371 194.12049983 193.17827976 ... 186.43794409 194.69623906
  193.04994271]
 [183.56434911 186.39175267 185.44567483 ... 178.59677953 185.55858163
  182.24992439]]
(32, 64)


### Read (Embedding & Projections)
#### A Mental Model

In [9]:
# FINAL SIMPLE MENTAL MODEL
# ------------------------

# Imagine ONE token embedding
# This is ONE row from E
# It is just a list of numbers describing the token
'''
e = [e1, e2, e3, ..., e768]   # shape = (768,)
'''

# Think of this as:
# "All information the model knows about this token"


# Now imagine ONE column from Wq
# This column is ONE learned direction
# It decides what kind of information we want to look at
'''
w = [w1, w2, w3, ..., w768]   # shape = (768,)
'''

# Think of this as:
# "Which parts of the token matter for THIS question?"


# DOT PRODUCT (most important step)
# --------------------------------
# We multiply matching numbers and add them
'''
score = (e1*w1) + (e2*w2) + (e3*w3) + ... + (e768*w768)
'''

# Result:
# ‚úî ONE number
# ‚úî This number answers ONE question about the token


# WHAT DOES THIS NUMBER MEAN?
# ---------------------------
# Big value  ‚Üí token strongly matches this question
# Small value ‚Üí token weakly matches this question

# This is called a PROJECTION
# We are projecting the token onto ONE direction


# FULL PROJECTION
# ---------------
# Wq has MANY columns (64 of them)
# Each column asks a DIFFERENT question

# So we repeat the same dot product 64 times
'''
q = [score1, score2, score3, ..., score64]   # shape = (64,)
'''

# This is the new representation of the token
# Same token, but viewed differently


# BATCH VERSION
# -------------
# When we do:
'''
Q = E @ Wq
'''
#
# This means:
# - Take each token one by one
# - Ask the same 64 questions
# - Do NOT mix tokens
# - Everything is independent


# ONE-LINE MEMORY RULE (IMPORTANT)
# -------------------------------
# Each column of Wq is a question.
# Each dot product is the answer.

imp_mental_model = 'above'
