<a href="https://colab.research.google.com/github/ravichas/bifx-546/blob/main/Notebooks/Chapter04_LinAlgebra.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Algebra

üéì Course Context

Prepared for BIFX-546 ‚Äì Machine Learning for Bioinformatics

Instructor: Sarangan Ravichandran, PhD., PMP

# üìò Attribution & Reading Reference

This notebook is based on concepts, structure and examples from,

Data Science from Scratch, 2nd Edition by Joel Grus,published by O'Reilly Media,Inc.

# Relevant Reading:

Data Science from Scratch, 2nd Edition ‚Äî Chapter 4: [Linear Algebra]

The material in this notebook has been **expanded with additional explanations,new examples, and code adaptations** to support instructional use and execution in a Google Colab Envinronment. Any additions, reformatting, or implementation detailss beyond the original text are the responsibility of the notebook author.

This notebook is intended for educational use only and does not replace the original book/code examples.

## Code

Portions of the vector and matrix code are adapted from Data Science from Scratch by Joel Grus (GitHub source: https://github.com/joelgrus/data-science-from-scratch).

The domain-specific examples‚Äîparticularly the healthcare and genomics use cases‚Äîwere created specifically for this class.‚Äù

# Linear Algebra

Branch of mathematics that deals with vector spaces.

## Fundamentals:
* Vector (magnitude or length + direction) or ordered list of numbers
 ```  [1.5, 2.5] 2D vector```
* In Linear Algebra, it is almost, always the vector is rooted at the origin (the place where the horizontal and vertical axis intersect (atleast in 2D space)
* in `[1.5, 2.5]`, the first # is the x-coordinate and the second is the y-coordinate
* Each axis is perpendicular to each other
* Vectors can be scaled, added, multiplied.


# Vectors

* Vectors are points in some finite-dimensional space.
  * 2D: [x, y]
  * 3D: [x, y, z]; [height, weight, age]
  * 100 dimensional: [gene0, gene2, ..., gene99]
    * gene0: expressional value of gene0
* We can add or subtract or transform vectors
* Linear algebra almost always insist the origin of the vector is fixed.
* How can we represent vectors in Python?
```
 Vectors is a list of scalars (in this case, floats)
```

What does `List[float]` means?

* List comes from the module typing, which is part of the Python standard library (available by default in Python 3.5+).
* List[float] is a type hint meaning: ‚Äúa list whose elements are floats.‚Äù

In [None]:
from typing import List

height_weight_age = [70,  # inches,
                     170, # pounds,
                     40 ] # years

grades = [95,   # exam1
          80,   # exam2
          75,   # exam3
          62 ]  # exam4

In [None]:
height_weight_age

[70, 170, 40]

In [None]:
grades

[95, 80, 75, 62]

Python lists are not vectors. We cannot carry out vector operations using lists. We have to either use a special library or built it ourseleves.

Key points

*	A vector is an ordered list of numbers
*	Represents:
  -	a data point
  -	a feature vector
  -	a direction
  -	No arrows, no geometry yet

In data science, vectors represent observations or feature values.

# Write some code for adding/subtracting vectors.

In [None]:
# Vector: List of floats
Vector = List[float]

In [None]:
# code from
def add(v: Vector, w: Vector) -> Vector:
    """Adds corresponding elements"""
    assert len(v) == len(w), "vectors must be the same length"

    return [v_i + w_i for v_i, w_i in zip(v, w)]

assert add([1, 2, 3], [4, 5, 6]) == [5, 7, 9]

# Vector Addition

üìä Combining Effects

Key points
* Add vectors element-wise
* Requires same length
* Models accumulation of effects


Vector addition combines independent contributions feature by feature.

In [None]:
def vector_sum(vectors: List[Vector]) -> Vector:
    """Sums all corresponding elements"""
    # Check that vectors is not empty
    assert vectors, "no vectors provided!"

    # Check the vectors are all the same size
    num_elements = len(vectors[0])
    assert all(len(v) == num_elements for v in vectors), "different sizes!"

    # the i-th element of the result is the sum of every vector[i]
    return [sum(vector[i] for vector in vectors)
            for i in range(num_elements)]

assert vector_sum([[1, 2], [3, 4], [5, 6], [7, 8]]) == [16, 20]

No interaction terms - this is linear

In [None]:
a = [1, 2, 3]
b = [3 * i for i in a]

vector_sum([a, b])

[4, 8, 12]

# Scalar Multiplication

üìä Scaling Without Changing Direction

Key points
* Multiply every component by a constant
* Preserves direction
* Changes magnitude only


Scalar multiplication controls the strength of a vector.

In [None]:
def scalar_multiply(c: float, v: Vector) -> Vector:
    """Multiplies every element by c"""
    return [c * v_i for v_i in v]

assert scalar_multiply(2, [1, 2, 3]) == [2, 4, 6]

# Take-home

Same pattern, louder signal

# Compute the componentwise means of a list of (same-sized) vectors

In [None]:
def vector_mean(vectors: List[Vector]) -> Vector:
    """Computes the element-wise average"""
    n = len(vectors)
    return scalar_multiply(1/n, vector_sum(vectors))

assert vector_mean([[1, 2], [3, 4], [5, 6]]) == [3, 4]

# Dot Product (Core DS Operation)

üìä Similarity & Prediction

Key points
* Produces a scalar
* Measures alignment
* Used in prediction

The dot product is the engine of linear models.

In [None]:
def dot(v: Vector, w: Vector) -> float:
    """Computes v_1 * w_1 + ... + v_n * w_n"""
    assert len(v) == len(w), "vectors must be same length"

    return sum(v_i * w_i for v_i, w_i in zip(v, w))

assert dot([1, 2, 3], [4, 5, 6]) == 32  # 1 * 4 + 2 * 5 + 3 * 6

# Sum of Squares

Computing a Vector's sum of squares

In [None]:
def sum_of_squares(v: Vector) -> float:
    """Returns v_1 * v_1 + ... + v_n * v_n"""
    return dot(v, v)

assert sum_of_squares([1, 2, 3]) == 14  # 1 * 1 + 2 * 2 + 3 * 3

# Magnitute of a Vector



In [None]:
import math

def magnitude(v: Vector) -> float:
    """Returns the magnitude (or length) of v"""
    return math.sqrt(sum_of_squares(v))   # math.sqrt is square root function

assert magnitude([3, 4]) == 5

# Squared Distance of 2 vectors

In [None]:
def squared_distance(v: Vector, w: Vector) -> float:
    """Computes (v_1 - w_1) ** 2 + ... + (v_n - w_n) ** 2"""
    return sum_of_squares(subtract(v, w))

In [None]:
def distance(v: Vector, w: Vector) -> float:
    """Computes the distance between v and w"""
    return math.sqrt(squared_distance(v, w))


In [None]:
def distance(v: Vector, w: Vector) -> float:
    return magnitude(subtract(v, w))

In [None]:
def subtract(v: Vector, w: Vector) -> Vector:
    """Subtracts corresponding elements"""
    assert len(v) == len(w), "vectors must be the same length"

    return [v_i - w_i for v_i, w_i in zip(v, w)]

assert subtract([5, 7, 9], [4, 5, 6]) == [1, 2, 3]

# Note:

"*Using lists as vectors is great way for learning but not good for performance. In production code, you should use NumPy library, which includes a high performance array class with all sorts of aritchmetric operations included*" (taken from the book)

6Ô∏è‚É£ Matrices (Datasets)

üìä Matrices = Stacked Vectors

Key points
* Rows = observations
* Columns = features
* A dataset is a matrix


Matrices organize many vectors into structured data.



Note in Python the index starts at 0. So, the first row, first column element is (0, 0) instead of (1, 1)

Matrix: 2D collection of numbers; list of lists, with each inner list having the same size and we call that as the row of the matrix.

If A is a matrix, then `A[i][j]` is the element of the ith row and jth column.

Note: Usually Matrices are represented by Capital Letters, **A**

In [None]:
# Another type alias
Matrix = List[List[float]]

A = [[1, 2, 3],  # A has 2 rows and 3 columns
     [4, 5, 6]]

B = [[1, 2],     # B has 3 rows and 2 columns
     [3, 4],
     [5, 6]]

In [None]:
A[0][0]

1

In [None]:
B[2][1]

6

In [None]:
len(A) # Num of Rows

2

In [None]:
len(A[0]) # Columns

3

In [None]:
A

[[1, 2, 3], [4, 5, 6]]

In [None]:
from typing import Tuple

def shape(A: Matrix) -> Tuple[int, int]:
    """Returns (# of rows of A, # of columns of A)"""
    num_rows = len(A)
    num_cols = len(A[0]) if A else 0   # number of elements in first row
    return num_rows, num_cols

assert shape([[1, 2, 3], [4, 5, 6]]) == (2, 3)  # 2 rows, 3 columns

# Matrix Shape

* Shape = (rows, columns)
* Controls valid operations
* Most common source of bugs

Linear Algebra fails loudly when shapes don't match

In [None]:
shape(A)

(2, 3)

In [None]:
def get_row(A: Matrix, i: int) -> Vector:
    """Returns the i-th row of A (as a Vector)"""
    return A[i]             # A[i] is already the ith row

def get_column(A: Matrix, j: int) -> Vector:
    """Returns the j-th column of A (as a Vector)"""
    return [A_i[j]          # jth element of row A_i
            for A_i in A]   # for each row A_i

In [None]:
A

[[1, 2, 3], [4, 5, 6]]

In [None]:
get_row(A, 1)

[4, 5, 6]

In [None]:
from typing import Callable

def make_matrix(num_rows: int,
                num_cols: int,
                entry_fn: Callable[[int, int], float]) -> Matrix:
    """
    Returns a num_rows x num_cols matrix
    whose (i,j)-th entry is entry_fn(i, j)
    """
    return [[entry_fn(i, j)             # given i, create a list
             for j in range(num_cols)]  #   [entry_fn(i, 0), ... ]
            for i in range(num_rows)]   # create one list for each i

In [None]:
make_matrix(num_rows=2, num_cols=3, entry_fn=lambda i, j: 1 if i == j else 0)

[[1, 0, 0], [0, 1, 0]]

In [None]:
def identity_matrix(n: int) -> Matrix:
    """Returns the n x n identity matrix"""
    return make_matrix(n, n, lambda i, j: 1 if i == j else 0)

assert identity_matrix(5) == [[1, 0, 0, 0, 0],
                              [0, 1, 0, 0, 0],
                              [0, 0, 1, 0, 0],
                              [0, 0, 0, 1, 0],
                              [0, 0, 0, 0, 1]]

In [None]:
identity_matrix(n=4)

[[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]]

Matrices can be used to represent multiple vectors. For example, each row can be a vector. For example, if we had collected heights, weights, and ages of 1000 people, you can put them in a 1000 x 3 matrix

```
data = [ [70, 170, 40],
         [65, 120, 26],
         [77, 250, 19],
         # ....
]
```



# Matrix‚ÄìVector Multiplication

üìä Predictions as Linear Algebra

Key points
* Applies same rule to all rows
* Produces predictions


Matrix‚Äìvector multiplication is batch prediction.

# Eigenvalues & Eigenvectors

üìä Stable Directions

Key points
* Direction unchanged by transformation
* Only scaled
* Reveals structure

Eigenvectors are directions a matrix acts on cleanly.

A v = Œª v

The matrix, A, changes the size of the vector, but not its direction.

Why this matters in DS

Eigenvectors reveal:
* dominant directions
* structure in data
* variance (PCA)

Eigenvalues tell:
* how important that direction is

### Key Application is PCA

‚ÄúPCA finds directions where the data varies most.
Those directions are eigenvectors.‚Äù



## Why DS loves linear algebra

Because linear systems are:
	‚Ä¢	interpretable
	‚Ä¢	composable
	‚Ä¢	scalable
	‚Ä¢	optimizable

Almost every ML model starts as:

linear algebra + nonlinearity

# Healthcare Applications

# 1. Patient lab values as vectors + risk score (dot product)

Idea: Represent a patient‚Äôs key lab values as a vector, and a simple linear risk model as another vector. The dot product is a risk score.

In [None]:
# Vector = [systolic_bp, LDL_cholesterol, HbA1c]
patient_1: Vector = [140, 160, 7.2]   # Example hypertensive, high LDL, diabetic
patient_2: Vector = [120, 100, 5.8]   # Better-controlled patient

# Simple (made-up) linear risk model:
# weights say how much each lab contributes to overall cardiovascular risk
risk_weights: Vector = [0.3, 0.4, 0.3]

def cardiovascular_risk(labs: Vector, weights: Vector) -> float:
    """Return a simple risk score as a dot product of labs and weights."""
    assert len(labs) == len(weights)
    return dot(labs, weights)

risk_1 = cardiovascular_risk(patient_1, risk_weights)
risk_2 = cardiovascular_risk(patient_2, risk_weights)

print("Risk patient 1:", risk_1)
print("Risk patient 2:", risk_2)

Risk patient 1: 108.16
Risk patient 2: 77.74


# Key take-home message:

* ‚ÄúEach patient is a vector.
* Our ‚Äòmodel‚Äô is another vector.
* Linear algebra (dot product) turns this into a single number that summarizes risk.‚Äù

# 2. Gene expression profiles as vectors + ‚Äúsignature‚Äù score

Idea: Each sample‚Äôs gene expression is a vector of expression levels. A ‚Äúgene signature‚Äù (e.g., interferon response, proliferation score) is a vector of weights. Again, use dot product.

In [None]:
# Suppose we measure expression of 4 genes for each patient:
# [GeneA, GeneB, GeneC, GeneD]
patient_tumor_1: Vector = [8.5, 2.1, 0.3, 5.0]
patient_tumor_2: Vector = [5.0, 1.2, 0.1, 2.4]

# A made-up "proliferation signature": how much each gene contributes
proliferation_signature: Vector = [0.6, 0.1, 0.0, 0.3]

def signature_score(expr: Vector, signature: Vector) -> float:
    """Score how strongly this sample expresses a gene signature."""
    assert len(expr) == len(signature)
    return dot(expr, signature)

score_1 = signature_score(patient_tumor_1, proliferation_signature)
score_2 = signature_score(patient_tumor_2, proliferation_signature)

print("Proliferation score (patient 1):", score_1)
print("Proliferation score (patient 2):", score_2)

Proliferation score (patient 1): 6.81
Proliferation score (patient 2): 3.84


# Key take-home message
* Gene expression = vector.
* Gene signature = vector.
* Dot product tells you how aligned the sample is with that signature.‚Äù

# 3. Distance between patients (clustering / similarity)

**Idea**: Use Euclidean distance between patient feature vectors to say how similar two patients are (for clustering, nearest neighbors, etc.)

In [None]:
# Each patient vector: [BMI, systolic_bp, LDL, HDL]
patient_a: Vector = [30.0, 140, 160, 40]
patient_b: Vector = [25.0, 120, 110, 55]
patient_c: Vector = [31.0, 145, 170, 38]

print("Distance(a, b):", distance(patient_a, patient_b))
print("Distance(a, c):", distance(patient_a, patient_c))

Distance(a, b): 56.124860801609124
Distance(a, c): 11.40175425099138


Key take-home message:
* Smaller distance ‚Üí patients more similar.
* This is what K-means, KNN, etc. are doing under the hood.

# 4. Matrix of patient √ó lab values

Idea: Represent a cohort as a matrix: each row is a patient, each column is a lab test.

In [None]:
# Rows = patients, Columns = [systolic_bp, LDL, HDL]
labs: Matrix = [
    [140, 160, 40],  # patient 0
    [120, 100, 55],  # patient 1
    [130, 130, 50],  # patient 2
]

print("Shape of labs matrix:", shape(labs))   # (3, 3)

def patient_vector(lab_matrix: Matrix, patient_id: int) -> Vector:
    return get_row(lab_matrix, patient_id)

def lab_values(lab_matrix: Matrix, lab_index: int) -> Vector:
    return get_column(lab_matrix, lab_index)

patient_0 = patient_vector(labs, 0)
ldl_values = lab_values(labs, 1)

print("Patient 0 labs:", patient_0)
print("LDL for all patients:", ldl_values)

Shape of labs matrix: (3, 3)
Patient 0 labs: [140, 160, 40]
LDL for all patients: [160, 100, 130]


# Key take-home message:
* ‚ÄúRow = patient vector.‚Äù
* ‚ÄúColumn = all patients‚Äô values for one lab test.‚Äù
* Shows shape, get_row, get_column in a natural clinical context.

# 5. Matrix of samples √ó genes + per-gene summary

Idea: Each row is a sample (patient, tissue), each column is a gene. Use vector_mean to get average expression per gene#

In [None]:
# Rows = samples, Columns = [GeneA, GeneB, GeneC]
expression_matrix: Matrix = [
    [8.5, 2.1, 0.3],   # sample 0
    [5.0, 1.2, 0.1],   # sample 1
    [6.7, 1.5, 0.2],   # sample 2
]

print("Expression shape:", shape(expression_matrix))  # (3, 3)

# Mean expression per gene (column-wise mean)
def mean_expression_per_gene(expr: Matrix) -> Vector:
    num_rows, num_cols = shape(expr)
    means: Vector = []
    for j in range(num_cols):
        gene_values = get_column(expr, j)
        means.append(sum(gene_values) / len(gene_values))
    return means

gene_means = mean_expression_per_gene(expression_matrix)
print("Mean expression per gene:", gene_means)

Expression shape: (3, 3)
Mean expression per gene: [6.733333333333333, 1.5999999999999999, 0.19999999999999998]


Take-home message:

This example connects Matrix + column operations with basic bioinformatics summary stats.

# 6. Normalizing lab values with a diagonal matrix (gentle intro)

**Idea**: Use make_matrix to build a diagonal scaling matrix that normalizes labs (e.g., divide each lab by a typical value). Then apply it with matrix‚Äìvector multiplication written ‚Äúby hand‚Äù (using dot + get_row).


In [None]:
# not in the book
def mat_vec_multiply(A: Matrix, v: Vector) -> Vector:
    """Multiply an n x k matrix A by a k-dimensional vector v."""
    num_rows, num_cols = shape(A)
    assert len(v) == num_cols
    return [dot(get_row(A, i), v) for i in range(num_rows)]

In [None]:
# Typical / reference values for [systolic_bp, LDL, HDL]
reference: Vector = [120.0, 100.0, 50.0]

def scaling_matrix(scales: Vector) -> Matrix:
    n = len(scales)
    return make_matrix(
        n, n,
        lambda i, j: scales[i] if i == j else 0.0
    )

# We want to divide each lab by its reference value
scales = [1.0 / ref for ref in reference]
S = scaling_matrix(scales)

patient_labs: Vector = [140.0, 160.0, 40.0]
normalized = mat_vec_multiply(S, patient_labs)

print("Original labs:", patient_labs)
print("Normalized labs:", normalized)  # roughly [1.17, 1.6, 0.8]

Original labs: [140.0, 160.0, 40.0]
Normalized labs: [1.1666666666666667, 1.6, 0.8]


**Take-home Message**

	‚Ä¢	‚ÄúThis diagonal matrix is a linear transformation that rescales each feature independently.‚Äù


# 7. Cosine similarity between gene expression profiles

Idea: Use dot, magnitude to define cosine similarity between two gene expression vectors (are two tumors molecularly similar?).

In [None]:
def cosine_similarity(v: Vector, w: Vector) -> float:
    """Cosine of angle between v and w: dot(v,w) / (||v|| * ||w||)."""
    return dot(v, w) / (magnitude(v) * magnitude(w))

tumor_1: Vector = [8.5, 2.1, 0.3, 5.0]
tumor_2: Vector = [8.0, 2.0, 0.4, 4.8]
tumor_3: Vector = [2.0, 9.0, 5.0, 1.0]

print("cos(tumor1, tumor2):", cosine_similarity(tumor_1, tumor_2))
print("cos(tumor1, tumor3):", cosine_similarity(tumor_1, tumor_3))

cos(tumor1, tumor2): 0.9998891270058804
cos(tumor1, tumor3): 0.3989671682444102


In [None]:
# Using standard libraries to accomplish the same thing

from sklearn.metrics.pairwise import cosine_similarity
X = [[0, 0, 0], [1, 1, 1]]
Y = [[1, 0, 0], [1, 1, 0]]
cosine_similarity(X, Y)
# array([[0.   , 0.   ],
#        [0.577, 0.816]])

array([[0.        , 0.        ],
       [0.57735027, 0.81649658]])

# Key take-home message

Cosine near 1 ‚Üí expression patterns point in a similar ‚Äúdirection‚Äù (similar biology), even if overall magnitudes differ.

# Further Exploration

There are several applications. Please refer to the book.