
<div align="center">
  <img src="https://www.dropbox.com/s/vold2f3fm57qp7g/ECE4179_5179_6179_banner.png?dl=1" alt="ECE4179/5179/6179 Banner" style="max-width: 60%;"/>
</div>

<div align="center">

# Let There Be Light!

</div>

Welcome to **ECE4179/5179/6179 Week 1**! In this notebook, we will study some aspects of data processing.

By the end of this notebook, you will have accomplished the following learning outcomes:

- Understanding vector algebra, including operations such as addition, subtraction, norm, inner product, and cosine similarity.
- Applying these vector operations in a practical contexts such as audio and text processing.

So, let's **get started**!




In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import librosa

from IPython.display import Audio, display
import ipywidgets as widgets

RND_SEED = 42
np.random.seed(RND_SEED)  # For reproducibility


## Vector Algebra

Recall from your previous math units tthat in $\mathbb{R}^2$ (2-dimensional real plane), vectors can be represented as directed line segments. We briefly discuss basic vector operations such as addition, subtraction, norm, inner product, and cosine similarity.

### Vector Addition
Vector addition involves adding corresponding components of two vectors to form a new vector. Geometrically, this can be visualized by placing the tail of vector $\mathbf{b}$ at the head of vector $\mathbf{a}$. The resultant vector $\mathbf{a} + \mathbf{b}$ is the diagonal of the parallelogram formed by $\mathbf{a}$ and $\mathbf{b}$.




### Vector Subtraction
Vector subtraction involves subtracting corresponding components of one vector from another. Geometrically, this can be visualized by placing the tail of vector $\mathbf{b}$ at the tail of vector $\mathbf{a}$. The resultant vector $\mathbf{a} - \mathbf{b}$ points from the head of $\mathbf{b}$ to the head of $\mathbf{a}$.





### Norm (Magnitude)
The norm or magnitude of a vector is a measure of its length. For a vector $\mathbf{v} = \begin{pmatrix} x \\ y \end{pmatrix}$, the norm is given by:

\begin{align}
\|\mathbf{v}\| = \sqrt{x^2 + y^2}
\end{align}



### Inner Product (Dot Product)
The inner product (or dot product) of two vectors $\mathbf{a}$ and $\mathbf{b}$ is a scalar representing the product of their magnitudes and the cosine of the angle between them. It is calculated as:

\begin{align}
\mathbf{a} \cdot \mathbf{b} = a_1b_1 + a_2b_2
\end{align}



### Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors, indicating their directional similarity. It is given by the formula:

\begin{align}
\text{cosine\_similarity} = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}
\end{align}



<div style="background-color: #3352FF; color: white; padding: 10px; border-radius: 5px;">



### <span style="color: pink;">Task #1.</span>  Practice with Vector Operations

Consider the following two vectors in $\mathbb{R}^2$:

\begin{align}
\mathbf{a} = \begin{pmatrix} 3 \\ 4 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} 1 \\ 2 \end{pmatrix}
\end{align}

Obtain the following:
- $\mathbf{a} + \mathbf{b}$
- $\mathbf{a} - \mathbf{b}$
- $\|\mathbf{a}\|$ and $\|\mathbf{b}\|$
- $\mathbf{a} \cdot \mathbf{b}$
- cosine similarity between $\mathbf{a}$ and $\mathbf{b}$
</div>

<div style="background-color: #3352FF; color: white; padding: 10px; border-radius: 5px;">



### <span style="color: pink;">Task #2.</span>  Use NumPy for Vector Operations

Define the vectors $\mathbf{a} = \begin{pmatrix} 3 \\ 4 \end{pmatrix}$ and $\mathbf{b} = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$ using NumPy. Then, compute the following:

- $\mathbf{a} + \mathbf{b}$
- $\mathbf{a} - \mathbf{b}$
- $\|\mathbf{a}\|$ and $\|\mathbf{b}\|$
- $\mathbf{a} \cdot \mathbf{b}$
- cosine similarity between $\mathbf{a}$ and $\mathbf{b}$

You can use the `numpy.linalg.norm()`, and `numpy.dot()` functions to compute the norm, and dot product, respectively.
</div>

In [None]:
# Define two vectors in R2
a = np.array([3, 4])
b = np.array([1, 2])

# Vector Addition
a_plus_b = # TODO <--- write your code here

# Vector Subtraction
a_minus_b = # TODO <--- write your code here

# Norm (Magnitude) of vectors
norm_a = # TODO <--- write your code here
norm_b = # TODO <--- write your code here

# Inner Product (Dot Product)
inner_product = # TODO <--- write your code here

# Cosine Similarity
cosine_similarity = # TODO <--- write your code here

# Print results
print(f"Vector a: {a}")
print(f"Vector b: {b}")
print(f"Vector Addition (a + b): {a_plus_b}")
print(f"Vector Subtraction (a - b): {a_minus_b}")
print(f"Norm of a: {norm_a}")
print(f"Norm of b: {norm_b}")
print(f"Inner Product (a · b): {inner_product}")
print(f"Cosine Similarity: {cosine_similarity}")



# Working with audio

An audio file is a sequence of values recorded over time. In technical terms, it is a time-series data. If you were to open an MP3 file with a text editor, you would see a series of characters. These characters correspond to the raw audio samples that constitute the time-series data. To properly read and play an audio file, we need to understand some formatting details, but let's not worry about that for now. The crucial parameter we need is the sample rate (SR), which denotes the number of samples per second. For example, if the SR is 44100, then there are 44100 samples per second. Higher SR values generally indicate better audio quality. 

The [Librosa](https://librosa.org/doc/latest/index.html) library offers a diverse set of functionalities for reading, analyzing, and manipulating audio data. Utilizing Librosa, we can load audio files in various formats, including MP3 or WAV.

First, install Librosa and then study the code below.

In [None]:
# !pip install librosa


# Librosa is a library for audio analysis
# https://librosa.org/doc/latest/index.html
# with this library we can load audio files, extract features, and perform analysis
import librosa
from IPython.display import Audio

# Path to the MP3 file
audio_file = "data/Warm-Memories.mp3"

# Load the audio file
audio, sr = librosa.load(audio_file, sr=None)


# Display the audio player
Audio(audio_file)

<div style="background-color: #3352FF; color: white; padding: 10px; border-radius: 5px;">



### <span style="color: pink;">Task #3.</span>  Be a conductor!
Use the code below to read piano.mp3 and drums.mp3 files. Your task is now 
 
1. play them simultaneously 
2. play the drum and after 10 seconds play the piano
3. can you make the piano louder than the drums?
</div>

In [None]:
# Load the first audio file
y1, sr1 = librosa.load('data/piano.mp3', sr=None)

# Load the second audio file
y2, sr2 = librosa.load('data/drum.mp3', sr=None)
# Ensure both audio files have the same length
min_len = min(len(y1), len(y2))
y1 = y1[:min_len]
y2 = y2[:min_len]

In [None]:
#1. playing the audios simultaneously
y_superimposed1 = # TODO <--- write your code here
y_superimposed1 = y_superimposed1 / np.max(np.abs(y_superimposed1)) # Normalize the audio

# Display the audio player
Audio(y_superimposed1, rate=sr1)

In [None]:
# 2. playing the piano into drums after 10 seconds
y_superimposed2 = # TODO <--- write your code here
y_superimposed2 = y_superimposed2 / np.max(np.abs(y_superimposed2)) # Normalize the audio
# Display the audio player
Audio(y_superimposed2, rate=sr1)

In [None]:
# Make piano louder than drums

y_superimposed3 = # TODO <--- write your code here
y_superimposed3 = y_superimposed3 / np.max(np.abs(y_superimposed3))

# Display the audio player
Audio(y_superimposed3, rate=sr1)

## Working with Text data

We learn about how text data can be represented as vectors using a Language Model (LM), known as ![BERT](https://arxiv.org/abs/1810.04805). A language model is a statistical model that assigns probabilities to sequences of words. We will learn about LMs in depth later in the course but for now, our focus will be on how text data can be represented as vectors.

### Tokenization
Tokenization is the process of splitting text into smaller units called tokens. These tokens can be words, subwords, or characters. In LLMs and during training, a vocabulary is created that contains words and subwords more frequently seen in the training data. Each token in the vocabulary is assigned a unique integer ID. During tokenization, each token in the text is replaced with its corresponding ID.


BERT-base is a 12-layer neural network with 110M parameters model, developed by Google in 2019. It is trained on English text data and can be fine-tuned on specific tasks such as text classification, question-answering, and named entity recognition.

You need to install the transformers library to use the BERT model. You can install it using the following command:

```!pip install transformers```

In [None]:
from transformers import BertTokenizer, BertModel
import torch
# Function to get BERT embeddings for a word
def get_tokens_and_embeddings(text, model, tokenizer):
    tokens = tokenizer(text, return_tensors='pt')
    token_ids = tokens['input_ids']
    with torch.inference_mode():
        token_embs = model.embeddings.word_embeddings(token_ids)
    # Pick the first token embedding for each word
    token_embs = token_embs[:, 1, :]
    return token_ids.squeeze().detach().numpy(), token_embs.squeeze().detach().numpy()


# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
LM_model = BertModel.from_pretrained('bert-base-uncased')





In [None]:
text = "Hello my friend"
tokens = tokenizer(text, return_tensors='pt')['input_ids'].squeeze().detach().numpy()

print(f"Tokens: {tokens}")


Note that the tokenizer has some special tokens such as `[CLS]` and `[SEP]` that are used to mark the beginning and end of a sentence, respectively.

<div style="background-color: #3352FF; color: white; padding: 10px; border-radius: 5px;">



### <span style="color: pink;">Task #4.</span>  Use BERT to embed 

Consider the following words: "france", "germany", "italy", "apple", "orange", "banana", "truck", "car", "bus". Use the BERT model to embed these words into vectors. Use your previous code to create a similarity matrix between these words.


You can use the following two functions ```cosine_similarity``` and ```plot_similarity_matrix``` to compute the cosine similarity and plot the similarity matrix, respectively.

</div>

In [None]:
def cosine_similarity(a, b=None):
    """
    Calculates the cosine similarity between two arrays.

    Parameters:
    a (ndarray): A 2D array of shape (m, n) where each row represents a vector.
    b (ndarray, optional): A 2D array of shape (k, n) where each row represents a vector. If None, defaults to `a`.

    Returns:
    ndarray: A 2D array of shape (m, k) containing the cosine similarity scores.

    This function computes the cosine similarity between each pair of vectors in `a` and `b`. 
    If `b` is not provided, it computes the cosine similarity between each pair of vectors in `a`.

    The cosine similarity is calculated as the dot product of the vectors divided by the product of their magnitudes.

    # Example usage:
    a = np.random.rand(5, 4)  # Random array of shape (5, 4)
    b = np.random.rand(3, 4)  # Random array of shape (3, 4)
    similarity_matrix = cosine_similarity(a, b)
    print("Cosine Similarity Matrix:\n", similarity_matrix)
    """

    if b is None:
        b = a
    
    # Compute the dot product between each pair of vectors in `a` and `b`
    a_times_b = np.dot(a, b.T)  # Shape will be (m, k)
    
    # Compute the L2 norm (magnitude) of each vector in `a` and `b`
    norm_a = np.linalg.norm(a, axis=1, keepdims=True)  # Shape will be (m, 1)
    norm_b = np.linalg.norm(b, axis=1, keepdims=True)  # Shape will be (k, 1)
    
    # Compute the cosine similarity
    cos_sim = a_times_b / (norm_a * norm_b.T)  # Shape will be (m, k)
    
    return cos_sim



def plot_similarity_matrix(words, similarity_matrix):
    """
    Plots a normalized similarity matrix using matplotlib.

    Parameters:
    words (list of str): List of words corresponding to the similarity matrix.
    similarity_matrix (ndarray): A square matrix of similarity scores between words.

    This function normalizes each row of the similarity matrix, plots it using a heatmap, 
    and annotates each cell with the original similarity value up to three decimal places.

    # Example usage
    words = ["word1", "word2", "word3", "word4", "word5"]
    similarity_matrix = cosine_similarity(np.random.rand(5, 4))
    plot_similarity_matrix(words, similarity_matrix)
    """

    # Normalize each row of the similarity matrix
    row_max = similarity_matrix.max(axis=1, keepdims=True)
    row_min = similarity_matrix.min(axis=1, keepdims=True)
    normalized_similarity_matrix = (similarity_matrix - row_min) / (row_max - row_min)

    # Plotting the normalized similarity matrix using matplotlib
    fig, ax = plt.subplots(figsize=(10, 8))
    cax = ax.matshow(normalized_similarity_matrix, cmap='plasma')

    # Add color bar to the plot
    plt.colorbar(cax)

    # Set up the axes with words
    ax.set_xticks(np.arange(len(words)))
    ax.set_yticks(np.arange(len(words)))

    # Label the axes with the words
    ax.set_xticklabels(words)
    ax.set_yticklabels(words)

    # Rotate the tick labels and set their alignment for better readability
    plt.setp(ax.get_xticklabels(), rotation=45, ha="left", rotation_mode="anchor")

    # Annotate each cell with the original similarity value up to three decimal places
    for i in range(len(words)):
        for j in range(len(words)):
            ax.text(j, i, f"{similarity_matrix[i, j]:.3f}", ha="center", va="center", color="black")

    # Show the plot
    plt.show()


In [None]:
words = # TODO <--- write your code here
tokens, token_embs = get_tokens_and_embeddings(words, LM_model, tokenizer)

# Calculate cosine similarity matrix
similarity_matrix = # TODO <--- write your code here


plot_similarity_matrix(words, similarity_matrix)

<div style="background-color: #3352FF; color: white; padding: 10px; border-radius: 5px;">



### <span style="color: pink;">Task #5.</span>  Vector Arithmetic 

Consider this analogy: "The relation of man to woman is as king to ?".

As a human, you can easily identify that the answer is "queen".

Your task is to demonstrate this relationship using vector algebra with pre-trained BERT embeddings. Follow these steps:

1. Embed the words "man", "woman", "king", "queen", "france", "germany", "italy", "apple", "orange", "banana", "truck", "car", and "bus" using BERT.
2. Perform a vector operation that uses the embeddings of "man", "woman", and "king" to find the missing word in the analogy.
3. Calculate the cosine similarity between the resulting vector and the embeddings of "queen", "france", "germany", "italy", "apple", "orange", "banana", "truck", "car", and "bus".
4. Print out their similarities.



</div>

In [None]:
# Words for the analogy and additional words for comparison
words = ["man", "woman", "king", "queen", 
         "france", "germany", "italy", 
         "apple", "orange", "banana", 
         "truck", "car", "bus"]


# TODO <--- write your code here

# Conclusion

In this notebook, we have explored some concepts in vector algebra and their applications in data processing. Here are the key takeaways:

- **Vector Algbera**: We reviewed how to perform basic vector operations, discussed vector norms and cosine similarity.

- **Continuous/Discrete Data** We learned how to work with audio data and text data. The former is an example of continuous data, while the latter is an example of discrete data.

We will see you next week with more exciting topics. Until then, happy learning!
