Please use Markdown cells in your submission to document your thought process. You are expected to follow the clean code and PEP 8 guidelines as much as you can. You should use docstrings for all function declarations.
In this assignment, you will learn several functions from numpy. Please check their functionalities using their documentations or help() function, and see if you can apply them to solve the homework problems.

# Cosin Similarity
Cosine similarity measures the similarity between two high dimensional vectors. It is widely-used in applications such as clustering tasks in machine learning, building recommendation systems for e-commerce companies. See more background on cosine similarity in [Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity). 

Write a function named "cosine_similarity". The function takes two 1D numpy arrays as inputs, and returns their cosine similarity. 

**Note**: You are allowed to use np.dot() for inner product, np.sum() for summation, np.sqrt() for calculating square root. Do not use other built-in functions from Numpy such as np.linalg.norm(). 


In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    """
    Calculate the cosine similarity between two 1D numpy arrays.
    
    Cosine similarity measures the cosine of the angle between two vectors
    
    Parameters:
    vec1 : numpy.ndarray
    vec2 : numpy.ndarray
        
    Returns:
    float --> the cosine similarity between vec1 and vec2
        
    """
    # Calculate the dot product of the two vectors
    dot_product = np.dot(vec1, vec2)
    
    # Calculate the magnitude of vec1
    # Magnitude = sqrt(sum of squared elements)
    magnitude_vec1 = np.sqrt(np.sum(vec1 ** 2))
    
    # Calculate the magnitude of vec2
    magnitude_vec2 = np.sqrt(np.sum(vec2 ** 2))
    
    # Calculate cosine similarity
    # Handle division by zero
    if magnitude_vec1 == 0 or magnitude_vec2 == 0:
        return 0.0
    
    similarity = dot_product / (magnitude_vec1 * magnitude_vec2)
    
    return similarity

In [3]:
# Test case 1: Similar vectors
vec_a = np.array([1, 2, 3])
vec_b = np.array([4, 5, 6])
print(f"Test 1 - Similar vectors: {cosine_similarity(vec_a, vec_b)}")

# Test case 2: Identical vectors
vec_c = np.array([1, 1, 1])
vec_d = np.array([1, 1, 1])
print(f"Test 2 - Identical vectors: {cosine_similarity(vec_c, vec_d)}")

# Test case 3: Orthogonal vectors
vec_e = np.array([1, 0])
vec_f = np.array([0, 1])
print(f"Test 3 - Orthogonal vectors: {cosine_similarity(vec_e, vec_f)}")

# Test case 4: Opposite vectors
vec_g = np.array([1, 2, 3])
vec_h = np.array([-1, -2, -3])
print(f"Test 4 - Opposite vectors: {cosine_similarity(vec_g, vec_h)}")

Test 1 - Similar vectors: 0.9746318461970762
Test 2 - Identical vectors: 1.0000000000000002
Test 3 - Orthogonal vectors: 0.0
Test 4 - Opposite vectors: -1.0


# Stock Price Analysis--Part I
Write a function named "count_max_streak". The function takes the historical stock price data, represented as a 1D numpy array, as input, and returns the maximum number of consecutive days when the stock price increased. 

To test your function, please use `tesla_closing_price` as the 1D numpy array input to your function. The historical stock price of Tesla is provided for you to test your code.

```py
import pandas as pd
tesla = pd.read_csv('TSLA.csv') # load Tesla stock price from csv file
tesla_np = tesla.to_numpy() # convert the data to numpy arrays
tesla_closing_price = tesla_np[:, 4] # extract the closing stock price of Tesla
```

In [6]:
import numpy as np
import pandas as pd

def count_max_streak(prices):
    """
    Count the maximum number of consecutive days when stock price increased.
    
    Parameters: 
    prices --> numpy.ndarray
        
    Returns: 
    int --> Maximum number of consecutive days with price increases
    
    """
    # Handle edge cases
    if len(prices) <= 1:
        return 0
    
    max_streak = 0  # Maximum streak found so far
    current_streak = 0  # Current consecutive increase count
    
    # Iterate through prices
    for i in range(1, len(prices)):

        # Check if price increased from previous day
        if prices[i] > prices[i - 1]:
            current_streak += 1
            max_streak = max(max_streak, current_streak)

        else:
            # Price didn't increase, reset current streak
            current_streak = 0
    
    return max_streak

In [8]:
# Test case 1: Simple increasing sequence
test_prices_1 = np.array([100, 105, 110, 115, 120])
print(f"Test 1: {count_max_streak(test_prices_1)}")
# Expected: 4 (four consecutive increases)

# Test case 2: Mixed pattern
test_prices_2 = np.array([100, 105, 110, 108, 112, 115, 120])
print(f"Test 2: {count_max_streak(test_prices_2)}")
# Expected: 3 (longest streak is 108->112->115->120)

# Test case 3: All decreasing
test_prices_3 = np.array([120, 115, 110, 105, 100])
print(f"Test 3: {count_max_streak(test_prices_3)}")
# Expected: 0

# Test case 4: No change
test_prices_4 = np.array([100, 100, 100, 100])
print(f"Test 4: {count_max_streak(test_prices_4)}")
# Expected: 0

# Test case 5: Single element
test_prices_5 = np.array([100])
print(f"Test 5: {count_max_streak(test_prices_5)}")
# Expected: 0

Test 1: 4
Test 2: 3
Test 3: 0
Test 4: 0
Test 5: 0


# Stock Price Analysis--Part 2
Write a function named "detect_crash". The function takes a 2D numpy array as input, where the first column represents the historical opening stock price, and the second column represents the historical closing stock price. The function should return a 2D numpy array whose first column contains the indices when crashes occurred, and whose second column contains amount of price drop. 

**Note**: We say there is a stock price crash if the closing price is less than the opening price.

**Hint**: You may find functions np.where() and np.column_stack() helpful. You are also welcome to come up with other solutions without using these functions. 

To test your function, please use `open_close_prices` as the 2D numpy array input to your function. You can download the historical stock price of Tesla here.

```py
import pandas as pd
tesla = pd.read_csv('TSLA.csv') # load Tesla stock price from csv file
tesla_np = tesla.to_numpy() # convert the data to numpy arrays
tesla_opening_price = tesla_np[:, 1] # extract the closing stock price of Tesla
tesla_closing_price = tesla_np[:, 4] # extract the closing stock price of Tesla
open_close_price = np.column_stack((tesla_opening_price, tesla_closing_price)) # form a 2D numpy array using opening and closing prices 
```

In [10]:
def detect_crash(price_data):
    """
    Detect stock price crashes and calculate the amount of price drop.
    
    A crash is defined as a day when the closing price is less than 
    the opening price.
    
    Parameters: 
    price_data : numpy.ndarray
        2D numpy array where --> Column 0 is opening prices and Column 1 is closing prices
        
    Returns
    numpy.ndarray
        2D numpy array where --> Column 0 is indices of crash days and Column 1: amount of price drop (opening - closing)

    """
    
    # Get opening and closing prices from the 2D array
    opening_prices = price_data[:, 0]
    closing_prices = price_data[:, 1]
    
    # Find indices where closing price < opening price (crash condition)
    crash_indices = np.where(closing_prices < opening_prices)[0]
    
    # Calculate price drops for crash days
    price_drops = opening_prices[crash_indices] - closing_prices[crash_indices]
    
    # Combine indices and price drops into a 2D array and handle no crash case
    if len(crash_indices) == 0:
        return np.array([]).reshape(0, 2)
    
    result = np.column_stack((crash_indices, price_drops))
    
    return result

In [12]:
# Load Tesla stock price data (got it from Kaggle: https://www.kaggle.com/datasets/varpit94/tesla-stock-data-updated-till-28jun2021?resource=download)
tesla = pd.read_csv('TSLA.csv')  # load Tesla stock price from csv file
tesla_np = tesla.to_numpy()  # convert the data to numpy arrays
tesla_opening_price = tesla_np[:, 1]  # get opening stock price of Tesla
tesla_closing_price = tesla_np[:, 4]  # get closing stock price of Tesla

# Form a 2D numpy array using opening and closing prices
open_close_price = np.column_stack((tesla_opening_price, tesla_closing_price))

# Test the function with Tesla data
crash_data = detect_crash(open_close_price)

print(f"Total number of crash days: {len(crash_data)}")
print(f"\nFirst 10 crash days:")
print(f"{'Index':<10} {'Price Drop':<15}")
print("-" * 25)

# Used chatGPT to help format the table nicer
for i in range(min(10, len(crash_data))):
    print(f"{int(crash_data[i, 0]):<10} ${crash_data[i, 1]:<14.2f}")

# Summary statistics
if len(crash_data) > 0:
    print(f"\nCrash Statistics:")
    print(f"Average price drop: ${np.mean(crash_data[:, 1]):.2f}")
    print(f"Maximum price drop: ${np.max(crash_data[:, 1]):.2f}")
    print(f"Minimum price drop: ${np.min(crash_data[:, 1]):.2f}")

Total number of crash days: 1481

First 10 crash days:
Index      Price Drop     
-------------------------
1          $0.39          
2          $0.61          
3          $0.76          
4          $0.78          
5          $0.12          
7          $0.04          
8          $0.18          
11         $0.01          
12         $0.01          
14         $0.31          

Crash Statistics:
Average price drop: $3.22
Maximum price drop: $150.10
Minimum price drop: $0.00


# Infectious Disease Simulation

Write a function named "simulate_disease" to simulate how infectious disease may propagate over a network. Given the infection probabilities  of all individuals at time  and a network connection , the infection probabilities at time step  is computed as . Here are the requirements of this function:

The function uses a 1D numpy array to denote the probabilities of all individuals within the network of being infected. 
The network connection among individuals is represented using a 2D symmetric, row-stochastic numpy array, with all entries being non-negative (A row-stochastic matrix is one whose elements adds up to 1 for each row). If there are N individuals in the network, the matrix is of dimension . The -th entry of the matrix represents the connection strength between individual  and . 
Given the initial probability, the network connections, and prediction time horizon, the function returns the probabilities of each individual in the network of getting disease at the end of prediction time horizon.
You can use the following code snippet to generate a random connection matrix of dimension  to verify your function:
```py
np.random.seed(20)
matrix = np.random.rand(N, N)
symmetric_matrix = (matrix + matrix.T) / 2
connection_matrix = symmetric_matrix / symmetric_matrix.sum(axis=1, keepdims=True)
```
You can use the following code snippet to generate an initial infection probability: `np.random.rand(N, 1)`

In [None]:
import numpy as np

def simulate_disease(initial_prob, connection_matrix, time_horizon):
    """
    Simulate infectious disease propagation over a network.
    
    The simulation uses the iterative formula: P(t+1) = A x P(t)
    where A is the connection matrix and P(t) is the infection probability
    vector at time t.
    
    Parameters:
    initial_prob : numpy.ndarray
        1D numpy array of shape (N,) representing initial infection 
        probabilities for N individuals in the network.
        
    connection_matrix : numpy.ndarray
        2D numpy array of shape (N, N) representing network connections.
        
    time_horizon : int
        Number of time steps to simulate the disease propagation.
        
    Returns:
    numpy.ndarray
        1D numpy array of shape (N,) containing infection probabilities
        for each individual after the specified time horizon.
        
    """
    # Make a copy of the initial probabilities to avoid modifying the input
    current_prob = initial_prob.copy()
    
    # Simulate disease propagation for the specified time horizon
    for t in range(time_horizon):
        current_prob = np.dot(connection_matrix, current_prob)
    
    return current_prob

In [14]:
# Set random seed for reproducibility
np.random.seed(20)

# Define network size
N = 5

# Generate random connection matrix
matrix = np.random.rand(N, N)
symmetric_matrix = (matrix + matrix.T) / 2
connection_matrix = symmetric_matrix / symmetric_matrix.sum(axis=1, keepdims=True)

# Verify the connection matrix properties
print("Connection Matrix:")
print(connection_matrix)
print(f"\nIs symmetric: {np.allclose(connection_matrix, connection_matrix.T)}")
print(f"Row sums (should all be 1): {connection_matrix.sum(axis=1)}")
print()

# Generate initial infection probability
initial_infection_prob = np.random.rand(N)
print("Initial Infection Probabilities:")
print(initial_infection_prob)
print()

# Run simulation for different time horizons
time_horizons = [0, 1, 5, 10, 20]

for horizon in time_horizons:
    final_prob = simulate_disease(initial_infection_prob, connection_matrix, horizon)
    print(f"Time horizon = {horizon}:")
    print(f"Final probabilities: {final_prob}")
    print(f"Average infection probability: {np.mean(final_prob):.4f}")
    print()

Connection Matrix:
[[0.20724897 0.28005359 0.20506162 0.15020478 0.15743104]
 [0.28887794 0.13764649 0.2248394  0.14078764 0.20784852]
 [0.16853888 0.17914921 0.22677643 0.2319319  0.19360359]
 [0.2058796  0.1870772  0.3867892  0.11554263 0.10471136]
 [0.16706943 0.21383576 0.24997946 0.081072   0.28804336]]

Is symmetric: False
Row sums (should all be 1): [1. 1. 1. 1. 1.]

Initial Infection Probabilities:
[0.49238104 0.63125307 0.83949792 0.4610394  0.49794007]

Time horizon = 0:
Final probabilities: [0.49238104 0.63125307 0.83949792 0.4610394  0.49794007]
Average infection probability: 0.5844

Time horizon = 1:
Final probabilities: [0.5986205  0.58628476 0.5897849  0.64958269 0.60790925]
Average infection probability: 0.6064

Time horizon = 5:
Final probabilities: [0.60340002 0.60339756 0.60339162 0.60342335 0.603401  ]
Average infection probability: 0.6034

Time horizon = 10:
Final probabilities: [0.60340112 0.60340112 0.60340112 0.60340112 0.60340112]
Average infection probability:

# Music Composition

Write a function named "compose_music" that takes a music sheet in A major scale as input and plays a piece of music based on the music sheet. The music sheet specifies a sequence of notes and their durations (see below for an example). You can use the in-class example "generate_sine" function to generate music notes with `fs = 8000`.

You can test your code using the example music_sheet below (simplified from "A Better Tomorrow (Mark's theme)" by Joseph Koo).
```py
music_sheet = [("Note_Cs", 0.5), ("Note_D", 0.5), ("Note_B", 0.5), ("Note_A_high", 1.5), ("Note_A_high", 0.5), ("Note_Gs", 0.5), ("Note_Fs", 0.5), ("Note_Gs", 0.5), ("Note_E", 0.75)]   
```
**Extra challenge**: Can you revise the function to play chords based on a given music sheet? A chord refers to the combination of two music nodes. (Extra challenge is not graded).

In [27]:
import numpy as np
from IPython.display import Audio, display

def generate_sine(freq, duration, fs=8000):
    """
    Generate a sine wave for a given frequency and duration.
    
    Parameters
    freq : float
        Frequency of the sine wave in Hz
    duration : float
        Duration of the sine wave in seconds
    fs : int, optional
        Sampling frequency in Hz (default is 8000)
        
    Returns
    numpy.ndarray
        1D array containing the sine wave samples
        
    """
    # Calculate number of samples needed
    num_samples = int(fs * duration)
    
    # Generate time array
    t = np.arange(num_samples) / fs
    
    # Generate sine wave
    sine_wave = np.sin(2 * np.pi * freq * t)
    
    return sine_wave


def compose_music(music_sheet, fs=8000):
    """
    Compose and generate music based on a music sheet.
    
    The function takes a music sheet specifying notes and their durations,
    and generates a continuous audio signal
    
    Parameters
    music_sheet : list of tuples
        List where each tuple contains (note_name, duration).
        - note_name (str): Name of the note (e.g., "Note_A", "Note_B")
        - duration (float): Duration of the note in seconds
        
    fs : int, optional
        Sampling frequency in Hz (default is 8000)
        
    Returns
    numpy.ndarray
        1D array containing the complete music signal

    """
    
    # Define frequencies for notes in A major scale
    note_frequencies = {
        "Note_A": 440.00,      # A4
        "Note_B": 493.88,      # B4
        "Note_Cs": 554.37,     # C#5
        "Note_D": 587.33,      # D5
        "Note_E": 659.25,      # E5
        "Note_Fs": 739.99,     # F#5
        "Note_Gs": 830.61,     # G#5
        "Note_A_high": 880.00  # A5
    }
    
    # Initialize empty list to store all note signals
    music_signal = []
    
    # Generate sine wave for each note in the music sheet
    for note_name, duration in music_sheet:
        # Get frequency for the note
        if note_name in note_frequencies:
            freq = note_frequencies[note_name]
            
            # Generate sine wave for this note
            note_signal = generate_sine(freq, duration, fs)
            
            # Add to music signal list
            music_signal.append(note_signal)
        else:
            print(f"Warning: Unknown note '{note_name}' - skipping")
    
    # Concatenate all notes into a single signal
    complete_music = np.concatenate(music_signal)
    
    return complete_music

In [28]:
# Test music sheet
music_sheet = [
    ("Note_Cs", 0.5), 
    ("Note_D", 0.5), 
    ("Note_B", 0.5), 
    ("Note_A_high", 1.5), 
    ("Note_A_high", 0.5), 
    ("Note_Gs", 0.5), 
    ("Note_Fs", 0.5), 
    ("Note_Gs", 0.5), 
    ("Note_E", 0.75)
]

# Generate the music
music = compose_music(music_sheet)

print(f"Total duration: {len(music) / 8000:.2f} seconds")
print(f"Total samples: {len(music)}")
print(f"Music signal shape: {music.shape}")

# Play the music!
display(Audio(music, rate=8000, autoplay=True))

Total duration: 5.75 seconds
Total samples: 46000
Music signal shape: (46000,)


# Rock Paper Scissor
Write a function named `play_rock_paper_scissor` so that a user can interact with the computer to play Rock Paper Scissor game. Here are the expected functionalities:
- For each round of the game, the function should ask the user whether the user wants to start playing by inputting 0 or 1. User input 0 indicates no, and 1 indicates yes.
- If the user wants to start the game, prompt the user to input a choice of action among Rock, Paper, Scissor. For now, you can safely assume the user's input is always among these three options. Randomly generate an action among Rock, Paper, Scissor for the computer. Please refer to our in-class practice problem for an example on random number generation.
- Compare the user input with the computer action, and print the winner based on the following rules:
    - Rock beats Scissors
    - Scissors beats Paper
    - Paper beats Rock
- Prompt the user whether the game should continue or not as we did in the first step.

In [29]:
import numpy as np

def play_rock_paper_scissor():
    """
    Play an interactive, standard Rock Paper Scissors game with the computer.
    
    Game Rules
    - Rock beats Scissors
    - Scissors beats Paper
    - Paper beats Rock
    
    User Input
    - 0: Stop playing
    - 1: Start/continue playing
    - Rock, Paper, or Scissors: User's choice for the round

    Used ChatGPT for better formatting and structure.
    
    """
    print("Welcome to Rock Paper Scissors!")
    print("=" * 40)
    
    # Track game statistics
    wins = 0
    losses = 0
    ties = 0
    
    # Main game loop
    while True:
        # Ask user if they want to play
        user_input = input("\nDo you want to play? (0 = No, 1 = Yes): ")
        
        # Check if user wants to continue
        if user_input == "0":
            print("\nThanks for playing!")
            print(f"Final Score - Wins: {wins}, Losses: {losses}, Ties: {ties}")
            break
        elif user_input == "1":
            # User wants to play
            # Get user's choice
            user_choice = input("Choose Rock, Paper, or Scissors: ").strip()
            
            # Generate computer's random choice
            # 0 = Rock, 1 = Paper, 2 = Scissors
            choices = ["Rock", "Paper", "Scissors"]
            computer_choice_index = np.random.randint(0, 3)
            computer_choice = choices[computer_choice_index]
            
            # Display choices
            print(f"\nYou chose: {user_choice}")
            print(f"Computer chose: {computer_choice}")
            
            # Determine winner
            if user_choice == computer_choice:
                print("It's a tie!")
                ties += 1
            elif (user_choice == "Rock" and computer_choice == "Scissors"):
                print("You win! Rock beats Scissors")
                wins += 1
            elif (user_choice == "Scissors" and computer_choice == "Paper"):
                print("You win! Scissors beats Paper")
                wins += 1
            elif (user_choice == "Paper" and computer_choice == "Rock"):
                print("You win! Paper beats Rock")
                wins += 1
            else:
                # Computer wins
                if computer_choice == "Rock" and user_choice == "Scissors":
                    print("Computer wins! Rock beats Scissors")
                elif computer_choice == "Scissors" and user_choice == "Paper":
                    print("Computer wins! Scissors beats Paper")
                elif computer_choice == "Paper" and user_choice == "Rock":
                    print("Computer wins! Paper beats Rock")
                losses += 1
            
            # Display current score
            print(f"\nCurrent Score - Wins: {wins}, Losses: {losses}, Ties: {ties}")
        else:
            print("Invalid input. Please enter 0 or 1.")

In [31]:
play_rock_paper_scissor()

Welcome to Rock Paper Scissors!

You chose: Paper
Computer chose: Rock
You win! Paper beats Rock

Current Score - Wins: 1, Losses: 0, Ties: 0
Invalid input. Please enter 0 or 1.

You chose: Rock
Computer chose: Paper
Computer wins! Paper beats Rock

Current Score - Wins: 1, Losses: 1, Ties: 0
Invalid input. Please enter 0 or 1.

Thanks for playing!
Final Score - Wins: 1, Losses: 1, Ties: 0


# Linear Regression
- You are given a straight line learned from linear regression, denoted as $\hat{y}=ax+b$. Your task is to use a Python function named `eval_predict` to assess prediction quality over a dataset. Please first define the metric being used for assessment. Then let your program calculate and return the defined metric value.
- To learn a high-quality straight line in the form of $\hat{y}=ax+b$, we need to find optimal values for $a$
 and $b$. In what follows, you will perform grid search. You’ll try many pairs on a rectangular grid and pick the one with the best metric (defined in Problem 7) on the dataset. For example, consider a ∈ {0.0, 0.5, 1.0} and b ∈ {-1.0, 0.0, 1.0}, you evaluate all 9 combinations, compute the metric for each, and pick the best. Write a Python function named `grid_search` to find best pairs for given options of $a$ and $b$.

In [32]:
import numpy as np

def eval_predict(a, b, x_data, y_data):
    """
    Evaluate the quality of linear regression predictions using Mean Squared Error.
    
    Parameters
    a : float
        Slope of the linear regression line
    b : float
        Intercept of the linear regression line
    x_data : numpy.ndarray
        1D array of input feature values
    y_data : numpy.ndarray
        1D array of actual target values
        
    Returns
    float
        Mean Squared Error (MSE) between predictions and actual values
    
    """
    # Calculate predicted values
    y_predicted = a * x_data + b
    
    # Calculate squared errors
    squared_errors = (y_data - y_predicted) ** 2
    
    # Calculate mean squared error
    mse = np.mean(squared_errors)
    
    return mse


def grid_search(a_options, b_options, x_data, y_data):
    """
    Perform grid search to find optimal parameters for linear regression.
    
    This function tries all possible combinations of slope (a) and intercept (b)
    values from the provided options, evaluates each combination using MSE,
    and returns the pair that gives the lowest error.
    
    Parameters
    a_options : list or numpy.ndarray
        List of possible values for the slope parameter
    b_options : list or numpy.ndarray
        List of possible values for the intercept parameter
    x_data : numpy.ndarray
        1D array of input feature values
    y_data : numpy.ndarray
        1D array of actual target values
        
    Returns
    tuple
        A tuple (best_a, best_b, best_mse) containing:
        - best_a (float): Optimal slope value
        - best_b (float): Optimal intercept value
        - best_mse (float): MSE achieved with optimal parameters
        
    """
    # Initialize variables to track the best parameters
    best_a = None
    best_b = None
    best_mse = np.inf  # Start with infinity (worst possible MSE)
    
    # Try all combinations of a and b
    for a in a_options:
        for b in b_options:
            # Evaluate this combination
            current_mse = eval_predict(a, b, x_data, y_data)
            
            # Update best parameters if this is better
            if current_mse < best_mse:
                best_mse = current_mse
                best_a = a
                best_b = b
    
    return best_a, best_b, best_mse

In [34]:
# Test Case 1: Perfect linear relationship (y = 2x + 1)
print("=" * 50)
print("Test Case 1: Perfect Linear Relationship")
print("=" * 50)

x_test1 = np.array([1, 2, 3, 4, 5])
y_test1 = 2 * x_test1 + 1  # y = 2x + 1

# Test eval_predict with correct parameters
mse_perfect = eval_predict(2, 1, x_test1, y_test1)
print(f"MSE with correct parameters (a=2, b=1): {mse_perfect}")

# Test eval_predict with wrong parameters
mse_wrong = eval_predict(1, 0, x_test1, y_test1)
print(f"MSE with wrong parameters (a=1, b=0): {mse_wrong}")

# Test grid search
a_options_1 = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
b_options_1 = [-1.0, 0.0, 0.5, 1.0, 1.5, 2.0]

best_a, best_b, best_mse = grid_search(a_options_1, b_options_1, x_test1, y_test1)
print(f"\nGrid Search Results:")
print(f"Best a: {best_a}")
print(f"Best b: {best_b}")
print(f"Best MSE: {best_mse}")

# Test Case 2: Noisy linear relationship
print("\n" + "=" * 50)
print("Test Case 2: Noisy Linear Relationship")
print("=" * 50)

np.random.seed(42)
x_test2 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y_test2 = 1.5 * x_test2 + 2 + np.random.normal(0, 0.5, 10)  # y ≈ 1.5x + 2 + noise

print(f"Data: x = {x_test2}")
print(f"      y = {y_test2.round(2)}")

a_options_2 = np.arange(0.5, 2.5, 0.1)  # More fine-grained search
b_options_2 = np.arange(0.0, 4.0, 0.1)

best_a, best_b, best_mse = grid_search(a_options_2, b_options_2, x_test2, y_test2)
print(f"\nGrid Search Results:")
print(f"Best a: {best_a:.2f}")
print(f"Best b: {best_b:.2f}")
print(f"Best MSE: {best_mse:.4f}")
print(f"Best fit line: y = {best_a:.2f}x + {best_b:.2f}")

# Test Case 3: Example from assignment
print("\n" + "=" * 50)
print("Test Case 3: Assignment Example")
print("=" * 50)

x_test3 = np.array([0, 1, 2, 3, 4])
y_test3 = np.array([1, 2, 3, 4, 5])

a_options_3 = [0.0, 0.5, 1.0]
b_options_3 = [-1.0, 0.0, 1.0]

print(f"a options: {a_options_3}")
print(f"b options: {b_options_3}")
print(f"\nEvaluating all {len(a_options_3) * len(b_options_3)} combinations...")

# Show MSE for all combinations
print("\nMSE for each combination:")
print(f"{'a':<6} {'b':<6} {'MSE':<10}")
print("-" * 25)

for a in a_options_3:
    for b in b_options_3:
        mse = eval_predict(a, b, x_test3, y_test3)
        print(f"{a:<6.1f} {b:<6.1f} {mse:<10.4f}")

best_a, best_b, best_mse = grid_search(a_options_3, b_options_3, x_test3, y_test3)
print(f"\nBest combination:")
print(f"a = {best_a}, b = {best_b}, MSE = {best_mse:.4f}")

Test Case 1: Perfect Linear Relationship
MSE with correct parameters (a=2, b=1): 0.0
MSE with wrong parameters (a=1, b=0): 18.0

Grid Search Results:
Best a: 2.0
Best b: 1.0
Best MSE: 0.0

Test Case 2: Noisy Linear Relationship
Data: x = [ 1  2  3  4  5  6  7  8  9 10]
      y = [ 3.75  4.93  6.82  8.76  9.38 10.88 13.29 14.38 15.27 17.27]

Grid Search Results:
Best a: 1.50
Best b: 2.20
Best MSE: 0.1182
Best fit line: y = 1.50x + 2.20

Test Case 3: Assignment Example
a options: [0.0, 0.5, 1.0]
b options: [-1.0, 0.0, 1.0]

Evaluating all 9 combinations...

MSE for each combination:
a      b      MSE       
-------------------------
0.0    -1.0   18.0000   
0.0    0.0    11.0000   
0.0    1.0    6.0000    
0.5    -1.0   9.5000    
0.5    0.0    4.5000    
0.5    1.0    1.5000    
1.0    -1.0   4.0000    
1.0    0.0    1.0000    
1.0    1.0    0.0000    

Best combination:
a = 1.0, b = 1.0, MSE = 0.0000


# Sign-Up Validator
Imagine you are building a sign-up page for a new app. To keep the user database clean, you must validate all inputs before creating an account.
Instead of relying on advanced libraries, you’ll use pure Python basics.

### Username Validation
Write a function `validate_username(username)` that:
- Returns False if the username is empty, shorter than 3, or longer than 20 characters.
- Returns False if it contains anything besides letters, digits, underscore _, dot ., or hyphen -.
- Returns True if valid.

### Email Validation (Simple Version)
Write a function `validate_email(email)` that:
- Returns False if it doesn’t contain an "@".
- Splits the email into local part and domain (use .partition("@")).
- Ensures the domain contains at least one "." and doesn’t start/end with ".".
- Returns True if valid, False otherwise.

### Phone Normalization (Simplified)
Write a function `normalize_phone(phone, default_cc="+1")` that:
- Removes all non-digit characters.
- If the number doesn’t start with "+", prepend the default country code.
- Check that the final string has between 9 and 15 digits (excluding +).
- Return the normalized phone number (e.g., "+14155551212"), or "Invalid" if not valid.

### Sign-Up Aggregator
Write a function `validate_signup(user_info)` where `user_info` is a dictionary, e.g.:
```py
{
    "username": "alice_01",
    "email": "Alice@example.com",
    "password": "Strong!Pass1",
    "phone": "(415) 555-1212",
    "country": "us"
}
```
It should:
- Call each validation function.
- Collect any errors in a dictionary.
- Return a result like:
```py
{
    "ok": True,
    "errors": {},
    "normalized": {
        "username": "alice_01",
        "email": "alice@example.com",
        "phone": "+14155551212",
        "country": "US"
    }
}
```
If there are errors, `ok` should be False and errors should explain them.


In [None]:
def validate_username(username):
    """
    Validate a username based on specific rules.
    
    A valid username must:
    - Not be empty
    - Be between 3 and 20 characters long
    - Contain only letters, digits, underscore (_), dot (.), or hyphen (-)
    
    Parameters
    username : str
        The username to validate
        
    Returns
    bool
        True if username is valid, False otherwise

    """
    # Check if username is empty
    if not username:
        return False
    
    # Check length (must be between 3 and 20 characters)
    if len(username) < 3 or len(username) > 20:
        return False
    
    # Check if contains only allowed characters
    # Allowed: letters, digits, underscore, dot, hyphen
    allowed_chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_.-"
    
    for char in username:
        if char not in allowed_chars:
            return False
    
    return True


def validate_email(email):
    """
    Validate an email address using simple rules.
    
    A valid email must:
    - Contain exactly one "@" symbol
    - Have a non-empty local part (before @)
    - Have a domain that contains at least one "."
    - Domain must not start or end with "."
    
    Parameters
    email : str
        The email address to validate
        
    Returns
    bool
        True if email is valid, False otherwise
        
    """
    # Check if email contains "@"
    if "@" not in email:
        return False
    
    # Split email into local part and domain using partition
    local, separator, domain = email.partition("@")
    
    # Check if local part is empty
    if not local:
        return False
    
    # Check if domain is empty
    if not domain:
        return False
    
    # Check if domain contains at least one "."
    if "." not in domain:
        return False
    
    # Check if domain starts or ends with "."
    if domain.startswith(".") or domain.endswith("."):
        return False
    
    return True


def normalize_phone(phone, default_cc="+1"):
    """
    Normalize a phone number by removing non-digits and adding country code.
    
    The function:
    - Removes all non-digit characters (except leading +)
    - Adds default country code if not present
    - Validates that the number has 9-15 digits
    
    Parameters
    phone : str
        The phone number to normalize
    default_cc : str, optional
        Default country code to prepend (default is "+1")
        
    Returns
    str
        Normalized phone number (e.g., "+14155551212") or "Invalid"

    """
    # Remove all non-digit characters, but keep track of leading +
    has_plus = phone.startswith("+")
    
    # Extract only digits
    digits_only = ""
    for char in phone:
        if char.isdigit():
            digits_only += char
    
    # If original had + at start, we assume it has country code already
    if has_plus:
        normalized = "+" + digits_only
    else:
        # Prepend default country code (remove + from default_cc for concatenation)
        cc_digits = default_cc[1:] if default_cc.startswith("+") else default_cc
        normalized = "+" + cc_digits + digits_only
    
    # Count digits (excluding the +)
    digit_count = len(normalized) - 1  # -1 for the + sign
    
    # Validate digit count (9-15 digits)
    if digit_count < 9 or digit_count > 15:
        return "Invalid"
    
    return normalized


def validate_signup(user_info):
    """
    Validate all sign-up information and normalize data.
    
    This function validates username, email, and phone number, collecting
    any errors and normalizing valid data.
    
    Parameters
    user_info : dict
        Dictionary containing user sign-up information with keys:
        - username (str): User's chosen username
        - email (str): User's email address
        - phone (str): User's phone number
        - country (str, optional): Country code
        - password (str): User's password (not validated in this version)
        
    Returns
    dict
        Dictionary with keys:
        - ok (bool): True if all validations passed
        - errors (dict): Dictionary of field names to error messages
        - normalized (dict): Dictionary of normalized data

    """
    # Initialize result structure
    result = {
        "ok": True,
        "errors": {},
        "normalized": {}
    }
    
    # Validate username
    username = user_info.get("username", "")
    if not validate_username(username):
        result["ok"] = False
        if not username:
            result["errors"]["username"] = "Username is required"
        elif len(username) < 3:
            result["errors"]["username"] = "Username must be at least 3 characters"
        elif len(username) > 20:
            result["errors"]["username"] = "Username must be at most 20 characters"
        else:
            result["errors"]["username"] = "Username contains invalid characters"
    else:
        result["normalized"]["username"] = username
    
    # Validate email
    email = user_info.get("email", "")
    if not validate_email(email):
        result["ok"] = False
        if not email:
            result["errors"]["email"] = "Email is required"
        elif "@" not in email:
            result["errors"]["email"] = "Email must contain @"
        else:
            result["errors"]["email"] = "Email format is invalid"
    else:
        # Normalize email to lowercase
        result["normalized"]["email"] = email.lower()
    
    # Validate and normalize phone
    phone = user_info.get("phone", "")
    country = user_info.get("country", "us")
    
    # Set country code based on country
    country_codes = {
        "us": "+1",
        "uk": "+44",
        "ca": "+1",
        "au": "+61",
        "de": "+49",
        "fr": "+33"
    }
    default_cc = country_codes.get(country.lower(), "+1")
    
    normalized_phone = normalize_phone(phone, default_cc)
    if normalized_phone == "Invalid":
        result["ok"] = False
        if not phone:
            result["errors"]["phone"] = "Phone number is required"
        else:
            result["errors"]["phone"] = "Phone number is invalid (must be 9-15 digits)"
    else:
        result["normalized"]["phone"] = normalized_phone
    
    # Normalize country to uppercase
    if country:
        result["normalized"]["country"] = country.upper()
    
    return result

In [36]:
# Asked ChatGPT to help create comprehensive test cases for the validation functions

print("=" * 60)
print("Testing Individual Validation Functions")
print("=" * 60)

# Test username validation
print("\n--- Username Validation Tests ---")
test_usernames = [
    ("alice_01", True),
    ("ab", False),  # Too short
    ("a" * 21, False),  # Too long
    ("user@name", False),  # Invalid character
    ("valid.user-123", True),
    ("", False),  # Empty
    ("abc", True),  # Minimum valid length
]

for username, expected in test_usernames:
    result = validate_username(username)
    status = "✓" if result == expected else "✗"
    print(f"{status} '{username}': {result} (expected {expected})")

# Test email validation
print("\n--- Email Validation Tests ---")
test_emails = [
    ("alice@example.com", True),
    ("user@domain.co.uk", True),
    ("invalid.email", False),  # No @
    ("@example.com", False),  # No local part
    ("user@", False),  # No domain
    ("user@.com", False),  # Domain starts with .
    ("user@domain.", False),  # Domain ends with .
    ("user@domain", False),  # No . in domain
]

for email, expected in test_emails:
    result = validate_email(email)
    status = "✓" if result == expected else "✗"
    print(f"{status} '{email}': {result} (expected {expected})")

# Test phone normalization
print("\n--- Phone Normalization Tests ---")
test_phones = [
    ("(415) 555-1212", "+14155551212"),
    ("+44 20 1234 5678", "+442012345678"),
    ("123", "Invalid"),  # Too short
    ("123-456-7890", "+11234567890"),
    ("+1 (555) 123-4567", "+15551234567"),
]

for phone, expected in test_phones:
    result = normalize_phone(phone)
    status = "✓" if result == expected else "✗"
    print(f"{status} '{phone}' -> '{result}' (expected '{expected}')")

# Test complete sign-up validation
print("\n" + "=" * 60)
print("Testing Complete Sign-Up Validation")
print("=" * 60)

# Test case 1: Valid sign-up
print("\n--- Test Case 1: Valid Sign-Up ---")
valid_info = {
    "username": "alice_01",
    "email": "Alice@example.com",
    "password": "Strong!Pass1",
    "phone": "(415) 555-1212",
    "country": "us"
}

result1 = validate_signup(valid_info)
print(f"Input: {valid_info}")
print(f"\nResult:")
print(f"  OK: {result1['ok']}")
print(f"  Errors: {result1['errors']}")
print(f"  Normalized: {result1['normalized']}")

# Test case 2: Invalid username
print("\n--- Test Case 2: Invalid Username ---")
invalid_username_info = {
    "username": "ab",  # Too short
    "email": "user@example.com",
    "phone": "555-1212",
    "country": "us"
}

result2 = validate_signup(invalid_username_info)
print(f"Input: {invalid_username_info}")
print(f"\nResult:")
print(f"  OK: {result2['ok']}")
print(f"  Errors: {result2['errors']}")
print(f"  Normalized: {result2['normalized']}")

# Test case 3: Invalid email
print("\n--- Test Case 3: Invalid Email ---")
invalid_email_info = {
    "username": "john_doe",
    "email": "notanemail",  # Missing @
    "phone": "1234567890",
    "country": "us"
}

result3 = validate_signup(invalid_email_info)
print(f"Input: {invalid_email_info}")
print(f"\nResult:")
print(f"  OK: {result3['ok']}")
print(f"  Errors: {result3['errors']}")
print(f"  Normalized: {result3['normalized']}")

# Test case 4: Multiple errors
print("\n--- Test Case 4: Multiple Validation Errors ---")
multiple_errors_info = {
    "username": "a",  # Too short
    "email": "bademail",  # No @
    "phone": "12",  # Too short
    "country": "uk"
}

result4 = validate_signup(multiple_errors_info)
print(f"Input: {multiple_errors_info}")
print(f"\nResult:")
print(f"  OK: {result4['ok']}")
print(f"  Errors: {result4['errors']}")
print(f"  Normalized: {result4['normalized']}")

# Test case 5: International phone number
print("\n--- Test Case 5: International Phone (UK) ---")
uk_info = {
    "username": "british_user",
    "email": "user@uk.co.uk",
    "phone": "20 1234 5678",
    "country": "uk"
}

result5 = validate_signup(uk_info)
print(f"Input: {uk_info}")
print(f"\nResult:")
print(f"  OK: {result5['ok']}")
print(f"  Errors: {result5['errors']}")
print(f"  Normalized: {result5['normalized']}")

# Test case 6: Empty/missing fields
print("\n--- Test Case 6: Missing Fields ---")
empty_info = {
    "username": "",
    "email": "",
    "phone": "",
    "country": "us"
}

result6 = validate_signup(empty_info)
print(f"Input: {empty_info}")
print(f"\nResult:")
print(f"  OK: {result6['ok']}")
print(f"  Errors: {result6['errors']}")
print(f"  Normalized: {result6['normalized']}")

Testing Individual Validation Functions

--- Username Validation Tests ---
✓ 'alice_01': True (expected True)
✓ 'ab': False (expected False)
✓ 'aaaaaaaaaaaaaaaaaaaaa': False (expected False)
✓ 'user@name': False (expected False)
✓ 'valid.user-123': True (expected True)
✓ '': False (expected False)
✓ 'abc': True (expected True)

--- Email Validation Tests ---
✓ 'alice@example.com': True (expected True)
✓ 'user@domain.co.uk': True (expected True)
✓ 'invalid.email': False (expected False)
✓ '@example.com': False (expected False)
✓ 'user@': False (expected False)
✓ 'user@.com': False (expected False)
✓ 'user@domain.': False (expected False)
✓ 'user@domain': False (expected False)

--- Phone Normalization Tests ---
✓ '(415) 555-1212' -> '+14155551212' (expected '+14155551212')
✓ '+44 20 1234 5678' -> '+442012345678' (expected '+442012345678')
✓ '123' -> 'Invalid' (expected 'Invalid')
✓ '123-456-7890' -> '+11234567890' (expected '+11234567890')
✓ '+1 (555) 123-4567' -> '+15551234567' (expect