# Smoothen Stock Price
Write a function named `smoothen_price`. Given the historical stock price data (represented by a 1D numpy array) and a window size (an integer) as inputs, the function smoothens the stock price using a moving average filter. You can use the simple moving average filter. The function should plot the original stock price and smoothened stock price. Make sure your plot is clearly labeled. Do not use the built-in moving average filters. 

Test your function using the opening price of Tesla from "TSLA.csv" and window size 100. Here is the expected output:
![TSLA](assets/output.png)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def smoothen_price(prices, window_size):
    """
    Smoothen stock price using simple moving average filter.
    
    Parameters:
    prices: 1D numpy array of stock prices
    window_size: integer, size of the moving average window
    
    Returns:
    smoothed: 1D numpy array of smoothed prices
    """
    n = len(prices)
    smoothed = np.zeros(n)
    
    # Calculate moving average manually
    for i in range(n):
        if i < window_size - 1:
            # For the first few points, use available data
            smoothed[i] = np.mean(prices[:i+1])
        else:
            # Use the window
            smoothed[i] = np.mean(prices[i-window_size+1:i+1])
    
    # Plot original and smoothed prices
    plt.figure(figsize=(12, 6))
    plt.plot(prices, label='Original Price', alpha=0.7)
    plt.plot(smoothed, label='Smoothed Price', linewidth=2)
    plt.xlabel('Time')
    plt.ylabel('Price')
    plt.title('Stock Price Smoothing')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()
    
    return smoothed

# Test with TSLA.csv
tsla_data = pd.read_csv('TSLA.csv')
opening_prices = tsla_data['Open'].values
smoothen_price(opening_prices, 100)


# Linear Regression

Write a program to accomplish the following goals:

- Load the "penguins.csv" data. 
- Remove all rows containing empty fields. 
- Fit a linear model to predict how body mass (dependent variable) of penguins is correlated to flipper length (independent variable).
- Print out the mean squared error and $R^2$ value.
Choose the appropriate plot to compare the actual body mass and the predicted body mass. Clearly label your plot.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load penguins.csv
penguins = pd.read_csv('penguins.csv')

# Remove rows with empty fields
penguins_clean = penguins.dropna()

# Extract features and target
X = penguins_clean[['flipper_length_mm']].values
y = penguins_clean['body_mass_g'].values

# Fit linear model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Calculate metrics
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.4f}")

# Plot actual vs predicted
plt.figure(figsize=(10, 6))
plt.scatter(y, y_pred, alpha=0.6)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Body Mass (g)')
plt.ylabel('Predicted Body Mass (g)')
plt.title('Actual vs Predicted Body Mass')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


# Logistic Regression

We have discussed linear regression in class. In this problem, you will explore logistic regression. Specifically, you will preprocess the dataset using pandas for logistic regression. Different from linear regression, logistic regression considers binary dependent variables rather than continuous dependent variables. Logistic regression is widely-used to predict binary outcomes based on a set of features (independent variables). 

Write a program to accomplish the following:

- Load the "diabetes.csv" file using pandas.
- Extract the following columns as your independent variables: Glucose, BloodPressure, and BMI. Extract the top 75% rows as the features in training dataset named feature_train. For practice purpose, **do not use** the built-in dataset splitting function in this problem. In your future projects, you can use `train_test_split` from `sklearn.model_selection` to automatically split the dataset for you. 
- Extract the "Outcome" column as the dependent variable. Extract the top 75% rows as the labels in the training dataset, named outcome_train.  For practice purpose, do not use the built-in dataset splitting function in this problem.
- Keep the rest of data as your testing dataset. Let's name the features in testing dataset as `feature_test`
- Instantiate a logistic regression model named `logreg`. Fit the model using your training dataset. You can use the following code snippet:
    ```py
    logreg = LogisticRegression()
    logreg.fit(feature_train, outcome_train)
    ```

- Let's evaluate the quality of prediction. Test the learned model using `outcome_pred = logreg.predict(feature_test)`
Compare `outcome_pred` with the actual outcome in testing dataset, count the percentage of positive patients that are correctly predicted by your model. For practice purpose, do not use the built-in report/summary or confusion matrix for calculation. 
- Print out the false negative and false positive of your model. For practice purpose, do not use the built-in report/summary or confusion matrix for calculation. 

In [None]:
from sklearn.linear_model import LogisticRegression

# Load diabetes.csv
diabetes = pd.read_csv('diabetes.csv')

# Calculate split point (75% for training)
n_total = len(diabetes)
n_train = int(0.75 * n_total)

# Extract features (Glucose, BloodPressure, BMI) for training
feature_train = diabetes[['Glucose', 'BloodPressure', 'BMI']].iloc[:n_train].values

# Extract outcome for training
outcome_train = diabetes['Outcome'].iloc[:n_train].values

# Extract features for testing
feature_test = diabetes[['Glucose', 'BloodPressure', 'BMI']].iloc[n_train:].values

# Extract outcome for testing
outcome_test = diabetes['Outcome'].iloc[n_train:].values

# Instantiate and fit logistic regression model
logreg = LogisticRegression()
logreg.fit(feature_train, outcome_train)

# Make predictions
outcome_pred = logreg.predict(feature_test)

# Calculate percentage of positive patients correctly predicted
# Positive patients are those with Outcome = 1
positive_patients = outcome_test == 1
positive_correct = (outcome_pred[positive_patients] == outcome_test[positive_patients]).sum()
positive_total = positive_patients.sum()
positive_accuracy = (positive_correct / positive_total * 100) if positive_total > 0 else 0

print(f"Percentage of positive patients correctly predicted: {positive_accuracy:.2f}%")

# Calculate false negative and false positive manually
# False negative: predicted 0 but actual is 1
false_negative = ((outcome_pred == 0) & (outcome_test == 1)).sum()

# False positive: predicted 1 but actual is 0
false_positive = ((outcome_pred == 1) & (outcome_test == 0)).sum()

print(f"False Negative: {false_negative}")
print(f"False Positive: {false_positive}")
