<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_12-Classification_I/From_Lines_to_Likelihood_Podcast_Companion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np

# --- Probability Calculation Example ---

# Define the parameters from the model
beta_0 = -2  # Intercept (log odds when visits = 0)
beta_1 = 0.5 # Coefficient for website visits

# Number of website visits
visits = 4

# Calculate the log odds (T)
T = beta_0 + beta_1 * visits
print(f"Log Odds (T) using beta_0 + beta_1 * visits where visits = 4: {T}")

# Sigmoid function to calculate probability
def sigmoid(t):
  return 1 / (1 + np.exp(-t))

# Calculate the probability
probability = sigmoid(T)
print(f"Probability of clicking: {probability:.2f}")

# --- Coefficient Interpretation ---
# Transcript explanation of the 0.5 coefficient

print(f"For each additional visit, the log odds of clicking increase by {beta_1}")

# Exponentiate the coefficient to get the change in odds
odds_ratio = np.exp(beta_1)
print(f"Odds increase by a factor of {odds_ratio:.2f} or {((odds_ratio - 1) * 100):.0f}%")

# --- Confusion Matrix Example ---

# Define the confusion matrix values
true_positives = 40
false_negatives = 10
false_positives = 5
true_negatives = 45

# Create the confusion matrix as a 2x2 NumPy array
confusion_matrix = np.array([[true_negatives, false_positives],
                           [false_negatives, true_positives]])

print("Confusion Matrix:")
print(confusion_matrix)

# --- Precision and Recall Calculation ---
# As described in the transcript

# Calculate precision
precision = true_positives / (true_positives + false_positives)
print(f"Precision: {precision:.2f}")

# Calculate recall
recall = true_positives / (true_positives + false_negatives)
print(f"Recall: {recall:.2f}")

Log Odds (T) using beta_0 + beta_1 * visits where visits = 4: 0.0
Probability of clicking: 0.50
For each additional visit, the log odds of clicking increase by 0.5
Odds increase by a factor of 1.65 or 65%
Confusion Matrix:
[[45  5]
 [10 40]]
Precision: 0.89
Recall: 0.80


**Explanation:**

1.  **Probability Calculation:**
    
    * Variables `beta_0`, `beta_1`, and `visits` are defined based on the example in the transcript.
    * The log odds (`T`) is calculated using the linear equation.
    * The `sigmoid` function is defined to convert log odds to a probability between 0 and 1.
    * The probability of clicking is calculated and printed.
2.  **Coefficient Interpretation:**
    
    * The code prints the direct interpretation of `beta_1` as the change in log odds.
    * It then calculates the odds ratio by exponentiating `beta_1` to show how the odds of clicking change multiplicatively.
3.  **Confusion Matrix:**
    
    * The values for true positives, false negatives, false positives, and true negatives are taken directly from the transcript.
    * These values are used to create a 2x2 NumPy array representing the confusion matrix.
4.  **Precision and Recall:**
    
    * Precision and recall are calculated using the formulas from the transcript, based on the values in the confusion matrix.
    * The calculated precision and recall values are printed.

**Log Odds:**

* In logistic regression, the model predicts the *log odds* of the outcome (e.g., the log odds of a customer clicking).
* The equation `T = beta_0 + beta_1 * visits` calculates this log odds (T).
* `beta_1` (0.5 in the example) represents the change in the log odds for every one-unit increase in the predictor variable (number of visits).
* So, "For each additional visit, the log odds of clicking increase by 0.  5" means that the value of T goes up by 0.5 for each extra visit.

**Odds:**

* "Odds" is a different way of expressing the likelihood of an event, compared to probability.
* To understand the effect of the predictor on the *odds* themselves, we need to transform the coefficient.
* We do this by exponentiating the coefficient (`np.exp(beta_1)` or `e^beta_1`).
* In the example, `exp(0.5)` is approximately 1.  65.
* This means that the *odds* of clicking are multiplied by 1.65 for each additional visit.
* We can also express this as a percentage increase: (1.  65 - 1) \* 100% = 65%.
* So, "Odds increase by a factor of 1.  65 or 65%" means that the likelihood of clicking increases by 65% for each extra visit.

**In summary:**

* Logistic regression models log odds, and the coefficient directly shows the change in log odds.
* To interpret the effect on odds, you exponentiate the coefficient. This gives you the factor by which the odds change, which can also be expressed as a percentage increase.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# --- Generate Synthetic Data ---
# Based on T = beta_0 + beta_1 * visits, but with some randomness

np.random.seed(42)  # For reproducibility

n_samples = 1000  # Number of data points
beta_0 = -2
beta_1 = 0.5

# Generate website visits
visits = np.random.randint(0, 20, n_samples)  # Visits between 0 and 19

# Calculate log odds (T) with some noise added
T = beta_0 + beta_1 * visits + np.random.normal(0, 1, n_samples)

# Calculate probabilities using the sigmoid function
def sigmoid(t):
    return 1 / (1 + np.exp(-t))

probabilities = sigmoid(T)

# Convert probabilities to binary outcomes (0 or 1)
# You can adjust the threshold to see how it affects the confusion matrix
threshold = 0.5
clicks = (probabilities > threshold).astype(int)

# Create a Pandas DataFrame
data = pd.DataFrame({'visits': visits, 'probabilities': probabilities, 'clicks': clicks})

# --- Logistic Regression Model ---
# Using scikit-learn for comparison and to show the process

# Split data into training and testing sets
X = data[['visits']]  # Independent variable
y = data['clicks']    # Dependent variable

# The randomness will provide slightly different results from previous example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# --- Confusion Matrix ---

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# --- Interpretation ---
# The confusion matrix is a 2x2 matrix:
# [[TN FP]
#  [FN TP]]
# TN: True Negatives (correctly predicted no click)
# FP: False Positives (predicted click, but no click)
# FN: False Negatives (predicted no click, but click)
# TP: True Positives (correctly predicted click)
# Precision = TP / (TP + FP)
# Recall = TP / (TP + FN)

Confusion Matrix:
[[ 40   6]
 [  9 145]]


**NOTE**:

* Precision: ...out of all the times that our model predicted a yes, a positive outcome, how often was it actually a yes?
* Recall: ...out of all the actual yes cases, how many did our model correctly identify?

**Explanation:**

1.  **Generate Synthetic Data:**
    
    * `np.random.seed(42)`: Ensures the data generation is the same every time you run the code, for reproducibility.
    * `n_samples`, `beta_0`, `beta_1`: Define the number of data points and the parameters of the logistic regression equation.
    * `visits = np.random.randint(0, 20, n_samples)`:  Creates an array of random integer values between 0 and 19 (inclusive) to represent the number of website visits for each user.
    * `T = beta_0 + beta_1 \* visits + np.random.normal(0, 1, n_samples)`: Calculates the log-odds (T) using the given equation but adds random noise (`np.random.normal(0, 1, n_samples)`) to make the data more realistic. This noise simulates other factors that might influence clicking behavior.
    * `sigmoid(T)`: Converts the log-odds to probabilities using the sigmoid function.
    * `clicks = (probabilities > threshold).astype(int)`:  Turns the probabilities into binary outcomes (0 or 1) by applying a threshold. If the probability is greater than the threshold, it's considered a "click" (1); otherwise, "no click" (0).
    * `pd.DataFrame(...)`:  Organizes the generated data into a Pandas DataFrame for easier handling.
2.  **Logistic Regression Model:**
    
    * `X = data[['visits']]`, `y = data['clicks']`:  Prepares the data for scikit-learn. `X` is the independent variable (visits), and `y` is the dependent variable (clicks).
    * `train_test_split(...)`:  Splits the data into training and testing sets. The model is trained on the training set and evaluated on the testing set.
    * `LogisticRegression()`: Creates a logistic regression model object.
    * `model.fit(X_train, y_train)`: Trains the model on the training data.
    * `model.predict(X_test)`:  Makes predictions on the test data.
3.  **Confusion Matrix:**
    
    * `confusion_matrix(y_test, y_pred)`:  Calculates the confusion matrix by comparing the true outcomes (`y_test`) with the model's predictions (`y_pred`).
    * The confusion matrix is printed to the console.