# Semester 3 Coding Portfolio Topic 4 Formative Part 2/2:
## Evaluating Logistic Regression Predictions

This notebook covers the following topics:
 - logistic regression

This notebook is expected to take around 5 hours to complete.

<b>Formative section</b><br>
Simply complete the given functions such that they pass the automated tests. This part is graded Pass/Fail; you must get 100% correct!
You can submit your notebook through Canvas as often as you like. Make sure to start doing so early to ensure that your code passes all tests!
You may ask for help from fellow students and TAs on this section, and solutions might be provided later on.

In [None]:
# Import Necessary Libraries
import sys
import math
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.discrete.discrete_model import BinaryResultsWrapper
import sklearn
import scipy
from scipy.stats import multivariate_normal
from scipy.special import expit as logistic_sigmoid
from packaging import version
from sklearn.model_selection import train_test_split
from sklearn.metrics import brier_score_loss
from sklearn.metrics import balanced_accuracy_score, brier_score_loss, accuracy_score, roc_curve, auc
from sklearn.model_selection import KFold

In [None]:
# These are the recommended (tested) versions of the libraries
# A separate yaml file is provided for setting up the environment
assert sys.version_info >= (3, 11), "This notebook requires Python 3.11 or above."
assert version.parse(pd.__version__) >= version.parse("2.3.3"), "Needs Pandas >= 2.3.3."
assert version.parse(np.__version__) >= version.parse("2.3.4"), "Needs NumPy >= 2.3.4."
assert version.parse(sm.__version__) >= version.parse("0.14"), "Needs Statsmodels >= 0.14."
assert version.parse(matplotlib.__version__) >= version.parse("3.10"), "Needs Matplotlib >= 3.10."
assert version.parse(sklearn.__version__) >= version.parse("1.7"), "Needs scikit-learn >= 1.7."
assert version.parse(sns.__version__) >= version.parse("0.13"), "Needs Seaborn >= 0.13."
assert version.parse(scipy.__version__) >= version.parse("1.16"), "Needs SciPy >= 1.16."

In [None]:
# Set display option to avoid scientific notation in pandas, show up to 5 decimal points
pd.set_option('display.float_format', lambda x: '%.5f' % x)
# and numpy
np.set_printoptions(suppress=True, precision=5)

# Set random seed for reproducibility
np.random.seed(42)

In this workbook we will be attempting to learn a model of <b>conspiracy spreading tweets</b> for the day of Januray 6th in the US. The model's job is to preemptively identify whether the tweet is likely to be fake-news sharing, without delving into the content of the tweet, but rather using a series of general features. 

In [None]:
# Load the labeled dataset of tweets 
df_labs = pd.read_csv('sem3_topic4_logreg_formative2_data.csv', low_memory=False)

## Part 1: Data Cleaning & Exploration

Your task is to clean the data. You need to complete the following tasks: 

### Exercise 1A
Drop incomplete records 

In [None]:
# Drop incomplete records, keep the variable name 'df_labs' for the cleaned dataset
df_labs = ...

### Exercise 1B 
Create a dummy variable called `conspiracy_binary`, taking value `1` when the conspiracy-assessment is `Yes`, and `0` otherwise.  

Hint: use `.astype(int)` to ensure the results are numbers, not booleans. 

In [None]:
# Conspiracy spreading flag
conspiracy_binary = ...

Let's have a look at what kinds of tweets we are talking about. 

In [None]:
# Filter rows where 'conspiracy_binary' is 1
conspiracy_texts = df_labs.loc[conspiracy_binary == 1, 'text']

# Sample 10 random texts
random_texts = conspiracy_texts.sample(n=10, random_state=np.random.RandomState())

# Iterate through the selected texts and print each one in full
for index, text in enumerate(random_texts, start=1):
    print(f"Text {index}: {text}\n")

### Exercise 1C
One-hot encode political ideology (retain just conservative and liberal columns), sentiment (retain just negative and positive columns).

Note: Name the new columns `Political Leanings_Conservative`, `Political Leanings_Liberal`, `Sentiment Analysis_Negative`, and `Sentiment Analysis_Positive`.

In [None]:
# Ideology
pol_lean_one_hot = ...

In [None]:
# Sentiment 
sentiment_one_hot = ...

### Exercise 1D
Make a binary variable indicating if the source of the tweet was an Apple device.

Hint: We found 6 different sources associated with Apple. 

In [None]:
# Apple product
apple_binary = ...

In [None]:
# Lexical diversity 
lexical_diversity_likert = df_labs['Lexical Diversity'].astype(int)
# Spelling and Grammar 
spelling_grammar_likert = df_labs['Spelling and Grammar Quality'].astype(int)
# Activity: 
user_active_num = df_labs['statuses_count'].astype(int)
# Popularity: 
user_popular_num = df_labs['followers_count'].astype(int)
# Tweet Popularity
tweet_popular_num = df_labs['retweet_count'].astype(int)

### Exercise 1E
One-hot encode state identifiers, storing the results in a matrix. 
Remember to drop the first dummy (dummy-trap).

In [None]:
# One-hot encode state identifiers
states_one_hot = ...

# Filtering to get just the state dummy columns
states_matrix = ...

### Exercise 1F
Concatenate the clean variables into a new dataframe called `X`. Exclude the `states_matrix` for now. 
Do not include the outcome (conspiracy binary).

Hint: There should be 10 columns.

In [None]:
X = ...

### Exercise 1G
Calculate the correlation matrix across the outcome and X. 

In [None]:
# Add conspiracy_binary as the first column in X to create a combined DataFrame YX
X['conspiracy_binary'] = conspiracy_binary
YX = X[['conspiracy_binary'] + [c for c in X.columns if c != 'conspiracy_binary']]  # Ensure conspiracy_binary is the first column

# Calculate the Correlation Matrix
corr = ...

# Plotting
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, annot=True, fmt=".2f", annot_kws={"size": 7})
plt.tight_layout()
plt.show()

## Part 2: Model Assessment and Selection

### Exercise 2A 
Set up the full design matrix X, this time include the states_matrix, and a constant. 
Finally bind the outcome to it and ensure it's the first column of the resulting dataframe. 

In [None]:
# Design matrix
X = ...

# Add a constant to the feature matrix for statsmodels
X_const = ...

# Get full dataset together 
YX_const =  ...

## Exercise 2B 
Create a training set (75%) and test set (25%). 
Ensure the rows of the full dataset selected for each set are chosen at random (use seed 42).

In [None]:
# Split data into train and test (75:25)
YX_const_train, YX_const_test = ...

### Exercise 2C
Using a dictionary, define three candidate models in terms of the columns of the design matrix involved in each. 
The first model should be the homogeneous probability model; the second should have have all covariates except the states; the third should use all the columns. Name the keys `homogeneous`, `no_states`, and `all`.

In [None]:
# Define predictors for each model variant
predictors = ...

### Exercise 2D
Using 5-fold cross-validation on the training set, compare the models using the following metrics: Brier score, Accuracy, Balanced Accuracy, and AIC.

For this question, given we are not at this stage interested in making inference but just understand which model has the best predictive power, you can avoid simulating and simply make point-estimate predictions. 

You can do this by simply fitting the model with sm.Logit, and using directly after the function 'model.predict',avoiding sampling from the approximate posterior of the betas, and then from the posterior predictive of y. 

This will not give you uncertainty estimates around your predictions, but will allow you to compare models based on their point-predictions, and that's good enough for model selection purposes. When we want to make inference, we want to also have access to uncertainty.

In [None]:
y = YX_const_train['Conspiracy Assessment'] # Target variable

# Define K
K = 5

# Setup the KFold cross-validation
kf = KFold(n_splits=K, shuffle=True)

# Initialize a dictionary to store Brier scores
scores = {key: [] for key in predictors}

# Initialize dictionaries to store scores
brier_scores = {key: [] for key in predictors}
acc_scores = {key: [] for key in predictors}  
balanced_acc_scores = {key: [] for key in predictors}
aic_scores = {key: [] for key in predictors}  # AIC scores

for key, cols in predictors.items():
    
    for train_index, test_index in kf.split(YX_const_train):
        
        # Split into train and test according to the folds 
        X_train, X_test = YX_const_train.iloc[train_index][cols], YX_const_train.iloc[test_index][cols]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # For each fold split, fit the model
        model = ...

        # Predict probabilities
        y_pred_prob = ...

        # Calculate Brier score
        brier_score = ...
        brier_scores[key].append(brier_score)

        # Convert probabilities to binary predictions (assume simple >0.5 probability as threshold)
        y_pred_binary = ...

        # Calculate Accuracy Score
        acc_score = ...
        acc_scores[key].append(acc_score)
        
        # Calculate Balanced Accuracy Score
        bal_acc_score = ...
        balanced_acc_scores[key].append(bal_acc_score)
        
        # Store AIC
        aic_scores[key].append(model.aic)


In [None]:
# Calculate and print the average scores
results = []
for key in predictors.keys():
    average_brier_score = np.mean(brier_scores[key])
    average_bal_acc_score = np.mean(balanced_acc_scores[key])
    average_acc_score = np.mean(acc_scores[key])
    average_aic_score = np.mean(aic_scores[key])  # Calculate average AIC
    results.append({
        'Model': key,
        'Average Brier Score': average_brier_score,
        'Average Accuracy': average_acc_score,
        'Average Balanced Accuracy': average_bal_acc_score,
        'Average AIC': average_aic_score
    })

# Convert results to DataFrame for nicer display
results_df = pd.DataFrame(results)
results_df

### Exercise 2E 
Re-fit the model with the lowest average AIC to the full training set. 

In [None]:
# Now fit the model to the full training set
model = ...

In [None]:
# Get summary results
summary = model.summary()
print(summary)

## Part 3: Model Evaluation and Estimation of Generalisation Error

### Exercise 3A 
Generate 1000 simulations of the regression coefficients by sampling from the empirical posterior distribution. Use seed 42.

Hint: check the documentation of `scipy.stats.multivariate_normal.rvs`

In [None]:
# Extract the coefficients (betas) and their covariance matrix from the logistic regression fit
beta_mean = model.params
beta_cov = model.cov_params()

# Number of simulations
n_simulations = 1000

# Simulate beta coefficients
simulated_betas = ...

### Exercise 3B  
For each simulation, generate a predicted probability for the test-set conspiracy assessments. 

In [None]:
# Initialize an array to store predictions from each simulation
predictions = np.zeros((n_simulations, YX_const_test.shape[0]))

# Generate predictions for each simulation
for i in range(n_simulations):
    beta_simulation = ...
    
    log_odds = ...
    
    # Convert log-odds to probabilities
    probabilities = ...  
    
    predictions[i] = ...

In [None]:
predictions

For the first 20 assessments in the test-set, we will plot the posterior distirbution of the probabilities, and highlight whether the density of each lies above or below a given `threshold` for classification. 

In [None]:
true_labels = YX_const_test['Conspiracy Assessment']

# Calculate posterior median and the 90% prediction interval for each of the first 10 observations
posterior_medians = np.median(predictions, axis=0)
lower_bounds = np.percentile(predictions, 5, axis=0)
upper_bounds = np.percentile(predictions, 95, axis=0)

# Plotting with the adjustments for the 90% prediction interval to be shown with red lines
fig, axes = plt.subplots(4, 5, figsize=(25, 16))

for i in range(20):
    ax = axes[i // 5, i % 5]
    # Histogram of simulated probabilities for observation i
    ax.hist(predictions[:, i], bins=30, color='skyblue', edgecolor='white', alpha=0.7)
    
    # Draw a line for the decision boundary 
    ax.axvline(x=0.5, color='black', linewidth=1, label='Decision Boundary')
    
    # Draw a thick solid black line at the true label position
    true_label_position = 0 if true_labels.iloc[i] == 0 else 1  # Determine the position based on the true label
    ax.axvline(x=true_label_position, color='black', linewidth=3, label='True Label')
    
    # Add posterior median
    ax.axvline(x=posterior_medians[i], color='red', linestyle='--', label='Posterior Median')
    
    # Marking the 90% prediction interval with red lines instead of shading
    ax.axvline(x=lower_bounds[i], color='red', linestyle='-', linewidth=1, label='90% Prediction Interval' if i == 0 else "")
    ax.axvline(x=upper_bounds[i], color='red', linestyle='-', linewidth=1)
    
    ax.set_xlim(-0.1, 1.1)
    ax.set_title(f'Observation {i+1}')
    if i == 0:  # Add legend to the first subplot only to avoid repetition
        ax.legend()

plt.tight_layout()
plt.show()

### Exercise 3C 
Simulate classes (1s or 0s) for the test-set conspiracy assessments, from the posterior predictive distirbution. 

Hint: check documentation of `np.random.binomial`

In [None]:
# Simulate from the posterior-predictive distirbution 
simulated_outcomes = ...

### Exercise 3D
Calculate the generalisation error for Classification. 
Choose <b>one</b> classification error metric you wish from the following list: `[Accuracy, Brier Score, AUC]`. The most basic metric we might be interested about is just `accuracy`. 

Hint: We have 1000 simulated predicted classes. For each of those 1000 sets of simulations of the test-set labels, you need to calculate the accuracy. Then you have to plot the histogram of the accuracies. 

In [None]:
def plot_histogram(metric_values, metric_name):
    plt.figure(figsize=(10, 6))
    plt.hist(metric_values, bins=30, color='skyblue', edgecolor='white')
    plt.axvline(x=np.median(metric_values), color='red', label='Median')
    plt.axvline(x=np.percentile(metric_values, 5), color='red', linestyle='--', label='5th percentile')
    plt.axvline(x=np.percentile(metric_values, 95), color='red', linestyle='--', label='95th percentile')
    plt.xlabel(metric_name)
    plt.ylabel('Frequency')
    plt.title(f'Out-of-Sample Posterior Distribution of {metric_name}')
    plt.legend()
    plt.show()

In [None]:
# Calculate selected metric for each simulation and plot histogram (choose from Accuracy, Brier Score, AUC)
accuracies = []
briers = []
aucs = []


Here is an example with the `Generalisation ROC Curve` and corresponding AUC. 

In [None]:
# Initialize lists to store TPRs (True Positive Rate), FPRs (False Positive Rare), and AUCs (Area Under the Curve) for each simulation
tprs = []
fprs = []
aucs = []

# Calculate ROC curve and AUC for each simulation
for i in range(n_simulations):
    fpr, tpr, thresholds = roc_curve(true_labels, predictions[i])
    roc_auc = auc(fpr, tpr)
    tprs.append(tpr)
    fprs.append(fpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, color='lightgray', lw=1, alpha=0.5)  # Plot each ROC curve faintly

# Calculate the mean AUC
mean_auc = np.mean(aucs)

# Plotting
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve (Mean AUC = {mean_auc:.2f})')
plt.show()