This notebook demonstrates how to perform PCA and extract top components for further analysis. The extracted components are used here to perform Logistic Regression, but you will see that it doesn't generate desired results. Instead of a marginal reduction in model performance, the AUC of the model drops significantly. 

### Import packages

In [None]:
# data processing
import pandas as pd
import numpy as np

# modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.decomposition import PCA

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid');

### Set-up

In [None]:
infile = 'https://raw.githubusercontent.com/vishal-git/dapt-631/main/data/credit_default_model_data.csv'

target = 'default payment next month'

### Read data

In [None]:
df = pd.read_csv(infile)

y = df[target]
X = df.drop(target, axis=1)
del df

X.head()

### Train-Test partition

In [None]:
X_train = X[X['group'] == 'M'].drop('group', axis=1)[X.columns[:-1]]
X_test = X[X['group'] == 'T'].drop('group', axis=1)[X.columns[:-1]]

y_train = y[X['group'] == 'M']
y_test = y[X['group'] == 'T']

len(X_train), len(X_test)

### Logistic Regression model

In [None]:
logit = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=314)

Standardize the input data.

In [None]:
X_scaler = 

# fit and transform the training data frame
X_train_std = 

# transform the test data frame
X_test_std = 

[Normalize data before or after split of training and testing data?](https://stackoverflow.com/questions/49444262/normalize-data-before-or-after-split-of-training-and-testing-data)

Fit the model and get model scores.

In [None]:
logit.fit(X_train_std, y_train)

logit_scores_train = logit.predict_proba(X_train_std)[:, 1]
logit_scores_test = logit.predict_proba(X_test_std)[:, 1]

ROC Curve

In [None]:
logit_fpr_test, logit_tpr_test, _ = roc_curve(y_test, logit_scores_test)
auc_logit = roc_auc_score(y_test, logit_scores_test)

sns.set(style='darkgrid')
plt.figure().set_size_inches(7, 7)

plt.plot(logit_fpr_test, logit_tpr_test, color='royalblue', lw=2, linestyle = '-',
         label=f'Test (AUC = {auc_logit:0.3f})')

plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate', fontsize = 14)
plt.ylabel('True Positive Rate', fontsize = 14)
plt.title('Default Risk Model: Logistic Regression', fontsize = 14)
plt.legend(loc="lower right", fontsize = 12);

### Principal Component Analysis (PCA)

In [None]:
pca = 


In [None]:
n_cols = len(X_train.columns)

plt.figure(figsize=(9, 6))
sns.lineplot(x=, 
             y=, 
             linewidth=3, 
             color='tomato')

plt.xlabel('Number of Components', fontsize = 14)
plt.ylabel('Explained Variance', fontsize = 14);

A very few principal compoents appear to explain most of the variance in the data. This is a red flag.

We need to *standardize* the data before fitting PCA -- i.e., run PCA on standardized data.

In [None]:
pca.fit(X_train_std)

plt.figure(figsize=(9, 6))
sns.lineplot(x=range(n_cols), 
             y=, 
             linewidth=3, 
             color='tomato')

plt.xlabel('Number of Components', fontsize = 14)
plt.ylabel('Explained Variance', fontsize = 14);

Cumulative Variance Explained

In [None]:
plt.figure(figsize=(9, 6))

sns.lineplot(x=range(n_cols), 
             y=,
             linewidth=3, 
             color='tomato')

plt.xlabel('Number of Components', fontsize = 14)
plt.ylabel('Explained Variance', fontsize = 14);

In [None]:
np.cumsum(pca.explained_variance_ratio_)[30]

Let's keep the top 30 principal compoents. By doing so, we will retain 97% of the total variance.

In [None]:
components_to_keep = 30

pca = PCA(n_components=components_to_keep, random_state=314)

In [None]:
pca_X_train = 
pca_X_train

In [None]:
# fit the model using principal components
pl_fit = logit.fit(pca_X_train, y_train)

# calculate model scores (predicted probabilities)
pl_scores_test = 

ROC Curve

In [None]:
pl_fpr_test, pl_tpr_test, _ = roc_curve(y_test, pl_scores_test)
auc_pl = roc_auc_score(y_test, pl_scores_test)

plt.figure(figsize=(9, 6))

plt.plot(logit_fpr_test, logit_tpr_test, color='royalblue', lw=2, 
         label=f'Logistic (AUC = {auc_logit:0.3f})')

plt.plot(pl_fpr_test, pl_tpr_test, color='tomato', lw=2,
         label=f'PCA + Logistic (AUC = {auc_pl:0.3f})')

plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate', fontsize = 14)
plt.ylabel('True Positive Rate', fontsize = 14)
plt.title('Default Risk Model: Logit vs. PCA+Logit', fontsize = 16)
plt.legend(loc="lower right", fontsize = 14);

Why did the model performance got worse?