# 31009 - Assignment 3: PCA & Logistic Regression

Built a logistic regression model on the `ProviderInfo` dataset. Then use PCA to reduce dimensionality and build two new models on the PCA dataset: one with two components and one with sixteen compontents. Compare the models.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

In [2]:
df_raw = pd.read_csv('data/ProviderInfo.csv')

In [3]:
# Clean the messy data and drop any rows with NAs
df = df_raw.copy()
df = df.select_dtypes([np.number])
df.replace(["NaN", 'NaT'], np.nan, inplace = True)
df = df.dropna(how='any', axis = 0)
df.drop(columns=['ZIP', 'PHONE'], inplace=True)
df.head()

Unnamed: 0,COUNTY_SSA,BEDCERT,RESTOT,OVERALL_RATING,SURVEY_RATING,QUALITY_RATING,STAFFING_RATING,RN_STAFFING_RATING,AIDHRD,VOCHRD,...,ADJ_AIDE,ADJ_LPN,ADJ_RN,ADJ_TOTAL,INCIDENT_CNT,CMPLNT_CNT,FINE_CNT,FINE_TOT,PAYDEN_CNT,TOT_PENLTY_CNT
0,290.0,57.0,51.5,5.0,5.0,5.0,4.0,4.0,3.43572,1.16495,...,3.11741,1.2475,0.83853,5.13047,0.0,0.0,0.0,0.0,0.0,0.0
3,360.0,92.0,79.8,2.0,2.0,4.0,3.0,3.0,2.32722,0.82104,...,2.40074,0.86962,0.56463,3.83026,0.0,1.0,0.0,0.0,0.0,0.0
4,360.0,103.0,98.1,3.0,3.0,4.0,3.0,2.0,2.33617,0.92407,...,2.55126,1.08955,0.3036,3.95709,0.0,0.0,0.0,0.0,0.0,0.0
5,410.0,149.0,119.7,5.0,3.0,5.0,4.0,3.0,2.57869,1.01443,...,2.56783,1.04823,0.46444,4.07866,0.0,1.0,0.0,0.0,0.0,0.0
6,250.0,124.0,96.0,5.0,4.0,5.0,3.0,4.0,1.99985,0.62768,...,2.12102,0.70311,0.75448,3.52979,1.0,1.0,0.0,0.0,0.0,0.0


### Split into Train & Test

This assignment specified an 80/20 train/test split.

In [4]:
np.random.seed(123)
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test =  df[~msk]

### Scaling

The scaling for both train and test uses the train data to do the scaling.

In [5]:
from sklearn.preprocessing import StandardScaler

In [6]:
scaler = StandardScaler()
scaler.fit(train)

X_train_scaled = pd.DataFrame(scaler.transform(train), columns=train.columns)
X_train_scaled.drop(columns='OVERALL_RATING', inplace=True)
y_train = train.reset_index().OVERALL_RATING

X_test_scaled = pd.DataFrame(scaler.transform(test), columns=test.columns)
X_test_scaled.drop(columns='OVERALL_RATING', inplace=True)
y_test = test.reset_index().OVERALL_RATING

In [7]:
X_train_scaled.head()

Unnamed: 0,COUNTY_SSA,BEDCERT,RESTOT,SURVEY_RATING,QUALITY_RATING,STAFFING_RATING,RN_STAFFING_RATING,AIDHRD,VOCHRD,RNHRD,...,ADJ_AIDE,ADJ_LPN,ADJ_RN,ADJ_TOTAL,INCIDENT_CNT,CMPLNT_CNT,FINE_CNT,FINE_TOT,PAYDEN_CNT,TOT_PENLTY_CNT
0,-0.368529,-0.84053,-0.67822,1.694526,0.826771,0.687788,0.488219,2.092472,0.788711,-0.000448,...,1.448243,1.075569,0.364616,1.492863,-0.379399,-0.569688,-0.561854,-0.273495,-0.27574,-0.552266
1,-0.114526,-0.260406,-0.148048,-0.637067,0.000141,-0.14512,-0.30208,0.053773,-0.152219,-0.289842,...,0.166911,-0.027644,-0.264347,-0.032,-0.379399,-0.437683,-0.561854,-0.273495,-0.27574,-0.552266
2,-0.114526,-0.078081,0.194783,0.140131,0.000141,-0.14512,-1.092378,0.070233,0.129669,-0.856345,...,0.436025,0.614437,-0.863757,0.116744,-0.379399,-0.569688,-0.561854,-0.273495,-0.27574,-0.552266
3,0.066905,0.684368,0.599437,0.140131,0.826771,0.687788,-0.30208,0.516264,0.376891,-0.509499,...,0.465651,0.493804,-0.494416,0.259319,-0.379399,-0.437683,-0.561854,-0.273495,-0.27574,-0.552266
4,-0.513673,0.269994,0.155442,0.917328,0.826771,-0.14512,0.488219,-0.54831,-0.681248,-0.143279,...,-0.333199,-0.513767,0.17161,-0.384385,-0.063247,-0.437683,-0.561854,-0.273495,-0.27574,-0.552266


## Model 1: Logistic Regression

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as skm

logr = LogisticRegression().fit(X_train_scaled, y_train)

y_train_pred = logr.predict(X_train_scaled)
y_test_pred  = logr.predict(X_test_scaled)

### Train

In [9]:
print("Train Score:", round(logr.score(X_train_scaled, y_train),2))
print("\nTrain Confusion Matrix")
print(skm.confusion_matrix(y_train, y_train_pred))
print("\nTrain Classification Report")
print(skm.classification_report(y_train, y_train_pred))

Train Score: 0.7

Train Confusion Matrix
[[1035  303    1    0    0]
 [ 225 1737  359   32    0]
 [   1  757  284  887    0]
 [   0  382   11 1833  432]
 [   0    0    0   85 3333]]

Train Classification Report
             precision    recall  f1-score   support

        1.0       0.82      0.77      0.80      1339
        2.0       0.55      0.74      0.63      2353
        3.0       0.43      0.15      0.22      1929
        4.0       0.65      0.69      0.67      2658
        5.0       0.89      0.98      0.93      3418

avg / total       0.68      0.70      0.68     11697



### Test

In [10]:
print("Test Score:", round(logr.score(X_test_scaled, y_test),2))
print("\nTest Confusion Matrix")
print(skm.confusion_matrix(y_test, y_test_pred))
print("\nTest Classification Report")
print(skm.classification_report(y_test, y_test_pred))

Test Score: 0.7

Test Confusion Matrix
[[274  85   1   0   0]
 [ 61 402  76   6   0]
 [  0 192  71 201   0]
 [  0 108   0 452 100]
 [  0   0   0  20 811]]

Test Classification Report
             precision    recall  f1-score   support

        1.0       0.82      0.76      0.79       360
        2.0       0.51      0.74      0.60       545
        3.0       0.48      0.15      0.23       464
        4.0       0.67      0.68      0.68       660
        5.0       0.89      0.98      0.93       831

avg / total       0.69      0.70      0.68      2860



The logistic model on the scaled dataset with all the variables does pretty well. Total average accuracy is about 70%. The confusion matrix looks pretty good. The metrics from train to test don't change much which means the model is not dramatically over or under fitted. Any PCA regression needs to get pretty close to this score to be a good model. 

## Model 2: PCA w. 2 components + Logistic Regression

In [11]:
# import PCA object from sklearn
from sklearn.decomposition import PCA

In [12]:
# limit PCA object to 2 components 
pca_two = PCA(n_components=2)

# use pca object to fit & apply pca transformation to data
X_train_pca2 = pca_two.fit_transform(X_train_scaled)
X_test_pca2  = pca_two.transform(X_test_scaled)

# Train the logistic model on the PCA data
logr_pca2 = LogisticRegression().fit(X_train_pca2, y_train)

# Calculate predictions from trained model
y_train_pca2_pred = logr_pca2.predict(X_train_pca2)
y_test_pca2_pred  = logr_pca2.predict(X_test_pca2)

### Train

In [13]:
print("Train Score:", round(logr_pca2.score(X_train_pca2, y_train),2))
print("\nTrain Confusion Matrix")
print(skm.confusion_matrix(y_train, y_train_pca2_pred))
print("\nTrain Classification Report")
print(skm.classification_report(y_train, y_train_pca2_pred))

Train Score: 0.38

Train Confusion Matrix
[[ 459  708    6   42  124]
 [ 362 1066    5   83  837]
 [ 131  807    4   93  894]
 [  81  634    1  116 1826]
 [  31  452    1  110 2824]]

Train Classification Report
             precision    recall  f1-score   support

        1.0       0.43      0.34      0.38      1339
        2.0       0.29      0.45      0.35      2353
        3.0       0.24      0.00      0.00      1929
        4.0       0.26      0.04      0.07      2658
        5.0       0.43      0.83      0.57      3418

avg / total       0.33      0.38      0.30     11697



### Test

In [14]:
print("Test Score:", round(logr_pca2.score(X_test_pca2, y_test),2))
print("\nTest Confusion Matrix")
print(skm.confusion_matrix(y_test, y_test_pca2_pred))
print("\nTest Classification Report")
print(skm.classification_report(y_test, y_test_pca2_pred))

Test Score: 0.38

Test Confusion Matrix
[[145 174   2  11  28]
 [ 81 252   0  23 189]
 [ 33 196   0  23 212]
 [ 15 169   0  15 461]
 [  5 121   1  24 680]]

Test Classification Report
             precision    recall  f1-score   support

        1.0       0.52      0.40      0.45       360
        2.0       0.28      0.46      0.35       545
        3.0       0.00      0.00      0.00       464
        4.0       0.16      0.02      0.04       660
        5.0       0.43      0.82      0.57       831

avg / total       0.28      0.38      0.30      2860



With only two components this model falls far short of the full model. The overall accuracy is only 38%, a little over half of the full score. The confusion matrix is all over the place whereas the non transformed model had lots of zeros on the edges.

## Model 3: PCA w. 16 components + Logistic Regression

In [15]:
# limit PCA object to 2 components 
pca_16 = PCA(n_components=16)

# use pca object to fit & apply pca transformation to data
X_train_pca16 = pca_16.fit_transform(X_train_scaled)
X_test_pca16  = pca_16.transform(X_test_scaled)

# Build the logistic model on the scaled PCA data
logr_pca16 = LogisticRegression().fit(X_train_pca16, y_train)

# Calculate predictions from the model
y_train_pca16_pred = logr_pca16.predict(X_train_pca16)
y_test_pca16_pred  = logr_pca16.predict(X_test_pca16)

### Train

In [16]:
print("Train Score:", round(logr_pca16.score(X_train_pca16, y_train),2))
print("\nTrain Confusion Matrix")
print(skm.confusion_matrix(y_train, y_train_pca16_pred))
print("\nTrain Classification Report")
print(skm.classification_report(y_train, y_train_pca16_pred))

Train Score: 0.69

Train Confusion Matrix
[[1032  306    1    0    0]
 [ 241 1718  361   33    0]
 [   1  770  262  896    0]
 [   0  386    3 1774  495]
 [   0    0    0   83 3335]]

Train Classification Report
             precision    recall  f1-score   support

        1.0       0.81      0.77      0.79      1339
        2.0       0.54      0.73      0.62      2353
        3.0       0.42      0.14      0.21      1929
        4.0       0.64      0.67      0.65      2658
        5.0       0.87      0.98      0.92      3418

avg / total       0.67      0.69      0.67     11697



### Test

In [17]:
print("Test Score:", round(logr_pca16.score(X_test_pca16, y_test),2))
print("\nTest Confusion Matrix")
print(skm.confusion_matrix(y_test, y_test_pca16_pred))
print("\nTest Classification Report")
print(skm.classification_report(y_test, y_test_pca16_pred))

Test Score: 0.7

Test Confusion Matrix
[[272  88   0   0   0]
 [ 65 401  75   4   0]
 [  0 189  69 206   0]
 [  0 110   0 433 117]
 [  0   0   0  16 815]]

Test Classification Report
             precision    recall  f1-score   support

        1.0       0.81      0.76      0.78       360
        2.0       0.51      0.74      0.60       545
        3.0       0.48      0.15      0.23       464
        4.0       0.66      0.66      0.66       660
        5.0       0.87      0.98      0.92       831

avg / total       0.68      0.70      0.67      2860



Great model! The stats are quite high and consistent from train to test. They are almost the same as the non-PCA model.

## Model Comparison

Model three, that is, the second PCA model with sixteen components performed the best. The scores were consistent between train and test, and the overall test score is the same as the non-transformed full model. It's amazing that model has the same accuracy with half the number of components as the original dataset.