### A tale of two models
A wild prediction on which model will test more accurately.

Based on my (admmittedly quite limited) understanding, I think the Logistic Regression model will perform better.

The time facet of the problem is what I'm fixated on here. Regression is better at anticipating a pattern for future data because its results are not based on mean values. Random Forest averages the values it has already seen, so unless the sample size is large enough to show atypical extremes, it will always stay neatly in the mean of what it already knows. If the line keeps continuing up or down beyond that, it will struggle.  

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

In [2]:
train_df = pd.read_csv(Path('Resources/2019loans.csv'))
test_df = pd.read_csv(Path('Resources/2020Q1loans.csv'))

In [3]:
# Convert categorical data to numeric and separate target feature for training data
x_train = pd.get_dummies(train_df.drop(columns=['target']))
y_train = train_df['target']

In [4]:
# Convert categorical data to numeric and separate target feature for testing data
x_test = pd.get_dummies(test_df.drop(columns=['target']))
y_test = test_df['target']

In [5]:
# add missing dummy variables to testing set
for col in x_train.columns:
    if col not in x_test.columns:
        x_test[col] = 0

In [6]:
# Train the Logistic Regression model on the unscaled data and print the model score
logreg = LogisticRegression()
logreg.fit(x_train,y_train)
logreg.score(x_test,y_test)

0.5070182900893236

In [7]:
# Train a Random Forest Classifier model and print the model score
forest = RandomForestClassifier()
forest.fit(x_train,y_train)
forest.score(x_test,y_test)

0.6437686091025095

## Thoughts 1 and Prediction 2
In hindsight, the second part of the assignment makes a lot more sense to me now and my limited statistical knowledge.

Regression performed poorly compared to Forest with scores of .507 and .640 respectively.

StandardScaler should make all the predicting datapoints have a centered mean. By my reading, but my own clumsy words: if the data are in a fairly normal distribution (which I believe the loan data should be even though the datasets are heavily truncated to have a roughly even distribution of high and low risk loans), then Regression should improve its performance comparitively because random forest already averages its trees.

In [8]:
# Scale the data
s_scaler = StandardScaler()
s_scaler.fit(x_train)
scaled_x_train = s_scaler.transform(x_train)
scaled_x_test = s_scaler.transform(x_test)

In [9]:
# Train the Logistic Regression model on the scaled data and print the model score
#same operations as before but on the scaled versions
logreg = LogisticRegression()
logreg.fit(scaled_x_train,y_train)
logreg.score(scaled_x_test,y_test)

0.7598894087622289

In [10]:
# Train a Random Forest Classifier model on the scaled data and print the model score
forest = RandomForestClassifier()
forest.fit(scaled_x_train,y_train)
forest.score(scaled_x_test,y_test)

0.6301573798383666

## Final Thoughts
I suppose a correct guess, even if it's for the wrong reasons, is a kind of victory.

Regression did what I hoped it would do initially with a huge jump to a .759 score while random forest stayed more or less the same, which I also thought would happen.

Upon playing with the parameters of StandardScaler (and thinking a little more) it's logical that it was the scaling for standard deviation, not the centering, that greatly benefits the model. This would indicate the data's predictors likely being in a fairly normal distribution. If the distribution is fairly normal, then accounting for the variance of the data will yield far more accurate results.