# Round 2: Addressing Survival Class Imbalance:
Let's take a look at our survival classes:

In [6]:
from IPython.utils import io
with io.capture_output() as captured:
    %run log_reg_01.ipynb

y_train.value_counts(normalize=True)

0    0.618629
1    0.381371
Name: Survived, dtype: float64

As we can see, passenger death accounts for 61 percent of the classifications. This can cause our model to be biased in that direction. Let's try to address this imbalance with random over-sampling:

In [7]:
# Join training data together to resample as a whole:
xidxs = x_train.index
yidxs = y_train.index
osample_df = pd.concat([x_train, y_train], axis = 1)
osample_df

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
294,0,1,-3.926007e-01,1,0,-0.470230,0
426,0,0,2.388379e-16,1,0,-0.502445,0
499,1,1,-3.237127e-01,0,2,2.402990,0
91,0,0,-4.816080e-02,1,0,-0.486337,0
227,2,0,-7.370406e-01,1,0,-0.437007,1
...,...,...,...,...,...,...,...
147,0,0,-1.859368e-01,1,0,-0.491456,1
762,0,0,7.784949e-01,1,0,-0.504962,0
835,0,0,-8.059285e-01,1,0,-0.481304,0
846,0,0,8.473829e-01,1,0,-0.496405,0


In [8]:
df_lived = osample_df[osample_df.Survived == 1]
df_died = osample_df[osample_df.Survived == 0]

# get counts of survivals
died, lived = osample_df.Survived.value_counts()
died, lived

(352, 217)

In [9]:
# we want to oversample the minority class, which is "survived"
df_lived = df_lived.sample(died, replace=True, random_state=333)

In [10]:
# get the training data back now that it is evenly distributed:
osample_df = pd.concat([df_lived, df_died])
x_train = osample_df.drop(['Survived'], axis=1)
y_train = osample_df.Survived

In [11]:
# show even distribution of survival class
y_train.value_counts()

1    352
0    352
Name: Survived, dtype: int64

In [12]:
x_train.sort_values(by="PassengerId").head(10)

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,0,-0.530377,0,0,-0.502445
2,1,1,0.571831,0,0,0.786845
3,0,1,-0.254825,1,0,-0.488854
5,0,0,0.365167,1,0,-0.486337
7,1,0,1.674039,1,0,0.395814
11,0,1,-1.77036,0,1,-0.312172
11,0,1,-1.77036,0,1,-0.312172
12,1,1,1.949591,1,0,-0.113846
14,0,0,0.640719,0,3,-0.018709
15,0,1,-1.08148,1,0,-0.49028


We can see the results of our oversampling in the duplicate passenger ID 11. We can also check to see if there is an even distribution of Survived classes:

Now, let's split our data back into independent and dependent variables, train a new model and see how it does:

In [13]:
models['Logistic Regression 2'] = evaluate(linear_model.LogisticRegression())
models['Logistic Regression 2']['Notes'] = "Minimal Features. Over-sampled to address Surivival class imbalance. No Tuning"
pprint(models)

 Logistic Regression 1: 
	 Died: 
		 precision :  0.8085106382978723
		 recall :  0.8636363636363636
		 f1-score :  0.8351648351648351
		 support :  88
	 Survived: 
		 precision :  0.7551020408163265
		 recall :  0.6727272727272727
		 f1-score :  0.7115384615384616
		 support :  55
	 accuracy :  0.7902097902097902
	 macro avg: 
		 precision :  0.7818063395570993
		 recall :  0.7681818181818182
		 f1-score :  0.7733516483516483
		 support :  143
	 weighted avg: 
		 precision :  0.7879688700357393
		 recall :  0.7902097902097902
		 f1-score :  0.7876162299239222
		 support :  143
	 AUC :  0.8022727272727272
	 Classifier :  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
	 Notes :   Minimal Features. No resampling t

## Round 2 Performance:

In almost all metrics our score went down by several points. 

Since over-sampling did not help us out, we need go back up to the initial train/test split cells and start from scratch. 