To start out with, let's consider the cross fold validation scenario. In previous examples we have explicitly created folds by using the KFolds function which allows us to explicitly chop up the data into a number of folds. On the other hand we don't have to do so directly. We can create a cross_validate object and tell it how many folds to use and it will handle the separation of the data for us.

In [4]:
from sklearn import metrics
import pandas as pd
url = "https://raw.githubusercontent.com/steviep42/bios534_spring_2020/master/data/pima.csv"
pm = pd.read_csv(url, sep=',')


# How many people have diabetes ? 
print("Value counts of the diabetes columns:\n")
pm.groupby('diabetes').size()

Value counts of the diabetes columns:



diabetes
neg    500
pos    268
dtype: int64

In [5]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X = pm.drop('diabetes',axis=1)
y = pm.diabetes

# Next we create a training and test pair with 80 / 20 proportions
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.2,
                                                    random_state=2)
#                                                    stratify=y)

rf = RandomForestClassifier(200)
rezults = cross_validate(rf,X_train,y_train,cv=8)
rezults['test_score'].mean().round(3)

rf.fit(X_train,y_train).score(X_train,y_train)
rf.fit(X_train,y_train).score(X_test,y_test).round(3)

0.76

On the other hand, since we do have something of a class imbalance we need to insure that the folds reflect class proportions of positive to negative across the folds. In reality, the cross_validate function does that for us where possible but let's see how we might do that ourselves when we create the train / test pair.

In [8]:
# Next we create a training and test pair with 80 / 20 proportions
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.2,
                                                    random_state=2,
                                                    stratify=y)

train_counts = y_train.value_counts()
test_counts = y_test.value_counts()
print("training group counts: \n",train_counts)
print("testing group counts: \n",test_counts)

# 

print("training group diabetes percentages",round(train_counts[1]/train_counts.sum()*100,2))
print("testing group diabetes percentages",round(test_counts[1]/test_counts.sum()*100,2))

rezults = cross_validate(rf,X_train,y_train,cv=8)
rezults['test_score'].mean().round(2)

y_test_preds = rf.fit(X_train,y_train).predict(X_test)

class_report = metrics.classification_report(y_test, 
                                             y_test_preds)

print(class_report)


training group counts: 
 neg    400
pos    214
Name: diabetes, dtype: int64
testing group counts: 
 neg    100
pos     54
Name: diabetes, dtype: int64
training group diabetes percentages 34.85
testing group diabetes percentages 35.06
              precision    recall  f1-score   support

         neg       0.77      0.87      0.82       100
         pos       0.68      0.52      0.59        54

    accuracy                           0.75       154
   macro avg       0.73      0.69      0.70       154
weighted avg       0.74      0.75      0.74       154



In [9]:
# Next we create a training and test pair with 80 / 20 proportions
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.2,
                                                    random_state=2,
                                                    stratify=y)

train_counts = y_train.value_counts()
test_counts = y_test.value_counts()
print("training group counts: \n",train_counts)
print("testing group counts: \n",test_counts)

# 

print("training group diabetes percentages",round(train_counts[1]/train_counts.sum()*100,2))
print("testing group diabetes percentages",round(test_counts[1]/test_counts.sum()*100,2))

from sklearn.model_selection import KFold
kf = KFold(n_splits=8, random_state=None, shuffle=False)

rezults = cross_validate(rf,X_train,y_train,cv=kf)
rezults['test_score'].mean().round(2)

y_test_preds = rf.fit(X_train,y_train).predict(X_test)

class_report = metrics.classification_report(y_test, 
                                             y_test_preds)

print(class_report)

training group counts: 
 neg    400
pos    214
Name: diabetes, dtype: int64
testing group counts: 
 neg    100
pos     54
Name: diabetes, dtype: int64
training group diabetes percentages 34.85
testing group diabetes percentages 35.06
              precision    recall  f1-score   support

         neg       0.77      0.88      0.82       100
         pos       0.69      0.50      0.58        54

    accuracy                           0.75       154
   macro avg       0.73      0.69      0.70       154
weighted avg       0.74      0.75      0.74       154

