# Block 1
Importing libraries and loading data

In [28]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

data = sns.load_dataset("geyser")

# Block 2
Create your own k-fold cross validation function

In [105]:
def kfold(X, y, k):
  """
  a function that performs k-fold cross validation

  X = predictor
  y = variable you want to predict
  k = number of folds
  """

  # Define our model (linear regression)
  model = LinearRegression()

  # Set a count to 0 (this will allow us to loop through folds)
  count = 0

  # Initialize a list to save the test R^2 values
  test_score_list = []

  # Create a for loop that iterates through the input number of folds
  for i in range(k):
    X = np.array(X)
    X_split = np.array_split(X, k)
    X_test = X_split[count]
    X_train = np.concatenate(np.delete(X_split, count))

    y = np.array(y)
    y_split = np.array_split(y, k)
    y_test = y_split[count]
    y_train = np.concatenate(np.delete(y_split, count)) 

    # Reshape train and test sets into 2D arrays
    X_train = X_train.reshape(-1, 1)
    y_train = y_train.reshape(-1, 1)
    X_test = X_test.reshape(-1, 1)
    y_test = y_test.reshape(-1, 1)

    # increase count by 1 each iteration to loop through folds
    count += 1

    # Fit a linear regression model to the training X and training y data
    model.fit(X_train, y_train)

    # Add the R^2 values to a list
    test_score_list.append(model.score(X_test, y_test))

  # Calculate the average R^2 value
  mean_test_score = np.mean(test_score_list)

  # Calculate the standard deviation of the R^2 values
  score_sd = np.std(test_score_list)

  # Output the average R^2 value and standard deviation of the R^2 values
  print("Average R^2 value:")
  print(mean_test_score)
  print("Standard deviation of R^2 values:")
  print(score_sd)

# Block 3
Print the averages and standard deviations for k-folds with k being 3, 5, 10, and 20

In [106]:
# Remove warning to make ouput easier to read
np.warnings.filterwarnings("ignore", category = np.VisibleDeprecationWarning) 

print("k = 3")
kfold(data["duration"], data["waiting"], 3)
print()

print("k = 5")
kfold(data["duration"], data["waiting"], 5)
print()

print("k = 10")
kfold(data["duration"], data["waiting"], 10)
print()

print("k = 20")
kfold(data["duration"], data["waiting"], 20)

k = 3
Average R^2 value:
0.8000625590143694
Standard deviation of R^2 values:
0.029772518371070162

k = 5
Average R^2 value:
0.8053092844986601
Standard deviation of R^2 values:
0.02413055856676901

k = 10
Average R^2 value:
0.7934416369601104
Standard deviation of R^2 values:
0.08534190424208755

k = 20
Average R^2 value:
0.788593647254244
Standard deviation of R^2 values:
0.09534147944216327


# Block 4
Interpret the results

Based on the results, when using the duration of the previous eruption of the Old Faithful Geyser to predict the amount of time until the next eruption, it would be most beneficial to use k-fold cross validation with the number of folds set at 5. This means that the data should be split approximately equally five ways, and the algorithm will be trained on 4 folds five times, leaving a different fold out each time. The left out data will be used to test how well the model is working. Each iteration of model testing will give us a new R^2 score, so we must take the average of the R^2 scores to determine the best number of folds to use. In this case, we know it is 5 because it ouputs the highest average R^2 score (0.805) and lowest standard deviation of R^2 values (0.024). A high R^2 is better because it signifies a better fitting linear regression model. A lower standard deviation of R^2 values is better because it means there is less variance between the R^2 scores found within each iteration. This is important because high variance in R^2 between iterations could be indicative of overfitting. Thus, it would be best to utilize 5 folds when predicting the amount of time until the next eruption based on the duration of the previous eruption. 3 folds would be the next best option, followed by 10 folds, and 20 folds being the worst of the options tested for.

We can also conclude that the duration of the previous eruption is a pretty strong predictor of the amount of time until the next eruption. This can be determined by looking at the average R^2 values found when using k-fold cross validation, which range from ~.78 to ~.81. These values indicate that our models were able to predict ~80% of the variance in the realtionship between the two variables.