In [1]:
library(tidyverse)
library(tidymodels)
library(repr)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.4     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



ERROR: Error in library(tidymodels): there is no package called ‘tidymodels’


In [2]:
set.seed(100)
# Necessary Code for reproducibility
rock_data_raw <- read_csv("https://raw.githubusercontent.com/yxing6/DSCI_100_Project_Group4/main/data/of_8460_database.csv")

borehole_data_raw <- rock_data_raw %>% 
    select("Location Type *", "MIRA Master Litho 1", "GRAIN DEN Sample Value [g/cm3]",
           "POR Sample Value [%]","MS Sample Value [SI A/m / A/m]",
           "NRM Sample Value [A/m]","RES Sample Value [Ohm.m]","CHG Sample Value [ms]") %>%
    rename("sample_type" = "Location Type *",
          "lithology" = "MIRA Master Litho 1",
           "density" = "GRAIN DEN Sample Value [g/cm3]",
           "porosity" = "POR Sample Value [%]",
           "MS" = "MS Sample Value [SI A/m / A/m]",
           "NRM" = "NRM Sample Value [A/m]",
           "RES" = "RES Sample Value [Ohm.m]",
           "chargeability" = "CHG Sample Value [ms]") %>%
    mutate(sample_type = as_factor(sample_type),
          lithology = as_factor(lithology)) %>%
    filter(lithology != "Other", sample_type == "Borehole")

borehole_data_unscaled <- na.omit(borehole_data_raw) %>%
    select(-sample_type)

borehole_data_split <- initial_split(borehole_data_unscaled, prop = 0.75, strata = lithology)
data_train <- training(borehole_data_split)
data_test <- testing(borehole_data_split)

[1m[1mRows: [1m[22m[34m[34m19653[34m[39m [1m[1mColumns: [1m[22m[34m[34m84[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (64): 0Phys Props LabID, Sample Name *, Alt Sample Name, SMS Curation Nu...
[32mdbl[39m (20): Date Mapped (yyyy/mm/dd) *, Lat Deg, Long Deg, UTM zone, Easting, ...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



ERROR: Error in initial_split(borehole_data_unscaled, prop = 0.75, strata = lithology): could not find function "initial_split"


Down the line you may run into this issue.
> ! Fold1: internal: No observations were detected in `truth` for level(s): 'Other'...

I'll let you figure out this issue. But this is closely related to one of the issues when you try and upsample. Since they're also trying to upsample the `Other` factor there's an additional couple of hundred `NA` observations.

In [82]:
levels(data_train$lithology)

If you look closely at the MS column of your training date, you will notice there are 26 observations with negative values. As you may know, you cannot log a negative number so for MS columns that contain a negative value the recipe will produce `NaN`. For now I'll just remove all the observations that contain a negative MS with `setdiff()` 

In [83]:
negative_MS <- data_train %>%
    filter(MS < 0)
negative_MS

lithology,density,porosity,MS,NRM,RES,chargeability
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Metamorphic,2.665,0.35,-1.60e-06,0.000587,3540,1.535
Sedimentary,2.636,10.66,-1.51e-06,0.000392,964,0.299
Sedimentary,2.639,8.51,-1.78e-06,0.000241,1600,0.272
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Metamorphic,2.646,0.33,-1.09e-05,0.000373,20400,3.223
Metamorphic,2.637,0.51,-9.72e-06,0.000335,15100,1.351
Metamorphic,2.644,0.67,-4.34e-06,0.000255,13200,2.330


In [84]:
set.seed(100)

# A new training_dataset that gets rid of the observations with a negative MS.
new_training <- setdiff(data_train, negative_MS)

test <- recipe(lithology ~ ., data = new_training) %>%
         step_upsample(lithology, over_ratio = 1, skip = FALSE) %>%
         step_log(porosity, MS, NRM, RES, chargeability) %>%
         step_scale(all_predictors()) %>%
         step_center(all_predictors()) %>%
         prep() %>%
         bake(new_training)
test

density,porosity,MS,NRM,RES,chargeability,lithology
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
-0.2468731,-0.4021398,1.4664554,1.473474,0.9147524,0.8525974,Igneous
1.1033663,-0.4258526,1.9765107,2.029514,1.1376573,0.6039673,Igneous
-0.6268400,1.0371133,0.6681407,1.199924,-0.3549377,-0.1858356,Igneous
⋮,⋮,⋮,⋮,⋮,⋮,⋮
,,,,,,
,,,,,,
,,,,,,


The NA values is specifically because if the issue I first mentioned at the top. Fix that issue and rerun the code and it'll have 1,662 obsevations. You'll still notice there are negative observations which is why there isn't exactly 1,677 observations but at this point you shouldn't have whole rows that are all `NA`.
## Don't continue until you fix the above issue

At this point you should have fixed 2 of the problems you asked about.
> ! Fold1: recipe: NaNs produced

> ! Fold1: model (predictions): NaNs produced

In [80]:
training_recipe <- recipe(lithology ~ ., data = new_training) %>%
     step_upsample(lithology, over_ratio = 1, skip = FALSE) %>%
     step_log(porosity, MS, NRM, RES, chargeability) %>%
     step_scale(all_predictors()) %>%
     step_center(all_predictors()) %>%
     prep()

training_vfold <- vfold_cv(new_training, v = 5, strata = lithology)

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) %>%
       set_engine("kknn") %>%
       set_mode("classification")

knn_results <- workflow() %>%
       add_recipe(training_recipe) %>%
       add_model(knn_tune) %>%
       fit_resamples(resamples = training_vfold)

[31mx[39m [31mFold1: model (predictions): Error: Problem with `mutate()` input `.row`.
[31m✖...[39m

[31mx[39m [31mFold2: model (predictions): Error: Problem with `mutate()` input `.row`.
[31m✖...[39m

[31mx[39m [31mFold3: model (predictions): Error: Problem with `mutate()` input `.row`.
[31m✖...[39m

[31mx[39m [31mFold4: model (predictions): Error: Problem with `mutate()` input `.row`.
[31m✖...[39m

[31mx[39m [31mFold5: model (predictions): Error: Problem with `mutate()` input `.row`.
[31m✖...[39m

“All models failed in [fit_resamples()]. See the `.notes` column.”


The issue seems to be related to step_upsample(). So will ask the rest of the teaching team and get back to you at a later time. For now, I would just remove the `step_upsample()`step. (Your note said Trevor suggested that as well)