## Machine Learning Methods
This notebook is a competition for my machine learning (EDS-232) class taught by Matteo Robbins. In this exercise, I will demonstrate machine learning methods to predict dissolved inorganic carbon (DIC) levels in water samples collected by the Califronia Cooperative Oceanic Fisheries Investigations program. This data was downloaded from the CalCOFI data portal, where bottle and cast data was merged, and releveant variables were selected. Data is split into a training and testing set. Machine learning models will be trained on the training set and then evaluated on the testing set. 

### Using multiple methods

The goal of this exercise will be to demonstrate and optimize multiple machine learning methods. I am particularly interested in running a decision tree, random forest, and deep learning model for my data. 

### Load Modules

In [47]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

### Load Data

In [48]:
# Load the datasets
train = pd.read_csv('data/train.csv')

test = pd.read_csv('data/test.csv')

### Exploring the data

In [49]:
# View head of the dataset
train.head(3)

Unnamed: 0,id,Lat_Dec,Lon_Dec,NO2uM,NO3uM,NH3uM,R_TEMP,R_Depth,R_Sal,R_DYNHT,R_Nuts,R_Oxy_micromol.Kg,Unnamed: 12,PO4uM,SiO3uM,TA1.x,Salinity1,Temperature_degC,DIC
0,1,34.38503,-120.66553,0.03,33.8,0.0,7.79,323,141.2,0.642,0.0,37.40948,,2.77,53.86,2287.45,34.198,7.82,2270.17
1,2,31.418333,-121.998333,0.0,34.7,0.0,7.12,323,140.8,0.767,0.0,64.81441,,2.57,52.5,2279.1,34.074,7.15,2254.1
2,3,34.38503,-120.66553,0.18,14.2,0.0,11.68,50,246.8,0.144,0.0,180.2915,,1.29,13.01,2230.8,33.537,11.68,2111.04


In [50]:
# Check for missing values
train.isnull().sum()

id                      0
Lat_Dec                 0
Lon_Dec                 0
NO2uM                   0
NO3uM                   0
NH3uM                   0
R_TEMP                  0
R_Depth                 0
R_Sal                   0
R_DYNHT                 0
R_Nuts                  0
R_Oxy_micromol.Kg       0
Unnamed: 12          1454
PO4uM                   0
SiO3uM                  0
TA1.x                   0
Salinity1               0
Temperature_degC        0
DIC                     0
dtype: int64

In [51]:
# Check shape of the dataset
train.shape

(1454, 19)

We have a total of 1454 rows in our training data. From our observations we also see that the column `Unnamed: 12` has missing values for every row. This data is useless so we can initially drop this column entirely. 

In [52]:
# Drop the column
train = train.drop('Unnamed: 12', axis = 1)

# Check if it dropped
train.isnull().sum()

id                   0
Lat_Dec              0
Lon_Dec              0
NO2uM                0
NO3uM                0
NH3uM                0
R_TEMP               0
R_Depth              0
R_Sal                0
R_DYNHT              0
R_Nuts               0
R_Oxy_micromol.Kg    0
PO4uM                0
SiO3uM               0
TA1.x                0
Salinity1            0
Temperature_degC     0
DIC                  0
dtype: int64

In [53]:
# Check the data types of the columns
train.dtypes

id                     int64
Lat_Dec              float64
Lon_Dec              float64
NO2uM                float64
NO3uM                float64
NH3uM                float64
R_TEMP               float64
R_Depth                int64
R_Sal                float64
R_DYNHT              float64
R_Nuts               float64
R_Oxy_micromol.Kg    float64
PO4uM                float64
SiO3uM               float64
TA1.x                float64
Salinity1            float64
Temperature_degC     float64
DIC                  float64
dtype: object

Our variables are composed of `float64`s and `int64`s meaning that they are numerical values. So, we can't use any classification models. The models I am initially interested in are a decision tree, random forest, and a deep learning model. Let's run through each one, one at a time, starting with a decision tree. 

### Train Decision Tree 

In [54]:
# Define features and target
X = train.drop('DIC', axis = 1)
y = train['DIC']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 808)

# Scale data using StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train = scaler.fit_transform(X_train)

# Transform the testing data
X_test = scaler.transform(X_test)

### Define Decision Tree and Parameters

In [55]:
dt = DecisionTreeRegressor()

# Define the parameters
param_dist = {
    'max_depth': [1, 2, 3, 4, 5, None],
    'min_samples_split': [1, 2, 3, 4, 5, 6, 7],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Tune the hyperparameters w/ GridSearchCV
random_search = RandomizedSearchCV(dt, 
                           param_dist, 
                           cv = 5, ## 5-fold cross validation
                           n_jobs = -1) ## Use all cores 

# Fit the model
random_search.fit(X_train, y_train)

# Get the best parameters
best_dt_params = random_search.best_params_
print(f"Best Hyperparameters: {best_dt_params}")

# Define best model variable for evaluation
best_dt = random_search.best_estimator_

Best Hyperparameters: {'min_samples_split': 7, 'min_samples_leaf': 1, 'max_features': None, 'max_depth': 4}


10 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\tdude\anaconda3\envs\ml-env\lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\tdude\anaconda3\envs\ml-env\lib\site-packages\sklearn\base.py", line 1145, in wrapper
    estimator._validate_params()
  File "c:\Users\tdude\anaconda3\envs\ml-env\lib\site-packages\sklearn\base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\tdude\anaconda3\envs\ml-env\lib\site-packages\sklearn\utils\_param_validation.py", line 96, in validate_

### Evaluate the model on our Training data 'test set'

In [56]:
# Predict on the best model
y_dt_pred = best_dt.predict(X_test)

# Calculate RMSE from the best model
dt_rmse = np.sqrt(mean_squared_error(y_test, y_dt_pred))
print(f"Decision Tree RMSE: {dt_rmse:.3f}")

Decision Tree RMSE: 10.655


We have calcualated the RMSE from our training data set in our model. Now, let's use the test set we intially loaded to evaluate how well our model will be on "New", "Unknown" data. 

In [57]:
# Compare testing and training data 
# Make sure columns are the same as the training data

# Create a conditional statement to check if the columns are the same

test.columns.equals(train.columns)

False

In [58]:
# Get the column names of both datasets
train_columns = set(train.columns)
test_columns = set(test.columns)

# Find the differences in column names
missing_in_train = test_columns - train_columns
missing_in_test = train_columns - test_columns

# Show the differences
if missing_in_train or missing_in_test:
    if missing_in_train:
        print(f"Columns in test dataset but not in train dataset: {missing_in_train}")
    if missing_in_test:
        print(f"Columns in train dataset but not in test dataset: {missing_in_test}")
else:
    print("The datasets have the same column names.")


Columns in test dataset but not in train dataset: {'TA1'}
Columns in train dataset but not in test dataset: {'DIC', 'TA1.x'}


Now we see which columns are missing in each dataset. From the looks of it, it seems that the variable `TA1` and `TA1.x` are supposed to be the same, but are just spelled differently. We can fix that by renaming one to the other. It also look `DIC` is not in our testing set. This is expected as this is our target variable. 

In [59]:
# Rename TA1 to TA1.x to match our training data
test = test.rename(columns = {'TA1': 'TA1.x'})

# Check to see if the column was renamed
test.head(3)

Unnamed: 0,id,Lat_Dec,Lon_Dec,NO2uM,NO3uM,NH3uM,R_TEMP,R_Depth,R_Sal,R_DYNHT,R_Nuts,R_Oxy_micromol.Kg,PO4uM,SiO3uM,TA1.x,Salinity1,Temperature_degC
0,1455,34.321666,-120.811666,0.02,24.0,0.41,9.51,101,189.9,0.258,0.41,138.8383,1.85,25.5,2244.94,33.83,9.52
1,1456,34.275,-120.033333,0.0,25.1,0.0,9.84,102,185.2,0.264,0.0,102.7092,2.06,28.3,2253.27,33.963,9.85
2,1457,34.275,-120.033333,0.0,31.9,0.0,6.6,514,124.1,0.874,0.0,2.174548,3.4,88.1,2316.95,34.241,6.65


Now we can make predictions on our test data. 

In [60]:
# Predict on test data
y_pred_dt = best_dt.predict(test)

# Create a dataframe containing ID and DIC for compeititon submission
test['DIC'] = y_pred_dt
submission = test[['id', 'DIC']]

# Save the submission to a csv 
submission.to_csv('submission_dt.csv', index = False)

