# Keras MLP with Extended Training Data

This kernel tries a simple multi-layer NN with just the tabular data. No images here. 

It extends the training data by taking all the data per patient, and creating all possible combinations of them. The thinking is that the goal here is to predict the FVC values per every week from -12 to 133 from just the based FVC value and image. You are given a set of training data with the base FVC value, and number of FVC values for the following weeks (or weeks before). In the test scenario, you must then predict the FVC for all the weeks, given just one week and its FVC.

The extended training data here is taking each FVC instance per patient, and creating a new row where that weeks FVC is the base and all the other given FVC/week combinations are the target. It gives about 12k rows of training data vs the 1.5k or so in the original. 

Much clever, such data? Don't know, the results score at around position 500, but maybe I am just bad at it :)

I have also experimented with combining this with my 3D CNN model, but that combination did not come up as very useful. This MLP kernel alone scored better than that combination, so something wrong there I guess. But the idea there was for me to train the CNN for images separately, and then combine with this MLP for experiments, since the images would just repeat for all the extended data generated here, making it very resource intensive to train with 12k images that just repeat (although my CNN kernel adds augmentation so its not 100% the same). I have a kernel that does that (combines this MLP with the CNN), but I'd rather not spam too many kernels with almost the same code. 

Hope you find something interesting in this.

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, GroupKFold, GroupShuffleSplit

import tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Dense, Flatten, Input
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization, Dropout, Activation
from tensorflow.keras.layers import Conv3D, MaxPooling3D
from tensorflow.keras import layers

from tensorflow.keras.callbacks import (ModelCheckpoint, LearningRateScheduler,
                             EarlyStopping, ReduceLROnPlateau,CSVLogger)

from tensorflow.keras import backend as K

import matplotlib.pylab as plt

from skimage.io import imread
from skimage.transform import resize
from keras.utils import Sequence
import math


from tqdm.auto import tqdm
tqdm.pandas()


# A Few Configuration Variables

In [None]:
#which columns from the train and test dataframe to use for the model (the Dense layers..)
dense_cols = ['base_fvc', 'base_week', 'pct', 'age', 'gender_female', 'gender_male', 'smoking_status', 'target_week']
batch_size = 16
epochs = 30
#how many splits to do on the training data, or how many cross-validation rounds to use
N_SPLITS = 10
#my_test_pct is a percentage of values left out of test/validation data to compare the final model against known results
my_test_pct = 0.05


The preprocessed data is in a [dataset](https://www.kaggle.com/donkeys/osic-pulmonary-fibrosispreprocessed) I previously uploaded. This is where it mounts:

In [None]:
DATA_DIR = "/kaggle/input/osic-pulmonary-fibrosispreprocessed/dataset"

# Brief Overview of the Data



In [None]:
df_train_orig = pd.read_csv("/kaggle/input/osic-pulmonary-fibrosis-progression/train.csv")
df_train = pd.read_csv(f"{DATA_DIR}/df_train_scaled_continous_smoke.csv").drop("Unnamed: 0", axis=1)
df_train.head()

In this case I used a continous value for the smoking feature, although you can say it is categorical. But 0 for non-smoking, 1 for currently smoking, and 0.5 for used to smoke.

In [None]:
df_train["SmokingStatus"].unique()

In [None]:
#the original competition data, not preprocessed or anything else like that
df_test_orig = pd.read_csv("/kaggle/input/osic-pulmonary-fibrosis-progression/test.csv")
df_test_orig.head()

List of patients in the training set. This can be useful, for example, to make patient-grouped data-splits.

In [None]:
patient_ids = df_train["Patient"].unique()
patient_ids.shape

# Extend Training Data

In [None]:
df_train.head()

Here we create new rows for all combinations of base-target FVC values that can be created from the train and test sets.

In [None]:
patient_ids = df_train["Patient"].unique()
training_rows = []
patient_count = 0
for patient_id in tqdm(patient_ids):
    df_patient = df_train[df_train["Patient"] == patient_id]
    patient_row_count = 0
    for idx, row in df_patient.iterrows():
        row_fvc = row["FVC"]
        for idx2, row2 in df_patient.iterrows():
            if row2["FVC"] == row_fvc:
                continue
            patient_row_count += 1
            training_row = {}
            training_row["patient_id"] = row["Patient"]
            training_row["base_fvc"] = row_fvc
            training_row["base_week"] = row["Weeks"]
            training_row["pct"] = row["Percent"]
            training_row["age"] = row["Age"]
            training_row["gender_female"] = row["Sex_Female"]
            training_row["gender_male"] = row["Sex_Male"]
            training_row["smoking_status"] = row["SmokingStatus"]
            training_row["target_week"] = row2["Weeks"]
            training_row["target_fvc"] = row2["fvc_raw"]
            training_rows.append(training_row)
    print(f"created {patient_row_count} instances for patient {patient_id}")
    patient_count += 1
print(f"processed {patient_count} patients")
print(f"rows before: {df_train.shape}")
print(f"extended rows: {len(training_rows)}")


In [None]:
df_new_train = pd.DataFrame(training_rows)
df_new_train

In [None]:
df_new_train.columns

In [None]:
x_cols = [col for col in df_new_train.columns if col != "target_fvc"]
df_x = df_new_train[x_cols]
df_y = df_new_train["target_fvc"]


In [None]:
df_x.head()

In [None]:
df_y.head()

# Preprocessing the Test Set

Unlike my CNN model, I will submit this just as an example for the competition results. For that, the test set will be different than what is given in the downloadable data. So I cannot rely on only the preprocessed test set but have to preprocess the live test set here. 

When the kernel is submitted for the competition, Kaggle runs the kernel again but with the much larger and real test set. You can actually take the trained model weights from above and build a kernel to do only predictions with the pre-trained model. Running it with the given test set of 5 rows takes about 4 minutes. 

Running a prediction version only with the actual submission data takes somewhere around 2 hours, so thats about the scale difference between the two as well.. And also why I believe you must have internet off, so you cannot upload the real test set and look at it yourself outside the submission. And why there is no execution log to debug errors, you could see information about the actual test set in that log.

In any case, the functions to do the preprocessing:

In [None]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder, MultiLabelBinarizer
import pickle

#scale numerical data to 0-1, one-hot categoricals. one-hot smoking if desired, else just 0, 0.5, and 1.
def scale_data(df_train, df_test, scale_smoking):
    df_train["train"] = 1
    df_test["train"] = 0
    df_full = pd.concat([df_train, df_test], axis=0)
    df_full["fvc_raw"] = df_full["FVC"]
    df_full = label_encode_data(df_full, scale_smoking)
    df_full = scale_minmax(df_full, scale_smoking)
    df_full = onehot_data(df_full, scale_smoking)

    df_train = df_full[df_full["train"] == 1].drop("train", axis=1)
    df_test = df_full[df_full["train"] == 0].drop("train", axis=1)
    return df_train, df_test

def scale_minmax(df_full, scale_smoking):
    minmax = MinMaxScaler()
    scale_cols = ["Weeks", "FVC", "Percent", "Age"]
    if scale_smoking:
        scale_cols.append("SmokingStatus")
    df_full[scale_cols] = minmax.fit_transform(df_full[scale_cols])
    with open("minmax.pkl", "wb") as f:
        pickle.dump(minmax, f)
    return df_full

def label_encode_data(df_full, scale_smoking):
    le = LabelEncoder()
    #fit first to ensure range 0-2 starting from never smoked to currently smoked, so value represent scale of current smoking
    le.fit(["Never smoked", "Ex-smoker", "Currently smokes"])
    if scale_smoking:
        df_full["SmokingStatus"] = le.transform(df_full["SmokingStatus"])
    with open("smoking_labelencoder.pkl", "wb") as f:
        pickle.dump(le, f)

    # le = LabelEncoder()
    # #fit first to ensure range 0-2 starting from never smoked to currently smoked, so value represent scale of current smoking
    # le.fit(df_full["Sex"])
    # df_full["Sex"] = le.transform(df_full["Sex"])
    # with open("gender_labelencoder.pkl", "wb") as f:
    #     pickle.dump(le, f)
    return df_full

#https://blog.cambridgespark.com/robust-one-hot-encoding-in-python-3e29bfcec77e
def onehot_data(df_full, scale_smoking):
    cols_to_encode = ["Sex"]
    if not scale_smoking:
        cols_to_encode.append("SmokingStatus")
    df_full = pd.get_dummies(df_full, columns=cols_to_encode)
    return df_full


Finally, run the actual scaling on the test data. Cannot use the pre-processed one from my dataset as the actual competition submission has a different test dataset.

So the following processes the test data into the same format as the training data. In the test data there is only one FVC per patient, so it cannot be extended (and shouldn't, its the test set after all..).

In [None]:
df_train_scaled__, df_test_scaled = scale_data(df_train_orig, df_test_orig, True)
df_test = df_test_scaled

In [None]:
df_test_scaled

In [None]:
patient_ids = df_test["Patient"].unique()
test_rows = []
patient_count = 0
for patient_id in tqdm(patient_ids):
    df_patient = df_test[df_test["Patient"] == patient_id]
    patient_row_count = 0
    for idx, row in df_patient.iterrows():
        row_fvc = row["FVC"]
        patient_row_count += 1
        test_row = {}
        test_row["patient_id"] = row["Patient"]
        test_row["base_fvc"] = row_fvc
        test_row["base_week"] = row["Weeks"]
        test_row["pct"] = row["Percent"]
        test_row["age"] = row["Age"]
        test_row["gender_female"] = row["Sex_Female"]
        test_row["gender_male"] = row["Sex_Male"]
        test_row["smoking_status"] = row["SmokingStatus"]
        test_rows.append(test_row)
    print(f"created {patient_row_count} instances for patient {patient_id}")
    patient_count += 1
print(f"processed {patient_count} patients")
print(f"rows before: {df_test.shape}")
print(f"extended rows: {len(test_rows)}")
    #break

# Custom Keras Sequence Generator

Here I define a Keras Sequence generator to provide batches. You don't really need this for this tabular dataset, as the data does not need to be augmented in any way. So it would actually be much simpler to use just basic Keras functionality and provide the training data as is.

However, I also wanted to try to combine this with my [3D CNN kernel] model, in which case I needed a generator to provide both the augmented 3D images and the tabular data rows. So doing the generator here allows simple combination of the two when needed. I did another kernel doing that, but its nothing special and scores less than this one so leaving it unpublished for now.

First, a small utility function to shuffle x and y at the end of each training epoch:

In [None]:
def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = np.random.permutation(len(a))
    new_a = a.iloc[p]
    new_b = b.iloc[p]
    return new_a, new_b

And the generator itself:

In [None]:

#https://stackoverflow.com/questions/49404993/keras-how-to-use-fit-generator-with-multiple-inputs
class MySequence3D(Sequence):

    def __init__(self, x_set, y_set, batch_size, mode="train", augment=True):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.mode = mode
        self.max_idx = math.ceil(len(x_set)/batch_size)
        self.augment = augment

    def __len__(self):
        #TODO: check is correct
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        idx2 = idx % self.max_idx
        start = idx2 * self.batch_size
        end = min(start + batch_size, len(self.x))
        batch_x = self.x.iloc[start : end]
        batch_y = self.y.iloc[start : end]
        
        next_batch_num = []
        for index, row in batch_x.iterrows():
            #print(row)
            nums = [row[col] for col in dense_cols]
            nums = np.array(nums)
            file_name = row["patient_id"]
            next_batch_num.append(nums)
        np_y = np.array(batch_y)
        np_x_num = np.array(next_batch_num)
        del next_batch_num
        del batch_y
        #print(f"loaded shape: {np_x.shape}, batch={idx}")
        result = np_x_num, np_y

        #print(f"shapes: {result[0].shape}, {result[1].shape}")
        return result

    def on_epoch_end(self):
        self.x, self.y = unison_shuffled_copies(self.x, self.y)

# Functions to Create and Train the Model

In [None]:
def create_model():
    num_input = Input(shape=(len(dense_cols,)))
    dense = Dense(200, activation='relu')(num_input)
    dense = BatchNormalization()(dense)
    dense = Dropout(0.45)(dense)
    dense = Dense(200, activation='relu')(dense)
    dense = BatchNormalization()(dense)
    dense = Dropout(0.35)(dense)
    #dense = Dense(50, activation='relu')(num_input)
    final_dense = Dense(1)(dense)
    model = keras.Model(
        inputs=[num_input],
        outputs=[final_dense],
    )
    adam = tensorflow.keras.optimizers.Adam(lr=0.001)

    model.compile(loss='mean_squared_error',
                  optimizer=adam,  #keras.optimizers.SGD(lr=0.01),
                  metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')])
    return model

In [None]:
def fit_model(model, callbacks_list, df_x_train, df_y_train, df_x_val, df_y_val):
    train_gen = MySequence3D(df_x_train, df_y_train, batch_size, augment=True)
    valid_gen = MySequence3D(df_x_val, df_y_val, batch_size, augment=False)

    #the total number of images we have:
    train_size = df_x_train.shape[0]
    #train_steps is how many steps per epoch Keras runs the genrator. One step is batch_size*images
    train_steps = train_size/batch_size
    #use 2* number of images to get more augmentations in. some do, some dont. up to you
    train_steps = int(train_steps) #2* seems to break this?
    #same for the validation set
    valid_size = df_x_val.shape[0]
    valid_steps = valid_size/batch_size
    valid_steps = int(valid_steps) #again, no 2*?    

    fit_history = model.fit_generator(
            train_gen,
            steps_per_epoch=train_steps,
            epochs = epochs,
            validation_data=valid_gen,
            validation_steps=valid_steps,
            callbacks=callbacks_list,
        use_multiprocessing=False,
        workers=2,
        verbose = 1
    )
    return fit_history


In [None]:

from sklearn.model_selection import train_test_split

# create callbacks list
def create_callbacks(idx):
    checkpoint = ModelCheckpoint(f'../working/weights_best_{idx}.h5', monitor='val_loss', verbose=1, 
                                 save_best_only=True, mode='min', save_weights_only = True)
    reduceLROnPlat = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3,
                                       verbose=1, mode='auto', epsilon=0.0001)
    early = EarlyStopping(monitor="val_loss", 
                          mode="min", 
                          patience=22)

    csv_logger = CSVLogger(filename='../working/training_log.csv',
                           separator=',',
                           append=True)

    callbacks_list = [checkpoint, csv_logger, early]
    return callbacks_list


# Functions to Make Predictions Using Trained Model

In [None]:
def make_predictions(model):
    predictions = []
    for target_week in tqdm(range(-12,134)):
        for row in test_rows:
            row["target_week"] = (target_week+12)/(133+12)
            tab_data = [row[col] for col in dense_cols]
            tab_data = np.array(tab_data)
            patient_id = row["patient_id"]
            tabs = np.array([tab_data])
            pred = model.predict([tabs])
            predictions.append(pred)
            #print(f"target week: {row['target_week']}, pred: {pred}")
    predictions = np.array(predictions)
    predictions = predictions.flatten()
    return predictions


In [None]:
def make_predictions_dict(model, test_rows):
    if isinstance(test_rows, pd.DataFrame):
        test_rows = [row.to_dict() for (idx, row) in test_rows.iterrows()]
    predictions = {}
    col_names = []
    for target_week in tqdm(range(-12,134)):
        for idx, row in enumerate(test_rows):
            row["target_week"] = (target_week+12)/(133+12)
            tab_data = [row[col] for col in dense_cols]
            tab_data = np.array(tab_data)
            patient_id = row["patient_id"]
            tabs = np.array([tab_data])
            pred = model.predict([tabs])
            col_name = f"{patient_id}_{target_week}"
            col_names.append(col_name)
            predictions[col_name] = pred.flatten()[0]
            #print(f"target week: {row['target_week']}, pred: {pred}")
        
    return predictions, col_names

In [None]:
def make_predictions_my_test(model, test_rows):
    test_rows = [row.to_dict() for (idx, row) in test_rows.iterrows()]
    predictions = []
    for idx, row in tqdm(enumerate(test_rows), total=len(test_rows)):
#        row["target_week"] = (target_week+12)/(133+12)
        tab_data = [row[col] for col in dense_cols]
        tab_data = np.array(tab_data)
        patient_id = row["patient_id"]
        tabs = np.array([tab_data])
        pred = model.predict([tabs])
        predictions.append(pred.flatten()[0])
        #print(f"target week: {row['target_week']}, pred: {pred}")
        
    return predictions

# Splitting the Data into N-Folds

In [None]:
indices = np.arange(df_x.shape[0])
# create a set of indexes for a separate, final, test set
#https://github.com/scikit-learn/scikit-learn/issues/9193
train_indices, my_test_indices = next(GroupShuffleSplit(test_size=my_test_pct, random_state=8).split(indices, groups=df_x["patient_id"]))
#train_inds, test_inds = next(GroupShuffleSplit().split(X, groups=groups))
my_test_X = df_x.iloc[my_test_indices]
my_test_y = df_y.iloc[my_test_indices]

full_indices = indices
#set only the 90% original data as train/test data for N-splits. In the end we shall test on the valid in
indices = train_indices



# Actually Training the Model

In [None]:

# First, create a set of indexes of the 5 folds
#train_indices, valid_indices = train_test_split(indexes, test_size=0.10, random_state=8, stratify=df_x["patient_id"])

#splits = list(StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=2020).split(indices, df_x.iloc[indices]["patient_id"]))
splits = list(GroupKFold(n_splits=N_SPLITS).split(indices, groups=df_x.iloc[indices]["patient_id"]))
preds_val = []
y_val = []
preds_test = []
preds_my_test = []
col_names = []
fit_histories = []
# If you dont know, enumerate(['a', 'b', 'c']) returns [(0, 'a'), (1, 'b'), (2, 'c')]
for idx, (train_idx, val_idx) in enumerate(splits):
    K.clear_session() # start Keras from clean state in each iteration
    print("Beginning fold {}".format(idx+1))
    # use the indexes to extract the folds in the train and validation data
    train_X, train_y, val_X, val_y = df_x.iloc[train_idx], df_y.iloc[train_idx], df_x.iloc[val_idx], df_y.iloc[val_idx]
    # instantiate the model for this fold
    model = create_model()
    callbacks = create_callbacks(idx)
    # Train, train, train
    history = fit_model(model, callbacks, train_X, train_y, val_X, val_y)
    fit_histories.append(history)
    #model.fit(train_X, train_y, batch_size=128, epochs=50, validation_data=[val_X, val_y], callbacks=callbacks)
    # loads the best weights saved by the checkpoint
    model.load_weights(f'weights_best_{idx}.h5')
    # Add the predictions of the validation to the list preds_val
    preds_val.append(model.predict(val_X[dense_cols], batch_size=512))
    # and the val true y
    y_val.append(val_y)
#    pred_test = make_predictions(model)
    pred_test, col_names = make_predictions_dict(model, test_rows)
    preds_test.append(pred_test)
    pred_test = make_predictions_my_test(model, my_test_X)
    preds_my_test.append(pred_test)
    


In [None]:
len(preds_test)

# A Look at the Predictions

In [None]:
#this would be the "fake" Kaggle test set that comes with the downloadable data
df_test_predictions = pd.DataFrame(preds_test)
df_test_predictions = df_test_predictions[col_names] #to get sorted order

In [None]:
df_test_predictions

And `preds_my_test` is my own test set split from the original training data, so the actual results are known and comparable. It is some percentage (5%) of the 12k rows. For specific set of patients, where the patient_id does not overlap with the data used in training.

In [None]:
len(preds_my_test[0])

In [None]:
preds_my_test_mean = np.mean(preds_my_test, axis=0)

What is the MSE to the actual data, if we average all folds?

In [None]:
from sklearn.metrics import mean_squared_error

mean_squared_error(my_test_y, preds_my_test_mean, squared=False)


In [None]:
preds_my_test_mean.shape

In [None]:
df_mytest_predictions = pd.DataFrame()
for idx, mtp in enumerate(preds_my_test):
    df_mytest_predictions[f"{idx+1}"] = mtp
#df_mytest_predictions = df_test_predictions[col_names] #to get sorted order


How big is the diff between the min and max prediction per row made by the different folds?

In [None]:
df_mytest_predictions_diff = pd.DataFrame()
for idx, mtp in enumerate(preds_my_test):
    df_mytest_predictions_diff[f"{idx+1}"] = np.abs(np.array(mtp) - my_test_y.values)
#df_mytest_predictions = df_test_predictions[col_names] #to get sorted order


Raw predictions for my test set:

In [None]:
df_mytest_predictions.T

In [None]:
df_mytest_predictions.T.describe()

Above shows the mean, str, etc over the N (10 here) folds for all patients in the test set.

By using describe() on describe(), we can get aggregated statistics on the overall dataset (so mean over all patients means, etc):

In [None]:
df_mytest_predictions.T.describe().T.describe()

Above was for raw predictions, the same stats for the diff: 

In [None]:
df_mytest_predictions_diff.T

In [None]:
df_mytest_predictions_diff.T.describe()

In [None]:
df_mytest_predictions_diff.T.describe().T.describe()

# Plotting and Visualizing the Training History

In [None]:
def plot_loss_and_accuracy(fit_history, n=0):
    plt.clf()
    plt.plot(fit_history.history['rmse'][n:])
    plt.plot(fit_history.history['val_rmse'][n:])
    plt.title('model rmse')
    plt.ylabel('rmse')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
    plt.clf()
    #loss is same as rmse in this case, so just not plotting the same twice. so commented out.
    # summarize history for loss
    #plt.plot(fit_history.history['loss'][n:])
    #plt.plot(fit_history.history['val_loss'][n:])
    #plt.title('model loss')
    #plt.ylabel('loss')
    #plt.xlabel('epoch')
    #plt.legend(['train', 'test'], loc='upper left')
    #plt.show()

In [None]:
for idx, fit_history in enumerate(fit_histories):
    print(f"fold {idx}")
#    plot_loss_and_accuracy_last(fit_history)
    plot_loss_and_accuracy(fit_history)


In [None]:
for idx, fit_history in enumerate(fit_histories):
    print(f"fold {idx}")
#    plot_loss_and_accuracy_last(fit_history)
    plot_loss_and_accuracy(fit_history, 4)


In [None]:
for idx, fit_history in enumerate(fit_histories):
    print(f"fold {idx}")
#    plot_loss_and_accuracy_last(fit_history)
    plot_loss_and_accuracy(fit_history, 8)

We will not see the actual test data, ever, but we can always take a look and see that the predictions over the fake test data at least look reasonable. This should (more likely) translate into a successful submissions:

In [None]:
df_test_predictions.describe()

In [None]:
descriptions = df_test_predictions.describe()

In [None]:
descriptions.max(axis=1)

In [None]:
descriptions.T.describe()

In [None]:
#my_test_indices

In [None]:
df_test_predictions

In [None]:
df_test_predictions.shape

Above shape shows we have made predictions for all 5 fake test patients, for all 10 folds:

In [None]:
#this shold show each 5 fake patients has number of predictions matching weeks -12...133:
df_test_predictions.shape[1]/5

# Calculating Submission Confidence

Besides the raw FVC values prediction, the competition requires a confidence value per prediction.

In this case, I take the N-folds and pick the range between the min and max value predicted per test set row as the confidence. So if out of 10 folds, the highest prediction for row 1 is 3100 and the lowest prediction of the 10 folds for row 1 is 2900, the confidence becomes 3000-2900=100.

In [None]:
df_confidence = pd.DataFrame(df_test_predictions.max()-df_test_predictions.min())
df_confidence.columns = ["Confidence"]
df_confidence

I am using the average of all folds as the prediction here:

In [None]:
preds_test_mean = np.mean(df_test_predictions, axis=0)

In [None]:
preds_test_mean.shape

In [None]:
preds_test_mean.head()

# Building the Submission

In [None]:
df_my_submission = preds_test_mean.to_frame()
df_my_submission

In [None]:
#this sets the header correct in the generated CSV file, and merges the confidence correct
df_my_submission.index.names = ['Patient_Week']

In [None]:
df_my_submission.columns = ["FVC"]
df_my_submission["FVC"] = df_my_submission["FVC"].astype(int)
df_my_submission = df_my_submission.join(df_confidence, how="outer")
df_my_submission["Confidence"] = df_my_submission["Confidence"].astype(int)


In [None]:
df_my_submission.to_csv("submission.csv", index=True)

In [None]:
!ls

In [None]:
!head submission.csv

In [None]:
df_my_submission.describe()