<h1 style='background:#2cab6c; border:0; color:white'><center>Importing Libraries</center></h1>

In [None]:
import json

import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path
from sklearn.model_selection import StratifiedKFold

<h1 style='background:#2cab6c; border:0; color:white'><center>Paths, files</center></h1>

In [None]:
# Paths to the base directories/files of the dataset
base_dir = Path('/kaggle/input/cassava-leaf-disease-classification')
train_df = pd.read_csv(f'{base_dir}/train.csv')

In [None]:
# Read the JSON file and write its contents to a variable
with open(f'{base_dir}/label_num_to_disease_map.json') as f:
    class_names = json.loads(f.read())
f.close()

<h1 style='background:#2cab6c; border:0; color:white'><center>Data Preprocessing</center></h1>

In [None]:
# Let's show the names of the classes 
class_names

In [None]:
# Let's check our DataFrame with training data
train_df.head()

In [None]:
# Add a new column with appropriate class names for labels
train_df['label_name'] = train_df['label'].apply(lambda x: class_names[str(x)])

In [None]:
# Check the result
train_df.head()

In [None]:
# Let's look at the distribution of classes
train_df.groupby('label')['image_id'].count().plot(kind='bar', title='Class distribution');

<h1 style='background:#2cab6c; border:0; color:white'><center>Stratified K-Folds</center></h1>

In [None]:
# Let's use StratifiedKFold to split the dataset into 4 parts
sk = StratifiedKFold(n_splits=4, random_state=42, shuffle=True)

for fold, (train, val) in enumerate(sk.split(train_df, train_df.label)):
    train_df.loc[val, 'fold'] = fold

In [None]:
# Converting from float type to int
train_df.fold = train_df.fold.astype(int)

In [None]:
# Check the result
train_df

We have successfully applied the dataset splitting into 4 parts using sklearn <a href='https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html'>sklearn StratifiedKFold</a>, let's save our updated dataset for later training the model.

<h1 style='background:#2cab6c; border:0; color:white'><center>Save Dataset</center></h1>

In [None]:
# Save updated dataset
train_df.to_csv('/kaggle/working/train_splitted.csv', index=False)