# Tabular DQA  
In this notebook we will evaluate the quality of the tabular data in the dataset from[kaggle page](https://www.kaggle.com/c/cassava-leaf-disease-classification/data):

## 1 - Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport
from PIL import Image

In [None]:
REMOVE_DUPLICATES = True
REMOVE_NULL = True
BASE_PATH = "../data"

### 1.1 - Helper Functions

In [None]:
def plot_hist(dataset, columns_to_plot):
    fig, axs = plt.subplots(3, 4, figsize=(20, 15))
    fig.subplots_adjust(hspace=0.5, wspace=0.3)

    for i, column in enumerate(columns_to_plot):
        row = i // 4
        col = i % 4

        axs[row, col].hist(dataset[column], bins=50, alpha=0.7, color="blue", edgecolor="black")
        axs[row, col].set_title(f"Histogram of {column}")
        axs[row, col].set_xlabel(column)
        axs[row, col].set_ylabel("Frequency")
        axs[row, col].set_yscale("log")
        axs[row, col].grid(True)

    for i in range(len(columns_to_plot), 12):
        fig.delaxes(axs.flatten()[i])

    plt.tight_layout()
    plt.show()

In [None]:
def display_images(df):
    num_cols = 5
    num_rows = (len(df) - 1) // num_cols + 1

    fig, axs = plt.subplots(num_rows, num_cols, figsize=(15, 20))
    for i, (index, row) in enumerate(df.iterrows()):
        id = row["id"].astype(int)
        image_path = BASE_PATH + f"/train_images/{id}.jpeg"
        img = Image.open(image_path)
        ax = axs[i // num_cols, i % num_cols]
        ax.imshow(img)
        ax.axis("off")
        ax.set_title(f"ID: {id}")

    for j in range(len(df), num_rows*num_cols):
        axs.flatten()[j].set_visible(False)

    plt.tight_layout()
    plt.show()

## 2 - Data Loading

In [None]:
train = pd.read_csv(BASE_PATH + "/train.csv")
test = pd.read_csv(BASE_PATH + "/test.csv")

In [None]:
TRAIN_TARGET = train.iloc[:, np.r_[0, 164:176]]
COLUMNS_TO_PLOT = [col for col in TRAIN_TARGET if col != "id"]

## 3 - Data Quality Analysis
### 3.1 - Duplicates and Null Values

In [None]:
if REMOVE_DUPLICATES:
    print(f"{train.duplicated().sum()} duplicates in train")
    print(f"{test.duplicated().sum()} duplicates in test")
    if train.duplicated().sum() or test.duplicated().sum():
        train = train.drop_duplicates()
        test = test.drop_duplicates()
        print("Duplicates removed")
if REMOVE_NULL:
    print(f"{train.isnull().sum().sum()} nulls in train")
    print(f"{test.isnull().sum().sum()} nulls in test")
    if train.isnull().sum().sum() or test.isnull().sum().sum():
        train = train.dropna()
        test = test.dropna()
        print("Nulls removed")

In [None]:
plot_hist(TRAIN_TARGET, COLUMNS_TO_PLOT)

### 3.2 - Profiling

In [None]:
train_profile = ProfileReport(train, title="Train.csv Profiling Report", minimal=True)
test_profile = ProfileReport(test, title="Test.csv Profiling Report", minimal=True)

### 3.3 - Plant Physiological Parameters (Target Values)

This notebook outlines several key physiological parameters relevant to plant science studies. These parameters are crucial for understanding plant function, growth, and adaptation to their environment.

#### 3.3.1 - X4: Stem Specific Density (SSD)

**Definition**: The Stem Specific Density (SSD) is a measure of wood density calculated as stem dry mass per stem fresh volume. It provides insights into the structural strength and resource allocation of the plant.

quantifies woodiness and stem-water content and is determined by dividing the oven dry mass of a stem segment by its fresh volume, expressed in $g*cm^{-3}$. [Link]('https://uol.de/f/5/inst/biologie/ag/landeco/download/LEDA/Standards/Leda-S3-3_stem_traits.pdf').
So no values should be below zero.




In [None]:
cleaned_train = train
cleaned_train_dummy = train

ssd_below_zero = cleaned_train_dummy[cleaned_train_dummy["X4_mean"] < 0]
print(f"{len(ssd_below_zero)} rows with X4_mean below 0")

cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X4_mean"] < 0].index)
display_images(ssd_below_zero)

#### 3.3.2 - X11: Leaf Area per Leaf Dry Mass (SLA)

**Definition**: Specific Leaf Area (SLA), also known as 1/LMA (Leaf Mass per Area), is calculated as leaf area per leaf dry mass. This parameter is indicative of the efficiency of leaf construction and has implications for photosynthetic capacity and resource use. [Wikipedia article]('https://en.wikipedia.org/wiki/Specific_leaf_area').



In [None]:
x11_low = cleaned_train_dummy[cleaned_train_dummy["X11_mean"] < 2]
x11_high = cleaned_train_dummy[cleaned_train_dummy["X11_mean"] > 100]
print(f"{len(x11_low)} rows with x_11 below 2, {len(x11_high)} rows with x_11 above 100")

#delete all x11_mean below 2 and above 300
cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X11_mean"] < 2].index)
cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X11_mean"] > 100].index)
display_images(x11_low)
display_images(x11_high)


#### 3.3.3 - X18: Plant Height

**Definition**: Plant height is a straightforward yet vital parameter, representing the vertical growth of a plant. It is essential for assessing competitiveness for light and space in plant communities. 
Plant height refers to the height (PATO:height) of the whole plant (PO:whole plant) as defined on [TraitGloss](https://www.try-db.org/de/TraitGloss.php). 

**Comment**: It's important to specify the component of the plant measured (vegetative or generative) when known. The term "canopy height" is considered polysemic and its use is discouraged in this context.

**Abbreviation**: Not specified.
**Synonyms**: Shoot height.
**Related Terms**: Canopy height.
**Formal Unit**: Length unit.



In [None]:
heigt_over_80 = cleaned_train_dummy[cleaned_train_dummy["X18_mean"] > 80]
print(f"{len(heigt_over_80)} rows with a height over 80")

#delete all x18_mean above 80
cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X18_mean"] > 80].index)
cleaned_train["X18_mean"].sort_values(ascending=False).head(10)
display_images(heigt_over_80)

#### 3.3.4 - X26: Seed Dry Mass

**Definition**: Seed dry mass is the weight of a seed when completely dried. It is critical for understanding reproductive strategies, seed dispersal, and germination success. Provided in mg. So anything over 100g seems not realistic.



In [None]:
mass_over_100g = cleaned_train_dummy[cleaned_train_dummy["X26_mean"] > 10000]
print(f"{len(mass_over_100g)} rows with a seed mass over 100g")

#delete all x26_mean above 100000
cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X26_mean"] > 10000].index)
display_images(mass_over_100g)

#### 3.3.5 - X50: Leaf Nitrogen Content per Leaf Area

**Definition**: This parameter measures the concentration of nitrogen in leaves, expressed per unit leaf area. Nitrogen is a crucial nutrient for plant growth, and its allocation can indicate the plant's nutritional status and photosynthetic efficiency.
The ratio of the mass (PATO:mass) of nitrogen (CHEBI:nitrogen atom) in the leaf (PO:leaf) or component thereof, i.e. leaf lamina or leaflet (PO:leaf lamina, PO:leaflet) per respective area (TOP:leaf area, TOP:leaf lamina area, TOP:leaflet area)
Comment: Equivalent and convertible to the quantity per area with formal unit: amount unit / area unit. The term concentration is polysemic and we suggest to not use it in this context.


In [None]:
nitro_over_100 = cleaned_train_dummy[cleaned_train_dummy["X50_mean"] > 8]
print(f"{len(nitro_over_100)} rows with nitrogen content over 100")

#remove all nitro over 100
cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X50_mean"] > 8].index)
display_images(nitro_over_100)

#### 3.3.6 - X3112: Leaf Area

**Definition**: Leaf area is the total surface area of leaves a plant has. For plants with compound leaves, this includes the area of all leaflets. The measurement may or may not include the petiole, depending on the study. It is a fundamental characteristic for understanding photosynthetic potential and water use.

In [None]:
leaf_area_over_200000 = cleaned_train_dummy[cleaned_train_dummy["X3112_mean"] > 75000]
print(f"{len(leaf_area_over_200000)} rows with a leaf area over 200000")

#remove all leaf area over 200000
cleaned_train = cleaned_train.drop(cleaned_train[cleaned_train["X3112_mean"] > 75000].index)
display_images(leaf_area_over_200000)

List the differences in the amount of rows between the original dataset and the cleaned datasets.

In [None]:
print("train:", train.shape)
print("cleaned_train:", cleaned_train.shape)
print("Number of rows cleaned", len(train) - len(cleaned_train))
print("test:", test.shape)

### 3.4 - Remove Rows from Image DQA
Load the ids of the images that were removed in the image DQA and remove them from the tabular dataset.

In [None]:
image_train_removed_id = pd.read_csv(BASE_PATH + "/train_ids_to_remove.csv")
image_test_removed_id = pd.read_csv(BASE_PATH + "/val_ids_to_remove.csv")

In [None]:
cleaned_train = cleaned_train[~cleaned_train["id"].isin(image_train_removed_id["id"])]
test = test[~test["id"].isin(image_test_removed_id["id"])]

In [None]:
CLEANED_TRAIN_TARGET = cleaned_train.iloc[:, np.r_[0, 164:176]]

In [None]:
plot_hist(CLEANED_TRAIN_TARGET, COLUMNS_TO_PLOT)

### 3.5 - Save Cleaned Dataset

In [None]:
cleaned_train.to_csv(BASE_PATH + "/cleaned/cleaned_train.csv", index=False)
test.to_csv(BASE_PATH + "/cleaned/cleaned_test.csv", index=False) 