My name is Robert S.

I have chosed the Iris (flower measurements) dataset from https://archive.ics.uci.edu/dataset/53/iris
--> Fisher, R. (1936). Iris [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76.

The project will process information related to flowers.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("iris/iris.data")
df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [4]:
print("View existing columns:", df.columns.tolist())
print(f"Dataset shape: {df.shape}")

View existing columns: ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
Dataset shape: (149, 5)


In [3]:
def standardize_column_names(df):
    """
    Standardize column names to be in snake_case format

    This function trims white spaces, replaces white spaces by underscores and converts to lower case. 
    It is essential for reproductibility and to reduce processing errors later on

    Args:
        df (pd.DataFrame): Original data frame.

    Returns:
        pd.DataFrame: Data frame with modified columns.
    """
    df.columns = [col.strip().lower().replace(' ', '_').replace('(', '').replace(')', '') 
                  for col in df.columns]
    return df

def handle_data_integrity(df, drop_duplicates=True):
    """
    Check and fix data integrity (remove duplicates and missing data)

    The dataset can have duplicate rows. The function identifies and eliminates the duplicates to prevent bias in the analysis

    Args:
        df (pd.DataFrame): The original data frame.
        drop_duplicates (bool):If set to True, it removes duplicates. Implicit is set to True.

    Returns:
        pd.DataFrame: Data frame with modified rows.
    """
    # check for missing data (in my case, Iris is already cleaned but it is good to do this)
    if df.isnull().values.any():
        df = df.dropna()
        
    # handle duplicates
    if drop_duplicates:
        df = df.drop_duplicates()
        
    return df.reset_index(drop=True)

In [5]:
df_clean = (df
            .pipe(standardize_column_names)
            .pipe(handle_data_integrity))

# VizualizeazÄƒ rezultatul
print("View existing columns:", df_clean.columns.tolist())
print(f"Dataset shape: {df_clean.shape}")
df_clean.head()

View existing columns: ['5.1', '3.5', '1.4', '0.2', 'iris-setosa']
Dataset shape: (146, 5)


Unnamed: 0,5.1,3.5,1.4,0.2,iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
