## ðŸ“˜ Ficaria Package â€” Fuzzy Imputation & Feature Selection
### _Hands-On Examples, Usage Guide, and Method Demonstration_

The Ficaria package provides fuzzy-logic-based methods for numeric missing-value imputation and feature selection.
This notebook demonstrates how to use all implemented functionalities.
It is divided into three main sections:

1. Installation, setup, and example data preparation

2. Missing-value imputation methods

3. Feature-selection methods

All components of the package are fully compatible with the scikit-learn API and implement the standard fit and transform methods.

### Setup

In [None]:
from ficaria.missing_imputation import FCMCentroidImputer, FCMParameterImputer, FCMRoughParameterImputer
from ficaria.missing_imputation import FCMKIterativeImputer, FCMInterpolationIterativeImputer, FCMDTIterativeImputer
from ficaria.feature_selection import FuzzyGranularitySelector, WeightedFuzzyRoughSelector

from sklearn.datasets import load_wine
from sklearn.metrics import root_mean_squared_error
from sklearn.preprocessing import MinMaxScaler

import numpy as np
import pandas as pd

np.random.seed(42) 

In [2]:
data = load_wine(as_frame=True)
X = data.data
y = data.target
feature_names = X.columns.tolist()

print("Original dataset shape:", X.shape)
X.head()

Original dataset shape: (178, 13)


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


---

### Missing Imputation

Missing-value imputation in Ficaria focuses on restoring incomplete numeric datasets using fuzzy-logic clustering techniques that capture uncertainty more effectively than traditional deterministic methods.

Most imputation methods in Ficaria are based on the fuzzy c-means algorithm, which requires specifying the number of clusters. Users may manually select the number of clusters or apply the built-in helper function that identifies the optimal value of `n_cluster`.

In [3]:
def introduce_missingness(X, missing_fraction=0.1):
    """
    Randomly removes `missing_fraction` of the entries in X.
    Returns:
        X_missing  - dataframe with NaNs
        missing_mask - boolean array where True means a value was removed
    """
    X_missing = X.copy()
    n_rows, n_cols = X.shape

    n_missing = int(missing_fraction * n_rows * n_cols)

    missing_row_idx = np.random.randint(0, n_rows, n_missing)
    missing_col_idx = np.random.randint(0, n_cols, n_missing)

    missing_mask = np.zeros((n_rows, n_cols), dtype=bool)
    missing_mask[missing_row_idx, missing_col_idx] = True

    X_missing.values[missing_mask] = np.nan

    return X_missing, missing_mask

In [4]:
def rmse(X_original, X_imputed, missing_mask):
    """
    Computes a single global RMSE across all features and all rows,
    using only the positions where missing values were introduced.
    """
    X_orig_vals = X_original.values
    X_imp_vals = X_imputed.values

    true_vals = X_orig_vals[missing_mask]
    imputed_vals = X_imp_vals[missing_mask]

    rmse = root_mean_squared_error(true_vals, imputed_vals)
    return rmse

In [5]:
X_missing, missing_mask = introduce_missingness(X, missing_fraction=0.10)

print("Missing values per column:")
X_missing.isna().sum()

Missing values per column:


alcohol                         24
malic_acid                      14
ash                             17
alcalinity_of_ash               15
magnesium                       15
total_phenols                   20
flavanoids                      14
nonflavanoid_phenols            17
proanthocyanins                 15
color_intensity                 19
hue                              9
od280/od315_of_diluted_wines    19
proline                         15
dtype: int64

In [6]:
X_missing.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,,2.29,5.64,1.04,3.92,1065.0
1,,1.78,,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,,,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,,1.82,4.32,1.04,2.93,735.0


In [7]:
scaler = MinMaxScaler()
X_missing_scaled = pd.DataFrame(
    scaler.fit_transform(X_missing), 
    columns=X_missing.columns, index=X_missing.index)

X_scaled = pd.DataFrame(
    scaler.fit_transform(X), 
    columns=X.columns, index=X.index)

#### FCMCentroidImputer

In [8]:
fcmc = FCMCentroidImputer(n_clusters=3, m=2, max_iter=1000, tol=1e-5, random_state=42)
fcmc.fit(X_missing_scaled)
X_imputed_fcmc = fcmc.transform(X_missing_scaled)

X_imputed_fcmc.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,0.842105,0.1917,0.572193,0.257732,0.619565,0.627586,0.57384,0.282672,0.737255,0.372014,0.455285,0.970696,0.561341
1,0.722359,0.205534,0.591036,0.030928,0.326087,0.575862,0.510549,0.245283,0.341176,0.264505,0.463415,0.78022,0.615091
2,0.560526,0.320158,0.700535,0.412371,0.336957,0.627586,0.611814,0.320755,0.594665,0.383369,0.447154,0.695971,0.646933
3,0.878947,0.23913,0.609626,0.319588,0.467391,0.989655,0.664557,0.207547,0.694118,0.556314,0.475683,0.798535,0.857347
4,0.581579,0.365613,0.807487,0.536082,0.521739,0.627586,0.495781,0.282672,0.552941,0.259386,0.455285,0.608059,0.325963


In [9]:
print("Missing before:", X_missing_scaled.isna().sum().sum())
print("Missing after :", pd.DataFrame(X_imputed_fcmc, columns=feature_names).isna().sum().sum())

pd.DataFrame({
    "Before imputation": X_missing_scaled.isna().sum(),
    "After imputation": pd.DataFrame(X_imputed_fcmc, columns=feature_names).isna().sum()
})

Missing before: 213
Missing after : 0


Unnamed: 0,Before imputation,After imputation
alcohol,24,0
malic_acid,14,0
ash,17,0
alcalinity_of_ash,15,0
magnesium,15,0
total_phenols,20,0
flavanoids,14,0
nonflavanoid_phenols,17,0
proanthocyanins,15,0
color_intensity,19,0


In [10]:
rmse_results_fcmc = rmse(X_scaled, X_imputed_fcmc, missing_mask)
print(f"Average RMSE for FCMCentroidImputer: {rmse_results_fcmc:.4f}")

Average RMSE for FCMCentroidImputer: 0.1703


#### FCMParameterImputer

In [11]:
fcmp = FCMParameterImputer(n_clusters=3, m=2, max_iter=1000, tol=1e-5, random_state=42)
fcmp.fit(X_missing_scaled)
X_imputed_fcmp = fcmp.transform(X_missing_scaled)

rmse_results_fcmp = rmse(X_scaled, X_imputed_fcmp, missing_mask)
print(f"Average RMSE for FCMParameterImputer: {rmse_results_fcmp:.4f}")

Average RMSE for FCMParameterImputer: 0.1663


#### FCMRoughParameterImputer

In [12]:
fcmrp = FCMRoughParameterImputer(n_clusters=3, m=2.0, max_iter=100, tol=1e-5, wl=0.6, wb=0.4, 
                                 tau=0.5, random_state=42)
fcmrp.fit(X_missing_scaled)
X_imputed_fcmrp = fcmrp.transform(X_missing_scaled)

rmse_results_fcmrp = rmse(X_scaled, X_imputed_fcmrp, missing_mask)
print(f"Average RMSE for FCMRoughParameterImputer: {rmse_results_fcmrp:.4f}")

Average RMSE for FCMRoughParameterImputer: 0.1837


#### FCMKIterativeImputer

In [13]:
fcki = FCMKIterativeImputer(random_state=42, max_clusters=3, m=2, max_iter=100)
fcki.fit(X_missing_scaled)
X_imputed_fcki = fcki.transform(X_missing_scaled)

rmse_results_fcki = rmse(X_scaled, X_imputed_fcki, missing_mask)
print(f"Average RMSE for FCMKIterativeImputer: {rmse_results_fcki:.4f}")

Average RMSE for FCMKIterativeImputer: 0.1789


#### FCMInterpolationIterativeImputer

In [14]:
fcmii = FCMInterpolationIterativeImputer(n_clusters=3, m=2, alpha=2, max_iter=1000, tol=1e-5, 
                                         max_outer_iter=20, stop_criteria=0.001, sigma=False, random_state=42)
fcmii.fit(X_missing_scaled)
X_imputed_fcmii = fcmii.transform(X_missing_scaled)

rmse_results_fcmii = rmse(X_scaled, X_imputed_fcmii, missing_mask)
print(f"Average RMSE for FCMInterpolationIterativeImputer: {rmse_results_fcmii:.4f}")

Average RMSE for FCMInterpolationIterativeImputer: 0.1972


#### FCMDTIterativeImputer

In [15]:
fcmdti = FCMDTIterativeImputer(random_state=42, min_samples_leaf=3, learning_rate=0.1, m=2, max_clusters=20, 
                               max_iter=1000, stop_threshold=1.0, alpha=1.0)
fcmdti.fit(X_missing_scaled)
X_imputed_fcmdti = fcmdti.transform(X_missing_scaled)

rmse_results_fcmdti = rmse(X_scaled, X_imputed_fcmdti, missing_mask)
print(f"Average RMSE for FCMInterpolationIterativeImputer: {rmse_results_fcmdti:.4f}")

Average RMSE for FCMInterpolationIterativeImputer: 0.1902


In [16]:
rmse_dict = {
    "FCMCentroidImputer": rmse_results_fcmc,
    "FCMParameterImputer": rmse_results_fcmp,
    "FCMRoughParameterImputer": rmse_results_fcmrp,
    "FCMKIterativeImputer": rmse_results_fcki,
    "FCMInterpolationIterativeImputer": rmse_results_fcmii,
    "FCMDTIterativeImputer": rmse_results_fcmdti
}

rmse_df = pd.DataFrame.from_dict(rmse_dict, orient='index', columns=["RMSE"])
rmse_df

Unnamed: 0,RMSE
FCMCentroidImputer,0.170264
FCMParameterImputer,0.166326
FCMRoughParameterImputer,0.18367
FCMKIterativeImputer,0.178928
FCMInterpolationIterativeImputer,0.197231
FCMDTIterativeImputer,0.190246


---

### Feature Selection

Both feature-selection methods require the user to specify how many columns (`n_features`) should remain in the final dataset.
After calling the `transform` method, a reduced DataFrame containing the selected number of features is returned.

The feature-selection module leverages fuzzy measures to identify and retain the most informative variables, improving model interpretability and reducing dimensionality in a principled way.

In [17]:
X.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


#### WeightedFuzzyRoughSelector

In [18]:
wfrfs = WeightedFuzzyRoughSelector(n_features=5, alpha=0.6, k=15)
feature_importance = wfrfs.fit(X, y)
X_selected_wfrfs = wfrfs.transform(X)

In [19]:
wfrfs.feature_importances_

Unnamed: 0,feature,importance
0,proline,0.308772
1,magnesium,0.225567
2,color_intensity,0.209702
3,alcalinity_of_ash,0.165844
4,flavanoids,0.150677
5,alcohol,0.144469
6,malic_acid,0.137205
7,od280/od315_of_diluted_wines,0.135196
8,total_phenols,0.11879
9,proanthocyanins,0.116574


In [20]:
print("Original shape:", X.shape)
print("Selected shape:", X_selected_wfrfs.shape)

Original shape: (178, 13)
Selected shape: (178, 5)


In [21]:
X_selected_wfrfs.head()

Unnamed: 0,proline,magnesium,alcalinity_of_ash,malic_acid,color_intensity
0,1065.0,127.0,15.6,1.71,5.64
1,1050.0,100.0,11.2,1.78,4.38
2,1185.0,101.0,18.6,2.36,5.68
3,1480.0,113.0,16.8,1.95,7.8
4,735.0,118.0,21.0,2.59,4.32


#### FuzzyGranularitySelector

In [22]:
figfs = FuzzyGranularitySelector(k=5, eps=0.5, d=20, sigma=10, random_state=42)
feature_importance = figfs.fit(X, y)
X_selected_figfs = figfs.transform(X)

In [23]:
X_selected_figfs.head()

Unnamed: 0,nonflavanoid_phenols,hue,ash,od280/od315_of_diluted_wines,total_phenols
0,0.28,1.04,2.43,3.92,2.8
1,0.26,1.05,2.14,3.4,2.65
2,0.3,1.03,2.67,3.17,2.8
3,0.24,0.86,2.5,3.45,3.85
4,0.39,1.04,2.87,2.93,2.8


In [24]:
print("Original shape:", X.shape)
print("Selected shape:", X_selected_figfs.shape)

Original shape: (178, 13)
Selected shape: (178, 5)
