<a href="https://colab.research.google.com/github/urosgodnov/juypterNotebooks/blob/main/DataMining/Machine_Learning_with_Python_3_Feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using Python to implement machine learning process
by dr. Uros Godnov**

# Feature selection

1. **Filter Methods:**

Filter methods are used to score each feature independently of the model and then select the highest scoring features:
- Variance Threshold
- Correlation Matrix with Heatmap



---


2. **Wrapper Methods:**

Wrapper methods involve training a machine learning model to evaluate the importance of features, where the model's performance serves as a criterion:

- Recursive Feature Elimination (RFE)
- Sequential Feature Selection



---
3. **Embedded Methods**

Embedded methods are techniques that are performed during model training and can provide feature importance scores directly from the algorithm:

- Feature Importance with Tree-Based Models
- L1 Regularization (Lasso)



---


4. **Principal Component Analysis (PCA)**

PCA is a dimensionality reduction technique that transforms features into a lower-dimensional space. While it doesn’t technically “select” features, it can help reduce feature dimensionality.

In [None]:
import pandas as pd
import numpy as np

from google.colab import drive
import sys
# Mount google drive
drive.mount('/content/gdrive')
# Changing path dirctory
sys.path.append('/content/gdrive/MyDrive/Google_Colab_modules')

import sweetviz as sv
import ydata_profiling as ydp

In [None]:
df=pd.read_csv("https://raw.githubusercontent.com/urosgodnov/datasets/refs/heads/master/laptip_prices_without_missing_values.csv")
for col in df.select_dtypes(exclude="number"):
    # Count frequencies including NaN
    counts = df[col].value_counts(dropna=False)

    # Replace categories with frequency ≤ 65, but keep NaNs
    df[col] = df[col].apply(
        lambda x: x if pd.isnull(x) or counts[x] > 65 else "Other"
    )
df.head()

## Correlation matrix

In [None]:
df.select_dtypes(include=['number','float']).corr()

**No** p- values.
What is a p-value?

Key points about p-values:
- Low p-value (e.g., < 0.05): Suggests that the observed data is unlikely under the null hypothesis, leading to rejection of the null hypothesis in favor of the alternative hypothesis. This implies the result is statistically significant.

- High p-value (e.g., > 0.05): Indicates that the observed data is consistent with the null hypothesis, meaning there is not enough evidence to reject it. The result is considered not statistically significant.

Interpretation:
- p < 0.01: Strong evidence against the null hypothesis.
- p < 0.05: Moderate evidence against the null hypothesis (common threshold for significance).
- p > 0.05: Weak evidence against the null hypothesis; you may not reject the null.

While p-values are useful for determining statistical significance, they don't measure the size or importance of an effect and can be affected by sample size. Therefore, it's crucial to interpret them in the context of the study and other statistics (e.g., confidence intervals).

In [None]:
from scipy.stats import pearsonr

dfC=df.select_dtypes(include=['number','float'])

correlation_matrix = dfC.corr()
p_values = pd.DataFrame(index=dfC.columns, columns=dfC.columns)

for col1 in dfC.columns:
    for col2 in dfC.columns:
        if col1 == col2:
            p_values.loc[col1, col2] = None  # p-value for self correlation is not meaningful
        else:
            corr, p_val = pearsonr(df[col1], df[col2])
            p_values.loc[col1, col2] = p_val

print("Correlation Matrix:")
print(correlation_matrix)
print("\nP-Values Matrix:")
print(p_values)

- Strongest correlation: ram_gb has the strongest relationship with price_euro (correlation = 0.7400), and this relationship is highly significant.
- Moderate correlation: cpu_frequency_ghz shows a moderate correlation with price.
- Weak correlations: inches and weight_kg have weak but statistically significant correlations with price_euro.

## T test and ANOVA

**Key Assumptions** (t test):
- The two groups are independent (no overlap).
- The data is approximately normally distributed.
- Variances between the two groups are equal (homogeneity of variance). If not, a Welch's t-test can be used as a variant.

**Key Assumptions** (ANOVA):
- ANOVA is used when comparing the means of three or more groups.
- It determines if at least one group has a significantly different mean without performing multiple pairwise t-tests, which can increase the risk of Type I errors (false positives).

In [None]:
import pandas as pd
from scipy import stats

# Identify object-type columns
object_columns = df.select_dtypes(include='object').columns

# Iterate through each object-type column
for col in object_columns:
    # Drop rows with NaN in the current column or 'price_euro'
    df_clean = df.dropna(subset=[col, 'price_euro'])
    unique_values = df_clean[col].unique()

    if len(unique_values) == 2:
        # Perform t-test for two unique values
        group1 = df_clean[df_clean[col] == unique_values[0]]['price_euro']
        group2 = df_clean[df_clean[col] == unique_values[1]]['price_euro']

        if len(group1) >= 2 and len(group2) >= 2:
            # Perform independent t-test
            t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)  # Welch's t-test

            print(f"T-test for column '{col}', comparing '{unique_values[0]}' vs '{unique_values[1]}':")
            print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4e}\n")
        else:
            print(f"Not enough data in one of the groups for column '{col}'.\n")

    elif len(unique_values) >= 3:
        # Perform ANOVA for three or more unique values
        groups = [df_clean[df_clean[col] == value]['price_euro'] for value in unique_values]

        # Check if all groups have at least two observations
        if all(len(group) >= 2 for group in groups):
            # Perform one-way ANOVA
            f_stat, p_value = stats.f_oneway(*groups)

            print(f"ANOVA for column '{col}':")
            print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.4e}\n")
        else:
            print(f"Not enough data in one of the groups for column '{col}'.\n")


## Regular linear regression

In [None]:
# for the purpose of demonstration, we select only numeric columns
# importing stats model - OLS regression
import statsmodels.api as sm
from statsmodels.formula.api import ols

df_numeric = df.select_dtypes(include=[np.number])

# Let's say we want to predict 'Y' using all other numeric columns.
dependent_var = 'price_euro'
independent_vars = df_numeric.columns.drop(dependent_var)  # all numeric cols except 'Y'

# 3. Construct the formula string for statsmodels
formula = f"{dependent_var} ~ {' + '.join(independent_vars)}"
print("Formula:", formula)
# e.g., "Y ~ X1 + X2"

# 4. Fit the linear regression model
model = ols(formula, data=df_numeric).fit()

In [None]:
print(model.summary())

## Scikit feature selection

### SelectKBest with f_regression

f_regression is a function from the sklearn.feature_selection module that evaluates the linear relationship between each feature (independent variable) and the target (dependent variable) using an F-test in a regression setting.

**Compare it to previous linear regression model!**

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression


**NOMINAL variables - get_dummies()**



In [None]:
Xdemo = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green'],
    'Size': ['S', 'M', 'L'],
    'Price': [10, 15, 20]
})

print(Xdemo)
Xdemo_dummies = pd.get_dummies(Xdemo)
print(Xdemo_dummies)

In [None]:
y = df['price_euro']

X = df.drop('price_euro', axis=1)

## get_dummies is similiar to one-code
X_dummies = pd.get_dummies(X)
X_dummies.head()

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_dummies)

X_scaled

In [None]:
selector = SelectKBest(score_func=f_regression, k=10)  # Select top 10 features

# Fit the selector to the data
X_new = selector.fit_transform(X_scaled, y)

# Get the boolean mask of selected features
mask = selector.get_support()

# Get the list of selected feature names
selected_features = X_dummies.columns[mask]

selected_features

#### Cheating with ordinal coding

In [None]:
from sklearn.preprocessing import OrdinalEncoder

categorical_cols=df.select_dtypes(include='object').columns

ordinal_encoder = OrdinalEncoder()

X_encoded = X.copy()

X_encoded[categorical_cols] = ordinal_encoder.fit_transform(X[categorical_cols])

X_encoded.transpose()

In [None]:
# if we want to see a mapping values
# Map encoded integers back to original categories
print("\nMapping of Encoded Values to Original Categories:")
for col, categories in zip(categorical_cols, ordinal_encoder.categories_):
    print(f"{col}: {dict(enumerate(categories))}")

In [None]:
selector = SelectKBest(score_func=f_regression, k=7)

X_new = selector.fit_transform(X_encoded, y)


# Get the boolean mask of selected features
mask = selector.get_support()

# Get the list of selected feature names
selected_features = X.columns[mask]

selected_features

**Feature importance**

In [None]:
scores = selector.scores_
pvalues = selector.pvalues_

feature_importances = pd.DataFrame({
    'Feature': X_encoded.columns,  # The original feature names after encoding
    'F-Score': scores,
    'p-Value': pvalues
})

# Sort by F-score to display the most important features
feature_importances_sorted = feature_importances.sort_values(by='F-Score', ascending=False)
feature_importances_sorted.head(7)

# Creating a pipeline

Creating a pipeline with Scikit-learn has several advantages, particularly in machine learning workflows. Here’s why you should use a pipeline:

- Streamlines Workflow: Combines multiple steps like preprocessing, feature selection, and model training into a single object, making the workflow more organized and easier to manage.

- Prevents Data Leakage: Ensures that all transformations (like scaling or encoding) are applied only on training data during cross-validation, avoiding the use of information from the test set in training.

- Consistency: Ensures that the same transformations are applied to both the training and test datasets, maintaining uniformity throughout the machine learning process.

- Cleaner Code: Simplifies code by chaining all preprocessing and modeling steps, reducing the need to manually apply transformations for every step.

- Cross-Validation Compatibility: Allows seamless use of cross-validation (cross_val_score, GridSearchCV, etc.) by treating the entire pipeline as a single entity, making it easy to optimize hyperparameters across all steps.

- Reusable: The pipeline can be reused for both training and prediction, applying all transformations automatically to any new data.

- Modular Design: Each step in the pipeline is modular, making it easier to update or replace specific parts (e.g., swapping out a model or a preprocessing step) without affecting the entire workflow.

- Reduces Human Error: By automating data transformations within the pipeline, it reduces the chance of making mistakes (like forgetting to scale test data) when manually performing steps

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer



In [None]:
# pipeline = Pipeline(steps=[
#    ('step_name', transformer_or_model),
#    ('step_name2', transformer_or_model2),
#    # Add more steps as needed
# ])

## Pipeline for preprocessing

In [None]:
import sklearn
dir(sklearn.feature_selection)

### SelectKBest

In [None]:
# data importing and frequency recalculation

df=pd.read_csv("https://raw.githubusercontent.com/urosgodnov/datasets/refs/heads/master/laptip_prices_without_missing_values.csv")

for col in df.select_dtypes(exclude="number"):
    # Count frequencies including NaN
    counts = df[col].value_counts(dropna=False)

    # Replace categories with frequency ≤ 65, but keep NaNs
    df[col] = df[col].apply(
        lambda x: x if pd.isnull(x) or counts[x] > 65 else "Other"
    )

In [None]:
# independent and dependent variables
X = df.drop('price_euro', axis=1)
y = df['price_euro']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

Explanation:

- transformers argument:
 - ('num', StandardScaler(), numerical_cols):
   - Apply StandardScaler to the columns in numerical_cols.
   - StandardScaler standardizes numerical features to have zero mean and unit variance.
 - ('cat', OrdinalEncoder(), categorical_cols):
   - Apply OrdinalEncoder to the columns in categorical_cols.
   - OrdinalEncoder converts categorical values into integers based on their order (e.g., ['Low', 'Medium', 'High'] → [0, 1, 2]).

- Each tuple in transformers specifies:
 - A name for the transformation ('num' or 'cat').
 - The transformation object (StandardScaler or OrdinalEncoder).
 - The columns to which the transformation will be applied (numerical_cols or categorical_cols).

In [None]:
# defining preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OrdinalEncoder(), categorical_cols)
    ])

# defining pipline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', SelectKBest(score_func=f_classif, k=7))  # Feature selection step
])

In [None]:
pipeline.fit(X, y)

X_selected = pipeline.transform(X)

selector = pipeline.named_steps['feature_selection']

scores = selector.scores_  # F-scores
pvalues = selector.pvalues_  # p-values

mask = selector.get_support()

selected_features = X.columns[mask]

# Output the selected feature names
print("Selected Features:", selected_features)

Selected Features: Index(['product', 'typename', 'inches', 'cpu_frequency_ghz', 'memory',
       'gpu_company', 'opsys'],
      dtype='object')


  f = msb / msw


In [None]:
feature_importances_selectkbest = pd.DataFrame({
    'Feature': X.columns,
    'F-Score': scores,
    'p-Value': pvalues
})

feature_importances_selectkbest = feature_importances_selectkbest[feature_importances_selectkbest['Feature'].isin(selected_features)].sort_values(by='F-Score', ascending=False)
feature_importances_selectkbest.head()

Unnamed: 0,Feature,F-Score,p-Value
2,typename,4.332661,5.600711e-61
1,product,1.643684,1.405849e-09
7,cpu_frequency_ghz,1.417062,1.335609e-05
10,gpu_company,1.413896,1.501003e-05
3,inches,1.341903,0.0001911267


### Variance thresholder

In [72]:
from sklearn.feature_selection import VarianceThreshold

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', VarianceThreshold(threshold=0.8))
])

In [73]:
pipeline.fit(X, y)

X_selected = pipeline.transform(X)

selector = pipeline.named_steps['feature_selection']


mask = selector.get_support()
variances = selector.variances_

selected_features = X.columns[mask]

# Output the selected feature names
print("Selected Features:", selected_features)

Selected Features: Index(['company', 'product', 'typename', 'inches', 'screenresolution',
       'cpu_type', 'cpu_frequency_ghz', 'memory', 'gpu_company'],
      dtype='object')


In [74]:
feature_importances_variance = pd.DataFrame({
    'Feature': X.columns,
    'Variance': variances
})

feature_importances_variance[feature_importances_variance["Variance"]>0.5]

Unnamed: 0,Feature,Variance
0,company,1.0
1,product,1.0
2,typename,1.0
3,inches,1.0
4,screenresolution,2.124567
6,cpu_type,1.177254
7,cpu_frequency_ghz,1.083122
9,memory,5.677033
10,gpu_company,4.044047
12,opsys,0.637259


### Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a feature selection technique used to identify the most important features for a predictive model. It works by recursively removing the least important features and fitting the model again on the remaining features.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', RFE(estimator=LinearRegression(), n_features_to_select=7))  # Feature selection step
])


In [None]:
pipeline.fit(X, y)

X_selected = pipeline.transform(X)

selector = pipeline.named_steps['feature_selection']

ranking = selector.ranking_
support = selector.support_

mask = selector.get_support()

selected_features = X.columns[mask]

# Output the selected feature names
print("Selected Features:", selected_features)

In [None]:
feature_importances_RFE = pd.DataFrame({
    'Feature': X.columns,
    'Ranking': ranking,
    'Selected': support
})

feature_importances_RFE = feature_importances_RFE[feature_importances_RFE['Feature'].isin(selected_features)].sort_values(by='Ranking')
feature_importances_RFE.head()

**Task**

In [None]:
# Use SelectKBest(score_func=f_classif, k=7) to find 5 feature
# neglect data quality
df_task=pd.read_csv("https://github.com/urosgodnov/datasets/blob/master/diabetes.csv")

df_task.head()