### Feature Selection Exercie

In this exercise, we will apply Feature Selection to a Iris flowers dataset, where the target variable is the Species. Essentially, our goal is to identify the features that are most relevant in discerning the species of each Iris flower. The dataset is from: https://www.kaggle.com/datasets/uciml/iris

1. Load the dataset from the exercise's Github Repository (Iris.csv)
2. Using buisness logic/common sense, drop features that are surely irrevelvant to the target variable.
3. Preprocess your data (split data into training and testing)
4. Apply feature selection with the following methods:
    - Pearson's correlation coefficient (r)
    - Kendall's tau (τ)
    - Mutual Information (MI)
    - Logistic Regression with L1 penalty
6. Compare the results of each feature selection method:
    - What features did you manually dropped before applying the feature selection methods? Explain why.
    - Are there any common features selected across multiple methods?
    - Can you explain why certain features were selected based on their characteristics?
(Optional) Visualize the importance of features using techniques like bar charts or heatmaps to make it easier to compare.



In [None]:
#Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from scipy.stats import pearsonr, kendalltau, mutual_information
from statsmodels.api import LogisticRegression

In [None]:
#Feature Selection with Pearson's

# Calculate correlations between features and target
correlations = X_train.corrwith(y_train)

# Select top k features with highest absolute correlation
k = 3
selector = SelectKBest(score_func=correlations.abs(), k=k)
selected_features_r = selector.fit(X_train, y_train).get_support(indices=True)

print("Features selected by Pearson's r:", X_train.columns[selected_features_r])

In [None]:
#Feature Selection with Kendall's tau (τ)

# Calculate Kendall's tau for each feature with target
tau_scores = []
for feature in X_train:
    tau, _ = kendalltau(X_train[feature], y_train)
    tau_scores.append(tau)

# Select top k features with highest absolute tau
k = 3
selector = SelectKBest(score_func=abs, k=k)
selected_features_tau = selector.fit(X_train[tau_scores], y_train).get_support(indices=True)

# Print selected features
print("Features selected by Kendall's tau:", X_train.columns[selected_features_tau])

In [None]:
#Feature Selection with Mutual Information (MI)

# Calculate Mutual Information for each feature with target
mi_scores = []
for feature in X_train:
    mi = mutual_information(X_train[feature], y_train)
    mi_scores.append(mi)

# Select top k features with highest mutual information
k = 3
selector = SelectKBest(score_func=mi_scores, k=k)
selected_features_mi = selector.fit(X_train, y_train).get_support(indices=True)

# Print selected features
print("Features selected by Mutual Information:", X_train.columns[selected_features_mi])

In [None]:
#Feature Selection with Logistic Regression with L1 penalty

# Fit Logistic Regression with L1 penalty
model = LogisticRegression(penalty='l1', solver='liblinear')
model.fit(X_train, y_train)

# Get coefficients and select non-zero features
coefs = model.coef_.flatten()
selected_features_lr = X_train.columns[coefs != 0]

# Print selected features
print("Features selected by Logistic Regression L1:", selected_features_lr)