### Feature Selection Exercie

In this exercise, we will apply Feature Selection to a Iris flowers dataset, where the target variable is the Species. Essentially, our goal is to identify the features that are most relevant in discerning the species of each Iris flower. The dataset is from: https://www.kaggle.com/datasets/uciml/iris

1. Load the dataset from the exercise's Github Repository (Iris.csv)
2. Using buisness logic/common sense, drop features that are surely irrevelvant to the target variable.
3. Preprocess your data (split data into training and testing)
4. Apply feature selection with the following methods:
    - Pearson's correlation coefficient (r)
    - Kendall's tau (τ)
    - Mutual Information (MI)
    - Logistic Regression with L1 penalty
6. Compare the results of each feature selection method:
    - What features did you manually dropped before applying the feature selection methods? Explain why.
    - Are there any common features selected across multiple methods?
    - Can you explain why certain features were selected based on their characteristics?
(Optional) Visualize the importance of features using techniques like bar charts or heatmaps to make it easier to compare.



In [1]:
#Import libraries
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from scipy.stats import pearsonr, kendalltau

In [2]:
# Load the dataset
iris_df = pd.read_csv("Iris.csv")

# Display the first few rows of the dataset
print(iris_df.head())

   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm FlowerColour  \
0   1            5.1           3.5            1.4           0.2       Purple   
1   2            4.9           3.0            1.4           0.2       Orange   
2   3            4.7           3.2            1.3           0.2        Black   
3   4            4.6           3.1            1.5           0.2        White   
4   5            5.0           3.6            1.4           0.2         Teal   

   YearCollected  MonthCollected  StigmaLegnth      Species  
0           2003               2             2  Iris-setosa  
1           1998               9             1  Iris-setosa  
2           1995               5             3  Iris-setosa  
3           2008               3             3  Iris-setosa  
4           2007               9             1  Iris-setosa  


In [3]:
irrelevant_features = ['Id', 'FlowerColour', 'YearCollected']
iris_df = iris_df.drop(irrelevant_features, axis=1)

# Separate features (X) and target variable (y)
X = iris_df.drop('Species', axis=1)
y = iris_df['Species']

# Preprocess the data: split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Method 1: Pearson's correlation coefficient (r)
correlation_scores, _ = pearsonr(X_train, y_train)

In [None]:
# Method 2: Kendall's tau (τ)
tau_scores, _ = kendalltau(X_train, y_train)

In [None]:
# Method 3: Mutual Information (MI)
mi_selector = SelectKBest(mutual_info_classif, k='all')
mi_selector.fit(X_train, y_train)
mi_scores = mi_selector.scores_

In [None]:
# Method 4: Logistic Regression with L1 penalty
logreg = LogisticRegression(penalty='l1', solver='liblinear')
logreg.fit(X_train, y_train)
l1_selector = SelectFromModel(logreg, prefit=True)
l1_support = l1_selector.get_support()

In [None]:
# Compare the results
feature_selection_results = pd.DataFrame({
    'Feature': X_train.columns,
    'Pearson_Correlation': correlation_scores,
    'Kendall_Tau': tau_scores,
    'Mutual_Information': mi_scores,
    'L1_Penalty': l1_support
})

# Display the results
print(feature_selection_results)