## <span style="color:red">Feature selction - Univarite Testing </span>


**Univariate** Feature Selection or Testing applies statistical tests to find relationships between the output variable and each input variable in isolation. Tests are conducted one input variable at a time. The tests depends whether you are running a regression task or a classification task

### Regression Task

In a regression task, you may be provided with an f-score and a p-value for each variable and gives you a view of the statistical significance of their relationships between the input and the output variables. This will help you assess how confident you should be with the variables you have used in your model.

### Univariate Testing: Regression Task Code Template



In [None]:
# import packages
import pandas as pd

from sklearn.feature_selection import SelectKBest, f_regression

# import data
my_df = pd.read_csv("feature_selection_sample_data.csv")

X = my_df.drop(["output"], axis = 1)
y = my_df["output"]

feature_selector = SelectKBest(f_regression, k = "all")
fit = feature_selector.fit(X,y)

p_values = pd.DataFrame(fit.pvalues_)
scores = pd.DataFrame(fit.scores_)
input_variable_names = pd.DataFrame(X.columns)
summary_stats = pd.concat([input_variable_names, p_values, scores], axis = 1)
summary_stats.columns = ["input_variable", "p_value", "f_score"]
summary_stats.sort_values(by = "p_value", inplace = True)

p_value_threshold = 0.05
score_threshold = 5

selected_variables = summary_stats.loc[(summary_stats["f_score"] >= score_threshold) &
                                       (summary_stats["p_value"] <= p_value_threshold)]
selected_variables = selected_variables["input_variable"].tolist()
X_new = X[selected_variables]

### Classification Task
Depending on what test you use, you might be provided a chi-square score and a p-value for each variable. Again, this gives you a view of the statistical significance of their relationships between the input variables and the output variables.

In either Regression or Classification tasks, this will give you a basic information around which variables may be more important than the others and you could also put a threshold for the statistical test scores, the p-value or both to say that you only want to include variables that appear to have a reliable relationship with the output variable that you are looking to predict.

Note: The only downside of univariate testing is that it only considers variables in isolation. It doesn’t account variables that are interacting with each other.

### Univariate Testing: Classification Task Code Template

In [5]:
# import packages
import pandas as pd

# import data
my_df = pd.read_csv("german_credit.csv")
from sklearn.feature_selection import SelectKBest, chi2

X = my_df.drop(["Creditability"], axis = 1)
y = my_df["Creditability"]

feature_selector = SelectKBest(chi2, k = "all")
fit = feature_selector.fit(X,y)

p_values = pd.DataFrame(fit.pvalues_)
scores = pd.DataFrame(fit.scores_)
input_variable_names = pd.DataFrame(X.columns)
summary_stats = pd.concat([input_variable_names, p_values, scores], axis = 1)
summary_stats.columns = ["input_variable", "p_value", "chi2_score"]
summary_stats.sort_values(by = "p_value", inplace = True)

p_value_threshold = 0.05
score_threshold = 5

selected_variables = summary_stats.loc[(summary_stats["chi2_score"] >= score_threshold) &
                                       (summary_stats["p_value"] <= p_value_threshold)]
selected_variables = selected_variables["input_variable"].tolist()
X_new = X[selected_variables]

-- the above code doesn't give result why?--

-- Your code appears to be set up correctly for performing chi-squared feature selection, and there might be a few reasons why it doesn't produce an error or result:

Threshold Values: The thresholds you've set for score_threshold and p_value_threshold may be too strict, leading to no features meeting the criteria. You might want to adjust these thresholds based on the characteristics of your data.

Variable Types: Ensure that the target variable y is categorical, as the chi-squared test is typically used for categorical target variables.

Feature Types: Chi-squared is typically applied to categorical features. If your dataset contains numerical features, you may need to discretize them or convert them into categorical variables.

Here's a modified version of your code with some additional comments and an example of how you might set the threshold values:

python

In [6]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2

# Import data
my_df = pd.read_csv("german_credit.csv")

# Extract features and target variable
X = my_df.drop(["Creditability"], axis=1)
y = my_df["Creditability"]

# Perform chi-squared feature selection
feature_selector = SelectKBest(chi2, k="all")
fit = feature_selector.fit(X, y)

# Create a DataFrame with the results
p_values = pd.DataFrame(fit.pvalues_)
scores = pd.DataFrame(fit.scores_)
input_variable_names = pd.DataFrame(X.columns)
summary_stats = pd.concat([input_variable_names, p_values, scores], axis=1)
summary_stats.columns = ["input_variable", "p_value", "chi2_score"]
summary_stats.sort_values(by="p_value", inplace=True)

# Example threshold values
p_value_threshold = 0.05
score_threshold = 5

# Select variables based on thresholds
selected_variables = summary_stats.loc[(summary_stats["chi2_score"] >= score_threshold) &
                                       (summary_stats["p_value"] <= p_value_threshold)]

# Display the selected variables
print("Selected Variables:")
print(selected_variables)

# Get the list of selected variable names
selected_variable_names = selected_variables["input_variable"].tolist()

# Create a new DataFrame with only the selected variables
X_new = X[selected_variable_names]


Selected Variables:
                       input_variable       p_value    chi2_score
4                       Credit Amount  0.000000e+00  58264.415475
1          Duration of Credit (month)  8.637197e-72    321.030795
0                     Account Balance  3.702000e-18     75.474269
5                Value Savings/Stocks  7.304944e-10     37.937451
12                        Age (years)  3.941008e-08     30.178268
2   Payment Status of Previous Credit  9.128229e-07     24.103752
11      Most valuable available asset  2.050765e-03      9.503534
6        Length of current employment  1.602710e-02      5.799899
