## feature selection as per metods on SECOM dataset.

Dataset Link : https://archive.ics.uci.edu/ml/datasets/SECOM

Drive Link : https://docs.google.com/spreadsheets/d/1dFCe1zgokabsiEr6BbWmMJtiMefkrChpJWLiG_0dDkk/edit?usp=share_link

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd

data = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQtBXo5cBnDsM2fmfHPm6u72KGUS5FjPHNGMxOfYjA9-CAhmnRpwkIw_rOR3sANJIToiUU__6fbBvig/pub?gid=572763137&single=true&output=csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 592 entries, Time to Pass/Fail
dtypes: float64(590), int64(1), object(1)
memory usage: 7.1+ MB


In [None]:
data.head()

Unnamed: 0,Time,0,1,2,3,4,5,6,7,8,...,581,582,583,584,585,586,587,588,589,Pass/Fail
0,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,...,,0.5005,0.0118,0.0035,2.363,,,,,-1
1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1
2,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,...,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,...,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1


In [None]:
#shape of data
data.shape

(1567, 592)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 592 entries, Time to Pass/Fail
dtypes: float64(590), int64(1), object(1)
memory usage: 7.1+ MB


In [None]:
#drop date column
data.drop('Time',axis=1,inplace=True)

In [None]:
#check the null values
data.isnull().sum()

Unnamed: 0,0
0,6
1,7
2,14
3,14
4,14
...,...
586,1
587,1
588,1
589,1


#handling missing values

In [None]:
#almost every column has missing values we can't drop everyone so we filled random number between min and max range
for i in data.columns:
  if data[i].isnull().sum()>0:
    min_value = data[i].min()
    max_value = data[i].max()
    # Generate random numbers within the range
    random_values = np.random.uniform(min_value, max_value, size=data[i].isnull().sum())

    # Create a Series with the random values
    random_series = pd.Series(random_values, index=data[i][data[i].isnull()].index)

    # Fill NaN values with the random series
    data[i].fillna(random_series, inplace=True)
  else:
      pass

In [None]:
#print missing columns if any missing values
for i in data.columns:
  if data[i].isnull().sum()>0:
    print(i)
  else:
      pass

In [None]:
#check the duplicated rows
data.duplicated().sum()

0

In [None]:
#check type of dataset
type(data)

In [None]:
# Separate features and target columns
X = data.drop('Pass/Fail', axis=1)
y = data['Pass/Fail']

In [None]:
#spliting the dataset train and test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
X_train.shape,X_test.shape

((1253, 590), (314, 590))

#with all(390) features check the accuracy

In [None]:
#apply logistic regression
model=LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)

In [None]:
#prediction
y_pred=model.predict(X_test)

In [None]:
accuracy=accuracy_score(y_test,y_pred)
print("Test accuracy with all features:-", accuracy*100,'%')

Test accuracy with all features:- 90.12738853503186 %



To perform filter-based feature selection on the "UCI SECOM" dataset, which has 592 columns and a target column called "Pass/Fail," we can utilize the following methods:

1. Duplicate Features:
   - Identify and remove duplicate columns from the dataset. Columns with identical values provide redundant information and do not contribute to the prediction task.

2. Variance Threshold Method:
   - Calculate the variance of each feature.
   - Remove features with low variance, as they tend to have little or no predictive power.
   - Set a threshold value for variance and remove features below that threshold.

3. Correlation:
   - Compute the correlation matrix of the features.
   - Identify highly correlated features and choose one from each highly correlated group.
   - High correlation between features indicates redundancy, and removing one from each correlated group helps reduce multicollinearity.

4. ANOVA (Analysis of Variance):
   - Perform an ANOVA test between each feature and the target variable ("Pass/Fail").
   - Select features with a significant impact on the target variable.
   - Set a significance level (e.g., p-value threshold) for the test to determine the importance of each feature.

5. Chi-Squared:
   - Apply the Chi-Squared test between each feature and the target variable, considering both features as categorical.
   - Select features with a significant association with the target variable.
   - Set a significance level (e.g., p-value threshold) to determine the importance of each feature.


#1.Duplicate Features remove:

In [None]:
#duplicated columns use Transverse use insted of rows True values mean
data.T.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
586,False
587,False
588,False
589,False


In [None]:
# Remove duplicate features
# Get the subset of columns with duplicate values
duplicated_cols = data.columns[data.T.duplicated()]
# Remove the duplicated columns
data = data.drop(columns=duplicated_cols)

In [None]:
#after removing duplicated column
data.shape[1]

479

#2. Variance Threshold Method:-

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
vt_model=VarianceThreshold(threshold=0.02)

In [None]:
#fit the model
sel=vt_model.fit(data)

In [None]:
#get the columns
sel.get_support()

array([ True,  True,  True,  True,  True, False,  True, False, False,
       False, False, False,  True, False,  True,  True,  True, False,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True,  True, False,  True,  True, False, False,
        True, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False, False, False, False, False, False,
        True, False, False, False, False,  True, False,  True, False,
       False, False, False, False,  True,  True, False, False, False,
       False, False, False, False, False, False, False, False,  True,
        True,  True, False, False,  True, False,  True, False, False,
       False, False,  True, False, False,  True,  True, False,  True,
        True, False,

In [None]:
#get all column in variance threshold =true
columns=data.columns[sel.get_support()]
columns.shape

(311,)

In [None]:
#apply variance trheshold in dataset and store in data in series
data_vt=sel.transform(data)

In [None]:
data_vt

array([[ 3.03093000e+03,  2.56400000e+03,  2.18773330e+03, ...,
         2.36300000e+00,  1.56008292e+02, -1.00000000e+00],
       [ 3.09578000e+03,  2.46514000e+03,  2.23042220e+03, ...,
         4.44470000e+00,  2.08204500e+02, -1.00000000e+00],
       [ 2.93261000e+03,  2.55994000e+03,  2.18641110e+03, ...,
         3.17450000e+00,  8.28602000e+01,  1.00000000e+00],
       ...,
       [ 2.97881000e+03,  2.37978000e+03,  2.20630000e+03, ...,
         3.05900000e+00,  4.35231000e+01, -1.00000000e+00],
       [ 2.89492000e+03,  2.53201000e+03,  2.17703330e+03, ...,
         3.56620000e+00,  9.34941000e+01, -1.00000000e+00],
       [ 2.94492000e+03,  2.45076000e+03,  2.19544440e+03, ...,
         3.62750000e+00,  1.37784400e+02, -1.00000000e+00]])

it is array format convert dataframe with columns

In [None]:
df=pd.DataFrame(data_vt,columns=columns)

In [None]:
print("Number of Columns after Variance Threshold Method- ",df.shape[1])

Number of Columns after Variance Threshold Method-  311


#3.Correlation coefficient remove columns:-

In [None]:
corr_matrix = df.corr().abs()
corr_matrix

Unnamed: 0,0,1,2,3,4,6,12,14,15,16,...,571,572,573,574,576,577,581,585,589,Pass/Fail
0,1.000000,0.138667,0.000189,0.022339,0.029163,0.000918,0.002732,0.002604,0.021386,0.028544,...,0.021821,0.011220,0.003857,0.012729,0.010749,0.006323,0.046469,0.012819,0.001295,0.027121
1,0.138667,1.000000,0.012057,0.004087,0.035133,0.021492,0.034123,0.039603,0.100452,0.050808,...,0.034179,0.001415,0.011524,0.000963,0.002226,0.010939,0.010250,0.006006,0.043691,0.003013
2,0.000189,0.012057,1.000000,0.282325,0.113482,0.110213,0.018322,0.002030,0.012889,0.004926,...,0.027507,0.001584,0.033401,0.000224,0.001447,0.029699,0.048783,0.019642,0.021813,0.002167
3,0.022339,0.004087,0.282325,1.000000,0.035194,0.613390,0.032956,0.001587,0.021208,0.007755,...,0.024482,0.004051,0.008224,0.003237,0.004564,0.012243,0.033334,0.022905,0.084588,0.027608
4,0.029163,0.035133,0.113482,0.035194,1.000000,0.005753,0.005694,0.008735,0.013601,0.005522,...,0.093813,0.021292,0.026620,0.021541,0.021562,0.013264,0.018239,0.004256,0.060185,0.024338
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
577,0.006323,0.010939,0.029699,0.012243,0.013264,0.014718,0.029766,0.010597,0.017833,0.001670,...,0.121115,0.863768,0.957874,0.851784,0.859278,1.000000,0.014276,0.015041,0.024815,0.049633
581,0.046469,0.010250,0.048783,0.033334,0.018239,0.015231,0.050050,0.005628,0.001992,0.004498,...,0.009184,0.015528,0.013280,0.014570,0.015994,0.014276,1.000000,0.001423,0.154144,0.034262
585,0.012819,0.006006,0.019642,0.022905,0.004256,0.026034,0.002443,0.003850,0.010346,0.000434,...,0.014504,0.014536,0.010286,0.013654,0.013497,0.015041,0.001423,1.000000,0.008115,0.003299
589,0.001295,0.043691,0.021813,0.084588,0.060185,0.050618,0.038482,0.067075,0.002566,0.020040,...,0.010964,0.022756,0.027196,0.020555,0.022653,0.024815,0.154144,0.008115,1.000000,0.002755


In [None]:
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.8)]

In [None]:
df= df.drop(to_drop, axis=1)

In [None]:
#shape of column
df.shape[1]

198

#4.ANOVA (Analysis of Variance):

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
#anova requre target column to test
X=df.drop(columns=['Pass/Fail'],axis=1)
y=df['Pass/Fail']

In [None]:
# Apply SelectKBest with ANOVA F-value
k_best = SelectKBest(score_func=f_classif, k=100)  # Set k=100 to score all features

In [None]:
df_k=k_best.fit(X,y)

In [None]:
df_k.get_support()

array([ True, False, False,  True,  True, False, False,  True, False,
       False, False, False,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False, False,  True,  True,
       False,  True, False, False,  True, False,  True, False, False,
       False,  True, False, False,  True,  True,  True, False,  True,
       False,  True,  True, False,  True,  True,  True, False, False,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True, False,  True,
       False, False,  True,  True,  True,  True, False, False,  True,
        True, False, False,  True, False,  True,  True,  True,  True,
       False, False,  True,  True,  True,  True,  True,  True, False,
       False, False, False, False,  True, False, False,  True,  True,
        True, False, False, False,  True,  True,  True, False, False,
        True,  True,  True,  True, False,  True, False, False, False,
        True,  True,

In [None]:
#get all column in anova test
columns=X.columns[df_k.get_support()]
columns.shape

(100,)

In [None]:
#apply variance trheshold in dataset and store in data in series
data_k=sel.transform(data)

In [None]:
#apply transform in X
X_k=df_k.transform(X)

In [None]:
# Get the scores for each feature
X= pd.DataFrame(X_k,columns=columns)

In [None]:
#now check the shape of datset
X.shape

(1567, 100)

In [None]:
X

Unnamed: 0,0,3,4,14,21,22,24,25,26,28,...,547,559,564,565,566,568,569,572,573,581
0,3030.93,1411.1265,1.3602,7.955800,-5419.00,2916.50,751.00,0.8955,1.7730,64.2333,...,395.570,0.4385,10.989265,0.199709,12.488487,11.555681,16.680809,8.95,0.3157,718.294199
1,3095.78,1463.6606,0.8294,10.154800,-5441.50,2604.25,-1640.25,1.2973,2.0143,68.4222,...,408.798,0.1745,1.518442,0.533363,4.323298,8.848036,75.860306,5.92,0.2653,208.204500
2,2932.61,1698.0172,1.5102,9.515700,-5447.75,2701.75,-1916.50,1.3122,2.0295,67.1333,...,411.136,0.3718,1.100000,0.621900,0.412200,0.411900,68.848900,11.21,0.1882,82.860200
3,2988.72,909.7926,1.3204,9.605200,-5468.25,2648.25,-1657.25,1.3137,2.0038,62.9333,...,372.822,0.7288,7.320000,0.163000,3.561100,2.729000,25.036300,9.33,0.1738,73.843200
4,3032.24,1326.5200,1.5334,10.566100,-5476.25,2635.25,117.00,1.2887,1.9912,62.8333,...,399.914,0.2156,17.997907,0.127347,1.397276,10.936339,25.950622,8.83,0.2224,438.215849
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1562,2899.41,3085.3781,1.4843,11.769200,-5418.75,2608.00,356.00,1.2817,1.9540,71.1444,...,401.774,0.3553,4.980000,0.087700,2.090200,1.884400,15.466200,7.98,0.2363,203.172000
1563,3052.31,1124.6595,0.8763,9.162000,-6408.75,2277.50,339.00,1.0870,1.8023,72.8444,...,400.814,0.3105,4.560000,0.130800,1.742000,1.708900,20.911800,5.48,0.3891,179.399084
1564,2978.81,1110.4967,0.8236,18.798112,-5153.25,2707.00,-1226.00,1.2930,1.9435,71.2667,...,391.040,0.1266,11.090000,0.238800,4.412800,4.319700,29.095400,6.49,0.4154,43.523100
1565,2894.92,1183.7287,1.5726,9.735400,-5271.75,2676.50,394.75,1.2875,1.9880,70.5111,...,400.814,0.1920,4.980000,0.087700,2.090200,1.884400,15.466200,9.13,0.3669,93.494100


In [None]:
#split the dataset train and test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# Initialize and train logistic regression model
log_reg = LogisticRegression(max_iter=10000)  # Increase max_iter if it doesn't converge
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Calculate and print accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy after all feature selection:", accuracy)

Test accuracy after all feature selection: 0.9012738853503185
