<a href="https://colab.research.google.com/github/urosgodnov/BigData/blob/master/Machine_Learning_with_Python_3_Feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using Python to implement machine learning process
by dr. Uros Godnov**

# Feature selection

In [1]:
import pandas as pd
import numpy as np

from google.colab import drive
import sys
# Mount google drive
drive.mount('/content/gdrive')
# Changing path dirctory
sys.path.append('/content/gdrive/MyDrive/Google_Colab_modules')

import sweetviz as sv
import ydata_profiling as ydp

Mounted at /content/gdrive


In [3]:
df=pd.read_csv("https://raw.githubusercontent.com/urosgodnov/datasets/refs/heads/master/laptip_prices_without_missing_values.csv")

df.head()

Unnamed: 0,company,product,typename,inches,screenresolution,cpu_company,cpu_type,cpu_frequency_ghz,ram_gb,memory,gpu_company,gpu_type,opsys,weight_kg,price_euro
0,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel,Core i5,2.3,8.0,128GB SSD,Intel,Iris Plus Graphics 640,macOS,1.37,1339.69
1,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel,Core i5,1.8,8.0,128GB Flash Storage,Intel,HD Graphics 6000,macOS,1.34,898.94
2,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel,Core i5 7200U,2.5,8.0,256GB SSD,Intel,HD Graphics 620,No OS,1.86,575.0
3,Apple,MacBook Pro,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel,Core i7,2.7,16.0,512GB SSD,AMD,Radeon Pro 455,macOS,1.83,2537.45
4,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel,Core i5,3.1,8.0,256GB SSD,Intel,Iris Plus Graphics 650,macOS,1.37,1803.6


## Correlation matrix

In [27]:
df.select_dtypes(include=['number','float']).corr()

Unnamed: 0,inches,cpu_frequency_ghz,ram_gb,weight_kg,price_euro
inches,1.0,0.305037,0.241078,0.826638,0.067419
cpu_frequency_ghz,0.305037,1.0,0.366254,0.318649,0.427772
ram_gb,0.241078,0.366254,1.0,0.38937,0.739986
weight_kg,0.826638,0.318649,0.38937,1.0,0.2122
price_euro,0.067419,0.427772,0.739986,0.2122,1.0


**No** p- values.
What is a p-value?

Key points about p-values:
- Low p-value (e.g., < 0.05): Suggests that the observed data is unlikely under the null hypothesis, leading to rejection of the null hypothesis in favor of the alternative hypothesis. This implies the result is statistically significant.

- High p-value (e.g., > 0.05): Indicates that the observed data is consistent with the null hypothesis, meaning there is not enough evidence to reject it. The result is considered not statistically significant.

Interpretation:
- p < 0.01: Strong evidence against the null hypothesis.
- p < 0.05: Moderate evidence against the null hypothesis (common threshold for significance).
- p > 0.05: Weak evidence against the null hypothesis; you may not reject the null.

While p-values are useful for determining statistical significance, they don't measure the size or importance of an effect and can be affected by sample size. Therefore, it's crucial to interpret them in the context of the study and other statistics (e.g., confidence intervals).

In [4]:
from scipy.stats import pearsonr

dfC=df.select_dtypes(include=['number','float'])

correlation_matrix = dfC.corr()
p_values = pd.DataFrame(index=dfC.columns, columns=dfC.columns)

for col1 in dfC.columns:
    for col2 in dfC.columns:
        if col1 == col2:
            p_values.loc[col1, col2] = None  # p-value for self correlation is not meaningful
        else:
            corr, p_val = pearsonr(df[col1], df[col2])
            p_values.loc[col1, col2] = p_val

print("Correlation Matrix:")
print(correlation_matrix)
print("\nP-Values Matrix:")
print(p_values)

Correlation Matrix:
                     inches  cpu_frequency_ghz    ram_gb  weight_kg  \
inches             1.000000           0.305037  0.241078   0.826638   
cpu_frequency_ghz  0.305037           1.000000  0.366254   0.318649   
ram_gb             0.241078           0.366254  1.000000   0.389370   
weight_kg          0.826638           0.318649  0.389370   1.000000   
price_euro         0.067419           0.427772  0.739986   0.212200   

                   price_euro  
inches               0.067419  
cpu_frequency_ghz    0.427772  
ram_gb               0.739986  
weight_kg            0.212200  
price_euro           1.000000  

P-Values Matrix:
                     inches cpu_frequency_ghz ram_gb weight_kg price_euro
inches                 None               0.0    0.0       0.0   0.016052
cpu_frequency_ghz       0.0              None    0.0       0.0        0.0
ram_gb                  0.0               0.0   None       0.0        0.0
weight_kg               0.0               0.0  

- Strongest correlation: ram_gb has the strongest relationship with price_euro (correlation = 0.7400), and this relationship is highly significant.
- Moderate correlation: cpu_frequency_ghz shows a moderate correlation with price.
- Weak correlations: inches and weight_kg have weak but statistically significant correlations with price_euro.

## T test and ANOVA

**Key Assumptions** (t test):
- The two groups are independent (no overlap).
- The data is approximately normally distributed.
- Variances between the two groups are equal (homogeneity of variance). If not, a Welch's t-test can be used as a variant.

**Key Assumptions** (ANOVA):
- ANOVA is used when comparing the means of three or more groups.
- It determines if at least one group has a significantly different mean without performing multiple pairwise t-tests, which can increase the risk of Type I errors (false positives).

In [7]:
import pandas as pd
from scipy import stats

# Identify object-type columns
object_columns = df.select_dtypes(include='object').columns

# Iterate through each object-type column
for col in object_columns:
    # Drop rows with NaN in the current column or 'price_euro'
    df_clean = df.dropna(subset=[col, 'price_euro'])
    unique_values = df_clean[col].unique()

    if len(unique_values) == 2:
        # Perform t-test for two unique values
        group1 = df_clean[df_clean[col] == unique_values[0]]['price_euro']
        group2 = df_clean[df_clean[col] == unique_values[1]]['price_euro']

        if len(group1) >= 2 and len(group2) >= 2:
            # Perform independent t-test
            t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)  # Welch's t-test

            print(f"T-test for column '{col}', comparing '{unique_values[0]}' vs '{unique_values[1]}':")
            print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4e}\n")
        else:
            print(f"Not enough data in one of the groups for column '{col}'.\n")

    elif len(unique_values) >= 3:
        # Perform ANOVA for three or more unique values
        groups = [df_clean[df_clean[col] == value]['price_euro'] for value in unique_values]

        # Check if all groups have at least two observations
        if all(len(group) >= 2 for group in groups):
            # Perform one-way ANOVA
            f_stat, p_value = stats.f_oneway(*groups)

            print(f"ANOVA for column '{col}':")
            print(f"F-statistic: {f_stat:.4f}, p-value: {p_value:.4e}\n")
        else:
            print(f"Not enough data in one of the groups for column '{col}'.\n")


ANOVA for column 'company':
F-statistic: 13.2786, p-value: 8.9374e-37

Not enough data in one of the groups for column 'product'.

ANOVA for column 'typename':
F-statistic: 154.9003, p-value: 1.4603e-128

Not enough data in one of the groups for column 'screenresolution'.

Not enough data in one of the groups for column 'cpu_company'.

Not enough data in one of the groups for column 'cpu_type'.

Not enough data in one of the groups for column 'memory'.

Not enough data in one of the groups for column 'gpu_company'.

Not enough data in one of the groups for column 'gpu_type'.

ANOVA for column 'opsys':
F-statistic: 18.8354, p-value: 6.4984e-27



## Scikit feature selection

### SelectKBest with f_regression

In [16]:
from sklearn.feature_selection import SelectKBest, f_regression


In [20]:
y = df['price_euro']

X = df.drop('price_euro', axis=1)
## get_dummies is similiar to one-code
X_dummies = pd.get_dummies(X)
X_dummies.head()

Unnamed: 0,inches,cpu_frequency_ghz,ram_gb,weight_kg,company_Acer,company_Apple,company_Asus,company_Chuwi,company_Dell,company_Fujitsu,...,gpu_type_UHD Graphics 620,opsys_Android,opsys_Chrome OS,opsys_Linux,opsys_Mac OS X,opsys_No OS,opsys_Windows 10,opsys_Windows 10 S,opsys_Windows 7,opsys_macOS
0,13.3,2.3,8.0,1.37,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,13.3,1.8,8.0,1.34,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,15.6,2.5,8.0,1.86,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3,15.4,2.7,16.0,1.83,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,13.3,3.1,8.0,1.37,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_dummies)

X_scaled

array([[-1.20574632e+00, -5.91760846e-03, -8.64993649e-02, ...,
        -7.94614867e-02, -1.91273014e-01,  9.85276221e+00],
       [-1.20574632e+00, -9.98674290e-01, -8.64993649e-02, ...,
        -7.94614867e-02, -1.91273014e-01,  9.85276221e+00],
       [ 4.03873147e-01,  3.91185064e-01, -8.64993649e-02, ...,
        -7.94614867e-02, -1.91273014e-01, -1.01494381e-01],
       ...,
       [-7.15862135e-01, -1.39577696e+00, -1.26393734e+00, ...,
        -7.94614867e-02, -1.91273014e-01, -1.01494381e-01],
       [ 4.03873147e-01,  3.91185064e-01, -4.78978689e-01, ...,
        -7.94614867e-02, -1.91273014e-01, -1.01494381e-01],
       [ 4.03873147e-01, -1.39577696e+00, -8.71458014e-01, ...,
        -7.94614867e-02, -1.91273014e-01, -1.01494381e-01]])

In [18]:
selector = SelectKBest(score_func=f_regression, k=10)  # Select top 10 features

# Fit the selector to the data
X_new = selector.fit_transform(X_scaled, y)

# Get the boolean mask of selected features
mask = selector.get_support()

# Get the list of selected feature names
selected_features = X.columns[mask]

selected_features

Index(['cpu_frequency_ghz', 'ram_gb', 'typename_Gaming', 'typename_Notebook',
       'screenresolution_1366x768', 'cpu_type_Core i7 7700HQ',
       'memory_1TB HDD', 'memory_512GB SSD', 'gpu_company_Nvidia',
       'gpu_type_GeForce GTX 1070'],
      dtype='object')

**Because we created dummy columns, we dont get back the column names**

#### Cheating with ordinal coding

In [22]:
from sklearn.preprocessing import OrdinalEncoder

categorical_cols=df.select_dtypes(include='object').columns

ordinal_encoder = OrdinalEncoder()

X_encoded = X.copy()

X_encoded[categorical_cols] = ordinal_encoder.fit_transform(X[categorical_cols])

X_encoded

Unnamed: 0,company,product,typename,inches,screenresolution,cpu_company,cpu_type,cpu_frequency_ghz,ram_gb,memory,gpu_company,gpu_type,opsys,weight_kg
0,1.0,300.0,4.0,13.3,23.0,1.0,40.0,2.3,8.0,4.0,2.0,56.0,8.0,1.37
1,1.0,301.0,4.0,13.3,1.0,1.0,40.0,1.8,8.0,2.0,2.0,50.0,8.0,1.34
2,7.0,50.0,3.0,15.6,8.0,1.0,46.0,2.5,8.0,16.0,2.0,52.0,4.0,1.86
3,1.0,300.0,4.0,15.4,25.0,1.0,54.0,2.7,16.0,29.0,0.0,76.0,8.0,1.83
4,1.0,300.0,4.0,13.3,23.0,1.0,40.0,3.1,8.0,16.0,2.0,57.0,8.0,1.37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1270,10.0,580.0,0.0,14.0,13.0,1.0,55.0,2.5,4.0,4.0,2.0,46.0,5.0,1.80
1271,10.0,588.0,0.0,13.3,19.0,1.0,55.0,2.5,16.0,29.0,2.0,46.0,5.0,1.30
1272,10.0,196.0,3.0,14.0,0.0,1.0,20.0,1.6,2.0,35.0,2.0,39.0,5.0,1.50
1273,7.0,2.0,3.0,15.6,0.0,1.0,55.0,2.5,6.0,10.0,0.0,88.0,5.0,2.19


In [27]:
selector = SelectKBest(score_func=f_regression, k=7)

X_new = selector.fit_transform(X_encoded, y)


# Get the boolean mask of selected features
mask = selector.get_support()

# Get the list of selected feature names
selected_features = X.columns[mask]

selected_features

Index(['screenresolution', 'cpu_type', 'cpu_frequency_ghz', 'ram_gb',
       'gpu_company', 'opsys', 'weight_kg'],
      dtype='object')

**Feature importance**

In [29]:
scores = selector.scores_
pvalues = selector.pvalues_

feature_importances = pd.DataFrame({
    'Feature': X_encoded.columns,  # The original feature names after encoding
    'F-Score': scores,
    'p-Value': pvalues
})

# Sort by F-score to display the most important features
feature_importances_sorted = feature_importances.sort_values(by='F-Score', ascending=False)
feature_importances_sorted.head(7)

Unnamed: 0,Feature,F-Score,p-Value
8,ram_gb,1540.757154,1.7072059999999999e-221
6,cpu_type,358.278755,1.338087e-70
7,cpu_frequency_ghz,285.117851,7.077937e-58
10,gpu_company,148.841271,1.862867e-32
4,screenresolution,147.621257,3.228876e-32
12,opsys,117.650866,2.79958e-26
13,weight_kg,60.024633,1.903397e-14


# Creating a pipeline

Creating a pipeline with Scikit-learn has several advantages, particularly in machine learning workflows. Here’s why you should use a pipeline:

- Streamlines Workflow: Combines multiple steps like preprocessing, feature selection, and model training into a single object, making the workflow more organized and easier to manage.

- Prevents Data Leakage: Ensures that all transformations (like scaling or encoding) are applied only on training data during cross-validation, avoiding the use of information from the test set in training.

- Consistency: Ensures that the same transformations are applied to both the training and test datasets, maintaining uniformity throughout the machine learning process.

- Cleaner Code: Simplifies code by chaining all preprocessing and modeling steps, reducing the need to manually apply transformations for every step.

- Cross-Validation Compatibility: Allows seamless use of cross-validation (cross_val_score, GridSearchCV, etc.) by treating the entire pipeline as a single entity, making it easy to optimize hyperparameters across all steps.

- Reusable: The pipeline can be reused for both training and prediction, applying all transformations automatically to any new data.

- Modular Design: Each step in the pipeline is modular, making it easier to update or replace specific parts (e.g., swapping out a model or a preprocessing step) without affecting the entire workflow.

- Reduces Human Error: By automating data transformations within the pipeline, it reduces the chance of making mistakes (like forgetting to scale test data) when manually performing steps

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer



In [None]:
# pipeline = Pipeline(steps=[
#    ('step_name', transformer_or_model),
#    ('step_name2', transformer_or_model2),
#    # Add more steps as needed
# ])

## Pipeline for preprocessing

### SelectKBest

In [90]:
df=pd.read_csv("https://raw.githubusercontent.com/urosgodnov/datasets/refs/heads/master/laptip_prices_without_missing_values.csv")


X = df.drop('price_euro', axis=1)
y = df['price_euro']

# Step 2: Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [91]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OrdinalEncoder(), categorical_cols)
    ])

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', SelectKBest(score_func=f_classif, k=7))  # Feature selection step
])

In [92]:
pipeline.fit(X, y)

X_selected = pipeline.transform(X)

selector = pipeline.named_steps['feature_selection']

scores = selector.scores_  # F-scores
pvalues = selector.pvalues_  # p-values

mask = selector.get_support()

selected_features = X.columns[mask]

# Output the selected feature names
print("Selected Features:", selected_features)

Selected Features: Index(['product', 'typename', 'inches', 'cpu_frequency_ghz', 'memory',
       'gpu_company', 'weight_kg'],
      dtype='object')


In [95]:
feature_importances_selectkbest = pd.DataFrame({
    'Feature': X.columns,
    'F-Score': scores,
    'p-Value': pvalues
})

feature_importances_selectkbest = feature_importances_selectkbest[feature_importances_selectkbest['Feature'].isin(selected_features)].sort_values(by='F-Score', ascending=False)
feature_importances_selectkbest.head()

Unnamed: 0,Feature,F-Score,p-Value
2,typename,4.332661,5.600711e-61
1,product,1.643684,1.405849e-09
7,cpu_frequency_ghz,1.597559,1.009262e-08
13,weight_kg,1.596341,1.062513e-08
9,memory,1.474721,1.496231e-06


### Recursive Feature Elimination (RFE)

In [96]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Preprocessing step
    ('feature_selection', RFE(estimator=LinearRegression(), n_features_to_select=7))  # Feature selection step
])


In [99]:
pipeline.fit(X, y)

X_selected = pipeline.transform(X)

selector = pipeline.named_steps['feature_selection']

ranking = selector.ranking_
support = selector.support_

mask = selector.get_support()

selected_features = X.columns[mask]

# Output the selected feature names
print("Selected Features:", selected_features)

Selected Features: Index(['company', 'product', 'typename', 'inches', 'ram_gb', 'gpu_type',
       'weight_kg'],
      dtype='object')


In [100]:
feature_importances_RFE = pd.DataFrame({
    'Feature': X.columns,
    'Ranking': ranking,
    'Selected': support
})

feature_importances_RFE = feature_importances_RFE[feature_importances_RFE['Feature'].isin(selected_features)].sort_values(by='Ranking')
feature_importances_RFE.head()

Unnamed: 0,Feature,Ranking,Selected
0,company,1,True
1,product,1,True
2,typename,1,True
3,inches,1,True
8,ram_gb,1,True
