# Feature Combination Correlation Analysis

This notebook explores how well each extracted numerical feature (and combinations thereof) correlate with the binary classification label `connection_type` (which takes values `wifi` or `Unknown`).

For each combination size (1, 2, 3, and 4 features), we build a logistic regression classifier using 5‑fold cross‑validation, handling missing values via mean imputation. Then we report the top 10 feature combinations (ranked by F1 score) along with their accuracy, F1 score, and recall.

Only numerical features are used in the analysis, and non‑feature columns (e.g. IDs, IP addresses) are dropped. The label is encoded using scikit‑learn's LabelEncoder.

In [1]:
import pandas as pd
import numpy as np
from itertools import combinations
from tqdm import tqdm

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, StratifiedKFold

# Change this path to your processed CSV file
input_csv = 'campus_queue_processed.csv'
df = pd.read_csv(input_csv)

# Display available columns
print('Columns in dataset:', df.columns.tolist())

# Define non-feature columns
non_feature_cols = ['id_upload', 'id_download', 'connection_type', 'IP']

# For all other columns, try to convert to numeric (coerce errors) and keep those with some non-NaN values
feature_cols = []
for col in df.columns:
    if col in non_feature_cols:
        continue
    df[col] = pd.to_numeric(df[col], errors='coerce')
    if df[col].notna().sum() > 0:
        feature_cols.append(col)

print('Identified feature columns:', feature_cols)

# Encode the connection_type label
if 'connection_type' not in df.columns:
    raise ValueError('connection_type column not found in dataset')

le = LabelEncoder()
df['label_enc'] = le.fit_transform(df['connection_type'])

# At this point, df contains numeric features (in feature_cols) and a numeric label in 'label_enc'.


Columns in dataset: ['id_upload', 'id_download', 'connection_type', 'throughput_upload', 'throughput_download', 'IP', 'series_upload_TCP.Backoff_count', 'series_upload_TCP.Backoff_mean', 'series_upload_TCP.Backoff_median', 'series_upload_TCP.Backoff_std', 'series_upload_TCP.Backoff_min', 'series_upload_TCP.Backoff_max', 'series_upload_TCP.Backoff_range', 'series_upload_TCP.Backoff_first', 'series_upload_TCP.Backoff_last', 'series_upload_TCP.Backoff_trend', 'series_upload_TCP.Backoff_skew', 'series_upload_TCP.Backoff_kurtosis', 'series_upload_TCP.Backoff_slope', 'series_upload_TCP.RcvSsThresh_count', 'series_upload_TCP.RcvSsThresh_mean', 'series_upload_TCP.RcvSsThresh_median', 'series_upload_TCP.RcvSsThresh_std', 'series_upload_TCP.RcvSsThresh_min', 'series_upload_TCP.RcvSsThresh_max', 'series_upload_TCP.RcvSsThresh_range', 'series_upload_TCP.RcvSsThresh_first', 'series_upload_TCP.RcvSsThresh_last', 'series_upload_TCP.RcvSsThresh_trend', 'series_upload_TCP.RcvSsThresh_skew', 'series_upl

  df['label_enc'] = le.fit_transform(df['connection_type'])


## Helper Function: Evaluate a Feature Combination

This function takes a tuple of feature names, performs mean imputation on the selected features, and uses logistic regression with 5‑fold cross‑validation to return the average accuracy, F1 score, and recall.

In [2]:
def evaluate_feature_combo(combo):
    """
    Evaluate a given combination of features using logistic regression with 5-fold CV.
    Returns a tuple: (combo, mean_accuracy, mean_f1, mean_recall)
    """
    X = df[list(combo)].values
    y = df['label_enc'].values
    
    # Impute missing values using mean imputation
    imp = SimpleImputer(strategy='mean')
    X_imputed = imp.fit_transform(X)
    
    # Set up 5-fold stratified CV
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    clf = LogisticRegression(max_iter=1000)
    scores = cross_validate(clf, X_imputed, y, cv=cv, 
                             scoring=['accuracy', 'f1', 'recall'], 
                             n_jobs=-1)
    
    mean_accuracy = np.mean(scores['test_accuracy'])
    mean_f1 = np.mean(scores['test_f1'])
    mean_recall = np.mean(scores['test_recall'])
    
    return (combo, mean_accuracy, mean_f1, mean_recall)


## Evaluate All Feature Combinations

For each combination size (1, 2, 3, and 4), we iterate over all possible combinations of the features, evaluate them using the helper above, and then record the results. Finally, we sort the results by F1 score (highest first) and print the top 10 combinations for each size.

In [None]:
results_summary = {}

for r in [1, 2, 3, 4]:
    print(f"Evaluating combinations of size {r}...")
    combo_results = []
    # Generate all combinations of r features
    all_combos = list(combinations(feature_cols, r))
    for combo in tqdm(all_combos, desc=f"Size {r} combos", leave=False):
        try:
            res = evaluate_feature_combo(combo)
            combo_results.append(res)
        except Exception as e:
            print(f"Error with combo {combo}: {e}")
    
    # Sort results by mean F1 score in descending order
    combo_results_sorted = sorted(combo_results, key=lambda x: x[2], reverse=True)
    top10 = combo_results_sorted[:10]
    results_summary[r] = top10
    
    print(f"\nTop 10 combinations for size {r}:")
    for comb, acc, f1, recall in top10:
        print(f"Features: {comb}, Accuracy: {acc:.3f}, F1: {f1:.3f}, Recall: {recall:.3f}")
    print("\n")


Evaluating combinations of size 1...


ABNORMAL: .

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
                                                                  


Top 10 combinations for size 1:
Features: ('series_upload_TCP.SndMSS_max',), Accuracy: 0.718, F1: 0.273, Recall: 0.187
Features: ('series_upload_TCP.SndMSS_first',), Accuracy: 0.718, F1: 0.273, Recall: 0.187
Features: ('series_upload_TCP.SndMSS_mean',), Accuracy: 0.718, F1: 0.273, Recall: 0.187
Features: ('series_upload_TCP.SndMSS_median',), Accuracy: 0.718, F1: 0.273, Recall: 0.187
Features: ('series_upload_TCP.SndMSS_min',), Accuracy: 0.718, F1: 0.273, Recall: 0.187
Features: ('series_upload_TCP.SndMSS_last',), Accuracy: 0.718, F1: 0.273, Recall: 0.187
Features: ('series_download_TCP.SndMSS_mean',), Accuracy: 0.718, F1: 0.272, Recall: 0.187
Features: ('series_download_TCP.SndMSS_median',), Accuracy: 0.718, F1: 0.272, Recall: 0.187
Features: ('series_download_TCP.SndMSS_min',), Accuracy: 0.718, F1: 0.272, Recall: 0.187
Features: ('series_download_TCP.SndMSS_max',), Accuracy: 0.718, F1: 0.272, Recall: 0.187


Evaluating combinations of size 2...


ABNORMAL: .

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Size 2 combos:   0%|          | 4232/1163575 [24:33<112:38:21,  2.86it/s]

## Results Summary

The dictionary `results_summary` now contains, for each combination size, the top 10 feature combinations (based on F1 score) along with their performance metrics. You can further analyze or export these results as needed.

In [None]:
import json

# Convert results_summary to a serializable format
results_serializable = {}
for size, combos in results_summary.items():
    results_serializable[size] = []
    for combo, acc, f1, recall in combos:
        results_serializable[size].append({
            "features": list(combo),  # convert tuple to list
            "accuracy": acc,
            "f1": f1,
            "recall": recall
        })

# Save to a JSON file on disk
output_file = "results_summary.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(results_serializable, f, indent=4)

print(f"Results summary saved to {output_file}")
