Preprocessing Data: Loads the dataset, samples it for quicker processing, encodes categorical variables, handles imbalance with SMOTE, standardizes features, and splits the data.
Model Selection: Uses three models (Logistic Regression, Random Forest, and SVM) with reduced hyperparameter grids for faster execution.
Grid Search: A 3-fold cross-validation is used with a smaller parameter grid for quicker searches.
Model Evaluation: Evaluates each model's performance on test data.
Model Saving: Saves the best model to a specified file path.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.over_sampling import SMOTE
import pandas as pd
import joblib

def preprocess_data(df, target_column="nutriscore_grade", test_size=0.2, sample_size=0.1):  # Reduce sample size for faster processing
    df_sample = df.sample(frac=sample_size, random_state=42) if sample_size < 1.0 else df.copy()

    X = df_sample.drop(columns=[target_column])
    y = df_sample[target_column]

    X = pd.get_dummies(X, drop_first=True)
    smote = SMOTE(random_state=42)
    X_balanced, y_balanced = smote.fit_resample(X, y)

    scaler = StandardScaler()
    X_balanced = scaler.fit_transform(X_balanced)

    X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=test_size, random_state=42)

    return X_train, X_test, y_train, y_test

# Load data
file_path = "C:/data/simplon_dev_ia_projects/flask_projects/nutriscore_prediction_app/data/final_csv_2.csv"
df = pd.read_csv(file_path)

# Preprocess data
X_train, X_test, y_train, y_test = preprocess_data(df)

# Define a smaller grid for faster execution
models = {
    "Random Forest": {
        "model": RandomForestClassifier(random_state=42),
        "params": {
            'n_estimators': [50],  # Smaller grid
            'max_depth': [5]
        }
    },
    "Logistic Regression": {
        "model": LogisticRegression(max_iter=1000, random_state=42),
        "params": {
            'C': [1],  # Smaller grid
            'solver': ['liblinear']
        }
    }
}

# Train and evaluate models with a simplified GridSearchCV
best_models = {}
for model_name, config in models.items():
    print(f"Training {model_name} with GridSearchCV...")
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        cv=2,  # Fewer folds for faster execution
        scoring='accuracy',
        n_jobs=1  # Run without parallelism to reduce resource usage
    )
    grid_search.fit(X_train, y_train)
    
    # Best model and performance
    best_model = grid_search.best_estimator_
    best_score = grid_search.best_score_
    best_params = grid_search.best_params_
    
    print(f"\nBest {model_name} Performance:")
    print("Accuracy on CV data:", best_score)
    print("Best Parameters:", best_params)

    # Evaluate on test set
    y_pred = best_model.predict(X_test)
    print("\nTest Performance:")
    print("Accuracy on test data:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    
    best_models[model_name] = best_model
    print("\n" + "-"*50 + "\n")

# Save the best model ()
final_best_model = best_models["Random Forest"]
save_path = "C:/data/simplon_dev_ia_projects/flask_projects/nutriscore_prediction_app/trained_models/best_model_1_1.joblib"
joblib.dump(final_best_model, save_path)
print(f"Best model saved successfully at: {save_path}")


Training Random Forest with GridSearchCV...

Best Random Forest Performance:
Accuracy on CV data: 0.4745585874799358
Best Parameters: {'max_depth': 5, 'n_estimators': 50}

Test Performance:
Accuracy on test data: 0.4812199036918138
Classification Report:
               precision    recall  f1-score   support

           a       0.55      0.54      0.55       604
           b       0.40      0.70      0.51       608
           c       0.59      0.08      0.14       628
           d       0.57      0.30      0.39       650
           e       0.49      0.80      0.61       625

    accuracy                           0.48      3115
   macro avg       0.52      0.48      0.44      3115
weighted avg       0.52      0.48      0.44      3115

Confusion Matrix:
 [[326 246   4   8  20]
 [134 424  12  11  27]
 [103 244  51  91 139]
 [ 17  75  17 195 346]
 [ 11  71   3  37 503]]

--------------------------------------------------

Training Logistic Regression with GridSearchCV...


#### Feature Selection and Model Optimization Report
Objective
The objective of this process was to identify the optimal subset of features that maximizes model performance for predicting the nutriscore_grade. By selectively removing certain features and evaluating model performance on each subset, we aim to identify the most effective feature set for model training.

Methodology
Step 1: Define the Initial Feature Set
We started with a comprehensive feature set, excluding only the target variable, nutriscore_grade.
The initial feature set includes all relevant columns in the dataset, aiming to explore combinations by excluding one or more features in each iteration.
Step 2: Iterative Feature Removal
To identify the optimal feature subset, we implemented a systematic loop over combinations of features. For each iteration:

Feature Selection: We removed one or more features and retained a reduced feature set for that iteration.
Model Preprocessing and Training:
Data was preprocessed, including encoding categorical variables and balancing the dataset with SMOTE.
We split the balanced data into training and test sets.
We trained the model on the training data using the reduced feature set.
Performance Tracking: After training, we evaluated each feature subset by measuring accuracy on the test data. The results, including cross-validation and test accuracies, were saved for each subset.
Step 3: Grid Search for Model Optimization
For each subset of features:

We applied GridSearchCV with a small parameter grid to expedite the process. Specifically:
Random Forest Model was used as the classifier.
A limited grid of hyperparameters (n_estimators and max_depth) was specified to optimize execution time.
Cross-validation (CV) scores and test accuracies were recorded for each feature subset.
Step 4: Results Compilation
After iterating through all subsets:

Results Storage: The test accuracy for each subset of removed features was stored and ranked.
Best Subset Selection: The feature subset that achieved the highest test accuracy was selected as the optimal subset.
Step 5: Final Model Training and Saving
The best-performing feature subset was retrained on the full sampled data.
The final model was saved using joblib for future deployment in the application.
Key Findings
The feature subset that achieved the best test accuracy excluded [excluded_features], demonstrating that removing these features improved model performance.
The final model was trained with this subset and saved for deployment.
Conclusion
This feature selection approach allowed us to systematically identify the optimal feature set, balancing performance and computational efficiency. By leveraging iterative feature exclusion and model performance tracking, we ensured that only the most relevant features contributed to the final model's performance.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE
import pandas as pd
import joblib
from itertools import combinations

# Define function for preprocessing
def preprocess_data(df, selected_features=None, target_column="nutriscore_grade", test_size=0.2, sample_size=0.1):
    df_sample = df.sample(frac=sample_size, random_state=42) if sample_size < 1.0 else df.copy()

    X = df_sample[selected_features] if selected_features else df_sample.drop(columns=[target_column])
    y = df_sample[target_column]

    X = pd.get_dummies(X, drop_first=True)
    smote = SMOTE(random_state=42)
    X_balanced, y_balanced = smote.fit_resample(X, y)

    scaler = StandardScaler()
    X_balanced = scaler.fit_transform(X_balanced)

    X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=test_size, random_state=42)

    return X_train, X_test, y_train, y_test

# Load data
file_path = "C:/data/simplon_dev_ia_projects/flask_projects/nutriscore_prediction_app/data/final_csv_2.csv"
df = pd.read_csv(file_path)

# Define features and remove target
all_features = df.columns.drop("nutriscore_grade").tolist()

# Define the model and parameters for grid search
model = RandomForestClassifier(random_state=42)
params = {'n_estimators': [50], 'max_depth': [5]}

# Initialize dictionary to store results
results = {}

# Loop over combinations of features by removing one feature at a time
for num_features_to_remove in range(1, 4):  # Test removing 1, 2, or 3 features at a time
    for features_to_remove in combinations(all_features, num_features_to_remove):
        # Select features for the current iteration
        selected_features = [f for f in all_features if f not in features_to_remove]
        
        print(f"\nEvaluating with features excluding: {features_to_remove}")
        
        # Preprocess data with selected features
        X_train, X_test, y_train, y_test = preprocess_data(df, selected_features=selected_features)
        
        # Initialize GridSearchCV
        grid_search = GridSearchCV(
            estimator=model,
            param_grid=params,
            cv=2,
            scoring='accuracy',
            n_jobs=1
        )
        grid_search.fit(X_train, y_train)
        
        # Best model and performance
        best_model = grid_search.best_estimator_
        best_score = grid_search.best_score_
        
        # Test performance
        y_pred = best_model.predict(X_test)
        test_accuracy = accuracy_score(y_test, y_pred)
        
        print(f"Test Accuracy with excluded features {features_to_remove}: {test_accuracy}")
        
        # Store results
        results[features_to_remove] = {
            "excluded_features": features_to_remove,
            "cv_accuracy": best_score,
            "test_accuracy": test_accuracy
        }

# Find the best feature subset
best_subset = max(results, key=lambda x: results[x]["test_accuracy"])
print(f"\nBest feature subset (excluding {best_subset}) achieved test accuracy: {results[best_subset]['test_accuracy']}")

# Save the best model with selected features
best_features = [f for f in all_features if f not in best_subset]
X_train, X_test, y_train, y_test = preprocess_data(df, selected_features=best_features)
best_model.fit(X_train, y_train)

save_path = "C:/data/simplon_dev_ia_projects/flask_projects/nutriscore_prediction_app/trained_models/best_model_with_selected_features.joblib"
joblib.dump(best_model, save_path)
print(f"Best model saved successfully at: {save_path}")



Evaluating with features excluding: ('pnns_groups_1',)
Test Accuracy with excluded features ('pnns_groups_1',): 0.6417335473515249

Evaluating with features excluding: ('pnns_groups_2',)
Test Accuracy with excluded features ('pnns_groups_2',): 0.6613162118780096

Evaluating with features excluding: ('food_groups',)
Test Accuracy with excluded features ('food_groups',): 0.6584269662921348

Evaluating with features excluding: ('energy-kcal_100g',)
Test Accuracy with excluded features ('energy-kcal_100g',): 0.6362760834670947

Evaluating with features excluding: ('fat_100g',)
Test Accuracy with excluded features ('fat_100g',): 0.6545746388443018

Evaluating with features excluding: ('saturated-fat_100g',)
Test Accuracy with excluded features ('saturated-fat_100g',): 0.6205457463884431

Evaluating with features excluding: ('carbohydrates_100g',)
Test Accuracy with excluded features ('carbohydrates_100g',): 0.6584269662921348

Evaluating with features excluding: ('sugars_100g',)
Test Accur