# Food Product and Nutrition Analysis System

Author: Jacob Yi

## Primary Goals

### 1. Nutritional Quality Assessment
- **Standardized Scoring System:** Develop a standardized scoring system for evaluating the nutritional quality of food products.
- **Informed Choices:** Empower consumers to make informed dietary choices through transparent and accessible nutritional information.
- **Product Comparisons:** Facilitate easy comparison of food products within the same category based on their nutritional content.

### 2. Product Classification & Grouping
- **Categorization:** Automatically categorize products based on their ingredients and nutritional content.
- **Cross-Brand Comparison:** Identify and compare similar products across different brands.
- **Health Impact Grouping:** Group products according to their potential health impacts, aiding in healthier product selection.

### 3. Pattern Discovery
- **Ingredient-Nutrition Relationships:** Explore and identify relationships between specific ingredients and nutritional outcomes.
- **Trend Analysis:** Analyze trends in product formulation to uncover evolving patterns in the food industry.
- **Product Improvement Opportunities:** Discover opportunities for product improvement and innovation, driving forward the industry's commitment to health.


**Import and Install required libraries.**

In [None]:
# Install additional required packages
!pip install plotly
!pip install wordcloud
!pip install pandasql

# Standard data processing libraries
import pandas as pd
import numpy as np
from pandasql import sqldf
import json
import datetime as dt
import re

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm
import plotly.express as px

# Machine Learning & PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
%%capture
!apt update
!pip install kaggle

# Part 1: Data Loading and Initial Exploration

### **1.1 Loading the Dataset**
Download dataset from kaggle

In [None]:
import kagglehub

# Download latest version of the dataset
path = kagglehub.dataset_download("openfoodfacts/world-food-facts")

print("Path to dataset files:", path)

Obtain TSV file

In [None]:
# Path to TSV file
file_path = path + '/en.openfoodfacts.org.products.tsv'

# Load the dataset
df_food = pd.read_csv(file_path, sep='\t', encoding='utf-8', low_memory=False)

# Display initial information
print("\nDataset Info:")
print("-" * 50)
print(df_food.info())

In [None]:
df_food.head()

In [None]:
# Shape
print("Shape of the dataset:", df_food.shape)

In [None]:
# Columns
print("Columns in the dataset:")
print(df_food.columns)
print("Number of columns:", len(df_food.columns))

In [None]:
# Rows
print("Rows in the dataset:")
print(df_food.index)
print("Number of rows:", len(df_food))

In [None]:
# Peak at first five rows.
display(df_food.head())

In [None]:
# Missing values
missing = df_food.isnull().sum()
print("\nMissing Values:")
print("-" * 50)
print(missing[missing > 0])

### **1.2 Data Quality Assessment**

In [None]:
quality_metrics = {}

# Basic statistics
quality_metrics['total_rows'] = len(df_food)
quality_metrics['total_columns'] = len(df_food.columns)

# Missing values
missing_vals = df_food.isnull().sum()
quality_metrics['columns_with_nulls'] = missing_vals[missing_vals > 0]

# Duplicate check
quality_metrics['duplicate_rows'] = df_food.duplicated().sum()

# Nutrition columns check
nutrition_cols = [col for col in df_food.columns if '_100g' in col]
quality_metrics['nutrition_columns'] = nutrition_cols

# Invalid values in nutrition columns
invalid_counts = {}
for col in nutrition_cols:
    # Combine conditions to avoid misalignment
    invalid = df_food[
        df_food[col].notna() &
        ((df_food[col] < 0) | (df_food[col] > 100))
    ].shape[0]
    if invalid > 0:
        invalid_counts[col] = invalid

quality_metrics['invalid_values'] = invalid_counts

In [None]:
# Display results
print("Data Quality Assessment Results:")
print("-" * 50)
for metric, value in quality_metrics.items():
    print(f"\n{metric}:")
    print(value)

### **1.3 Data Cleaning**

In [None]:
print("Starting data cleaning...")
print(f"Initial shape: {df_food.shape}")

In [None]:
# Make a copy
df_clean = df_food.copy()

# Select essential columns - focusing on key nutritional info
essential_cols = [
    'code', 'product_name', 'brands', 'categories',
    'energy_100g', 'proteins_100g', 'carbohydrates_100g',
    'fat_100g', 'fiber_100g', 'sugars_100g', 'salt_100g',
    'nutrition-score-fr_100g'
]

df_clean = df_clean[essential_cols]
print(f"After selecting essential columns: {df_clean.shape}")

# Remove duplicates based on code
df_clean = df_clean.drop_duplicates(subset=['code'])

# Clean text columns using .loc to avoid warnings
text_cols = ['product_name', 'brands', 'categories']
for col in text_cols:
    df_clean.loc[:, col] = df_clean[col].fillna('').astype(str).str.strip().str.lower()

# Handle missing values in nutrition columns
nutrition_cols = ['energy_100g', 'proteins_100g', 'carbohydrates_100g',
                  'fat_100g', 'fiber_100g', 'salt_100g']

for col in nutrition_cols:
    # Replace invalid values (negative or extreme outliers)
    mask = (df_clean[col] < 0) | (df_clean[col] > df_clean[col].quantile(0.99))
    df_clean.loc[mask, col] = np.nan

    # Fill remaining NaN with median
    median_val = df_clean[col].median()
    df_clean[col] = df_clean[col].fillna(median_val)


print(f"After removing duplicates: {df_clean.shape}")

print("Cleaning Results:")
print("-" * 50)
print(f"Original shape: {df_food.shape}")
print(f"Cleaned shape: {df_clean.shape}")

In [None]:
df_clean.head()

### **1.4 Feature Engineering**






#### 1. Calculate caloric content based on macronutrients

Following standard conversion:
- Proteins: 4 kcal/g
- Carbohydrates: 4 kcal/g
- Fats: 9 kcal/g

In [None]:
df_engineered = df_clean.copy()

# Calculate calories from macronutrients
df_engineered['calories_from_proteins'] = df_engineered['proteins_100g'] * 4
df_engineered['calories_from_carbs'] = df_engineered['carbohydrates_100g'] * 4
df_engineered['calories_from_fat'] = df_engineered['fat_100g'] * 9

# Calculate total calories from macronutrients
df_engineered['calculated_calories'] = (df_engineered['calories_from_proteins'] +
                            df_engineered['calories_from_carbs'] +
                            df_engineered['calories_from_fat'])

# Compare with reported energy
df_engineered['calorie_difference'] = df_engineered['energy_100g'] - df_engineered['calculated_calories']

####2. Calculate macronutrient ratio from total macros

In [None]:
# Calculate macronutrient ratios
total_macros = (df_engineered['proteins_100g'] + df_engineered['carbohydrates_100g'] + df_engineered['fat_100g'])
df_engineered['protein_ratio'] = df_engineered['proteins_100g'] / total_macros
df_engineered['carb_ratio'] = df_engineered['carbohydrates_100g'] / total_macros
df_engineered['fat_ratio'] = df_engineered['fat_100g'] / total_macros

####3. Calculate Nutrient Density Score based on modified NRF Index
    
Beneficial nutrients (encouraged):
- Protein: Essential for body function
- Fiber: Digestive health
- Vitamins/Minerals (if available)

Nutrients to limit:
- Sodium (salt)
- Saturated fat
- Added sugars

Score = (% DV of beneficial nutrients - % DV of nutrients to limit) / calories per 100g

Source:
https://www.sciencedirect.com/science/article/pii/S0002916523017847

In [None]:
# Daily Value references (based on 2000 calorie diet)
DV = {
    'proteins': 50,  # g
    'fiber': 25,    # g
    'salt': 2.3,    # g
    'saturated_fat': 20  # g
}

# Calculate percentage of daily value for each nutrient
beneficial_nutrients = (
    (df_engineered['proteins_100g'] / DV['proteins'] * 100) +  # Protein DV%
    (df_engineered['fiber_100g'] / DV['fiber'] * 100)         # Fiber DV%
)

nutrients_to_limit = (
    (df_engineered['salt_100g'] / DV['salt'] * 100)   # Salt DV%
)

# Calculate nutrition density score
# Cap percentages at 100% to avoid skewing from fortified foods
df_engineered['nutrition_density_score'] = (
    np.minimum(beneficial_nutrients, 100) -
    np.minimum(nutrients_to_limit, 100)
) / (df_engineered['energy_100g'] / 100)  # Per 100 calories

# Add interpretable categories
df_engineered['nutrition_density_category'] = pd.cut(
    df_engineered['nutrition_density_score'],
    bins=[-float('inf'), -1, 0, 1, float('inf')],
    labels=['Poor', 'Fair', 'Good', 'Excellent']
)

# Clip extreme values and handle division by zero
df_engineered['nutrition_density_score'] = df_engineered['nutrition_density_score'].replace([np.inf, -np.inf], np.nan)
# Fill NaN from infinity with a capped value or 0
df_engineered['nutrition_density_score'] = df_engineered['nutrition_density_score'].fillna(0)

####4. Categorize Products

Sort by high values of protein, fiber, and salt.

In [None]:
# Categorize products
df_engineered['is_high_protein'] = (df_engineered['protein_ratio'] > 0.3).astype(int)
df_engineered['is_high_fiber'] = (df_engineered['fiber_100g'] > 6).astype(int)
df_engineered['is_high_salt'] = (df_engineered['salt_100g'] > 1.5).astype(int)

print("\nFeature Engineering Results:")
print("-" * 50)
print("\nNutritional Features:")
print(df_engineered[['calories_from_proteins', 'calories_from_carbs',
                     'calories_from_fat', 'nutrition_density_score']].describe())

In [None]:
print("Shape of the dataset:", df_engineered.shape)

In [None]:
# Display the descriptive statistics of df_food
df_engineered.describe()

In [None]:
# Understanding the data types in df_food
df_engineered.dtypes

# Part 2: Exploratory Data Analysis

### **2.1 Nutritional Value analysis**

Used SQL queries to analyze the distribution of nutritional values

In [None]:
# Create temporary global variables for pandasql
locals_dict = {'df_engineered': df_engineered}

# Average nutritional values by category query
query1 = """
SELECT
    categories,
    COUNT(*) as product_count,
    ROUND(AVG(energy_100g), 2) as avg_energy,
    ROUND(AVG(proteins_100g), 2) as avg_proteins,
    ROUND(AVG(carbohydrates_100g), 2) as avg_carbs,
    ROUND(AVG(fat_100g), 2) as avg_fat
FROM df_engineered
WHERE categories != ''
GROUP BY categories
HAVING product_count > 10
ORDER BY product_count DESC
LIMIT 10
"""

# High protein foods query
query2 = """
SELECT
    product_name,
    brands,
    ROUND(proteins_100g, 2) as proteins_100g,
    ROUND(energy_100g, 2) as energy_100g
FROM df_engineered
WHERE proteins_100g > 20
    AND product_name != ''
ORDER BY proteins_100g DESC
LIMIT 10
"""

results = {}
results['category_nutrition'] = sqldf(query1, locals_dict)
results['high_protein'] = sqldf(query2, locals_dict)

In [None]:
# 3. Display results
print("\nTop Categories by Product Count and Average Nutrition:")
print("-" * 50)
display(results['category_nutrition'])

print("\nTop High-Protein Foods:")
print("-" * 50)
display(results['high_protein'])

In [None]:
results['category_nutrition']

#### Nutritional Balance Analysis


In [None]:
balance = df_engineered.groupby('categories').agg({
        'proteins_100g': ['mean', 'std'],
        'carbohydrates_100g': ['mean', 'std'],
        'fat_100g': ['mean', 'std']
    }).round(2)

print("\nTop 10 Categories by Nutritional Balance:")
display(balance.head(10))

####Nutrition Density Score Statistics

In [None]:
print("\nNutrition Density Score Statistics:")
display(df_engineered['nutrition_density_score'].describe())

#### Show top 10 most nutrient-dense foods

In [None]:
print("\nTop 10 Most Nutrient-Dense Foods:")
top_foods = df_engineered.nlargest(10, 'nutrition_density_score')[
    ['product_name', 'nutrition_density_score', 'nutrition_density_category']]
display(top_foods)

### **2.2 Plots**

####1. Distribution of Nutrition Density Scores

In [None]:
import plotly.express as px

fig = px.histogram(df_engineered, x='nutrition_density_score', nbins=50,
                   title='Distribution of Nutrition Density Scores',
                   range_x=[-3000, 3000],
                   log_y=True,  # Log scale for the y-axis
                   color_discrete_sequence=['orange'])


fig.update_traces(marker_line_color='black',
                  marker_line_width=1.5)

fig.update_layout(yaxis_title='Log of Count')
fig.show()

In [None]:
import plotly.express as px


category_counts = df_engineered['nutrition_density_category'].value_counts().reset_index()
category_counts.columns = ['Category', 'Count']

fig = px.bar(category_counts, x='Category', y='Count',
             title='Distribution of Nutrition Categories',
             labels={'Count': 'Number of Products'},
             color_discrete_sequence=['orange'])


fig.update_traces(text=category_counts['Count'], textposition='outside')

fig.show()

####2. Distribution of Nutrition Categories

####3. Plot nutrition distribution for top categories

In [None]:
data = results['category_nutrition'].sort_values(by='avg_energy', ascending=False)

fig = px.bar(data, x='categories', y='avg_energy',
             title='Average Energy Content by Category',
             labels={'avg_energy': 'Energy (per 100g)', 'categories': 'Category'},
             color='categories',
             color_discrete_sequence=px.colors.qualitative.Plotly,
             height=1000)


fig.update_traces(textposition='outside')
fig.for_each_trace(lambda t: t.update(text=t.y))


fig.update_layout(xaxis_tickangle=-45)

fig.show()

####4. Plot distributions of main nutrients


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# List of nutrients
nutrients = ['energy_100g', 'proteins_100g', 'carbohydrates_100g', 'fat_100g']


fig = make_subplots(rows=2, cols=2, subplot_titles=[f'Distribution of {nutrient}' for nutrient in nutrients])


positions = [(i, j) for i in range(1, 3) for j in range(1, 3)]


for nutrient, pos in zip(nutrients, positions):
    trace = go.Histogram(
        x=df_engineered[nutrient],
        nbinsx=50,
        marker=dict(
            line=dict(color='black', width=1.5)
        )
    )  # Histogram for each nutrient
    fig.add_trace(trace, row=pos[0], col=pos[1])


fig.update_layout(
    height=1000,
    width=1600,
    title_text="Nutrient Distributions in Food Products",
    showlegend=False
)


fig.show()

####5. Correlation Matrix of Nutritional Values

In [None]:
nutrition_cols = ['energy_100g', 'proteins_100g', 'carbohydrates_100g',
                  'fat_100g', 'fiber_100g', 'salt_100g']

plt.figure(figsize=(10, 8))
sns.heatmap(df_engineered[nutrition_cols].corr(),
            annot=True,
            cmap='RdBu',
            center=0)
plt.title('Correlation between Nutritional Values')
plt.tight_layout()
plt.show()

In [None]:
df_engineered.to_csv('processed_food_data.csv', index=False)

print("\nProcessing complete! Dataset is ready for modeling.")

# Part 3: Modeling: Logistic Regression

### **3.1 Define Features and Target**
Prepare the dataset for modeling by selecting features and defining the target variable.

Features: Nutritional attributes (e.g., energy_100g, proteins_100g)

Target: nutrition_density_category -  "Poor," "Fair," "Good," or "Excellent."

Fill missing values in features with the median and in the target with the most common category.

In [None]:
# Import Lib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import plotly.express as px
import pandas as pd
import numpy as np

In [None]:
# Define Features and Target
selected_columns = [
    'energy_100g', 'proteins_100g', 'carbohydrates_100g',
    'fat_100g', 'fiber_100g', 'salt_100g', 'nutrition_density_score'
]
features = df_engineered[selected_columns]
target = df_engineered['nutrition_density_category']

# Fill missing values

features = features.fillna(features.median())
target = target.fillna(target.mode()[0])

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

## QA Check nan values
print("Number of NaN values in X_train:")
print(X_train.isna().sum())

print("Number of Inf values in X_train:")
print(np.isinf(X_train).sum())

# Replace inf and -inf in the 'nutrition_density_score'
X_train['nutrition_density_score'] = X_train['nutrition_density_score'].replace([np.inf, -np.inf], np.nan)

# Fill NaN  with the column median
X_train['nutrition_density_score'] = X_train['nutrition_density_score'].fillna(X_train['nutrition_density_score'].median())

# Repeat  X_test
X_test['nutrition_density_score'] = X_test['nutrition_density_score'].replace([np.inf, -np.inf], np.nan)
X_test['nutrition_density_score'] = X_test['nutrition_density_score'].fillna(X_test['nutrition_density_score'].median())

Normalize data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### **3.2 Baseline Model - Logistic Regression**
Establish a baseline model

Predict

Evaluate result

In [None]:
# Baseline Model - Logistic Regression
print("\n--- Logistic Regression ---")
log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg.fit(X_train, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test)

# Evaluate result
print("\nClassification Report:")
print(classification_report(y_test, y_pred_log))

baseline_accuracy = accuracy_score(y_test, y_pred_log)
print(f"Baseline Model Accuracy: {baseline_accuracy:.2f}")

### **3.3 Validate the Classification Model**


Perform cross-validation to confirm the model generalizes well across different splits of the data.

Use confusion matrices for cross-validation folds.

QA and prep data

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
###QA##

# Inspect features for invalid values
print("Number of NaN values in features:")
print(features.isna().sum())

print("Number of Inf values in features:")
print(np.isinf(features).sum())

# Replace inf and NaN
features = features.replace([np.inf, -np.inf], np.nan)
features = features.fillna(features.median())

# Clip extreme values
features = features.clip(lower=features.quantile(0.01), upper=features.quantile(0.99), axis=1)


## Scale the features

scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

### **3.4 Perform cross-validation**

Fit the Logistic Regression model on:


*   Maximum iterations (max_iter) set to 200.

*   Random seed set to 42.



Perform 5-fold cross-validation using the cross_val_score function with accuracy as the scoring metric.

Store the cross-validation accuracy scores in log_reg_cv_scores.

Calculate the mean accuracy and store it in log_reg_mean_acc.

In [None]:
## Perform cross-validation ##
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg_cv_scores = cross_val_score(log_reg, features_scaled, target, cv=5, scoring='accuracy')

print("\nLogistic Regression Cross Validation Accuracy   Scores:")
print(log_reg_cv_scores)
print(f"Mean Accuracy: {log_reg_cv_scores.mean():.2f}")

# Part 4: Modeling: Random Forest

### **4.1 Train and Evaluate Random Forest Classifier**
Fit a Random Forest classifier on X_train and y_train using the following hyperparameters:



*   class_weight set to "balanced".
*   n_estimators (number of trees) set to 120.
*   max_depth set to 30.
*   Random seed set to 42.



Calculate the accuracy of the model on the test set using the score method and store it in rf_acc.

Compute confusion matrix for  predictions and save it as rf_confusion.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Fit Random Forest
rf_model = RandomForestClassifier(
    class_weight='balanced', n_estimators=120, max_depth=30, random_state=42
)
rf_model.fit(X_train, y_train)

# Evaluate accuracy
rf_acc = rf_model.score(X_test, y_test)

# Compute confusion matrix
rf_confusion = confusion_matrix(y_test, rf_model.predict(X_test))

print(f"Random Forest   Test Accuracy : {rf_acc:.2f}")
print("Random Forest  Confusion Matrix  :")
print(rf_confusion)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


class_labels = rf_model.classes_  # Get class labels

plt.figure(figsize=(10, 7))
sns.heatmap(rf_confusion, annot=True, fmt="d", cmap="Blues",
            xticklabels=class_labels, yticklabels=class_labels)

plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

### **4.2 Plot Feature Importances**

Understand the key drivers behind the nutrition_density_category predictions made by the Random Forest model.

Use feature importance to visualize the most important features.



In [None]:
import pandas as pd
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier

# Extract feature names from the DataFrame
feature_names = features.columns

# Extracting feature importances from the RF model
feature_importances = rf_model.feature_importances_

# Create a DataFrame with the features and their importances
features_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
})

features_df = features_df.sort_values(by='Importance', ascending=True)

# Barchart for Importances
fig = px.bar(features_df, x='Importance', y='Feature', orientation='h',
             title='Feature Importances - Random Forest',
             labels={'Importance': 'Importance Score', 'Feature': 'Feature'},
             color='Importance',  # Color the bars by importance
             color_continuous_scale=px.colors.sequential.Viridis)

fig.update_layout(
    xaxis_title='Importance Score',
    yaxis_title='Feature',
    coloraxis_showscale=True  # Show the color scale
)

# Show the plot
fig.show()

**Insights from Chart:** Based on our the feature importances calculated, nutrition_density_score is by far the strongest feature to determine whether a food is health or not, followed by salt per 100g and calories by 100g.

### **4.3 Insights from Random Forest Model**
The RandomForest model's feature importance analysis revealed critical insights into the factors that most significantly influence the nutritional quality of food products. The analysis focused on two primary features: nutrition_density_score and salt_100g, which are instrumental in determining the overall nutritional categorization of food products.

## 1. Nutrition Density Score (Importance: 0.599683)

### Explanation:
The nutrition_density_score serves as a comprehensive metric that integrates essential nutrients such as proteins and fiber with limiting factors such as salt. This integration makes it a robust indicator of overall nutritional value.
### Implications:
Given its comprehensive nature, the nutrition_density_score emerges as the predominant predictor in our model. It directly correlates with the categorization into nutritional quality bands, effectively distinguishing between higher and lower nutritional values. This feature's dominance underscores the importance of balanced nutrition in the evaluation of food quality.

## 2. Salt Content (Importance: 0.171232)

### Explanation:
Salt content, measured as salt_100g, plays a critical role in nutritional assessments due to its impact on health. High salt levels are known to detract from nutritional value, significantly affecting health ratings.
### Implications:
The substantial influence of salt content on the nutrition_density_score confirms it as a key predictor for identifying products likely categorized as Fair or Poor in nutritional quality. This finding aligns with current health guidelines which advocate for lower sodium intake to prevent various health issues.

# Part 5: Applying the Feed Forward Neural Network

In [None]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit label encoder on training labels and transform
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# Convert integer labels to one-hot encoded vectors
y_train_onehot = to_categorical(y_train_encoded)
y_test_onehot = to_categorical(y_test_encoded)

In [None]:
# Define the model
model = Sequential()

# Input layer and first hidden layer
model.add(Dense(64, activation='relu'))

# Second hidden layer
model.add(Dense(32, activation='relu'))

# Output layer
model.add(Dense(4, activation='softmax'))

model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()

In [None]:
# Train the model
history = model.fit(
    X_train,
    y_train_onehot,
    epochs=20,
    batch_size=128,
    validation_data=(X_test, y_test_onehot)
)

In [None]:
# Plot training & validation accuracy values
plt.figure(figsize=(12, 4))

# Accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Test')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

# Loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Test')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()

plt.show()

In [None]:
# Evaluate the Model
loss, accuracy = model.evaluate(X_test, y_test_onehot, verbose=0)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

In [None]:
# Predict class probabilities
y_pred_prob = model.predict(X_test)

# Convert probabilities to class labels
y_pred_classes = np.argmax(y_pred_prob, axis=1)

# Decode class labels back to original categories
y_pred_labels = label_encoder.inverse_transform(y_pred_classes)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# True labels
y_true = y_test

# Predicted labels
y_pred = y_pred_labels

# Classification report
print(classification_report(y_true, y_pred))

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print('Confusion Matrix')
print(cm)

# Part 6: K-Means Clustering

## Objective:
Our next step is to segment food products into distinct groups based on their nutritional profiles using K-Means clustering, informed by the most significant features identified through RandomForest analysis.

## Methodology:

**Feature Selection:** Utilize RandomForest to determine the key nutritional attributes that significantly impact food quality.
Clustering Execution: Implement K-Means clustering using these selected features to categorize food products into meaningful clusters.

**Rationale:**
This approach ensures that our clustering is focused on the most impactful nutritional factors, providing clear, actionable insights into different food categories. This targeted analysis helps in identifying nutritional patterns and supports informed decision-making in food product development and marketing strategies.

**Implications:**
Employing K-Means with features identified by RandomForest allows for precise segmentation of the food market, facilitating enhanced product positioning and health-focused consumer outreach.

### **6.1 Select Features**

Based on the feature importance from Random Forest model, we will include the following features:



*   nutrition_density_score
*   salt_100g
*   energy_100g
*   List item
* proteins_100g
* fiber_100g
* carbohydrates_100g
* fat_100g


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

#Select features based on Random Forest importance
selected_features = ['nutrition_density_score', 'salt_100g', 'energy_100g',
                     'proteins_100g', 'fiber_100g', 'carbohydrates_100g', 'fat_100g']

# Extract  features
features_for_clustering = df_engineered[selected_features]

QA & clean data (fill Nan)

In [None]:
###QA###
# Check for Inf or NaN values in features
print("Number of NaN values in features:")
print(features_for_clustering.isna().sum())

print("\nNumber of Inf values in features:")
print(np.isinf(features_for_clustering).sum())

# Create a cleaned version of the Df
## avoid warning
features_for_clustering_cleaned = features_for_clustering.copy()

# Replace Inf and -Inf with NaN
features_for_clustering_cleaned.replace([np.inf, -np.inf], np.nan, inplace=True)

# Replace NaN values with the median of each column
features_for_clustering_cleaned.fillna(features_for_clustering_cleaned.median(), inplace=True)

# Validate the cleaned Df
print("Remaining NaN values:", features_for_clustering_cleaned.isna().sum())
print("Remaining Inf values:", np.isinf(features_for_clustering_cleaned).sum())

Scale features

In [None]:
# Scale the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_for_clustering_cleaned)

### **6.2 Determine Optimal Number of Clusters (Elbow Method)**

In [None]:
# Determine optimal k using Elbow Method
inertia = []
k_values = range(1, 11)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(features_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Curve -- to find the optimal number of clusters
plt.figure(figsize=(8, 6))
plt.plot(k_values, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.grid()
plt.show()

After k=4, the rate of decrease becomes more gradual, which means diminishing returns in clustering improvement.

Optimal - > K = 4

### **6.3 Run K-Means Clustering with k=4:**

*   List item
*   List item



In [None]:
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sns
import matplotlib.pyplot as plt

# Fit K-Means with k=4
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
clusters = kmeans.fit_predict(features_scaled)

#   Add cluster labels to the original DataFrame
df_engineered['Cluster'] = clusters

# Analyze Cluster Centers
cluster_centers = pd.DataFrame(
    scaler.inverse_transform(kmeans.cluster_centers_),
    columns=selected_features
)
print("Cluster Centers:")
print(cluster_centers)

# Part 7: Prepare Pairwise Cluster using Pandasql

1. Use pandasql to boost efficency for preparing pairwise cluster visual

2. Prep data and fix the inf and NaN issue before querying

In [None]:
# Define the global variable for pandasql
locals_dict = {'df_engineered': df_engineered}

## fix the  inf issue before processing
# fix the NaN values before processing


# Replace 'inf' and '-inf' with NaN in the Df
df_engineered['nutrition_density_score'].replace([np.inf, -np.inf], np.nan, inplace=True)

# Fill NaN values in 'nutrition_density_score' with the median
median_nutrition_score = df_engineered['nutrition_density_score'].median()
df_engineered['nutrition_density_score'].fillna(median_nutrition_score, inplace=True)

# Verify the issue is resolved
print("Number of 'inf' or 'NaN' values in 'nutrition_density_score':")
print(df_engineered['nutrition_density_score'].isnull().sum())


Check Cluster represntation counts and decide # of limit row to balance

In [None]:



# SQL2: Select representative products for each cluster
query_representative_points_test = """
      SELECT
        Cluster,
        nutrition_density_score,
        salt_100g,
        energy_100g,
        proteins_100g,
        fiber_100g,
        carbohydrates_100g,
        fat_100g,
        ROW_NUMBER() OVER (PARTITION BY Cluster ORDER BY nutrition_density_score DESC) as row_num
    FROM df_engineered
    WHERE
        Cluster IS NOT NULL
        AND nutrition_density_score IS NOT NULL
        AND nutrition_density_score NOT IN ('inf', '-inf')

"""

representative_points_test = sqldf(query_representative_points_test, locals_dict)


# Check cluster representation counts
print("Cluster Representation :")
print(representative_points_test['Cluster'].value_counts())
print(representative_points_test)


In [None]:

# Define the global variable for pandasql
locals_dict = {'df_engineered': df_engineered}

##not using##
'''
#SQL 1: Compute average nutritional values by cluster

query_clusters = """
SELECT
    Cluster,
    ROUND(AVG(nutrition_density_score), 2) as avg_nutrition_density,
    ROUND(AVG(salt_100g), 2) as avg_salt,
    ROUND(AVG(energy_100g), 2) as avg_energy,
    ROUND(AVG(proteins_100g), 2) as avg_proteins,
    ROUND(AVG(fiber_100g), 2) as avg_fiber,
    ROUND(AVG(carbohydrates_100g), 2) as avg_carbs,
    ROUND(AVG(fat_100g), 2) as avg_fat
FROM df_engineered
GROUP BY Cluster
ORDER BY Cluster
"""

cluster_averages = sqldf(query_clusters, locals_dict)

print("Cluster Averages:")
print(cluster_averages)
'''




# SQL2: Select representative products for each cluster
query_representative_points_filtered = """
WITH ranked_points AS (
       SELECT
        Cluster,
        nutrition_density_score,
        salt_100g,
        energy_100g,
        proteins_100g,
        fiber_100g,
        carbohydrates_100g,
        fat_100g,
        ROW_NUMBER() OVER (PARTITION BY Cluster ORDER BY nutrition_density_score DESC) as row_num
    FROM df_engineered
    WHERE
        Cluster IS NOT NULL
        AND nutrition_density_score IS NOT NULL
        AND nutrition_density_score NOT IN ('inf', '-inf')
)
SELECT *
FROM ranked_points
WHERE row_num <= 4000
ORDER BY Cluster, row_num
"""
 ## limit row to balance based on finding in cluster rep counts. cluster 1 has smallest 4000.

representative_points = sqldf(query_representative_points_filtered, locals_dict)


# Check cluster representation counts
print("Cluster Representation :")
print(representative_points['Cluster'].value_counts())

"""
Cluster Representation :
Cluster
3    209202
2     99561
0     42221
1      4055
Name: count, dtype: int64
"""

#Visualization for Pairwise Clusters

In [None]:
# Visualize


# Ensure 'Cluster' column is treated as string
representative_points['Cluster'] = representative_points['Cluster'].astype(str)

# Visualization: Pairplot of representative points
selected_features = [
    'nutrition_density_score',
    'salt_100g',
    'energy_100g',
    'proteins_100g',
    'fiber_100g',
    'carbohydrates_100g',
    'fat_100g'
]

sns.pairplot(
    representative_points,
    vars=selected_features,
    hue='Cluster',
    palette='viridis',
    diag_kind='kde',
    markers=["o", "s", "D", "P"]
)
plt.suptitle('Pairplot of Representative Points by Cluster', y=1.02)
plt.show()


#QA
# Display a summary of representative points
print("\nRepresentative Points :")
print(representative_points)


# Part 8: Analysis of Clusters:

## Cluster Profiles:

* Cluster 0: High-protein, high-fat foods (e.g., peanuts or protein bars).
* Cluster 1: High-salt, mid-carb foods (e.g., SPAM or Sausage).
* Cluster 2: High-carb, high-fiber foods (e.g., Oats or black bread).
* Cluster 3: Low-calorie, low-fat foods (e.g., apple, tomato, or vegetables).

## Insights:
* Cluster 0 (High-protein, high-fat, moderate energy) Market as high-energy snacks; consider low-fat versions for calorie-conscious consumers. EG. hiker ?
* Cluster 1 (high-salt foods) may need reformulation to reduce salt for health-conscious consumers.
* Cluster 2 (high-carb, high-fiber foods) could be marketed as energy-dense and fiber-rich products for athletes.
* Cluster 3 (low-calorie foods) can be targeted to consumers having diet.

### Analysis of Pairwise Trends and Cluster Interpretations

**Overview of Key Trends:**  
A notable positive correlation exists between the energy content (`energy_100g`) and fat content (`fat_100g`) across our food product dataset. This correlation underscores a fundamental nutritional principle: fat is more energy-dense than proteins or carbohydrates. Consequently, food products with higher fat content tend to have higher caloric values.

**Cluster-Specific Analysis:**

1. **Cluster 0 (High-Protein, High-Fat Foods):**
   - **Examples:** Peanuts, Protein Bars
   - **Interpretation:** Foods in this cluster are characterized by their high energy density, primarily driven by their significant fat content. This aligns perfectly with the observed positive correlation, as these foods utilize fat as a primary source of calories, supporting their use in energy-demanding contexts.

2. **Cluster 1 (High-Salt, Mid-Carb Foods):**
   - **Examples:** SPAM, Sausage
   - **Interpretation:** While also following the upward trend between fat and energy, this cluster exhibits a less pronounced slope. This variation suggests that, besides fat, other macronutrients like carbohydrates and proteins also contribute to the energy content, albeit to a lesser extent. The presence of moderate carbohydrates moderates the direct impact of fat on energy levels.

3. **Cluster 2 (High-Carb, High-Fiber Foods):**
   - **Examples:** Oats, Black Bread
   - **Interpretation:** This cluster displays a milder trend between fat and energy, indicating that carbohydrates play a significant role in energy provision. The high fiber content also suggests health benefits beyond energy provision, aligning these products with dietary recommendations for fiber intake.

4. **Cluster 3 (Low-Calorie, Low-Fat Foods):**
   - **Examples:** Vegetables, Fruits
   - **Interpretation:** Representing the low end of the trend, these products are both low in fat and low in energy. This cluster illustrates the minimal impact of fat on the energy content, which is consistent with their positioning as diet-friendly options that minimize calorie intake without sacrificing volume.

**Detailed Inisghts and Potential Market Uses:**

Fat content has a clear link to energy levels in various food groups, showing just how different nutrients affect the energy value of food.

These insights are super helpful for food manufacturers and health experts as they develop or suggest products that meet specific dietary needs or health goals. Each food cluster provides specific chances for improving nutrition and marketing, ensuring that consumer needs are addressed accurately and with well-supported evidence.






# Conclusion and Recommendations

### Tailoring Nutrition to Consumer Needs

Our analysis effectively segments food products into clusters with distinct nutritional characteristics. This refined understanding allows stakeholders to cater their products more precisely to the dietary preferences and health needs of diverse groups.

#### Cluster Benefits, Target Audiences, and Cautions

- **Cluster 0: High-Protein, High-Fat Foods**
  - **Target Audience:** Bodybuilders and Fitness Enthusiasts
  - **Benefits:** Ideal for those requiring sustained energy and muscle repair, such as bodybuilders and fitness enthusiasts. Examples include nuts, seeds, and protein bars.
  - **Cautions:** Individuals with cardiovascular concerns or those on a low-fat diet might want to limit intake from this cluster due to the high fat content.

- **Cluster 1: High-Salt, Mid-Carb Foods**
  - **Target Audience:** Endurance Athletes
  - **Benefits:** Useful for endurance athletes who lose significant salt through sweating and need quick energy from carbohydrates.
  - **Cautions:** Individuals with hypertension or heart disease should avoid high-salt foods to manage blood pressure and heart health effectively.

- **Cluster 2: High-Carb, High-Fiber Foods**
  - **Target Audience:** Students, Office Workers, and Diabetics
  - **Benefits:** Provides slow-releasing energy ideal for long periods of mental exertion faced by students and office workers; high fiber content is also beneficial for diabetics managing blood sugar levels.
  - **Cautions:** Those on ketogenic or low-carb diets should avoid this cluster due to its high carbohydrate content.

- **Cluster 3: Low-Calorie, Low-Fat Foods**
  - **Target Audience:** Weight Loss Participants, Health-Conscious Consumers, and the Elderly
  - **Benefits:** Supports weight management with low-calorie content and is suitable for the elderly seeking nutrient-dense, easy-to-digest foods. Examples include leafy greens, fresh fruits, and vegetables.
  - **Cautions:** Athletes or individuals with high-caloric needs might find these foods insufficient for their energy requirements.

### Implementing Insights for Market Impact

Understanding these clusters helps manufacturers and health professionals tailor products precisely to meet the nutritional preferences and requirements of different consumer segments. This targeted approach not only enhances consumer satisfaction but also promotes healthier dietary choices, contributing to a better overall food environment.

We recommend stakeholders utilize these insights to develop specialized product lines catering to the nuanced demands of these diverse consumer segments, thereby enhancing both market presence and consumer health outcomes.

