# 🥗 **Team Work 01: Food and Nutrition**

## 👥 **Team Name:** Perraseb

### **Participants:**
- **perraseb**
- **emperora**

## 🛠 **Step 1: Подключение библиотек**

Импортируем необходимые библиотеки для:
- Предобработки данных
- Моделирования (регрессия и классификация)
- Кросс-валидации и подбора гиперпараметров
- Визуализации результатов
- Сохранения модели

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Models - Regression
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,StackingRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor


# Metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

# Saving model
import joblib


## 🪄 **Step 2: Подготовка данных**

### 📌 **Инструкция:**

- Используйте **набор данных от Epicurious, собранный Хьюго Дарвудом**.
- **Отфильтруйте столбцы:**
   - Удалите все **столбцы, не связанные с ингредиентами**.
   - Чем **меньше лишних столбцов**, тем **чище датасет**.
- Вы будете прогнозировать **рейтинг или категорию рейтинга**, используя **только ингредиенты и ничего больше**.

---

🎯 **Цель:**  
Подготовить **чистый и понятный датасет для обучения моделей**, оставив **только те данные, которые реально влияют на качество и вкус блюда.


In [111]:
df=pd.read_csv('data/epi_r.csv')

In [112]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20052 entries, 0 to 20051
Columns: 680 entries, title to turkey
dtypes: float64(679), object(1)
memory usage: 105.5 MB


In [113]:
df.head()

Unnamed: 0,title,rating,calories,protein,fat,sodium,#cakeweek,#wasteless,22-minute meals,3-ingredient recipes,...,yellow squash,yogurt,yonkers,yuca,zucchini,cookbooks,leftovers,snack,snack week,turkey
0,"Lentil, Apple, and Turkey Wrap",2.5,426.0,30.0,7.0,559.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,Boudin Blanc Terrine with Red Onion Confit,4.375,403.0,18.0,23.0,1439.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Potato and Fennel Soup Hodge,3.75,165.0,6.0,7.0,165.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Mahi-Mahi in Tomato Olive Sauce,5.0,,,,,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Spinach Noodle Casserole,3.125,547.0,20.0,32.0,452.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [114]:
print(df.columns.tolist())

['title', 'rating', 'calories', 'protein', 'fat', 'sodium', '#cakeweek', '#wasteless', '22-minute meals', '3-ingredient recipes', '30 days of groceries', 'advance prep required', 'alabama', 'alaska', 'alcoholic', 'almond', 'amaretto', 'anchovy', 'anise', 'anniversary', 'anthony bourdain', 'aperitif', 'appetizer', 'apple', 'apple juice', 'apricot', 'arizona', 'artichoke', 'arugula', 'asian pear', 'asparagus', 'aspen', 'atlanta', 'australia', 'avocado', 'back to school', 'backyard bbq', 'bacon', 'bake', 'banana', 'barley', 'basil', 'bass', 'bastille day', 'bean', 'beef', 'beef rib', 'beef shank', 'beef tenderloin', 'beer', 'beet', 'bell pepper', 'berry', 'beverly hills', 'birthday', 'biscuit', 'bitters', 'blackberry', 'blender', 'blue cheese', 'blueberry', 'boil', 'bok choy', 'bon appétit', 'bon app��tit', 'boston', 'bourbon', 'braise', 'bran', 'brandy', 'bread', 'breadcrumbs', 'breakfast', 'brie', 'brine', 'brisket', 'broccoli', 'broccoli rabe', 'broil', 'brooklyn', 'brown rice', 'brown

In [None]:
ingredients = [
    "rating", "almond", "apple", "apricot", "artichoke", "arugula", "asian pear", "asparagus", "avocado",
    "bacon", "banana", "barley", "basil", "beef", "beet", "bell pepper",
    "berry", "blackberry", "blue cheese", "blueberry", "bok choy", "broccoli", "broccoli rabe", "brussel sprout",
    "butter", "buttermilk", "butternut squash", "cabbage", "capers", "carrot", "cashew",
    "cauliflower", "celery", "cheese", "cherry", "chestnut", "chickpea", "chile", "chile pepper", "chili",
    "chive", "chocolate", "cilantro", "cinnamon", "clove", "coconut", "cod", "collard greens",
    "corn", "cornmeal", "cottage cheese", "crab", "cranberry", "cream cheese", "cucumber", "cumin", "currant", "curry",
    "date", "dill", "egg", "eggplant", "endive", "feta", "fig", "fish", "fontina",
    "garlic", "ginger", "goat cheese", "gouda", "grape", "grapefruit", "green bean",
    "green onion/scallion", "ground beef", "ground lamb", "guava", "hazelnut", "honey", "honeydew", "horseradish",
    "hot pepper", "jalapeño", "jam or jelly", "kale", "lamb",
    "lentil", "lettuce", "lima bean", "lime", "lobster",
    "macadamia nut", "mango", "maple syrup", "melon", "mint", "molasses", "monterey jack", "mozzarella", "mushroom",
    "mussel", "mustard", "mustard greens", "nutmeg", "oat", "okra", "olive", "onion", "orange",
    "oregano", "orzo", "oyster", "parmesan", "parsley", "parsnip", "pea", "peach", "peanut", "peanut butter",
    "pear", "pecan", "pepper", "persimmon", "pine nut", "pineapple", "pistachio", "plantain", "plum", "pomegranate",
    "pork", "potato", "prune", "pumpkin", "quince", "quinoa", "radicchio", "radish", "raisin", "raspberry",
    "rhubarb", "rice", "ricotta", "root vegetable", "rosemary", "rye",
    "sage", "salmon", "sardine", "sausage", "scallop", "sesame",
    "sesame oil", "shallot", "shellfish", "shrimp", "snapper", "soy", "spinach", "squash", "squid", "strawberry",
    "sugar snap pea", "sweet potato/yam", "tilapia", "tofu", "tomatillo", "tomato", "tree nut", "turnip",
    "vanilla", "veal", "vinegar", "wasabi", "watercress", "watermelon", "wild rice", "yogurt", "yuca",
    "zucchini", "marshmallow", "milk", "jam"
]


In [116]:
df = df[ingredients]

In [117]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20052 entries, 0 to 20051
Columns: 181 entries, rating to marshmallow
dtypes: float64(181)
memory usage: 27.7 MB


In [118]:
df.head()

Unnamed: 0,rating,almond,apple,apricot,artichoke,arugula,asian pear,asparagus,avocado,bacon,...,veal,vinegar,wasabi,watercress,watermelon,wild rice,yogurt,yuca,zucchini,marshmallow
0,2.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,4.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [119]:
ingredient_counts = df.sum().sort_values(ascending=False)
ingredient_counts

rating         74482.5
onion           2238.0
tomato          2140.0
egg             1768.0
garlic          1643.0
                ...   
marshmallow        9.0
yuca               6.0
sardine            4.0
chili              3.0
orzo               2.0
Length: 181, dtype: float64

In [120]:
min_occurrences = 10

frequent_cols = ingredient_counts[ingredient_counts >= min_occurrences].index.tolist()

In [121]:
df=df[frequent_cols]

In [122]:
df.sum().sort_values(ascending=False)

rating       74482.5
onion         2238.0
tomato        2140.0
egg           1768.0
garlic        1643.0
              ...   
wild rice       18.0
chile           13.0
tilapia         12.0
persimmon       11.0
gouda           10.0
Length: 176, dtype: float64

In [123]:
X = df.drop(columns=['rating'])
y = df['rating']

X.head()

Unnamed: 0,onion,tomato,egg,garlic,cheese,ginger,potato,fish,pork,chocolate,...,guava,plantain,asian pear,rye,wild rice,radicchio,chile,tilapia,persimmon,gouda
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 📊 **Step 3: Регрессионные модели**

### 📌 **Инструкция:**

1️⃣ **Регрессия**  
Попробуйте использовать **разные алгоритмы и их гиперпараметры** для прогнозирования рейтинга:
- Linear Regression
- SVR
- RandomForestRegressor
- XGBRegressor

Выберите лучший вариант с помощью **перекрёстной проверки** и оцените показатель **RMSE (среднеквадратическая ошибка)** на тестовой подвыборке.

---

2️⃣ **Ансамбли**
Попробуйте **разные ансамбли (Bagging и Boosting) и их гиперпараметры**:
- RandomForest (Bagging)
- XGBoost / GradientBoosting (Boosting)

Выберите лучший ансамбль с помощью **кросс-валидации** и оцените его на тестовой подвыборке.

---

3️⃣ **Наивная модель**
Рассчитайте **среднеквадратическую ошибку (RMSE)** для **наивной регрессионной модели**, которая прогнозирует **средний рейтинг** по всему train датасету для всех наблюдений.

---

🎯 **Цель:**
- Найти **лучшую регрессионную модель** для предсказания рейтинга блюда по ингредиентам.
- Сравнить её с наивным предсказанием.
- Использовать эту модель в проекте Nutritionist.

In [124]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=21)

In [125]:
print(f"Train X: {X_train.shape}, Train y: {y_train.shape}")
print(f"Test X: {X_test.shape}, Test y: {y_test.shape}")

Train X: (16041, 175), Train y: (16041,)
Test X: (4011, 175), Test y: (4011,)


In [126]:
y_test.value_counts()

rating
4.375    1604
3.750    1034
5.000     544
0.000     367
3.125     298
2.500     106
1.250      33
1.875      25
Name: count, dtype: int64

In [127]:
y_train.value_counts()


rating
4.375    6415
3.750    4135
5.000    2175
0.000    1469
3.125    1191
2.500     426
1.250     131
1.875      99
Name: count, dtype: int64

In [128]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [129]:
cv = KFold(n_splits=5, shuffle=True, random_state=21)

## LinearRegression

In [130]:
lr = LinearRegression()
param_grid_lr = {}  # нет параметров

grid_lr = GridSearchCV(
    lr,
    param_grid_lr,
    cv=cv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)

grid_lr.fit(X_train, y_train)
print(f"Linear Regression CV RMSE: {abs(grid_lr.best_score_):.4f}")

Linear Regression CV RMSE: 1.3192


## RandomForestRegressor

In [131]:
rf = RandomForestRegressor(random_state=21, n_jobs=-1)

param_grid_rf = {
    'n_estimators': [100, 200],          # количество деревьев
    'max_depth': [None, 10, 20],         # глубина дерева
    'min_samples_split': [2, 5]          # минимальное количество образцов для сплита
}

grid_rf = GridSearchCV(
    rf,
    param_grid_rf,
    cv=cv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=2
)

grid_rf.fit(X_train, y_train)

print(f"RandomForest Best Params: {grid_rf.best_params_}")
print(f"RandomForest CV RMSE: {abs(grid_rf.best_score_):.4f}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time= 1.2min
[CV] END max_depth=None, min_samples_split=5, n_estimators=100; total time= 1.3min
[CV] END max_depth=None, min_samples_split=5, n_estimators=100; total time= 1.3min
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time= 1.4min
[CV] END max_depth=None, min_samples_split=5, n_estimators=100; total time= 1.5min
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time= 1.6min
[CV] END max_depth=None, min_samples_split=5, n_estimators=100; total time= 1.6min
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=  13.9s
[CV] END max_depth=None, min_samples_split=5, n_estimators=100; total time= 1.8min
[CV] END max_depth=None, min_samples_split=2, n_estimators=100; total time= 1.9min
[CV] END max_depth=10, min_samples_split=2, n_estimators=100; total time=  18.0s
[CV] END max_depth=10, min_sam

##  XGBRegressor

In [132]:
xgb = XGBRegressor(random_state=21, verbosity=0, tree_method='hist')

param_grid_xgb = {
    'n_estimators': [100, 200,300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.1, 0.05]
}

grid_xgb = GridSearchCV(
    xgb,
    param_grid_xgb,
    cv=cv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=2
)

grid_xgb.fit(X_train, y_train)


print(f"XGBRegressor Best Params: {grid_xgb.best_params_}")
print(f"XGBRegressor CV RMSE: {abs(grid_xgb.best_score_):.4f}")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   0.7s
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=100; total time=   0.7s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   1.0s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   1.0s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=100; total time=   1.1s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   1.2s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   1.4s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   1.4s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   1.5s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   1.5s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=300; total time=   1.5s
[CV] END ...learning_rate=0.1, max_depth=3, n_es

## Ridge

In [133]:
ridge = Ridge(random_state=21)

param_grid_ridge = {
    'alpha': np.logspace(0, 5, 25)
}


grid_ridge = GridSearchCV(
    ridge,
    param_grid_ridge,
    cv=cv,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1,
    verbose=2
)

grid_ridge.fit(X_train_scaled, y_train)


print(f"Ridge Best Params: {grid_ridge.best_params_}")
print(f"Ridge CV RMSE: {abs(grid_ridge.best_score_):.4f}")

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV] END ............................alpha=2.610157215682537; total time=   2.3s
[CV] END ............................alpha=1.615598098439874; total time=   2.6s
[CV] END ............................alpha=2.610157215682537; total time=   2.6s
[CV] END ............................alpha=1.615598098439874; total time=   2.6s
[CV] END ............................alpha=1.615598098439874; total time=   2.7s
[CV] END ............................alpha=1.615598098439874; total time=   2.7s
[CV] END ............................alpha=2.610157215682537; total time=   2.7s
[CV] END ............................alpha=1.615598098439874; total time=   2.8s
[CV] END ............................alpha=4.216965034285822; total time=   2.8s
[CV] END ............................alpha=2.610157215682537; total time=   3.0s
[CV] END ..........................................alpha=1.0; total time=   3.1s
[CV] END ......................................

## Ансамбл StackingRegressor

In [134]:
base_models = [
    ('ridge', Ridge(alpha=3500)),  
    ('rf', RandomForestRegressor(
        n_estimators=200, max_depth=20, min_samples_split=5,
        random_state=21, n_jobs=-1
    )),
    ('xgb', XGBRegressor(
        n_estimators=300, max_depth=3, learning_rate=0.1,
        random_state=21, verbosity=0
    ))
]

stack = StackingRegressor(
    estimators=base_models,
    final_estimator=lr,
    passthrough=True,  
    n_jobs=-1
)


cv_scores = cross_val_score(
    stack,
    X_train_scaled,
    y_train,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1
)


print(f"StackingRegressor CV RMSE: {abs(np.mean(cv_scores)):.4f}")

StackingRegressor CV RMSE: 1.3126


## Поверка лучшей модели на тестовых данных

In [135]:
stack.fit(X_train_scaled, y_train)

y_pred = stack.predict(X_test_scaled)

rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
mae_test = mean_absolute_error(y_test, y_pred)
r2_test = r2_score(y_test, y_pred)


print(f"StackingRegressor Test RMSE: {rmse_test:.4f}")
print(f"StackingRegressor Test MAE: {mae_test:.4f}")
print(f"StackingRegressor Test R^2: {r2_test:.4f}")

StackingRegressor Test RMSE: 1.3156
StackingRegressor Test MAE: 0.9084
StackingRegressor Test R^2: 0.0370


### 📌 Вывод:
> - **Logistic Regression не подходит для задачи классификации рейтингов блюда с текущими данными.**

> - **Accuracy = 66% обусловлено угадыванием доминантного класса, не качеством предсказания.**

> - **Нужны более сложные модели.**

## GridSearchCV