**Feature engeneering**

**Previous engineered features**

- **`average_rating`**: This represents the average rating for each author's recipes. It helps evaluate whether an author's average rating is a reliable predictor for the ratings of their future recipes.

- **`rating_count`**: This captures the total number of recipes posted by an author. It allows us to analyze the relationship between the quantity of recipes and their ratings.

- **`TotalTimeMinutes`**: This is the total preparation and cooking time for a recipe, measured in minutes. It serves as a numerical feature for prediction.

- **`DescriptionLength`**: This measures the length of the recipe description. It will help assess whether the description's size influences the recipe's rating. 

### Feature Engineering and Dataset Preparation  

1. **One-Hot Encoding Seasons**: Extracted seasonal information from the timestamp and encoded it as one-hot features.  

2. **Recipe Complexity Metrics**: Added features for instruction length and the number of ingredients to represent the complexity of recipes.  

3. **One-Hot Encoded Recipe Categories**: Transformed recipe categories into binary one-hot encoded features.  

4. **One-Hot Encoded Keywords**: Encoded recipe keywords into a one-hot format to capture keyword-related patterns.  

5. **Ingredient Vectorization**: Converted the list of ingredients into a numerical vector using a suitable vectorization method.   

6. **Training and Testing Dataset Creation**: Split the processed dataset into training and testing subsets for model development and evaluation.

7. **Numerical Feature Scaling**: Standardized all numerical features to ensure consistency and improve model performance. 

In [60]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from collections import Counter
from sklearn.preprocessing import OneHotEncoder


In [82]:
# import the dataframe
df = pd.read_pickle('/Users/shendong/Desktop/Springboard_local/Springboard_old/data capstone 2/merged_df')

In [83]:
#summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 864122 entries, 0 to 864121
Data columns (total 33 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    864122 non-null  float64            
 1   Name                        864122 non-null  object             
 2   AuthorId_recipe             864122 non-null  int64              
 3   AuthorName_recipe           864122 non-null  object             
 4   TotalTime                   864122 non-null  object             
 5   DatePublished               864122 non-null  datetime64[ns, UTC]
 6   Description                 864122 non-null  object             
 7   RecipeCategory              864122 non-null  object             
 8   Keywords                    864122 non-null  object             
 9   RecipeIngredientQuantities  864122 non-null  object             
 10  RecipeIngredientParts       864122 non-null 

In [84]:
#delete columns t6hat are not relevant for our predictive model
df = df.drop(columns = ['Name', 'AuthorId_recipe', 'AuthorName_recipe', 'TotalTime', 'ReviewId', 'AuthorId_review',  'AuthorName_review','DateSubmitted', 'DateModified', 'Description', 'Review', 'RecipeServings'])   

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 864122 entries, 0 to 864121
Data columns (total 21 columns):
 #   Column                      Non-Null Count   Dtype              
---  ------                      --------------   -----              
 0   RecipeId                    864122 non-null  float64            
 1   DatePublished               864122 non-null  datetime64[ns, UTC]
 2   RecipeCategory              864122 non-null  object             
 3   Keywords                    864122 non-null  object             
 4   RecipeIngredientQuantities  864122 non-null  object             
 5   RecipeIngredientParts       864122 non-null  object             
 6   Calories                    864122 non-null  float64            
 7   FatContent                  864122 non-null  float64            
 8   SaturatedFatContent         864122 non-null  float64            
 9   CholesterolContent          864122 non-null  float64            
 10  SodiumContent               864122 non-null 

In [139]:
#remove recepe multiplicate (each recepe was listed with individual rating from different reviewers)
df = df.drop_duplicates(subset=['RecipeId', 'DatePublished', 'RecipeCategory',
       'Keywords', 'RecipeIngredientQuantities', 'RecipeIngredientParts',
       'Calories', 'FatContent', 'SaturatedFatContent', 'CholesterolContent',
       'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent',
       'ProteinContent', 'RecipeInstructions',
       'TotalTimeMinutes', 'description_length'], keep='first')

In [177]:
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,RecipeId,DatePublished,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,...,SugarContent,ProteinContent,RecipeInstructions,TotalTimeMinutes,description_length,Rating,average_rating,rating_count,month,season
0,38.0,1999-08-09 21:46:00+00:00,Frozen Desserts,"(Dessert, Low Protein, Low Cholesterol, Health...","(4, 1⁄4, 1, 1)","(blueberries, granulated sugar, vanilla yogurt...",170.9,2.5,1.3,8.0,...,30.2,3.2,"(Toss 2 cups berries with sugar., Let stand fo...",1485,75,5,4.25,4,8,Summer
1,39.0,1999-08-29 13:12:00+00:00,Chicken Breast,"(Chicken Thigh & Leg, Chicken, Poultry, Meat, ...","(1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","(saffron, milk, hot green chili peppers, onion...",1110.7,58.8,16.6,372.8,...,20.4,63.4,(Soak saffron in warm milk for 5 minutes and p...,265,49,3,3.0,1,8,Summer
2,40.0,1999-09-05 19:52:00+00:00,Beverages,"(Low Protein, Low Cholesterol, Healthy, Summer...","(1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4)","(sugar, lemons, rind of, lemon, zest of, fresh...",311.1,0.2,0.0,0.0,...,77.2,0.3,"(Into a 1 quart Jar with tight fitting lid, pu...",35,350,5,4.333333,9,9,Fall
3,41.0,1999-09-03 14:54:00+00:00,Soy/Tofu,"(Beans, Vegetable, Low Cholesterol, Weeknight,...","(12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","(extra firm tofu, eggplant, zucchini, mushroom...",536.1,24.0,3.8,0.0,...,32.1,29.3,"(Drain the tofu, carefully squeezing out exces...",1460,104,5,4.5,2,9,Fall
4,42.0,1999-09-19 06:19:00+00:00,Vegetable,"(Low Protein, Vegan, Low Cholesterol, Healthy,...","(46, 4, 1, 2, 1)","(plain tomato juice, cabbage, onion, carrots, ...",103.6,0.4,0.1,0.0,...,17.7,4.3,"(Mix everything together and bring to a boil.,...",50,54,5,2.666667,9,9,Fall


**New engineered features**

1. **One-Hot Encoding Seasons**: Extracted seasonal information from the timestamp and encoded it as one-hot features.

In [178]:
df['DatePublished'].head()

0   1999-08-09 21:46:00+00:00
1   1999-08-29 13:12:00+00:00
2   1999-09-05 19:52:00+00:00
3   1999-09-03 14:54:00+00:00
4   1999-09-19 06:19:00+00:00
Name: DatePublished, dtype: datetime64[ns, UTC]

In [179]:
#Extracting the date 

df['DatePublished'] = pd.to_datetime(df['DatePublished'])

# Extract the month and create a new column
df['month'] = df['DatePublished'].dt.month

# Define seasons
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['month'].apply(get_season)
df.head()

Unnamed: 0,RecipeId,DatePublished,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,...,SugarContent,ProteinContent,RecipeInstructions,TotalTimeMinutes,description_length,Rating,average_rating,rating_count,month,season
0,38.0,1999-08-09 21:46:00+00:00,Frozen Desserts,"(Dessert, Low Protein, Low Cholesterol, Health...","(4, 1⁄4, 1, 1)","(blueberries, granulated sugar, vanilla yogurt...",170.9,2.5,1.3,8.0,...,30.2,3.2,"(Toss 2 cups berries with sugar., Let stand fo...",1485,75,5,4.25,4,8,Summer
1,39.0,1999-08-29 13:12:00+00:00,Chicken Breast,"(Chicken Thigh & Leg, Chicken, Poultry, Meat, ...","(1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","(saffron, milk, hot green chili peppers, onion...",1110.7,58.8,16.6,372.8,...,20.4,63.4,(Soak saffron in warm milk for 5 minutes and p...,265,49,3,3.0,1,8,Summer
2,40.0,1999-09-05 19:52:00+00:00,Beverages,"(Low Protein, Low Cholesterol, Healthy, Summer...","(1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4)","(sugar, lemons, rind of, lemon, zest of, fresh...",311.1,0.2,0.0,0.0,...,77.2,0.3,"(Into a 1 quart Jar with tight fitting lid, pu...",35,350,5,4.333333,9,9,Fall
3,41.0,1999-09-03 14:54:00+00:00,Soy/Tofu,"(Beans, Vegetable, Low Cholesterol, Weeknight,...","(12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","(extra firm tofu, eggplant, zucchini, mushroom...",536.1,24.0,3.8,0.0,...,32.1,29.3,"(Drain the tofu, carefully squeezing out exces...",1460,104,5,4.5,2,9,Fall
4,42.0,1999-09-19 06:19:00+00:00,Vegetable,"(Low Protein, Vegan, Low Cholesterol, Healthy,...","(46, 4, 1, 2, 1)","(plain tomato juice, cabbage, onion, carrots, ...",103.6,0.4,0.1,0.0,...,17.7,4.3,"(Mix everything together and bring to a boil.,...",50,54,5,2.666667,9,9,Fall


In [180]:
season_dummies = pd.get_dummies(df['season'], prefix = 'season')
season_dummies = season_dummies.astype(int)

In [181]:
df_encode_season = pd.concat([df, season_dummies], axis=1)
df_encode_season.head()

Unnamed: 0,RecipeId,DatePublished,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,...,description_length,Rating,average_rating,rating_count,month,season,season_Fall,season_Spring,season_Summer,season_Winter
0,38.0,1999-08-09 21:46:00+00:00,Frozen Desserts,"(Dessert, Low Protein, Low Cholesterol, Health...","(4, 1⁄4, 1, 1)","(blueberries, granulated sugar, vanilla yogurt...",170.9,2.5,1.3,8.0,...,75,5,4.25,4,8,Summer,0,0,1,0
1,39.0,1999-08-29 13:12:00+00:00,Chicken Breast,"(Chicken Thigh & Leg, Chicken, Poultry, Meat, ...","(1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","(saffron, milk, hot green chili peppers, onion...",1110.7,58.8,16.6,372.8,...,49,3,3.0,1,8,Summer,0,0,1,0
2,40.0,1999-09-05 19:52:00+00:00,Beverages,"(Low Protein, Low Cholesterol, Healthy, Summer...","(1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4)","(sugar, lemons, rind of, lemon, zest of, fresh...",311.1,0.2,0.0,0.0,...,350,5,4.333333,9,9,Fall,1,0,0,0
3,41.0,1999-09-03 14:54:00+00:00,Soy/Tofu,"(Beans, Vegetable, Low Cholesterol, Weeknight,...","(12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","(extra firm tofu, eggplant, zucchini, mushroom...",536.1,24.0,3.8,0.0,...,104,5,4.5,2,9,Fall,1,0,0,0
4,42.0,1999-09-19 06:19:00+00:00,Vegetable,"(Low Protein, Vegan, Low Cholesterol, Healthy,...","(46, 4, 1, 2, 1)","(plain tomato juice, cabbage, onion, carrots, ...",103.6,0.4,0.1,0.0,...,54,5,2.666667,9,9,Fall,1,0,0,0


We extracted the month in order to create a season feature that we one-hot encoded.

In [182]:
#drop the DatePublished column
df_encode_season = df_encode_season.drop(columns = ['DatePublished', 'season'])

2. **Recipe Complexity Metrics**: Added features for instruction length and the number of ingredients to represent the complexity of recipes.

**Instruction length**

In [183]:
df_encode_season['Instruction_length']= df_encode_season['RecipeInstructions'].str.len()
df_instructionlen = df_encode_season.drop(columns = 'RecipeInstructions')
df_instructionlen.columns

Index(['RecipeId', 'RecipeCategory', 'Keywords', 'RecipeIngredientQuantities',
       'RecipeIngredientParts', 'Calories', 'FatContent',
       'SaturatedFatContent', 'CholesterolContent', 'SodiumContent',
       'CarbohydrateContent', 'FiberContent', 'SugarContent', 'ProteinContent',
       'TotalTimeMinutes', 'description_length', 'Rating', 'average_rating',
       'rating_count', 'month', 'season_Fall', 'season_Spring',
       'season_Summer', 'season_Winter', 'Instruction_length'],
      dtype='object')

**Number of ingredients** 

In [184]:
df_instructionlen['ingredient_count'] = df_instructionlen['RecipeIngredientParts'].apply(len)
df_instructionlen.head()

Unnamed: 0,RecipeId,RecipeCategory,Keywords,RecipeIngredientQuantities,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,...,Rating,average_rating,rating_count,month,season_Fall,season_Spring,season_Summer,season_Winter,Instruction_length,ingredient_count
0,38.0,Frozen Desserts,"(Dessert, Low Protein, Low Cholesterol, Health...","(4, 1⁄4, 1, 1)","(blueberries, granulated sugar, vanilla yogurt...",170.9,2.5,1.3,8.0,29.8,...,5,4.25,4,8,0,0,1,0,9,4
1,39.0,Chicken Breast,"(Chicken Thigh & Leg, Chicken, Poultry, Meat, ...","(1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","(saffron, milk, hot green chili peppers, onion...",1110.7,58.8,16.6,372.8,368.4,...,3,3.0,1,8,0,0,1,0,11,25
2,40.0,Beverages,"(Low Protein, Low Cholesterol, Healthy, Summer...","(1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4)","(sugar, lemons, rind of, lemon, zest of, fresh...",311.1,0.2,0.0,0.0,1.8,...,5,4.333333,9,9,1,0,0,0,5,5
3,41.0,Soy/Tofu,"(Beans, Vegetable, Low Cholesterol, Weeknight,...","(12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","(extra firm tofu, eggplant, zucchini, mushroom...",536.1,24.0,3.8,0.0,1558.6,...,5,4.5,2,9,1,0,0,0,15,14
4,42.0,Vegetable,"(Low Protein, Vegan, Low Cholesterol, Healthy,...","(46, 4, 1, 2, 1)","(plain tomato juice, cabbage, onion, carrots, ...",103.6,0.4,0.1,0.0,959.3,...,5,2.666667,9,9,1,0,0,0,4,5


In [185]:
df_instructionlen.shape

(172299, 26)

3. **One-Hot Encoded Recipe Categories**: Transformed recipe categories into binary one-hot encoded features.  

In [186]:
cat = df_instructionlen['RecipeCategory'].value_counts()

In [187]:
#how many category has a number of recipes <100
cat_100 = len(cat[cat <100])
cat_100

164

In [188]:
#In order to avoid unseen categories in the test dataset we will classify 
#all the recepes under 'other' for category when their category contains less than 250
df_instructionlen['RecipeCategory'] = df_instructionlen['RecipeCategory'].apply(lambda x: 'Others' if cat[x] < 100 else x)

In [189]:
df_instructionlen['RecipeCategory'].value_counts()

RecipeCategory
Dessert          16973
Lunch/Snacks     12738
One Dish Meal    11910
Vegetable        10145
Breakfast         7810
                 ...  
Lactose Free       106
Coconut            104
Deer               102
Free Of...         102
Spicy              100
Name: count, Length: 115, dtype: int64

In [190]:
#check for the category names so we can pool the one that are similar to avoid duplication
df_instructionlen['RecipeCategory'].unique()

array(['Frozen Desserts', 'Chicken Breast', 'Beverages', 'Soy/Tofu',
       'Vegetable', 'Pie', 'Chicken', 'Dessert', 'Others', 'Stew',
       'Black Beans', '< 60 Mins', 'Whole Chicken', 'Sauces', 'Breakfast',
       'Bar Cookie', 'Brown Rice', 'Oranges', 'Free Of...', 'Cheese',
       'Lamb/Sheep', 'Very Low Carbs', 'Breads', 'Spaghetti', 'Scones',
       'Drop Cookies', 'Lunch/Snacks', 'Cheesecake', 'Punch Beverage',
       'Yeast Breads', 'Low Cholesterol', 'Weeknight', 'Low Protein',
       'Curries', '< 30 Mins', 'Savory Pies', 'Coconut', 'Quick Breads',
       'Steak', 'Lobster', 'Pork', 'Halibut', 'Crab', 'Potato', 'Meat',
       'Poultry', 'Chowders', 'European', 'Pineapple', 'Smoothies',
       'Beans', 'Onions', 'Greek', 'Corn', 'Lentil', 'Healthy',
       'High Protein', 'Summer', 'Long Grain Rice', 'Cauliflower', 'Tuna',
       'Fruit', 'Apple', 'Salad Dressings', 'Asian', 'Mexican',
       'Clear Soup', 'Shakes', 'Candy', 'One Dish Meal',
       'Short Grain Rice', '< 15 

All the category seem to be very specific so we will not change anything

In [191]:
df_instructionlen.shape

(172299, 26)

In [192]:
encoder = OneHotEncoder(sparse=True, handle_unknown='ignore')
encoded_array = encoder.fit_transform(df_instructionlen[['RecipeCategory']])
encoded_dense = encoded_array.toarray()
cat_encoded = pd.DataFrame(encoded_dense, columns=encoder.get_feature_names_out(['RecipeCategory']))
cat_encoded.head()



Unnamed: 0,RecipeCategory_< 15 Mins,RecipeCategory_< 30 Mins,RecipeCategory_< 4 Hours,RecipeCategory_< 60 Mins,RecipeCategory_Apple,RecipeCategory_Asian,RecipeCategory_Australian,RecipeCategory_Bar Cookie,RecipeCategory_Beans,RecipeCategory_Berries,...,RecipeCategory_Turkey Breasts,RecipeCategory_Veal,RecipeCategory_Vegan,RecipeCategory_Vegetable,RecipeCategory_Very Low Carbs,RecipeCategory_Weeknight,RecipeCategory_White Rice,RecipeCategory_Whole Chicken,RecipeCategory_Yam/Sweet Potato,RecipeCategory_Yeast Breads
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [193]:
cat_encoded.shape

(172299, 115)

In [194]:
df_encoded_cat = pd.concat([df_instructionlen, cat_encoded], axis =1)
df_encoded_cat = df_encoded_cat.drop(columns = 'RecipeCategory')
df_encoded_cat.shape


(172299, 140)

4. **One-Hot Encoded Keywords**: Encoded recipe keywords into a one-hot format to capture keyword-related patterns.  

In [196]:
df_encoded_cat['Keywords']

0         (Dessert, Low Protein, Low Cholesterol, Health...
1         (Chicken Thigh & Leg, Chicken, Poultry, Meat, ...
2         (Low Protein, Low Cholesterol, Healthy, Summer...
3         (Beans, Vegetable, Low Cholesterol, Weeknight,...
4         (Low Protein, Vegan, Low Cholesterol, Healthy,...
                                ...                        
172294                                         (< 30 Mins,)
172295      (Low Protein, Low Cholesterol, < 30 Mins, Easy)
172296    (Brunch, < 30 Mins, Easy, Inexpensive, From Sc...
172297    (Low Cholesterol, High Fiber, Healthy, High In...
172298                                              (Easy,)
Name: Keywords, Length: 172299, dtype: object

In [197]:
df_encoded_cat['Keywords'] = df_encoded_cat['Keywords'].apply(list)
df_encoded_cat['Keywords']

0         [Dessert, Low Protein, Low Cholesterol, Health...
1         [Chicken Thigh & Leg, Chicken, Poultry, Meat, ...
2         [Low Protein, Low Cholesterol, Healthy, Summer...
3         [Beans, Vegetable, Low Cholesterol, Weeknight,...
4         [Low Protein, Vegan, Low Cholesterol, Healthy,...
                                ...                        
172294                                          [< 30 Mins]
172295      [Low Protein, Low Cholesterol, < 30 Mins, Easy]
172296    [Brunch, < 30 Mins, Easy, Inexpensive, From Sc...
172297    [Low Cholesterol, High Fiber, Healthy, High In...
172298                                               [Easy]
Name: Keywords, Length: 172299, dtype: object

In [198]:
#Extract all unique keywords
unique_keywords = set(keyword for keywords in df_encoded_cat['Keywords'] for keyword in keywords)

# Create a dictionary of multi-hot encoded columns
encoded_columns = {
    keyword: df_encoded_cat['Keywords'].apply(lambda x: 1 if keyword in x else 0)
    for keyword in unique_keywords
}

# Create a new DataFrame with the multi-hot encoded columns
encoded_keyword_df = pd.DataFrame(encoded_columns)

# Concatenate the encoded columns with the original DataFrame
df_encoded_keywords = pd.concat([df_encoded_cat.drop(columns=['Keywords']), encoded_keyword_df], axis=1)

df_encoded_keywords.head()

Unnamed: 0,RecipeId,RecipeIngredientQuantities,RecipeIngredientParts,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,...,Eggs Breakfast,Bath/Beauty,African,Lunch/Snacks,Homeopathy/Remedies,Collard Greens,Nepalese,Free Of...,Dessert,Pheasant
0,38.0,"(4, 1⁄4, 1, 1)","(blueberries, granulated sugar, vanilla yogurt...",170.9,2.5,1.3,8.0,29.8,37.1,3.6,...,0,0,0,0,0,0,0,1,1,0
1,39.0,"(1, 4, 2, 2, 8, 1⁄4, 8, 1⁄2, 1, 1, 1⁄4, 1⁄4, 1...","(saffron, milk, hot green chili peppers, onion...",1110.7,58.8,16.6,372.8,368.4,84.4,9.0,...,0,0,0,0,0,0,0,0,0,0
2,40.0,"(1 1⁄2, 1, None, 1 1⁄2, None, 3⁄4)","(sugar, lemons, rind of, lemon, zest of, fresh...",311.1,0.2,0.0,0.0,1.8,81.5,0.4,...,0,0,0,0,0,0,0,0,0,0
3,41.0,"(12, 1, 2, 1, 10, 1, 3, 2, 2, 2, 1, 2, 1⁄2, 1⁄...","(extra firm tofu, eggplant, zucchini, mushroom...",536.1,24.0,3.8,0.0,1558.6,64.2,17.3,...,0,0,0,0,0,0,0,0,0,0
4,42.0,"(46, 4, 1, 2, 1)","(plain tomato juice, cabbage, onion, carrots, ...",103.6,0.4,0.1,0.0,959.3,25.1,4.8,...,0,0,0,0,0,0,0,0,0,0


In [234]:
df_encoded_keywords.shape

(172299, 430)

5. **Ingredient Vectorization**: Converted the list of ingredients into a numerical vector using a suitable vectorization method.

- Step 1: homogenize ingredient names using the pyfood library (not perfect approach but the most simple)          

In [203]:
# Convert the tuples to lists to process through pyfood
df_encoded_keywords['List_Ingredients'] = df_encoded_keywords['RecipeIngredientParts'].apply(list)
df_encoded_keywords = df_encoded_keywords.drop(columns = 'RecipeIngredientParts')


In [204]:
from pyfood.utils import Shelf
shelf = Shelf(month_id=0)

In [205]:
#Standardization of the ingredient names using the pyfood library.
# we will pass the whole dataframe through shelf.process_ingredients by chunk to make the process faster

#iterate by chunk of 10000 rows through the whole ingredient column
chunksize = 10000
num_rows = len(df_encoded_keywords)
final = []
for start in range(0, num_rows, chunksize):
    end = min(start + chunksize, num_rows)
    chunk = df_encoded_keywords.iloc[start:end]
    
#process each chunk with shelf.process_ingredients and extract ingredient standard names 
    result_chunk =[]
    for value in chunk['List_Ingredients']:
            test = shelf.process_ingredients(value)
            if test is not None:

    # extract ingredient standard names from 'ingredients_by_taxon' key
                test2 = test.get('ingredients_by_taxon', [])
    
    # add the names from the 'HS' key. HS has the name of the food out of season. shelf.process_ingredients() has a month argument in order to have the name of the ingredients in season. 'HS' lists the ingredients out of season ("Hors Saison"). We do not care about the season so we are adding it to the list of ingredients.
                test2 = [item[0] for item in test2] + test.get('HS', [])
                result_chunk.append(test2)
            else:
    # Handle the case where test is None (you can append an empty list or handle it differently)
                result_chunk.append([])
#append all the data in final list
    final.append(result_chunk)



https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [206]:
# Flatten the outermost level (from list of lists of lists to list of lists)
final2 = [sublist for inner_list in final for sublist in inner_list]

# Print the length of the result. THere should be 172299 ingredient lists 
len(final2)


172299

In [207]:
#add the list of ingredients to the dataframe with the recepe_ID
df_encoded_keywords['homogen_ingredient'] = final2
df_encoded_keywords = df_encoded_keywords.drop(columns = 'List_Ingredients')

- Step 2: vectorizing ingredient column and concat with the original dataframe

In [212]:
from sklearn.feature_extraction.text import CountVectorizer

In [208]:
#joining the elements into a single string, where each ingredient is separated by a comma and a space (', ') in order to prepare the data for vectorization.
df_encoded_keywords['ingredient_vect'] = df_encoded_keywords['homogen_ingredient'].apply(lambda x: ', '.join(x))
df_encoded_keywords = df_encoded_keywords.drop(columns = ['homogen_ingredient')

In [210]:
df_final = df_encoded_keywords.drop(columns = ['RecipeId', 'RecipeIngredientQuantities'])
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172299 entries, 0 to 172298
Columns: 428 entries, Calories to ingredient_vect
dtypes: float64(125), int32(2), int64(300), object(1)
memory usage: 561.3+ MB


In [213]:
#vectorization
vectorizer = CountVectorizer(tokenizer=lambda x: x.split(', '))
vectorized_matrix = vectorizer.fit_transform(df_final['ingredient_vect'])
vectorized_df = pd.DataFrame(vectorized_matrix.toarray(), columns=vectorizer.get_feature_names_out())



In [214]:
Final_df = pd.concat([df_final, vectorized_df], axis=1)

In [215]:
Final_df.shape

(172299, 1238)

In [219]:
Final_df =Final_df.drop(columns = ['ingredient_vect'])

In [217]:
Final_df =Final_df.drop(columns = ['Rating'])


In [218]:
Final_df =Final_df.drop(columns = ['month'])

In [226]:
Final_df.to_pickle('/Users/shendong/Desktop/Springboard_local/Springboard_old/data capstone 2/Final_df')

7. **Training and Testing Dataset Creation**: Split the processed dataset into training and testing subsets for model development and evaluation. 

In [227]:
from sklearn.model_selection import train_test_split

# define features and target variable
X = Final_df.drop(columns = 'average_rating')  
y = Final_df['average_rating']                 

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [228]:
X_train

Unnamed: 0,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,TotalTimeMinutes,...,yan mai,yeast,yellow bell pepper,ying su,yogurt,zeder,zhi ma,zucca,zucchini,zucker
103551,177.4,3.7,0.5,0.0,12.4,38.2,6.4,9.2,3.0,90,...,0,0,0,0,0,0,0,0,0,0
23517,16.7,0.4,0.2,1.1,44.5,1.9,0.3,0.3,1.6,10,...,0,0,0,0,0,0,0,0,0,0
12399,30.8,0.3,0.0,0.0,63.4,6.4,0.8,1.4,1.8,200,...,0,0,0,0,0,0,0,0,0,0
2799,60.3,0.0,0.0,0.0,13.2,16.4,0.0,5.7,0.0,375,...,0,0,0,0,0,0,0,0,0,0
23821,307.2,12.3,5.3,124.7,706.5,8.6,2.6,1.7,42.2,255,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119879,507.0,30.7,16.4,299.9,1371.0,26.9,2.0,2.5,30.2,15,...,0,0,0,0,0,0,0,0,0,0
103694,624.4,28.1,9.6,146.8,2469.1,29.9,11.1,6.4,60.8,240,...,0,0,0,0,0,0,0,0,0,0
131932,491.9,17.3,6.1,179.1,154.1,72.0,8.5,4.3,14.6,21,...,0,0,0,0,0,0,0,0,0,0
146867,394.1,19.1,7.2,112.4,1314.3,28.8,2.4,12.6,25.9,60,...,0,0,0,0,0,0,0,0,0,0


7. **Numerical Feature Scaling**: Standardized all numerical features to ensure consistency and improve model performance.

In [229]:
from sklearn.preprocessing import StandardScaler
#Scaling the X_train and X_test separately

#X_train
# Step 1: Identify column types
numerical_columns = ['Calories', 'FatContent', 'SaturatedFatContent', 'CholesterolContent', 'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent', 'ProteinContent',
       'TotalTimeMinutes', 'description_length', 'rating_count', 'Instruction_length', 'ingredient_count']

df_nonscale = X_train.drop(columns = numerical_columns)

# Step 2: Initialize StandardScaler
scaler = StandardScaler()

# Step 3: Scale numerical columns
X_train_num_scale = scaler.fit_transform(X_train[numerical_columns])

# Step 4: Convert scaled data back to DataFrame
X_train_num_scale_df = pd.DataFrame(X_train_num_scale, columns=numerical_columns, index=X_train.index)

# Step 5: Concatenate scaled numerical and one-hot encoded columns
X_train_scaled = pd.concat([X_train_num_scale_df, df_nonscale], axis=1)

X_train_scaled.head()


Unnamed: 0,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,TotalTimeMinutes,...,yan mai,yeast,yellow bell pepper,ying su,yogurt,zeder,zhi ma,zucca,zucchini,zucker
103551,-0.144893,-0.48395,-0.578234,-0.64869,-0.246106,0.015766,0.532097,-0.014623,-0.562344,0.119954,...,0,0,0,0,0,0,0,0,0,0
23517,-0.275026,-0.597614,-0.605776,-0.638418,-0.231687,-0.107443,-0.435823,-0.050862,-0.624262,-0.387235,...,0,0,0,0,0,0,0,0,0,0
12399,-0.263608,-0.601058,-0.624137,-0.64869,-0.223197,-0.092169,-0.356485,-0.046383,-0.615416,0.817338,...,0,0,0,0,0,0,0,0,0,0
2799,-0.239719,-0.611391,-0.624137,-0.64869,-0.245747,-0.058227,-0.483426,-0.028874,-0.695024,1.926812,...,0,0,0,0,0,0,0,0,0,0
23821,-0.039782,-0.187737,-0.137565,0.515745,0.065687,-0.084702,-0.07087,-0.045161,1.171342,1.16603,...,0,0,0,0,0,0,0,0,0,0


In [230]:
# Scale Transform test data using training parameters
df_nonscale_test = X_test.drop(columns = numerical_columns)
X_test_num_scale = scaler.transform(X_test[numerical_columns])
X_test_num_scale_df = pd.DataFrame(X_test_num_scale, columns=numerical_columns, index=X_test.index)
X_test_scaled = pd.concat([X_test_num_scale_df, df_nonscale_test], axis=1)

X_test_scaled.head()


Unnamed: 0,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,TotalTimeMinutes,...,yan mai,yeast,yellow bell pepper,ying su,yogurt,zeder,zhi ma,zucca,zucchini,zucker
9915,-0.057273,0.139476,0.468356,0.263622,-0.224724,-0.031753,0.071938,0.019172,-0.416396,1.831714,...,0,0,0,0,0,0,0,0,0,0
82111,-0.026582,-0.27729,-0.229371,0.027373,-0.140363,0.041901,-0.181942,-0.047604,-0.146613,-0.133641,...,0,0,0,0,0,0,0,0,0,0
152686,-0.134527,-0.363398,-0.247732,-0.268638,-0.138522,-0.019534,-0.13434,-0.005258,-0.500427,-0.260438,...,0,0,0,0,1,0,0,0,0,0
43738,-0.207733,-0.607947,-0.624137,-0.64869,-0.249565,-0.02734,-0.467558,0.048489,-0.686179,-0.355535,...,0,0,0,0,0,0,0,0,0,0
27996,-0.021399,0.356469,-0.266093,-0.64869,-0.186901,-0.047705,0.056071,-0.00648,-0.549076,-0.260438,...,0,0,0,0,0,0,0,0,1,0


In [232]:
X_test_scaled.shape

(34460, 1234)

In [235]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)  # Retain 95% of the variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

TypeError: Feature names are only supported if all input features have string names, but your input has ['NoneType', 'str'] as feature name / column name types. If you want feature names to be stored and validated, you must convert them all to strings, by using X.columns = X.columns.astype(str) for example. Otherwise you can remove feature / column names from your input data, or convert them all to a non-string data type.