# McDonald's Menu Analysis

## Introduction
McDonald's is one of the most famous fast food in the world. Before COVID Outbreak, a lot of people in my city love to spend their time in McDonald's. Not only for eating, sometimes they choose McDonald's because it's comfortable to meet with friends or maybe Studying with enjoying some french fries and Wi-Fi Connection.

In this notebook, I'm going to answer all inspiration from this dataset: 
1. How many calories does the average McDonald's value meal contain? 
2. How much do beverages, like soda or coffee, contribute to the overall caloric intake?
3. Does ordered grilled chicken instead of crispy increase a sandwich's nutritional value?
4. What about ordering egg whites instead of whole eggs? 
5. What is the least number of items could you order from the menu to meet one day's nutritional requirements?

## Working on Data
### Install and Import Some Usefull Package

In [None]:
!pip install pulp

In [None]:
import pandas as pd
import numpy as np
from pulp import *

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In this notebook, most of packages that We use is common like Pandas, NumPy, etc. and We're going to use Pulp package later. Pulp is used to solve a linear programming problem.
### Data Acquisition and Data Cleaning
Let's begin with load a dataset first.


In [None]:
df = pd.read_csv('../input/nutrition-facts/menu.csv')
df.head()

#### Check Quality of data
Before we dive deeper. I'm going to check is there any null value or incorrect data type in this dataset? and then if we see a data in `Serving Size` column, we need to normalize it to a standard unit.

In [None]:
df.isnull().sum() #Check is there any null value?

In [None]:
df.dtypes #Check data type in each columns

From the result, we can conclude there's no null value in dataset and all columns data types are correct, except `Serving Size` column because it supposed to be int64, not object data type. Most of the nutrients have a column of percentage daily need. We're going to add Sugars, Protein, and Calories column in percentage daily needs later.

If we read how all data write in `Serving Size`. there are several weight/volume units to describe the menus. That is Ounce (oz), Fluid Ounce (fl oz), Gram (g) and Millilitre (ml). Most of the data use an ounce or gram unit for food and fluid oz for a drink, but some data like `1% Low Fat Milk Jug` and `Fat-Free Chocolate Milk Jug` items use millilitre unit. So we're going to transform all unit to gram unit, and for data who didn't provide mass unit (like a fluid ounce or millilitre), we're going to use the density of milk (Only milk item who didn't have mass unit) for millilitre and than for fluid once we're going to use the density of pure water.

In [None]:
serving_size_conv = []
for i in df['Serving Size']:
    if '(' in i and 'g)' in i:
        serving_size_conv.append(float(i[i.find('(')+1:-3])) 
    elif 'fl oz' in i:
        serving_size_conv.append(float(i[0:i.find(' ')])*29.5735) ## assume 1 fl oz = 29.5735 g
    elif '(' and 'ml)':
        serving_size_conv.append(float(i[i.find('(')+1:-4])*1.04) ## assume 1 ml = 1.04g

df['Serving Size']= pd.DataFrame(serving_size_conv).astype('float')
df.head(2)

After normalized `Serving Size` column, let's add Sugars, Protein, and Calories column in percentage daily needs unit.

In [None]:
df['Calories (% Daily Value)'] = df['Calories']*100/2500 
df['Sugars (% Daily Value)'] = df['Sugars']*100/30
df['Protein (% Daily Value)'] = df['Protein']*100/50
df.head(2)

after the dataset has been checked and cleaned. now the dataset is ready to use and we can start dive into our dataset

### Data Exploration
Let's describe our data first

In [None]:
df.describe()

The average calories in McDonald's menu is **368.27 kal**. Let's see how many foods and drinks on the menu and from where the calories came.

In [None]:
by_category = pd.DataFrame({'Category':df['Category'].value_counts().index.tolist(),
                            'Count':df['Category'].value_counts().tolist(),
                            'Calories (Sum)':df[['Category','Calories']].groupby('Category').sum()['Calories'],
                            'Calories (Mean)':df[['Category','Calories']].groupby('Category').mean()['Calories']})
by_category.reset_index(drop = True)

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2)
fig.set_size_inches(15.5, 7.5)
ax1.set_title("Total of Items in Each Category")
ax1.pie(by_category['Count'],labels =by_category['Category'],autopct = '%1.1f%%')

ax2.set_title("Total of Calories in Each Category")
ax2.pie(by_category['Calories (Sum)'],labels =by_category['Category'],autopct = '%1.1f%%')

fig.suptitle('Calories by Category',fontsize = 20)
fig.legend(by_category['Category'],ncol=4,loc=8)

From the figure, most item in the menu is **Coffee & Tea (36.5%)** and most calories came from **Chicken & Fish (28.2%)**. If we see dessert, with only 7 item (2.7%) in the menu (Second least item in the menu), it gets 15.5% calories (Fourt most calories in the menu).

Let's see how the correlation between nutrients with Heatmap plot:

In [None]:
nutrition = ['Serving Size',
 'Calories',
 'Calories from Fat',
 'Total Fat',
 'Saturated Fat',
 'Trans Fat',
 'Cholesterol',
 'Sodium',
 'Carbohydrates',
 'Dietary Fiber',
 'Sugars',
 'Protein',
 'Vitamin A (% Daily Value)',
 'Vitamin C (% Daily Value)',
 'Calcium (% Daily Value)',
 'Iron (% Daily Value)']

corr = df[nutrition].corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
fig = plt.figure(figsize = (13,13))

ax = sns.heatmap(corr,
                 mask = mask,
                 square =True,
                 cmap="vlag",
                 annot=True,
                 fmt="1.2f"
                )
ax.set_title("Nutrition Correlation",fontsize = 20);

From the figure, we can conclude that **Fat has the highest correlation with calories**. It means more fat in a menu can imply into more calories too. followed by Protein in second place (0.79) and Carbohydrates in third place (0.78). Besides the correlation with calories, there's a high correlation score in another feature, for example the correlation between Sodium and Iron (0.87), and Sodium between Protein (0.87)

After know some interesting information, we're going to know **does ordered grilled chicken instead of crispy increase a sandwich nutritional values?**

In [None]:
crispy_vs_no = df[df['Item'].str.contains('Sandwich')].reset_index(drop=True)

crispy = crispy_vs_no[crispy_vs_no['Item'].str.contains('Crispy')].reset_index(drop=True)
grilled = crispy_vs_no[crispy_vs_no['Item'].str.contains('Grilled')].reset_index(drop=True)

label = grilled[grilled['Item'].str.contains('Grilled')]['Item'].replace('Grilled','',regex = True).tolist()

crispy.drop(['Category','Item','Serving Size'], axis=1,inplace = True)
grilled.drop(['Category','Item','Serving Size'], axis=1,inplace = True)

crispy_vs_no.head(2)

In [None]:
def comparison_plot(dataframe,plot_title,row,column,last_blank = False):
    label = ['Calories (% Daily Value)','Total Fat (% Daily Value)','Saturated Fat (% Daily Value)'
             ,'Cholesterol (% Daily Value)','Sodium (% Daily Value)','Carbohydrates (% Daily Value)'
             ,'Dietary Fiber (% Daily Value)','Sugars (% Daily Value)','Protein (% Daily Value)'
             ,'Dietary Fiber (% Daily Value)','Vitamin A (% Daily Value)','Vitamin C (% Daily Value)'
             ,'Calcium (% Daily Value)']

    label_tick = ['Calories','Total Fat','Saturated Fat','Cholesterol','Sodium','Carbohydrates'
             ,'Dietary Fiber','Sugar','Protein','Dietary Fiber','Vitamin A','Vitamin C','Calcium']
    
    n = len(label)
    ind = np.arange(n)

    width = 0.3
    fig,axs = plt.subplots(row,column)
    
    if last_blank:
        axs[-1, -1].axis('off')
    
    fig.set_figheight(10)
    fig.set_figwidth(20)

    index = 0
    for ax in axs:
        for x in ax:
            if last_blank and index == (row*column*2)-2:
                continue
            x.bar(ind - width/2,dataframe.iloc[index][label].tolist(),width,label = dataframe.iloc[index]['Item'])
            x.bar(ind + width/2,dataframe.iloc[index+1][label].tolist(),width,label =dataframe.iloc[index+1]['Item'])
            if index == 4 or index == 6:
                x.set_xticks(ind)
                x.set_xticklabels(label_tick)
                x.tick_params(labelrotation=90)
            else:
                x.set_xticks(ind)
                x.set_xticklabels([""]*len(label_tick))
            x.legend()
            index +=2
    fig.suptitle(plot_title,fontsize = 20)
comparison_plot(crispy_vs_no,'Crispy vs Grilled Sandwich Comparison',2,2)

From the figure, **mostly crispy sandwich has more nutrient rather than grilled**. But in grilled sandwich have more cholesterol, protein and vitamin c rather than in crispy sandwich. 

Let's see is **ordering egg whites instead of whole eggs can affect the nutrient?**

In [None]:
egg =  df[df['Item'].str.contains('with')&df['Item'].str.contains('Egg') & ~df['Item'].str.contains('White')].reset_index(drop=True)
egg_vs_white = pd.DataFrame(columns = df.columns)
for i in egg['Item']:
    regex = "^"
    
    for j in i.split(" "):
        regex += "(?=.*"+j+")"
    regex+= ".*$"
    a = df[df['Item'].str.contains(regex)]
    egg_vs_white = egg_vs_white.append(a,ignore_index = True)

In [None]:
comparison_plot(egg_vs_white,"Whole Egg vs White Egg Comparison",2,2,last_blank = True)

From the figure, Items with **whole egg have more nutrient rather than without egg folk**. But, **egg yolk contains a lot of cholesterol**. It's shown on how significant different between cholesterol on Whole egg and cholesterol on egg white only. The different can be up to seven times. 

And the last question that we going to answer is **What is the least number of items could you order from the menu to meet one day's nutritional requirements?**

To answer this problem, please note that we want to order several items in order to meet our daily needs and so we can bring this problem into **Optimization Problem**, because we're going to **minimum the number of item** with **maximum nutrition**. We can use **Linear Programming** to solve this problem. To do that, we need state some function.

We know that we're going to get **minimum number of order**. It's our **objective function**. this objective function can be written into a mathematical equation as follows:

$Z=menu_{1}+menu_{2}+menu_{3}+...+menu_{259}+menu_{260}$

$Z$ define as a number of order, and $menu_{1},...,menu_{260}$ define as a total item that we must order. 

After we define our objective function, we need **define our constrain**. We know that we're going to meet one day's nutrional requirement, so the constrain is the nutrient itself. For example we have a amount of calories:

In [None]:
df[['Item','Calories','Calories (% Daily Value)']]

and assume we need 2500 calories in one day. so we can write this constrain into a mathematical function as follows:
$300\times menu_{1}+250\times menu_{2}+370 \times menu_{3}+...+340 \times menu_{257}+810 \times menu_{258}+410 \times menu_{259}\geq 2500$

Because in datasets provided one day's nutritional requirement in each nutrition. We will use percentage rather than use gram unit. And at the end we will have 16 constrains because we have 16 nutrition information.

After we state our "nutrient constraints", don't forget to define the minimum of item that we can buy. $(menu_{n} \geq 0)$

let's bring that all equation into a method with `pulp` library:

In [None]:
def LPModel(df,threshold,tolerant=0,morethan = True, lessthan = False,Print = True):
    prob = LpProblem("McDonalds Problem",LpMinimize)

    '''
    Objective function
    '''
    food_item = df['Item'].to_list()
    
    costs = dict(zip(food_item,[1]*len(food_item))) ##All item have equal weight. Z = 1*menu1 + 1*menu2 +...+ 1*menu260

    
    '''
    Define Constrains
    '''
    calories = dict(zip(food_item,df['Calories (% Daily Value)'])) 

    fat = dict(zip(food_item,df['Total Fat (% Daily Value)']))

    sat_fat = dict(zip(food_item,df['Saturated Fat (% Daily Value)']))

    cholesterol = dict(zip(food_item,df['Cholesterol (% Daily Value)']))

    sodium = dict(zip(food_item,df['Sodium (% Daily Value)']))

    carbs = dict(zip(food_item,df['Carbohydrates (% Daily Value)']))

    dietary_fiber = dict(zip(food_item,df['Dietary Fiber (% Daily Value)']))

    sugar = dict(zip(food_item,df['Sugars (% Daily Value)'])) 

    protein = dict(zip(food_item,df['Protein (% Daily Value)'])) 

    vit_a = dict(zip(food_item,df['Vitamin A (% Daily Value)']))

    vit_c = dict(zip(food_item,df['Vitamin C (% Daily Value)']))

    calcium = dict(zip(food_item,df['Calcium (% Daily Value)']))

    iron = dict(zip(food_item,df['Iron (% Daily Value)']))

    food_vars = LpVariable.dicts('Menu',food_item,lowBound=0,cat='Integer') #We cannot order a half of big mac ofc, set minimum order to 0

    prob += lpSum([costs[i]*food_vars[i] for i in food_item])
    
    i = threshold
    b = tolerant
    
    ##Set Constrains
    if lessthan:
        prob += lpSum([calories[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([fat[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([sat_fat[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([cholesterol[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([sodium[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([carbs[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([dietary_fiber[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([sugar[f]* food_vars[f] for f in food_item])<= i+b
        prob += lpSum([protein[f]* food_vars[f] for f in food_item])<= i+b
        prob += lpSum([vit_a[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([vit_c[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([calcium[f] * food_vars[f] for f in food_item]) <= i+b
        prob += lpSum([iron[f] * food_vars[f] for f in food_item]) <= i+b
        
    if morethan:
        prob += lpSum([calories[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([fat[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([sat_fat[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([cholesterol[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([sodium[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([carbs[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([dietary_fiber[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([sugar[f]* food_vars[f] for f in food_item])>= i-b
        prob += lpSum([protein[f]* food_vars[f] for f in food_item])>= i-b
        prob += lpSum([vit_a[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([vit_c[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([calcium[f] * food_vars[f] for f in food_item]) >= i-b
        prob += lpSum([iron[f] * food_vars[f] for f in food_item]) >= i-b

        
    
    prob.solve()#Solve the problem
    if Print:
        print("Status:", LpStatus[prob.status])
    
    #Get the item that we must order
    ideal_item_name = []
    ideal_item_count = []
    for v in prob.variables():
        if v.varValue>0:
            ideal_item_name.append(v.name)
            ideal_item_count.append(v.varValue)
    dic = {'Items':ideal_item_name,
          'Count':ideal_item_count}
    df_opt = pd.DataFrame(dic)
    df_opt.replace('Menu_','',inplace = True,regex=True)
    df_opt.replace("_", " ",inplace = True,regex=True)
    
    return df_opt,LpStatus[prob.status]

def df_multiplier(df_opt):
    df_ideal = pd.DataFrame(columns = df.columns.tolist())


    for i in range(df_opt.shape[0]):
        df_ideal = df_ideal.append(df[df['Item'] == df_opt['Items'][i]])
        df_ideal.reset_index(drop = True,inplace = True)
        df_ideal.loc[i,'Serving Size':'Protein (% Daily Value)']*=df_opt['Count'][i]
    return df_ideal

In [None]:
df_opt,opt_stats = LPModel(df,100,morethan = True, lessthan = False)
df_opt

So the **mininum number of order is 5** with 3 Big breakfast with Hotcakes (Large Biscuit), 1 Fruit & Maple Oatmeal without Brown Sugar, and 1 Premium Southwest Salad with Grilled Chicken. Let's see how much calories that we get from that order:

In [None]:
df_ideal = df_multiplier(df_opt)
df_ideal

In [None]:
df_ideal[['Serving Size','Calories (% Daily Value)','Total Fat (% Daily Value)','Saturated Fat (% Daily Value)','Cholesterol (% Daily Value)','Sodium (% Daily Value)',
         'Carbohydrates (% Daily Value)','Dietary Fiber (% Daily Value)','Sugars (% Daily Value)','Protein (% Daily Value)','Vitamin A (% Daily Value)','Vitamin C (% Daily Value)','Calcium (% Daily Value)',
        'Iron (% Daily Value)']].sum()

In [None]:
df_ideal[['Calories (% Daily Value)','Total Fat (% Daily Value)','Saturated Fat (% Daily Value)','Cholesterol (% Daily Value)','Sodium (% Daily Value)',
         'Carbohydrates (% Daily Value)','Dietary Fiber (% Daily Value)','Sugars (% Daily Value)','Protein (% Daily Value)','Vitamin A (% Daily Value)','Vitamin C (% Daily Value)','Calcium (% Daily Value)',
        'Iron (% Daily Value)']].sum().plot(kind='barh',figsize = (15,10))

plt.axvline(x=100,color = 'Red',linestyle =  '--')

With almost 2 kg serving size, all the nutrients are meet the daily needs. But on another side, we got a lot of cholesterol too. It's almost 6 times from daily requirement. of course this is a big problem for health. Not only cholesterol, we got several nutrients that bost more than 2 times from daily requirement. It's vitamin A (2.2 times), protein (2.8 times), sugar (2.6 times), sodium (3.2 times), saturated fat (2.9 times), and total fat (3.2 times).   

To get "more healthy" order item and of course to avoid a heart attack, it's better to add more constrain on this problem.

If before we only give a minimum daily need requirement, now we can give **maximum daily requirement** to limit all nutrient intake. And because we have a lot of nutrient and menus that must be optimized, it's impossible to get exactly 100% of all nutrient daily needs requirement. We're going to give a tolerant number, so it will be ok if we get nutrient that have slightly less/more from daily needs.

In [None]:
step = np.arange(0,100,0.1)

for s in step:
    df_opt,opt_stats = LPModel(df,100,s,morethan = True, lessthan = True, Print = False)
    if opt_stats == "Optimal":
        print("Optimal Tolerance:",s)
        break

In [None]:
df_opt

From the iteration above, we get the smallest tolerant number is 26. And with that tolerant number, we get a set of item that we must order to get daily nutrient requirement. It's 1 Egg McMuffin, 2 Hamburger, 1 Hash Brown, 1 Hotcakes, 1 Iced Coffee with Sugar-Free French Vanilla Syrup, 1 Large Fries, and 2 Side Salad (The total is 9 items). Let's see is give maximum nutrient need constrain can save our life?

In [None]:
df_ideal = df_multiplier(df_opt)
df_ideal

In [None]:
df_ideal[['Serving Size','Calories (% Daily Value)','Total Fat (% Daily Value)','Saturated Fat (% Daily Value)','Cholesterol (% Daily Value)','Sodium (% Daily Value)',
         'Carbohydrates (% Daily Value)','Dietary Fiber (% Daily Value)','Sugars (% Daily Value)','Protein (% Daily Value)','Vitamin A (% Daily Value)','Vitamin C (% Daily Value)','Calcium (% Daily Value)',
        'Iron (% Daily Value)']].sum()

In [None]:
df_ideal[['Calories (% Daily Value)','Total Fat (% Daily Value)','Saturated Fat (% Daily Value)','Cholesterol (% Daily Value)','Sodium (% Daily Value)',
         'Carbohydrates (% Daily Value)','Dietary Fiber (% Daily Value)','Sugars (% Daily Value)','Protein (% Daily Value)','Vitamin A (% Daily Value)','Vitamin C (% Daily Value)','Calcium (% Daily Value)',
        'Iron (% Daily Value)']].sum().plot(kind='barh',figsize = (15,10))

plt.axvline(x=100,color = 'Blue',linestyle =  '--')
plt.axvline(x=100+s,color = 'Red',linestyle =  '--')
plt.axvline(x=100-s,color = 'Red',linestyle =  '--')

The answer is **Yes**, with a serving size that not too different than before (1.8 kg), **only 5 nutrient that not meet the daily requirement** (Calories, Carbohydrates, Dietary Fiber, Calcium, and Iron). Besides that, we didn't get huge cholesterol in this set menu. With the number of tolerant is 26, we get  74% of calcium for the minimum percentage compared by all nutrient based on nutrient daily need. and the highest percentage is 126% of Cholesterol and Sodium compared by all nutrient based on nutrient daily need

## Closing
After we spent much time to work in our dataset, we got much information from there and we can answers all question that came from this dataset inspiration. Thanks for "read" this notebook. If you have any feedback, please let me know :)

**Alvaro Basily**