## Introduction

In this lab, we will continue learning about the data science methodology, and focus on the **Data Understanding** and the **Data Preparation** stages.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
1. [Recap](#0)<br>
2. [Data Understanding](#2)<br>
3. [Data Preparation](#4)<br>
</div>
<hr>

# Recap <a id="0"></a>

In Lab **From Requirements to Collection**, we learned that the data we need to answer the question developed in the business understanding stage, namely *can we automate the process of determining the cuisine of a given recipe?*, is readily available. A researcher named Yong-Yeol Ahn scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:

<img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig1_allrecipes.png" width=500>

www.allrecipes.com

<img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig2_epicurious.png" width=500>

www.epicurious.com

<img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig3_menupan.png" width=500>

www.menupan.com

For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf).

We also collected the data and placed it on an IBM server for your convenience.

------------

# Data Understanding <a id="2"></a>

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig4_flowchart_data_understanding.png" width=500>

<strong> Important note:</strong> Please note that you are not expected to know how to program in Python. The following code is meant to illustrate the stages of data understanding and data preparation, so it is totally fine if you do not understand the individual lines of code. We have a full course on programming in Python, <a href="http://cocl.us/PY0101EN_DS0103EN_LAB3_PYTHON_Coursera"><strong>Python for Data Science</strong></a>, which is also offered on Coursera. So make sure to complete the Python course if you are interested in learning how to program in Python.

### Using this notebook:

To run any of the following cells of code, you can type **Shift + Enter** to excute the code in a cell.

Download the library and dependencies that we will need to run this lab.

In [1]:
import re  # import library for regular expression
import pandas as pd  # import library to read data into dataframe
import numpy as np  # import numpy library

Download the data from the IBM server and read it into a *pandas* dataframe.

In [2]:
recipes = pd.read_csv(
    "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/data/recipes.csv"
)
print("Data read into dataframe!")  # takes about 30 seconds

Data read into dataframe!


Show the first few rows.

In [3]:
print(recipes.head())

      country almond angelica anise anise_seed apple apple_brandy apricot  \
0  Vietnamese     No       No    No         No    No           No      No   
1  Vietnamese     No       No    No         No    No           No      No   
2  Vietnamese     No       No    No         No    No           No      No   
3  Vietnamese     No       No    No         No    No           No      No   
4  Vietnamese     No       No    No         No    No           No      No   

  armagnac artemisia  ... whiskey white_bread white_wine  \
0       No        No  ...      No          No         No   
1       No        No  ...      No          No         No   
2       No        No  ...      No          No         No   
3       No        No  ...      No          No         No   
4       No        No  ...      No          No         No   

  whole_grain_wheat_flour wine wood yam yeast yogurt zucchini  
0                      No   No   No  No    No     No       No  
1                      No   No   No  No    No   

Get the dimensions of the dataframe.

In [4]:
print(recipes.shape)

(57691, 384)


So our dataset consists of 57,691 recipes. Each row represents a recipe, and for each recipe, the corresponding cuisine is documented as well as whether 384 ingredients exist in the recipe or not, beginning with almond and ending with zucchini.

We know that a basic sushi recipe includes the ingredients:
* rice
* soy sauce
* wasabi
* some fish/vegetables

Let's check that these ingredients exist in our dataframe:

In [5]:
ingredients = list(recipes.columns.values)
print(
    [
        match.group(0)
        for ingredient in ingredients
        for match in [(re.compile(".*(rice).*")).search(ingredient)]
        if match
    ]
)
print(
    [
        match.group(0)
        for ingredient in ingredients
        for match in [(re.compile(".*(wasabi).*")).search(ingredient)]
        if match
    ]
)
print(
    [
        match.group(0)
        for ingredient in ingredients
        for match in [(re.compile(".*(soy).*")).search(ingredient)]
        if match
    ]
)

['brown_rice', 'licorice', 'rice']
['wasabi']
['soy_sauce', 'soybean', 'soybean_oil']


Yes, they do!

* rice exists as rice.
* wasabi exists as wasabi.
* soy exists as soy_sauce.

So maybe if a recipe contains all three ingredients: rice, wasabi, and soy_sauce, then we can confidently say that the recipe is a **Japanese** cuisine! Let's keep this in mind!

----------------

# Data Preparation <a id="4"></a>

<img src="https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab3_fig5_flowchart_data_preparation.png" width=500>

In this section, we will prepare data for the next stage in the data science methodology, which is modeling. This stage involves exploring the data further and making sure that it is in the right format for the machine learning algorithm that we selected in the analytic approach stage, which is decision trees.

First, look at the data to see if it needs cleaning.

In [6]:
print(recipes["country"].value_counts()) # frequency table

American        40150
Mexico           1754
Italian          1715
Italy            1461
Asian            1176
                ...  
Indonesia          12
Belgium            11
East-African       11
Israel              9
Bangladesh          4
Name: country, Length: 69, dtype: int64


By looking at the above table, we can make the following observations:

1. Cuisine column is labeled as Country, which is inaccurate.
2. Cuisine names are not consistent as not all of them start with an uppercase first letter.
3. Some cuisines are duplicated as variation of the country name, such as Vietnam and Vietnamese.
4. Some cuisines have very few recipes.

#### Let's fixes these problems.

Fix the name of the column showing the cuisine.

In [7]:
column_names = recipes.columns.values
column_names[0] = "cuisine"
recipes.columns = column_names
print(recipes)

          cuisine almond angelica anise anise_seed apple apple_brandy apricot  \
0      Vietnamese     No       No    No         No    No           No      No   
1      Vietnamese     No       No    No         No    No           No      No   
2      Vietnamese     No       No    No         No    No           No      No   
3      Vietnamese     No       No    No         No    No           No      No   
4      Vietnamese     No       No    No         No    No           No      No   
...           ...    ...      ...   ...        ...   ...          ...     ...   
57686       Japan     No       No    No         No    No           No      No   
57687       Japan     No       No    No         No    No           No      No   
57688       Japan     No       No    No         No    No           No      No   
57689       Japan     No       No    No         No    No           No      No   
57690       Japan     No       No    No         No    No           No      No   

      armagnac artemisia  .

Make all the cuisine names lowercase.

In [8]:
recipes["cuisine"] = recipes["cuisine"].str.lower()

Make the cuisine names consistent.

In [9]:
recipes.loc[recipes["cuisine"] == "austria", "cuisine"] = "austrian"
recipes.loc[recipes["cuisine"] == "belgium", "cuisine"] = "belgian"
recipes.loc[recipes["cuisine"] == "china", "cuisine"] = "chinese"
recipes.loc[recipes["cuisine"] == "canada", "cuisine"] = "canadian"
recipes.loc[recipes["cuisine"] == "netherlands", "cuisine"] = "dutch"
recipes.loc[recipes["cuisine"] == "france", "cuisine"] = "french"
recipes.loc[recipes["cuisine"] == "germany", "cuisine"] = "german"
recipes.loc[recipes["cuisine"] == "india", "cuisine"] = "indian"
recipes.loc[recipes["cuisine"] == "indonesia", "cuisine"] = "indonesian"
recipes.loc[recipes["cuisine"] == "iran", "cuisine"] = "iranian"
recipes.loc[recipes["cuisine"] == "italy", "cuisine"] = "italian"
recipes.loc[recipes["cuisine"] == "japan", "cuisine"] = "japanese"
recipes.loc[recipes["cuisine"] == "israel", "cuisine"] = "jewish"
recipes.loc[recipes["cuisine"] == "korea", "cuisine"] = "korean"
recipes.loc[recipes["cuisine"] == "lebanon", "cuisine"] = "lebanese"
recipes.loc[recipes["cuisine"] == "malaysia", "cuisine"] = "malaysian"
recipes.loc[recipes["cuisine"] == "mexico", "cuisine"] = "mexican"
recipes.loc[recipes["cuisine"] == "pakistan", "cuisine"] = "pakistani"
recipes.loc[recipes["cuisine"] == "philippines", "cuisine"] = "philippine"
recipes.loc[recipes["cuisine"] == "scandinavia", "cuisine"] = "scandinavian"
recipes.loc[recipes["cuisine"] == "spain", "cuisine"] = "spanish_portuguese"
recipes.loc[recipes["cuisine"] == "portugal", "cuisine"] = "spanish_portuguese"
recipes.loc[recipes["cuisine"] == "switzerland", "cuisine"] = "swiss"
recipes.loc[recipes["cuisine"] == "thailand", "cuisine"] = "thai"
recipes.loc[recipes["cuisine"] == "turkey", "cuisine"] = "turkish"
recipes.loc[recipes["cuisine"] == "vietnam", "cuisine"] = "vietnamese"
recipes.loc[recipes["cuisine"] == "uk-and-ireland", "cuisine"] = "uk-and-irish"
recipes.loc[recipes["cuisine"] == "irish", "cuisine"] = "uk-and-irish"

print(recipes)

          cuisine almond angelica anise anise_seed apple apple_brandy apricot  \
0      vietnamese     No       No    No         No    No           No      No   
1      vietnamese     No       No    No         No    No           No      No   
2      vietnamese     No       No    No         No    No           No      No   
3      vietnamese     No       No    No         No    No           No      No   
4      vietnamese     No       No    No         No    No           No      No   
...           ...    ...      ...   ...        ...   ...          ...     ...   
57686    japanese     No       No    No         No    No           No      No   
57687    japanese     No       No    No         No    No           No      No   
57688    japanese     No       No    No         No    No           No      No   
57689    japanese     No       No    No         No    No           No      No   
57690    japanese     No       No    No         No    No           No      No   

      armagnac artemisia  .

Remove cuisines with < 50 recipes.

In [10]:
recipes_counts = recipes["cuisine"].value_counts()
cuisines_indices = recipes_counts > 50
cuisines_to_keep = list(
    np.array(recipes_counts.index.values)[np.array(cuisines_indices)]
)

In [11]:
rows_before = recipes.shape[0]  # number of rows of original dataframe
print(f"Number of rows of original dataframe is {rows_before}.")
recipes = recipes.loc[recipes["cuisine"].isin(cuisines_to_keep)]
rows_after = recipes.shape[0]  # number of rows of processed dataframe
print(f"Number of rows of processed dataframe is {rows_after}.")
print(f"{rows_before - rows_after} rows removed!")

Number of rows of original dataframe is 57691.
Number of rows of processed dataframe is 57403.
288 rows removed!


Convert all Yes's to 1's and the No's to 0's

In [12]:
recipes = recipes.replace(to_replace="Yes", value=1)
recipes = recipes.replace(to_replace="No", value=0)

#### Let's analyze the data a little more in order to learn the data better and note any interesting preliminary observations.

Run the following cell to get the recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed**.

In [13]:
print(recipes.head())

      cuisine  almond  angelica  anise  anise_seed  apple  apple_brandy  \
0  vietnamese       0         0      0           0      0             0   
1  vietnamese       0         0      0           0      0             0   
2  vietnamese       0         0      0           0      0             0   
3  vietnamese       0         0      0           0      0             0   
4  vietnamese       0         0      0           0      0             0   

   apricot  armagnac  artemisia  ...  whiskey  white_bread  white_wine  \
0        0         0          0  ...        0            0           0   
1        0         0          0  ...        0            0           0   
2        0         0          0  ...        0            0           0   
3        0         0          0  ...        0            0           0   
4        0         0          0  ...        0            0           0   

   whole_grain_wheat_flour  wine  wood  yam  yeast  yogurt  zucchini  
0                        0     0 

In [14]:
check_recipes = recipes.loc[
    (recipes["rice"] == 1)
    & (recipes["soy_sauce"] == 1)
    & (recipes["wasabi"] == 1)
    & (recipes["seaweed"] == 1)
]
print(check_recipes)

          cuisine  almond  angelica  anise  anise_seed  apple  apple_brandy  \
11306    japanese       0         0      0           0      0             0   
11321    japanese       0         0      0           0      0             0   
11361    japanese       0         0      0           0      0             0   
12171       asian       0         0      0           0      0             0   
12385       asian       0         0      0           0      0             0   
13010       asian       0         0      0           0      0             0   
13159       asian       0         0      0           0      0             0   
13513    japanese       0         0      0           0      0             0   
13586    japanese       0         0      0           0      0             0   
13625  east_asian       0         0      0           0      0             0   
14495  east_asian       0         0      0           0      0             0   

       apricot  armagnac  artemisia  ...  whiskey  

Based on the results of the above code, can we classify all recipes that contain **rice** *and* **soy** *and* **wasabi** *and* **seaweed** as **Japanese** recipes? Why?

Your Answer: no

Double-click __here__ for the solution.
<!-- The correct answer is:
No, because other recipes such as Asian and East_Asian recipes also contain these ingredients.
-->

Let's count the ingredients across all recipes.

In [15]:
ing = recipes.iloc[:, 1:].sum(axis=0)

In [16]:
# define each column as a pandas series
ingredient = pd.Series(ing.index.values, index=np.arange(len(ing)))
count = pd.Series(list(ing), index=np.arange(len(ing)))
# create the dataframe
ing_df = pd.DataFrame(dict(ingredient=ingredient, count=count))
ing_df = ing_df[["ingredient", "count"]]
print(ing_df.to_string())

                  ingredient  count
0                     almond   2306
1                   angelica      1
2                      anise    223
3                 anise_seed     87
4                      apple   2422
5               apple_brandy     37
6                    apricot    620
7                   armagnac     11
8                  artemisia     13
9                  artichoke    391
10                 asparagus    460
11                   avocado    660
12                     bacon   2169
13              baked_potato      9
14                      balm      3
15                    banana    989
16                    barley    266
17             bartlett_pear     23
18                     basil   3842
19                       bay   1463
20                      bean   1992
21                     beech      1
22                      beef   4902
23                beef_broth    845
24                beef_liver     10
25                      beer    307
26                      beet

Now we have a dataframe of ingredients and their total counts across all recipes. Let's sort this dataframe in descending order.

In [17]:
ing_df.sort_values(["count"], ascending=False, inplace=True)
ing_df.reset_index(inplace=True, drop=True)
print(ing_df)

          ingredient  count
0                egg  21025
1              wheat  20781
2             butter  20719
3              onion  18080
4             garlic  17353
..               ...    ...
378   strawberry_jam      1
379  sturgeon_caviar      1
380      kaffir_lime      1
381            beech      1
382           durian      0

[383 rows x 2 columns]


However, note that there is a problem with the above table. There are ~40,000 American recipes in our dataset, which means that the data is biased towards American ingredients.

**Therefore**, let's compute a more objective summary of the ingredients by looking at the ingredients per cuisine.

#### Let's create a *profile* for each cuisine.

In other words, let's try to find out what ingredients Chinese people typically use, and what is **Canadian** food for example.

In [18]:
cuisines = recipes.groupby("cuisine").mean()
print(cuisines.head())

                almond  angelica     anise  anise_seed     apple  \
cuisine                                                            
african       0.156522  0.000000  0.000000    0.000000  0.034783   
american      0.040598  0.000025  0.003014    0.000573  0.052055   
asian         0.007544  0.000000  0.000838    0.002515  0.012573   
cajun_creole  0.000000  0.000000  0.000000    0.000000  0.006849   
canadian      0.036176  0.000000  0.000000    0.000000  0.036176   

              apple_brandy   apricot  armagnac  artemisia  artichoke  ...  \
cuisine                                                               ...   
african           0.000000  0.069565    0.0000        0.0   0.000000  ...   
american          0.000623  0.011308    0.0001        0.0   0.006351  ...   
asian             0.000000  0.005029    0.0000        0.0   0.000000  ...   
cajun_creole      0.000000  0.000000    0.0000        0.0   0.000000  ...   
canadian          0.000000  0.002584    0.0000        0.0   0

As shown above, we have just created a dataframe where each row is a cuisine and each column (except for the first column) is an ingredient, and the row values represent the percentage of each ingredient in the corresponding cuisine.

**For example**:

* *almond* is present across 15.65% of all of the **African** recipes.
* *butter* is present across 38.11% of all of the **Canadian** recipes.

Let's print out the profile for each cuisine by displaying the top four ingredients in each cuisine.

In [19]:
num_ingredients = 4  # define number of top ingredients to print
# define a function that prints the top ingredients for each cuisine
def print_top_ingredients(row):
    print(row.name.upper())
    row_sorted = row.sort_values(ascending=False) * 100
    top_ingredients = list(row_sorted.index.values)[0:num_ingredients]
    row_sorted = list(row_sorted)[0:num_ingredients]
    for ind, ingredient_ in enumerate(top_ingredients):
        print(f"{ingredient_} ({row_sorted[ind]})", end=" ")
    print("\n")


# apply function to cuisines dataframe
create_cuisines_profiles = cuisines.apply(print_top_ingredients, axis=1)

AFRICAN
onion (53.04347826086957) olive_oil (52.17391304347826) garlic (49.56521739130435) cumin (42.608695652173914) 

AMERICAN
butter (41.158156911581564) egg (40.51307596513076) wheat (39.84059775840598) onion (29.332503113325032) 

ASIAN
soy_sauce (49.62279966471081) ginger (48.61693210393965) garlic (47.946353730092206) rice (41.3243922883487) 

CAJUN_CREOLE
onion (69.86301369863014) cayenne (56.16438356164384) garlic (48.63013698630137) butter (36.3013698630137) 

CANADIAN
wheat (39.53488372093023) butter (38.11369509043928) egg (35.400516795865634) onion (34.366925064599485) 

CARIBBEAN
onion (51.36612021857923) garlic (50.81967213114754) vegetable_oil (31.147540983606557) black_pepper (31.147540983606557) 

CENTRAL_SOUTHAMERICAN
garlic (56.84647302904564) onion (54.356846473029044) cayenne (51.867219917012456) tomato (41.49377593360996) 

CHINESE
soy_sauce (68.55203619909503) ginger (53.39366515837104) garlic (52.94117647058824) scallion (48.19004524886878) 

EAST_ASIAN
garlic 

At this point, we feel that we have understood the data well and the data is ready and is in the right format for modeling!

-----------

### Thank you for completing this lab!

This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/). We hope you found this lab session interesting. Feel free to contact us if you have any questions!

This notebook is part of a course on **Coursera** called *Data Science Methodology*. If you accessed this notebook outside the course, you can take this course, online by clicking [here](https://cocl.us/DS0103EN_Coursera_LAB3).

<hr>

Copyright &copy; 2019 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).