## Data Exploration

### 1. Import:

In [5]:
import pandas as pd
import numpy as np 
import os

### 2. Data exploration:

#### 2.1. Read data from file:

In [6]:
path = os.path.join('..', 'Assert', 'ingredients.csv')
raw_df = pd.read_csv(path)
raw_df.head()

Unnamed: 0,Name of dish,active yeast,agave nectar,all-purpose flour,almond,almond extract,almond flour,almond milk,aloe vera,amaretto,...,yellow bell peppers,yellow lemon peel,yellow mustard,yellow pepper,yellow sweet potatoes,yogurt,yogurt drink,yuzu juice,yuzu sauce,zucchini
0,Change the taste with strange and delicious mi...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Spaghetti with Meatballs in Tomato Sauce, Quee...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,European Style Baked Spiced Potatoes,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Spicy and Flavorful Braised Pork with Pepper, ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Simple, Delicious, and Mesmerizing Swedish Bak...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 2.2. Explore data:

##### - How many rows and how many columns?

In [7]:
shape = raw_df.shape
shape

(682, 803)

The raw dataset has *...* rows and *...* columns, which satisfies the requirements about the size of dataset.

##### - What are the meaning of each row?

Each row contains the information about the ingredients of a dish.

##### - Are there duplicated rows?

In [8]:
def count_duplicated_rows(df: pd.DataFrame) -> bool:
    duplicated_df = df[df.duplicated(keep='first')]
    num_duplicated = duplicated_df.shape[0]
    return num_duplicated

In [9]:
print(f"Number of duplicated rows: {count_duplicated_rows(raw_df)}")

Number of duplicated rows: 69


The raw dataset contains ... duplicated rows, so let's remove them.

In [10]:
raw_df = raw_df.drop_duplicates()

In [11]:
print(f"Number of duplicated rows after update raw data: {count_duplicated_rows(raw_df)}")

Number of duplicated rows after update raw data: 0


##### - What are the meaning of each column?

- The first column `Name of dish` contains the name of each dish.
- The remaining columns represent each ingredient with the value of each cell in the column being 0 if the dish doesn't have this ingredient and 1 if it has.

##### - What is the current data type of each column? Are there columns having inappropriate data types?

In [12]:
# convert to dictionary in order to be easier to see the data type of all columns
raw_df.dtypes.to_dict()

{'Name of dish': dtype('O'),
 'active yeast': dtype('int64'),
 'agave nectar': dtype('int64'),
 'all-purpose flour': dtype('int64'),
 'almond': dtype('int64'),
 'almond extract': dtype('int64'),
 'almond flour': dtype('int64'),
 'almond milk': dtype('int64'),
 'aloe vera': dtype('int64'),
 'amaretto': dtype('int64'),
 'american beef belly': dtype('int64'),
 'anchovy': dtype('int64'),
 'annatto powder': dtype('int64'),
 'annatto seeds': dtype('int64'),
 'apple': dtype('int64'),
 'apple cider vinegar': dtype('int64'),
 'apple juice': dtype('int64'),
 'apricot': dtype('int64'),
 'apricot jam': dtype('int64'),
 'arrowroot powder': dtype('int64'),
 'artichoke': dtype('int64'),
 'asparagus': dtype('int64'),
 'avocado': dtype('int64'),
 'back ribs tips': dtype('int64'),
 'back-fat': dtype('int64'),
 'bacon': dtype('int64'),
 'bagel': dtype('int64'),
 'baguette': dtype('int64'),
 'baking powder': dtype('int64'),
 'baking soda': dtype('int64'),
 'balsamic vinegar': dtype('int64'),
 'balut eggs'

All the columns of the raw data seem to have appropritate types for their meanings.

##### - With each numerical column, how are values distributed?

- What is the percentage of missing values?

In [13]:
# collect numerical columns' data
numerical_cols = raw_df[raw_df.keys()[1:]]
numerical_cols.head()

Unnamed: 0,active yeast,agave nectar,all-purpose flour,almond,almond extract,almond flour,almond milk,aloe vera,amaretto,american beef belly,...,yellow bell peppers,yellow lemon peel,yellow mustard,yellow pepper,yellow sweet potatoes,yogurt,yogurt drink,yuzu juice,yuzu sauce,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# check if there are any missing values
missing_values = numerical_cols.isnull().sum()
missing_values

active yeast         0
agave nectar         0
all-purpose flour    0
almond               0
almond extract       0
                    ..
yogurt               0
yogurt drink         0
yuzu juice           0
yuzu sauce           0
zucchini             0
Length: 802, dtype: int64

In [15]:
if all(missing_values == 0):
    print("No missing values in the numerical columns")

No missing values in the numerical columns


- Min? Max? Are they abnormal?

In [25]:

if all(numerical_cols.min() == 0) and all(numerical_cols.max() == 1):
    print("All the values of numerical columns are normal (between 0 and 1).")
else:
    abnormal_columns = numerical_cols.columns[
        ~((0 <= numerical_cols.min() <= 1) & (0 <= numerical_cols.max() <= 1))
    ]
    print(f"There are some values that are not normal in columns: {abnormal_columns}.")

All the values of numerical columns are normal (between 0 and 1).


##### - With each categorical column, how are values distributed?

- What is the percentage of missing values?

In [16]:
# collect categorical columns' data
categorical_col = raw_df[raw_df.keys()[0]]
categorical_col.head()

0    Change the taste with strange and delicious mi...
1    Spaghetti with Meatballs in Tomato Sauce, Quee...
2                 European Style Baked Spiced Potatoes
3    Spicy and Flavorful Braised Pork with Pepper, ...
4    Simple, Delicious, and Mesmerizing Swedish Bak...
Name: Name of dish, dtype: object

In [17]:
# check if there are any missing values
missing_values = categorical_col.isnull().sum()
missing_values

0

In [18]:
if missing_values == 0:
    print("No missing values in the categorical columns")

No missing values in the categorical columns


- How many different values? Show a few


In [19]:
# find unique values
unique_values = categorical_col.unique()
print(f"There are {len(unique_values)} different values in this column.")

There are 601 different values in this column.


In [20]:
print('Some examples of name of dish:\n', unique_values[:10])

Some examples of name of dish:
 ['Change the taste with strange and delicious mixed chicken pho that you can eat forever without getting bored'
 'Spaghetti with Meatballs in Tomato Sauce, Queen of Italian Cuisine'
 'European Style Baked Spiced Potatoes'
 "Spicy and Flavorful Braised Pork with Pepper, a Dish You Can't Put Down Your Chopsticks for on a Winter Day"
 'Simple, Delicious, and Mesmerizing Swedish Baked Hasselback Potatoes'
 "Braised Duck with dracontomelon – A Dish That Defies All Weather, Making Many People Fall in Love Whether It's Winter or Summer"
 'Crispy Outside, Melty Inside Fried Cheese Bread Rolls'
 'Egg-Topped Tortillas - A Quick, Nutritious and Delicious Breakfast'
 'Shrimp Spaghetti with Tomato Cream Sauce - Mildly Sour, Rich, Rich and Extremely Delicious'
 'How to make spring rolls with river bone leaves - a fragrant dish in early winter']


#### Check if the data is reasonable?
 - Is there any item where the ingredients are all 0?
 - Is there any column where all values ​​are 0?

In [21]:
# Check rows
all_zero_ingredients = raw_df[(raw_df.iloc[:, 1:] == 0).all(axis=1)]
num_all_zero_ingredients = all_zero_ingredients.shape[0]
num_all_zero_ingredients
if num_all_zero_ingredients > 0:
    print(f"Number of rows that all ingredients are zero: {num_all_zero_ingredients}")
    raw_df = raw_df.drop(all_zero_ingredients.index)
    print("Remove all zero ingredients rows.")
# check columns
raw_df = raw_df.loc[:, ~(raw_df == 0).all()]

Number of rows that all ingredients are zero: 1
Remove all zero ingredients rows.


#### 3. Data analysis: