### Libraries

 - pandas - main library used for loading, cleaning and filtering the datasets
 - numpy - for math calculations, working with NaN's, filtering with conditions 
 - matplotlib - for visualizations during exploration
 - scipy - for hypothesis testing
 - basemap - for plotting a map of the locations where the products were packaged
 - scikit-learn - preprocessing and logistic regression

### Datasets 

 - __Open Food Facts__ - https://www.kaggle.com/openfoodfacts/world-food-facts - provides information on food products like ingredients, alergens, and most importantly various nutrition facts, which will be very useful in my case<br><br>
 
 - __Nutrition Facts for McDonalds Menu__ - https://www.kaggle.com/mcdonalds/nutrition-facts - provides detailed information on the amount of nutrients contained in each McDonalds product<br><br>
 
 - __Nutrition Facts for Starbucks Menu__ - https://www.kaggle.com/starbucks/starbucks-menu - provides detailed information on the amount of nutrients contained in each Starbucks product

## Step 1 - The Open Food Facts dataset

### Obtaining the dataset

In [1]:
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv
from minio import Minio
# ...existing code...
# Test kết nối MinIO
from dotenv import load_dotenv
import traceback

load_dotenv()  # đảm bảo nạp .env nếu chưa
endpoint = os.getenv("MINIO_HOST_ENDPOINT") or os.getenv("MINIO_ENDPOINT")
access = os.getenv("MINIO_ACCESS_KEY")
secret = os.getenv("MINIO_SECRET_KEY")
bucket = os.getenv("MINIO_BUCKET")
secure = os.getenv("MINIO_SECURE").lower() == "true"

print("Endpoint:", endpoint)
print("Access:", access)
print("Bucket:", bucket)
print("Secure (bool):", secure)

try:
    client = Minio(
        endpoint,
        access_key=access,
        secret_key=secret,
        secure=secure,
    )
    print("Minio client created, attempting list_buckets()...")
    buckets = client.list_buckets()
    print("Buckets:", [b.name for b in buckets])
    # optional: kiểm tra một file nhỏ nếu biết object key
    # info = client.stat_object(bucket, "path/to/object.csv")
    # print("Object size:", info.size)
    print("Kết nối MinIO thành công.")
except Exception as e:
    print("Kết nối MinIO thất bại:")
    traceback.print_exc()
# ...existing code...

Endpoint: localhost:9000
Access: minio
Bucket: off
Secure (bool): False
Minio client created, attempting list_buckets()...
Buckets: ['off']
Kết nối MinIO thành công.


In [2]:
# Quick preview helper: đọc nhanh N dòng từ một CSV/TSV trên MinIO (không dùng iterator/chunks)
from IPython.display import display

minio_client = Minio(
    endpoint,
    access_key=access,
    secret_key=secret,
    secure=secure,
)

def load_minio_csv_preview(object_key, nrows=5, bucket=None, **read_csv_kwargs):
    """Return a small DataFrame with first `nrows` from a MinIO object.
    Uses contextlib.closing to ensure the object/connection is released."""
    bucket = bucket or os.getenv("MINIO_BUCKET", "off")
    from contextlib import closing
    # Use closing() to ensure the response is closed after pandas reads from it
    with closing(minio_client.get_object(bucket, object_key)) as obj:
        # pandas supports nrows for quick preview
        return pd.read_csv(obj, nrows=nrows, **read_csv_kwargs)

# Example usage: preview first 5 rows of the large TSV file
preview_df = load_minio_csv_preview(
    f"{os.getenv('RAW_PATH')}/openfoodfacts/en.openfoodfacts.org.products.tsv",
    nrows=5, sep="\t", low_memory=False
)
# Use display() so Jupyter renders a nice table instead of raw text
display(preview_df.head())

Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,...,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,...,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,...,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,...,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,...,,,,,,,,,,


In [3]:
minio_client = Minio(
    endpoint,
    access_key=access,
    secret_key=secret,
    secure=secure,
)


dtype_dict = {
    "product_name": "category",
    "main_category": "category",
    "packaging": "category",
    "nutrition-score-fr_100g": "float32",
    "fat_100g": "float32",
    "nutrition_grade_fr": "category",
    "carbohydrates_100g": "float32",
    "proteins_100g": "float32",
    "ingredients_from_palm_oil_n": "float32",  
    "additives_n": "float32",
    "first_packaging_code_geo": "category",
}

object_key = f"{os.getenv('RAW_PATH')}/openfoodfacts/en.openfoodfacts.org.products.tsv"

obj = minio_client.get_object(bucket_name=bucket, object_name=object_key)

# Định nghĩa các giá trị được coi là NaN
world_food_data = pd.read_csv(
    obj,
    sep="\t",
    dtype=dtype_dict,
)

world_food_data.head()


  world_food_data = pd.read_csv(


Unnamed: 0,code,url,creator,created_t,created_datetime,last_modified_t,last_modified_datetime,product_name,generic_name,quantity,...,fruits-vegetables-nuts_100g,fruits-vegetables-nuts-estimate_100g,collagen-meat-protein-ratio_100g,cocoa_100g,chlorophyl_100g,carbon-footprint_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,glycemic-index_100g,water-hardness_100g
0,3087,http://world-en.openfoodfacts.org/product/0000...,openfoodfacts-contributors,1474103866,2016-09-17T09:17:46Z,1474103893,2016-09-17T09:18:13Z,Farine de blé noir,,1kg,...,,,,,,,,,,
1,4530,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Banana Chips Sweetened (Whole),,,...,,,,,,,14.0,14.0,,
2,4559,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489069957,2017-03-09T14:32:37Z,1489069957,2017-03-09T14:32:37Z,Peanuts,,,...,,,,,,,0.0,0.0,,
3,16087,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055731,2017-03-09T10:35:31Z,1489055731,2017-03-09T10:35:31Z,Organic Salted Nut Mix,,,...,,,,,,,12.0,12.0,,
4,16094,http://world-en.openfoodfacts.org/product/0000...,usda-ndb-import,1489055653,2017-03-09T10:34:13Z,1489055653,2017-03-09T10:34:13Z,Organic Polenta,,,...,,,,,,,,,,


In [4]:

print("Total {} observations on {} features".format(world_food_data.shape[0],world_food_data.shape[1]))

Total 356027 observations on 163 features


### Cleaning the dataset

We can see that the dataframe is pretty large and that's not ideal for exploration. So I'm going to pick the features I'll be using later on in part 2:
`["product_name","packaging","main_category","nutrition_grade_fr",`<br>`"nutrition_score_fr_100g","fat_100g","carbohydrates_100g","proteins_100g","additives_n",`<br>`"ingredients_from_palm_oil_n","first_packaging_code_geo"]`<br>
These columns are the main factors for exploration, which I set in my questions. Also, the last column represents the packaging coordinates of the products. I will keep that as well for confirmation of the product locations. After that, I'm going to rename some of the columns so that their names are more pythonic and accessible.<br>
Now there should be only 11 columns left in the dataframe.

In [5]:
cols_to_keep=["product_name","packaging","main_category","nutrition_grade_fr",
              "nutrition-score-fr_100g","fat_100g","carbohydrates_100g","proteins_100g",
               "additives_n","ingredients_from_palm_oil_n","first_packaging_code_geo"]
world_food_data = world_food_data[cols_to_keep]
world_food_data=world_food_data.rename(columns={"nutrition-score-fr_100g":"nutrition_score",
                                                "fat_100g":"fat_g",
                                               "carbohydrates_100g":"carbohydrates_g",
                                               "proteins_100g":"proteins_g"})

In [6]:
world_food_data.head()

Unnamed: 0,product_name,packaging,main_category,nutrition_grade_fr,nutrition_score,fat_g,carbohydrates_g,proteins_g,additives_n,ingredients_from_palm_oil_n,first_packaging_code_geo
0,Farine de blé noir,,,,,,,,,,
1,Banana Chips Sweetened (Whole),,,d,14.0,28.57,64.290001,3.57,0.0,0.0,
2,Peanuts,,,b,0.0,17.860001,60.709999,17.860001,0.0,0.0,
3,Organic Salted Nut Mix,,,d,12.0,57.139999,17.860001,17.860001,0.0,0.0,
4,Organic Polenta,,,,,1.43,77.139999,8.57,0.0,0.0,


In [7]:
world_food_data.isnull().sum()

product_name                    17512
packaging                      266068
main_category                  252778
nutrition_grade_fr             101171
nutrition_score                101171
fat_g                           76530
carbohydrates_g                 76807
proteins_g                      61866
additives_n                     72160
ingredients_from_palm_oil_n     72160
first_packaging_code_geo       335155
dtype: int64

In [8]:
world_food_data.additives_n.unique()

array([nan,  0.,  1.,  2.,  3.,  6.,  5.,  8.,  4., 10., 11.,  9.,  7.,
       14., 12., 18., 22., 13., 20., 23., 17., 26., 21., 25., 15., 24.,
       16., 19., 27., 29., 30.], dtype=float32)

In [9]:
len(world_food_data[world_food_data.ingredients_from_palm_oil_n==2])

79

### Data Imputation

In [10]:
most_common_first_packaging_code_geo = world_food_data['first_packaging_code_geo'].value_counts().index[0]
most_common_packaging = world_food_data['packaging'].value_counts().index[0]
mean_additives=world_food_data['additives_n'].mean()

world_food_data['additives_n'] = world_food_data['additives_n'].fillna(mean_additives)
world_food_data['ingredients_from_palm_oil_n'] = world_food_data['ingredients_from_palm_oil_n'].fillna(0)
world_food_data['first_packaging_code_geo'] = world_food_data['first_packaging_code_geo'].fillna(most_common_first_packaging_code_geo)
world_food_data['packaging'] = world_food_data['packaging'].fillna(most_common_packaging)


world_food_data= world_food_data.dropna()



In [11]:
print("Total {} observations on {} features".format(world_food_data.shape[0],world_food_data.shape[1]))

Total 71091 observations on 11 features


In [12]:
assert most_common_first_packaging_code_geo is not None
assert most_common_packaging is not None
assert mean_additives is not None
assert world_food_data.isnull().sum().sum() == 0, "There are still missing values in the dataset."

In [13]:
world_food_data.dtypes

product_name                   category
packaging                      category
main_category                  category
nutrition_grade_fr             category
nutrition_score                 float32
fat_g                           float32
carbohydrates_g                 float32
proteins_g                      float32
additives_n                     float32
ingredients_from_palm_oil_n     float32
first_packaging_code_geo       category
dtype: object

In [14]:
world_food_data.nutrition_score.unique()

array([  6.,   9.,   1.,  18.,   2.,  14.,  26.,  10.,  13.,  12.,  22.,
         8.,  24.,  21.,  17.,  20.,  19.,  11.,   4.,  -2.,  -3.,  -5.,
        -1.,  15.,  -4.,  23.,  25.,  27.,   5.,   0.,   7.,  16.,  -9.,
         3.,  -6.,  29.,  -7., -13.,  28.,  30., -10., -11.,  -8.,  33.,
        36.,  31., -12.,  34.,  32.,  40., -14., -15.,  35.,  37.],
      dtype=float32)

In [15]:
world_food_data.additives_n.unique()

array([ 0.       ,  5.       ,  3.       ,  1.8768507,  4.       ,
        2.       ,  1.       , 11.       ,  8.       , 10.       ,
        7.       ,  9.       ,  6.       , 13.       , 12.       ,
       14.       , 15.       , 16.       , 17.       , 18.       ,
       21.       , 30.       , 20.       , 22.       , 19.       ,
       26.       ], dtype=float32)

In [16]:
world_food_data.ingredients_from_palm_oil_n.unique()


array([0., 1., 2.], dtype=float32)

In [17]:
world_food_data.head()

Unnamed: 0,product_name,packaging,main_category,nutrition_grade_fr,nutrition_score,fat_g,carbohydrates_g,proteins_g,additives_n,ingredients_from_palm_oil_n,first_packaging_code_geo
176,Salade Cesar,Frais,en:plant-based-foods-and-beverages,c,6.0,12.0,23.0,22.0,0.0,0.0,"47.633333,-2.666667"
182,Chaussons tressés aux pommes,Frais,en:sugary-snacks,c,9.0,10.7,38.700001,3.33,5.0,0.0,"47.633333,-2.666667"
183,Pain Burger Artisan,"Frais,plastique",fr:boulange,b,1.0,1.11,53.299999,10.0,0.0,0.0,"47.633333,-2.666667"
185,Root Beer,"Canette,Métal",en:beverages,e,18.0,0.0,14.2,0.0,3.0,0.0,"47.633333,-2.666667"
187,Quiche Lorraine,Frai,en:meals,b,2.0,6.79,7.86,5.36,3.0,0.0,"47.633333,-2.666667"


There are a few changes I would like to make here.<br><br>
First of all, since I'll be extracting the french products from this dataframe, I'm going to remove the abbreviations from the `main_category` column.<br><br> Secondly, there's no need for the columns `additives_n`, `ingredients_from_palm_oil` and `nutrition_score` to be floating point, so I'm going to convert them into integers.<br><br>Then, I'd like to split the first packaging coordinates column `first_packaging_code_geo` into two separate columns - one for the latitude, and one for the longitude for easy plotting later on. I will also round these coordinates to two decimal places and drop the old column.<br><br>Finally, I'm going to reset the index column, since I dropped a lot of rows in the previous steps.<br><br> The dataframe should now have 12 features.

In [18]:
world_food_data[["additives_n","ingredients_from_palm_oil_n", "nutrition_score"]]=world_food_data[["additives_n","ingredients_from_palm_oil_n", "nutrition_score"]].astype("int32")
world_food_data["main_category"] = world_food_data["main_category"].map(lambda x: str(x)[3:]).astype('category')


world_food_data[["fp_lat","fp_lon"]]=world_food_data["first_packaging_code_geo"].str.split(",", n=1, expand=True)
world_food_data.fp_lat=round(world_food_data.fp_lat.astype("float32"),2)
world_food_data.fp_lon=round(world_food_data.fp_lon.astype("float32"),2)
world_food_data=world_food_data.drop(columns="first_packaging_code_geo")

world_food_data=world_food_data.reset_index(drop=True)

In [19]:
world_food_data.dtypes

product_name                   category
packaging                      category
main_category                  category
nutrition_grade_fr             category
nutrition_score                   int32
fat_g                           float32
carbohydrates_g                 float32
proteins_g                      float32
additives_n                       int32
ingredients_from_palm_oil_n       int32
fp_lat                          float32
fp_lon                          float32
dtype: object

In [20]:
world_food_data.head()

Unnamed: 0,product_name,packaging,main_category,nutrition_grade_fr,nutrition_score,fat_g,carbohydrates_g,proteins_g,additives_n,ingredients_from_palm_oil_n,fp_lat,fp_lon
0,Salade Cesar,Frais,plant-based-foods-and-beverages,c,6,12.0,23.0,22.0,0,0,47.630001,-2.67
1,Chaussons tressés aux pommes,Frais,sugary-snacks,c,9,10.7,38.700001,3.33,5,0,47.630001,-2.67
2,Pain Burger Artisan,"Frais,plastique",boulange,b,1,1.11,53.299999,10.0,0,0,47.630001,-2.67
3,Root Beer,"Canette,Métal",beverages,e,18,0.0,14.2,0.0,3,0,47.630001,-2.67
4,Quiche Lorraine,Frai,meals,b,2,6.79,7.86,5.36,3,0,47.630001,-2.67


The dataframe seems much cleaner now and the data types are correct. Now there's just a few more things I would like to add.<br><br>I'm going to add a column called `contains_additives`, which will be:
 - 1 - if the additive count is > 0
 - 0 - if the additive count is = 0

This will be used later on for modelling. I also noticed that the `packaging` column contains string values starting with both uppercase and lowercase. So I'm going to convert all of the words into lowercase for correct filtering later on in the exploration.

In [21]:
world_food_data["contains_additives"]=pd.Series(np.where(world_food_data.additives_n>0,1,0)).astype(int)
world_food_data.packaging=world_food_data.packaging.str.lower()

In [22]:
world_food_data.head()

Unnamed: 0,product_name,packaging,main_category,nutrition_grade_fr,nutrition_score,fat_g,carbohydrates_g,proteins_g,additives_n,ingredients_from_palm_oil_n,fp_lat,fp_lon,contains_additives
0,Salade Cesar,frais,plant-based-foods-and-beverages,c,6,12.0,23.0,22.0,0,0,47.630001,-2.67,0
1,Chaussons tressés aux pommes,frais,sugary-snacks,c,9,10.7,38.700001,3.33,5,0,47.630001,-2.67,1
2,Pain Burger Artisan,"frais,plastique",boulange,b,1,1.11,53.299999,10.0,0,0,47.630001,-2.67,0
3,Root Beer,"canette,métal",beverages,e,18,0.0,14.2,0.0,3,0,47.630001,-2.67,1
4,Quiche Lorraine,frai,meals,b,2,6.79,7.86,5.36,3,0,47.630001,-2.67,1


Let's take a look at the `ingredients_from_palm_oil_n` column. It seems that there are still 59 products, which contain 2 ingredients from palm oil. For simplicity, I'm just going to change these values with 1's, which will indicate that the french product either contains or doesn't contain such ingredients.

In [23]:
world_food_data["ingredients_from_palm_oil_n"].unique()

array([0, 1, 2], dtype=int32)

In [24]:
len(world_food_data[world_food_data.ingredients_from_palm_oil_n==2])

59

In [25]:
world_food_data.loc[
    world_food_data["ingredients_from_palm_oil_n"] == 2,
    "ingredients_from_palm_oil_n"
] = 1

In [26]:
world_food_data["ingredients_from_palm_oil_n"].unique()

array([0, 1], dtype=int32)

## Step 2 - The Starbucks dataset

### Obtaining the dataset

In [27]:
dtypes = {
    "Beverage": "category",
    "Beverage_category": "category",
    "Beverage_prep": "category",
    "Size (oz)": "int32",
    "Calories": "int32",
    "Total Fat (g)": "float32",
    "Total Carbohydrates (g)": "float32",
    "Sugars (g)": "int32",
    "Protein (g)": "float32",
}

object_key = f"{os.getenv('RAW_PATH')}/starbucks/starbucks_drinkMenu_expanded.csv"

object = minio_client.get_object(
    bucket_name=bucket,
    object_name=object_key
)

starbucks_data = pd.read_csv(object, dtype=dtypes, sep=',')
starbucks_data.head()

Unnamed: 0,Beverage_category,Beverage,Beverage_prep,Calories,Total Fat (g),Trans Fat (g),Saturated Fat (g),Sodium (mg),Total Carbohydrates (g),Cholesterol (mg),Dietary Fibre (g),Sugars (g),Protein (g),Vitamin A (% DV),Vitamin C (% DV),Calcium (% DV),Iron (% DV),Caffeine (mg)
0,Coffee,Brewed Coffee,Short,3,0.1,0.0,0.0,0,5,0,0,0,0.3,0%,0%,0%,0%,175
1,Coffee,Brewed Coffee,Tall,4,0.1,0.0,0.0,0,10,0,0,0,0.5,0%,0%,0%,0%,260
2,Coffee,Brewed Coffee,Grande,5,0.1,0.0,0.0,0,10,0,0,0,1.0,0%,0%,0%,0%,330
3,Coffee,Brewed Coffee,Venti,5,0.1,0.0,0.0,0,10,0,0,0,1.0,0%,0%,2%,0%,410
4,Classic Espresso Drinks,Caffè Latte,Short Nonfat Milk,70,0.1,0.1,0.0,5,75,10,0,9,6.0,10%,0%,20%,0%,75


### Cleaning the dataset

In [28]:
starbucks_data.columns = starbucks_data.columns.str.strip().str.lower().str.replace(" ","").str.replace("-","_").str.replace("(","").str.replace(")","")
starbucks_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   beverage_category    242 non-null    category
 1   beverage             242 non-null    category
 2   beverage_prep        242 non-null    category
 3   calories             242 non-null    int32   
 4   totalfatg            242 non-null    object  
 5   transfatg            242 non-null    float64 
 6   saturatedfatg        242 non-null    float64 
 7   sodiummg             242 non-null    int64   
 8   totalcarbohydratesg  242 non-null    int64   
 9   cholesterolmg        242 non-null    int64   
 10  dietaryfibreg        242 non-null    int64   
 11  sugarsg              242 non-null    int64   
 12  proteing             242 non-null    float64 
 13  vitamina%dv          242 non-null    object  
 14  vitaminc%dv          242 non-null    object  
 15  calcium%dv           24

In [29]:
cols_to_keep=["beverage_category","beverage","beverage_prep","calories","sugarsg","proteing","totalfatg","totalcarbohydratesg"]
starbucks_data = starbucks_data[cols_to_keep]
starbucks_data = starbucks_data.rename(columns={
    "totalfatg": "fat_g",
    "totalcarbohydratesg": "carbohydrates_g",
    "proteing": "proteins_g",
    "beverage": "product_name"
})
starbucks_data.head()

Unnamed: 0,beverage_category,product_name,beverage_prep,calories,sugarsg,proteins_g,fat_g,carbohydrates_g
0,Coffee,Brewed Coffee,Short,3,0,0.3,0.1,5
1,Coffee,Brewed Coffee,Tall,4,0,0.5,0.1,10
2,Coffee,Brewed Coffee,Grande,5,0,1.0,0.1,10
3,Coffee,Brewed Coffee,Venti,5,0,1.0,0.1,10
4,Classic Espresso Drinks,Caffè Latte,Short Nonfat Milk,70,9,6.0,0.1,75


In [30]:
starbucks_data.shape

(242, 8)

In [31]:
starbucks_data.dtypes

beverage_category    category
product_name         category
beverage_prep        category
calories                int32
sugarsg                 int64
proteins_g            float64
fat_g                  object
carbohydrates_g         int64
dtype: object

It seems that the carbohydrates were rounded to integers here, but I'm going to convert them to float, so that they match the carbohydrates column data types in `world_food_data`.

In [32]:
starbucks_data.carbohydrates_g=starbucks_data.carbohydrates_g.astype("float32")
starbucks_data.proteins_g=starbucks_data.proteins_g.astype("float32")
starbucks_data.sugarsg=starbucks_data.sugarsg.astype("int32")

In [33]:
starbucks_data.fat_g.unique()

array(['0.1', '3.5', '2.5', '0.2', '6', '4.5', '0.3', '7', '5', '0.4',
       '9', '1.5', '4', '2', '8', '3', '11', '0', '1', '10', '15', '13',
       '0.5', '3 2'], dtype=object)

It seems that there is just a mistake in the data. As I'm not really sure what the value `3 2` is supposed to be, I'm just going to replace it with NaN's and convert the data type to float.

In [34]:
starbucks_data.fat_g = pd.to_numeric(starbucks_data.fat_g, errors='coerce').astype("float32")

In [35]:
starbucks_data.head()

Unnamed: 0,beverage_category,product_name,beverage_prep,calories,sugarsg,proteins_g,fat_g,carbohydrates_g
0,Coffee,Brewed Coffee,Short,3,0,0.3,0.1,5.0
1,Coffee,Brewed Coffee,Tall,4,0,0.5,0.1,10.0
2,Coffee,Brewed Coffee,Grande,5,0,1.0,0.1,10.0
3,Coffee,Brewed Coffee,Venti,5,0,1.0,0.1,10.0
4,Classic Espresso Drinks,Caffè Latte,Short Nonfat Milk,70,9,6.0,0.1,75.0


In [36]:
starbucks_data.dtypes

beverage_category    category
product_name         category
beverage_prep        category
calories                int32
sugarsg                 int32
proteins_g            float32
fat_g                 float32
carbohydrates_g       float32
dtype: object

The Starbucks dataset seems ready for exploration now. So I'll move on to the last dataset - the McDonalds dataset.

## Step 3 - The McDonalds dataset