# **Week 1 Practice Activity: Preliminary Dataset Review**

Data mining in health and wellness offers numerous impactful applications. It transforms raw nutrition and food tracking data into actionable insights, helping users make informed decisions that support their overall health and wellness goals.



In this practice activity, you will be working with a preliminary review of a nutrition and food tracking dataset. The main objective of this exercise is for you to understand the dataset structure and get familiar with the meaning of each feature of the nutritional data.




---

*Instructions:*
*   Download the nutrition dataset here and load it to Jupyter Notebooks.
*   Inspect the first few rows and columns using head() to get a sense of the data.
*   Review the dataset’s features and identify potential target variables.
*   Based on the initial findings, answer the following questions:
  *   How many entries does the dataset have?
  *   What are the column names, and what kind of data do they contain?
  *   Do you notice any missing or unusual values?
*   Create a table that lists each feature, describes its meaning, and identifies its type (e.g., numerical, categorical).
*   Based on the results, answer the following questions:
  *   Which columns are numerical, and which are categorical?  
  *   Are there any features that look redundant or unnecessary for analysis?


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/3rd year 2024-25/Term 2/Data Mining Principles/nutrition_data.csv')

# Inspect the first few rows and columns using head() to get a sense of the data.
print(df.head())


        food_item  calories  protein  carbs  fats  meal_time
0           Apple      95.0      0.5   25.0   0.3  Breakfast
1          Banana     105.0      1.3   27.0   0.4      Snack
2  Chicken Breast     165.0     31.0    0.0   3.6      Lunch
3           Steak     679.0     62.0    0.0  48.0     Dinner
4           Salad     150.0      2.0   15.0   7.0      Lunch


In [5]:
# How many entries does the dataset have?

num_rows = df.shape[0]
print(f"The DataFrame has {num_rows} rows.")


The DataFrame has 105 rows.


In [6]:
# What are the column names, and what kind of data do they contain?
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   food_item  105 non-null    object 
 1   calories   103 non-null    float64
 2   protein    104 non-null    float64
 3   carbs      104 non-null    float64
 4   fats       105 non-null    float64
 5   meal_time  105 non-null    object 
dtypes: float64(4), object(2)
memory usage: 5.0+ KB


In [7]:
df.dtypes


Unnamed: 0,0
food_item,object
calories,float64
protein,float64
carbs,float64
fats,float64
meal_time,object


In [8]:
# Do you notice any missing or unusual values?
df.isnull().sum()


Unnamed: 0,0
food_item,0
calories,2
protein,1
carbs,1
fats,0
meal_time,0


In [9]:
# Create a table that lists each feature, describes its meaning, and identifies its type (e.g., numerical, categorical).
import pandas as pd

data = {
    'Feature': ['food_item', 'calories', 'protein', 'carbs', 'fats', 'meal_time'],
    'Description': ['Name of the food', 'Total calories per serving', 'Protein content per serving',
                    'Carbohydrate content per serving', 'Fat content per serving',
                    'Time of day the food was consumed'],
    'Type': ['Categorical', 'Numerical', 'Numerical', 'Numerical', 'Numerical', 'Categorical']
}

df1 = pd.DataFrame(data)

# table styles
df1.style.set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'left')]},
    {'selector': 'td', 'props': [('text-align', 'left')]}
])

Unnamed: 0,Feature,Description,Type
0,food_item,Name of the food,Categorical
1,calories,Total calories per serving,Numerical
2,protein,Protein content per serving,Numerical
3,carbs,Carbohydrate content per serving,Numerical
4,fats,Fat content per serving,Numerical
5,meal_time,Time of day the food was consumed,Categorical


Table styling source: https://pandas.pydata.org/docs/user_guide/style.html

In [10]:
# 1. Which columns are numerical, and which are categorical?
numerical_columns = df.select_dtypes(include=['number']).columns
categorical_columns = df.select_dtypes(include=['object']).columns

print("Numerical Columns:")
print(numerical_columns)

print("\nCategorical Columns:")
print(categorical_columns)

Numerical Columns:
Index(['calories', 'protein', 'carbs', 'fats'], dtype='object')

Categorical Columns:
Index(['food_item', 'meal_time'], dtype='object')


In [11]:
# 2. Are there any features that look redundant or unnecessary for analysis?
# There are multiple duplicate rows. Pizza also shows up multiple times with different missing values and negative calories.

In [12]:
# Check for duplicate rows
duplicate_rows = df[df.duplicated()]

# Print the duplicate rows
print("Duplicate Rows:")
print(duplicate_rows)

Duplicate Rows:
          food_item  calories  protein  carbs  fats  meal_time
10            Apple      95.0      0.5   25.0   0.3  Breakfast
11           Banana     105.0      1.3   27.0   0.4      Snack
12   Chicken Breast     165.0     31.0    0.0   3.6      Lunch
13            Steak     679.0     62.0    0.0  48.0     Dinner
14            Salad     150.0      2.0   15.0   7.0      Lunch
..              ...       ...      ...    ...   ...        ...
100           Pizza       NaN     12.0   36.0  10.0     Dinner
101            Rice     206.0      4.3   45.0   0.5      Lunch
102            Fish     232.0     26.0    0.0   5.0     Dinner
103          Yogurt      59.0     10.0   12.0   0.4      Snack
104           Pasta     131.0      5.0   25.0   1.1      Lunch

[90 rows x 6 columns]


# **Week 2 Practice Activity No. 1: Handling Missing Data**

1.   How many missing values are there for each column?
2.   Did you drop any rows or fill them with specific values? Why?




In [13]:
df.isnull().sum()

Unnamed: 0,0
food_item,0
calories,2
protein,1
carbs,1
fats,0
meal_time,0


In [15]:
# Show rows with missing values
print(df[df.isnull().any(axis=1)])

    food_item  calories  protein  carbs  fats meal_time
5       Pizza       NaN     12.0   36.0  10.0    Dinner
15      Pizza     285.0      NaN   36.0  10.0    Dinner
25      Pizza     285.0     12.0    NaN  10.0    Dinner
100     Pizza       NaN     12.0   36.0  10.0    Dinner


Source: https://stackoverflow.com/questions/30447083/python-pandas-return-only-those-rows-which-have-missing-values

In [16]:
print(df.isnull())

     food_item  calories  protein  carbs   fats  meal_time
0        False     False    False  False  False      False
1        False     False    False  False  False      False
2        False     False    False  False  False      False
3        False     False    False  False  False      False
4        False     False    False  False  False      False
..         ...       ...      ...    ...    ...        ...
100      False      True    False  False  False      False
101      False     False    False  False  False      False
102      False     False    False  False  False      False
103      False     False    False  False  False      False
104      False     False    False  False  False      False

[105 rows x 6 columns]


# **Week 2 Practice Activity No. 2: Data Transformation, Normalization and Feature Engineering**