# Python Practice Notebook 1

## Business Task

A health tech startup called WellBeing360 has developed a holistic health tracking app that collects daily data on physical activity, nutrition, stress, mindfulness, sleep, hydration, BMI, alcohol consumption, and smoking.

The app computes an Overall Health Score (0–100) as a composite of all factors.

Analyze this dataset to understand user health patterns, identify high-risk behaviors, and provide insights that can guide personalized interventions.



##  Import Libraries & Data Collection

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("data/holistic_health_lifestyle_dataset.csv")
df.head()

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status
0,54.934283,5.643011,5.696572,0.0,6.292214,2.578565,24.275932,4.28061,8.984006,36.950187,Poor
1,42.234714,6.389001,5.566647,4.450144,8.519054,2.448713,25.970141,7.461846,3.223304,55.167774,Average
2,57.953771,5.805238,3.12696,9.129716,6.70272,3.261433,25.193857,0.0,4.600482,78.304426,Good
3,75.460597,7.220836,6.159168,16.496689,7.135854,3.726265,19.5273,9.958423,3.947706,94.018274,Good
4,40.316933,9.394357,2.019835,25.241623,8.076086,3.049478,23.348229,4.320347,8.084322,100.0,Good


## Question and Solution

### Q1. Display the first 10 rows

In [3]:
# First 10 rows
df.head(10)

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status
0,54.934283,5.643011,5.696572,0.0,6.292214,2.578565,24.275932,4.28061,8.984006,36.950187,Poor
1,42.234714,6.389001,5.566647,4.450144,8.519054,2.448713,25.970141,7.461846,3.223304,55.167774,Average
2,57.953771,5.805238,3.12696,9.129716,6.70272,3.261433,25.193857,0.0,4.600482,78.304426,Good
3,75.460597,7.220836,6.159168,16.496689,7.135854,3.726265,19.5273,9.958423,3.947706,94.018274,Good
4,40.316933,9.394357,2.019835,25.241623,8.076086,3.049478,23.348229,4.320347,8.084322,100.0,Good
5,40.317261,5.457916,3.691631,21.941359,6.911555,3.309235,26.284728,3.438906,6.716783,88.75812,Good
6,76.584256,9.001641,1.001824,3.206752,4.273228,2.326077,25.463141,0.0,16.43792,60.37233,Average
7,60.348695,5.436656,8.117253,11.282877,8.560881,2.600324,21.263454,0.452641,2.695139,80.430716,Good
8,35.610512,5.304746,4.536901,15.763756,8.882393,3.688586,25.455662,1.489827,15.757114,71.172442,Good
9,55.851201,8.637189,9.333768,12.377929,4.253326,2.921369,27.202112,0.0,0.0,81.568796,Good


### Q2. Check the data types of all columns.

In [4]:
# Basic info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Physical_Activity     10000 non-null  float64
 1   Nutrition_Score       10000 non-null  float64
 2   Stress_Level          10000 non-null  float64
 3   Mindfulness           10000 non-null  float64
 4   Sleep_Hours           10000 non-null  float64
 5   Hydration             10000 non-null  float64
 6   BMI                   10000 non-null  float64
 7   Alcohol               10000 non-null  float64
 8   Smoking               10000 non-null  float64
 9   Overall_Health_Score  10000 non-null  float64
 10  Health_Status         10000 non-null  object 
dtypes: float64(10), object(1)
memory usage: 859.5+ KB


In [5]:
# Check the data types
df.dtypes

Physical_Activity       float64
Nutrition_Score         float64
Stress_Level            float64
Mindfulness             float64
Sleep_Hours             float64
Hydration               float64
BMI                     float64
Alcohol                 float64
Smoking                 float64
Overall_Health_Score    float64
Health_Status            object
dtype: object

**All the columns except Health_Status is numeric, float data type.**

### Q3. Find the number of missing values in each column.

In [6]:
df.isnull().sum()

Physical_Activity       0
Nutrition_Score         0
Stress_Level            0
Mindfulness             0
Sleep_Hours             0
Hydration               0
BMI                     0
Alcohol                 0
Smoking                 0
Overall_Health_Score    0
Health_Status           0
dtype: int64

**There are no missing values.**

### Q4. Compute mean, median, min, max, and standard deviation for all numeric columns.

In [7]:
df.describe()

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,45.047069,6.966599,4.987202,15.224636,7.000194,2.503302,24.095086,3.523663,5.706911,78.227945
std,19.832871,1.883295,1.938195,9.454891,1.46858,0.80166,3.356663,3.270784,5.00026,19.697853
min,0.0,0.0,1.0,0.0,3.0,0.5,18.0,0.0,0.0,2.217088
25%,31.548189,5.675978,3.599696,8.053871,6.003898,1.958461,21.653393,0.298894,1.065818,64.62706
50%,44.9481,7.031693,4.988464,14.896178,7.014341,2.506579,24.072122,2.980658,4.954994,81.118118
75%,58.421618,8.38773,6.327795,21.790305,8.025752,3.052666,26.380536,5.706382,8.991626,97.972163
max,120.0,10.0,10.0,52.278333,10.0,5.0,36.376168,18.040621,27.978693,100.0


**We could see average health score is in 70-80 range.**

### Q5. Identify the top 5 users with the highest Overall_Health_Score and the 5 users with the lowest.

In [8]:
#top 5 users with the highest Overall_Health_Score
top5 = df.sort_values('Overall_Health_Score', ascending=False).head(5)
top5

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status
9970,26.748236,9.170378,6.324544,29.748278,7.937627,4.061004,18.101793,3.577642,0.0,100.0,Good
9968,46.267668,9.607748,6.34529,21.168729,9.127439,2.256672,18.9154,1.924566,9.912102,100.0,Good
26,21.980128,7.490997,8.747153,25.446972,9.808487,1.835566,22.046242,6.657714,0.0,100.0,Good
24,34.112346,8.530875,3.131029,25.972328,5.600645,1.808175,25.086684,0.0,2.435469,100.0,Good
20,74.312975,10.0,9.426556,25.005483,9.648561,2.039538,29.788886,2.25817,6.45372,100.0,Good


**Top scorers have high physical activity, good nutrition, high sleep hours and low stress.**

In [9]:
# the 5 users with the lowest Overall_Health_Score
bottom5= df.sort_values('Overall_Health_Score', ascending=True).head(5)
bottom5

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status
8440,29.086897,1.948337,3.363219,0.0,5.46819,1.333555,26.095117,9.473138,7.042689,2.217088,Poor
6585,47.070763,1.913814,4.686803,0.94802,6.26084,1.772245,24.219959,6.879957,14.710177,4.100249,Poor
5162,43.160864,0.0,7.654257,2.830695,4.156696,1.57762,25.130162,3.399922,2.307202,7.026355,Poor
2001,42.109627,3.242038,4.746421,0.0,8.302063,1.619533,30.903076,3.686391,16.126818,7.73962,Poor
3430,54.330906,5.684203,7.379839,0.0,5.525995,2.713062,25.535648,6.496403,21.854822,7.901196,Poor


**Bottom scorers have poor nutrition,less sleep hours, higher alcohol intake and smoke, even though physical activity and stress are in moderate levels**

### Q6. Find all users who exercise more than 60 minutes per day and have a BMI over 30. How many are there?

In [10]:
# Users exercising >60 min/day and BMI > 30
high_activity_high_bmi = df[(df['Physical_Activity']>60) & (df['BMI']>30)]
high_activity_high_bmi 

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status
31,82.045564,8.891267,7.371817,22.138202,4.531798,2.419896,31.671607,1.721077,8.656170,88.921295,Good
202,66.661025,7.978261,4.653027,12.587963,5.689905,1.182631,31.845540,2.045897,1.149634,75.468306,Good
251,63.357239,8.988757,7.334781,19.376334,9.253974,1.394604,31.917537,4.639878,7.038856,87.014137,Good
323,86.847746,6.449035,2.710183,22.529675,7.327045,2.623339,30.729099,2.840479,4.068939,100.000000,Good
373,64.185417,5.869300,5.169595,7.240956,5.976173,1.932886,30.700853,8.465391,11.815983,33.056172,Poor
...,...,...,...,...,...,...,...,...,...,...,...
9512,64.585195,5.730409,3.877919,1.829663,5.961853,0.500000,30.757632,3.069795,1.321703,40.761727,Average
9534,61.402872,4.889797,4.069545,26.024326,7.096505,3.587521,30.540693,0.000000,7.935333,97.714410,Good
9577,60.063799,8.411788,6.539364,10.598098,6.854988,1.790123,33.116292,3.282573,4.579422,66.572205,Average
9650,67.468546,8.286471,5.617573,29.124482,7.771175,1.191236,31.003700,0.000000,8.668310,100.000000,Good


In [11]:
print("Total number of users:", high_activity_high_bmi.shape[0])

Total number of users: 87


**Large number of users , possibly muscular users or outliers**

### Q7. List users who report high stress (Stress_Level ≥ 8) but have high mindfulness (Mindfulness ≥ 30)

In [12]:
high_stress_high_mindful=df[(df['Stress_Level']>=8)& (df['Mindfulness']>=30)]
high_stress_high_mindful.head()

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status
733,10.948328,5.923526,9.05818,30.055071,7.188504,2.479221,23.028091,1.239865,0.0,100.0,Good
874,44.337461,9.838129,8.488623,30.950363,9.41616,3.077472,32.084458,0.0,2.239655,100.0,Good
1055,31.567533,5.510042,8.584902,30.8339,9.69952,2.183404,20.411242,2.302321,12.336948,98.3249,Good
1093,46.038918,7.326856,8.019959,30.993921,8.05854,2.879051,23.231655,0.0,17.869393,100.0,Good
1148,76.890101,4.445258,8.065066,36.190167,5.687887,2.058206,22.224242,0.0,11.448267,100.0,Good


In [13]:
print("Total number of users:", high_stress_high_mindful.shape[0])

Total number of users: 36


 **Many users are highly stressed but  they compensate with mindfulness.**


### Q8. Create a new column called Sleep_Quality where, 
* If Sleep_Hours < 6 → "Poor"
* If Sleep_Hours 6–8 → "Average"
* If Sleep_Hours > 8 → "Good"

In [14]:
## Sleep Quality
def sleep_quality(hours):
    if(hours   < 6) :
        return "Poor"
    elif(hours >= 6 and hours <8):
        return "Average"
    else:
        return "Good"     

In [15]:
df["Sleep_Quality"]= df["Sleep_Hours"].apply(sleep_quality)
df.head()

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status,Sleep_Quality
0,54.934283,5.643011,5.696572,0.0,6.292214,2.578565,24.275932,4.28061,8.984006,36.950187,Poor,Average
1,42.234714,6.389001,5.566647,4.450144,8.519054,2.448713,25.970141,7.461846,3.223304,55.167774,Average,Good
2,57.953771,5.805238,3.12696,9.129716,6.70272,3.261433,25.193857,0.0,4.600482,78.304426,Good,Average
3,75.460597,7.220836,6.159168,16.496689,7.135854,3.726265,19.5273,9.958423,3.947706,94.018274,Good,Average
4,40.316933,9.394357,2.019835,25.241623,8.076086,3.049478,23.348229,4.320347,8.084322,100.0,Good,Good


### Q9.Create a new column called Lifestyle_Risk:
Start with 0:
* +1 if BMI > 30
* +1 if Alcohol > 14 units/week
* +1 if Smoking > 5 cigarettes/day
* +1 if Physical_Activity < 30 minutes/day


This will simulate a “risk index” based on unhealthy habits.

In [16]:
# Start at 0
df['Lifestyle_Risk'] = 0

# Add 1 if BMI > 30
df['Lifestyle_Risk'] += (df['BMI'] > 30).astype(int)

# Add 1 if alcohol > 14 units
df['Lifestyle_Risk'] += (df['Alcohol'] > 14).astype(int)

# Add 1 if smoking > 10 cigarettes
df['Lifestyle_Risk'] += (df['Smoking'] > 5).astype(int)

# Add 1 if physical activity < 30 minutes
df['Lifestyle_Risk'] += (df['Physical_Activity'] < 30).astype(int)

# Select the few columns
df[["BMI","Alcohol",'Smoking','Physical_Activity','Lifestyle_Risk']]

Unnamed: 0,BMI,Alcohol,Smoking,Physical_Activity,Lifestyle_Risk
0,24.275932,4.280610,8.984006,54.934283,1
1,25.970141,7.461846,3.223304,42.234714,0
2,25.193857,0.000000,4.600482,57.953771,0
3,19.527300,9.958423,3.947706,75.460597,0
4,23.348229,4.320347,8.084322,40.316933,1
...,...,...,...,...,...
9995,22.419178,6.763459,15.932184,71.022041,1
9996,25.706761,1.282725,0.982908,5.033101,1
9997,25.287737,1.124172,3.221665,30.893666,0
9998,25.847730,2.986629,6.557577,54.915311,1


### Q10. Compute the average Overall_Health_Score for each Sleep_Quality category.

In [17]:
#average health score for each sleep quality category 
health_score_avg = df.groupby('Sleep_Quality')['Overall_Health_Score'].mean()

# display the average scores from highest to lowest
health_score_avg.sort_values(ascending=False)

Sleep_Quality
Good       82.611517
Average    78.168517
Poor       73.871340
Name: Overall_Health_Score, dtype: float64

### Q11. Group users by Lifestyle_Risk and calculate:
* Average Overall_Health_Score
* Average Stress_Level

In [18]:
# Average Overall_Health_Score
health_score_avg = df.groupby('Lifestyle_Risk')['Overall_Health_Score'].mean()
health_score_avg

Lifestyle_Risk
0    83.932394
1    76.465697
2    69.354088
3    63.944348
Name: Overall_Health_Score, dtype: float64

In [19]:
# Average Stress_Level
stress_level_avg = df.groupby('Lifestyle_Risk')['Stress_Level'].mean()
stress_level_avg

Lifestyle_Risk
0    5.000153
1    5.007765
2    4.871696
3    5.024801
Name: Stress_Level, dtype: float64

In [20]:
# Or
# Avg Overall_Health_Score & Stress_Level together
df.groupby('Lifestyle_Risk')[['Overall_Health_Score','Stress_Level']].mean()

Unnamed: 0_level_0,Overall_Health_Score,Stress_Level
Lifestyle_Risk,Unnamed: 1_level_1,Unnamed: 2_level_1
0,83.932394,5.000153
1,76.465697,5.007765
2,69.354088,4.871696
3,63.944348,5.024801


### Q12. Sort users by Overall_Health_Score descending. Which 3 factors are lowest among the bottom 5 users?

In [21]:
df.sort_values('Overall_Health_Score', ascending=False).tail(5)

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status,Sleep_Quality,Lifestyle_Risk
3430,54.330906,5.684203,7.379839,0.0,5.525995,2.713062,25.535648,6.496403,21.854822,7.901196,Poor,Poor,1
2001,42.109627,3.242038,4.746421,0.0,8.302063,1.619533,30.903076,3.686391,16.126818,7.73962,Poor,Good,2
5162,43.160864,0.0,7.654257,2.830695,4.156696,1.57762,25.130162,3.399922,2.307202,7.026355,Poor,Poor,0
6585,47.070763,1.913814,4.686803,0.94802,6.26084,1.772245,24.219959,6.879957,14.710177,4.100249,Poor,Average,1
8440,29.086897,1.948337,3.363219,0.0,5.46819,1.333555,26.095117,9.473138,7.042689,2.217088,Poor,Poor,2


**Nutrition_Score, Mindfulness, Sleep_Hours are the lowest factors in the bottom 5 users.**

### Q13. Compute the correlation matrix for all numeric columns. Identify the strongest positive correlations and  strongest negative correlations of Overall_Health_Score .

In [22]:
# selecting the only numeric columns
num_cols= df.select_dtypes(["int64","float64"])

# finding the correlation matrix
corr_matrix = num_cols.corr()

In [23]:
corr_matrix['Overall_Health_Score'].sort_values(ascending=False)

Overall_Health_Score    1.000000
Mindfulness             0.715856
Nutrition_Score         0.385678
Sleep_Hours             0.171856
Physical_Activity       0.146014
Hydration               0.142044
Stress_Level           -0.129315
BMI                    -0.160278
Alcohol                -0.174725
Lifestyle_Risk         -0.253431
Smoking                -0.294431
Name: Overall_Health_Score, dtype: float64

**Mindfulness and Nutrition_Score are the strongest positive correlations. Smoking and Alcohol are the strongest negative correlations**

### Q14. Identify users who are outliers: BMI > 35 or Stress_Level ≥ 9. Save this subset to a new DataFrame.

In [24]:
outliers = df[(df['BMI']>35) | (df['Stress_Level']>=9)]
outliers

Unnamed: 0,Physical_Activity,Nutrition_Score,Stress_Level,Mindfulness,Sleep_Hours,Hydration,BMI,Alcohol,Smoking,Overall_Health_Score,Health_Status,Sleep_Quality,Lifestyle_Risk
9,55.851201,8.637189,9.333768,12.377929,4.253326,2.921369,27.202112,0.000000,0.000000,81.568796,Good,Poor,0
20,74.312975,10.000000,9.426556,25.005483,9.648561,2.039538,29.788886,2.258170,6.453720,100.000000,Good,Good,1
126,25.189273,7.945630,10.000000,16.701065,6.969940,3.226328,27.000497,2.421524,8.248503,73.203956,Good,Average,2
214,38.694615,3.173319,9.023443,27.983700,6.734340,2.862605,28.349877,2.937137,4.851734,76.738921,Good,Average,0
284,87.660667,7.242957,9.623219,13.225192,6.848536,3.511267,24.449142,4.783122,2.754459,84.807466,Good,Average,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9891,63.316476,2.588130,9.105115,9.590910,6.024946,3.985643,20.422287,4.902665,12.098090,40.970707,Average,Average,1
9915,50.798223,6.930079,9.511034,10.453558,8.363162,2.247165,29.761195,3.957465,0.000000,67.770764,Average,Good,0
9925,34.582142,8.493370,10.000000,17.051444,5.929060,3.615746,20.240816,3.005586,2.511867,93.186915,Good,Poor,0
9927,40.308419,3.822994,9.002045,6.758480,9.617611,2.813723,20.994581,9.022736,0.000000,53.283368,Average,Good,0


**Health_Status are either good or average,but with BMI > 35 or Stress_Level ≥ 9. These are Outliers.**

### Q15. Create another dataset with Weekly_Steps. Merge it with your health data.

In [26]:
# create a dataframe with Weekly_Steps
weekly_steps = pd.DataFrame({
    'Weekly_Steps': np.random.randint(10000, 120000, size=len(df))
})
weekly_steps

Unnamed: 0,Weekly_Steps
0,109997
1,14259
2,49052
3,68156
4,69053
...,...
9995,74182
9996,27015
9997,15729
9998,16882


In [None]:
# merge it with health df
new_df = df.merge(weeky_steps, left_index=True, right_index=True)