<a href="https://colab.research.google.com/github/theshriramgupta/py/blob/master/ml_wellness_project_shriram_gupta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment : Personalized Wellness AI

## Phase 1 : Technical Proof of Concept

### Synthetic Data Design & Insights           
Describe your synthetic data generation strategy.      
• Features: What specific features (e.g., daily_steps,
sleep_duration_hours, mood_score, dietary_category_intake,
stress_level) did you include, and why are they crucial for wellness
recommendations?          
• Realism & Assumptions: How did you create realistic relationships and variability?       
What key assumptions did you make about user behavior or wellness factors?      
• Visual Insights: Include 1-2 key visualizations of your synthetic data (e.g., scatter plot
of activity vs. sleep, histogram of mood scores). Explain the patterns or "story" these
visuals reveal.

In [None]:
# Synthetic Data generation strategy

To build a concept behind developing Personalized Wellness AI system, We need to generate synthetic data that closely mimics real-world health and lifestyle patterns. The goal is to simulate data that a wellness app might collect from users through devices, wearables, or manual inputs.
Features chosen are :
> daily_steps : Represents the number of steps taken daily by a user. Physical activity is a key component of personal wellness and strongly correlates with mood, energy levels and sleep quality.
> sleep_durations_hours : Measures how many hours of sleep the user gets per day. Sleep is fundamental need to emotional stability, cognitive functions and overall well-being.
> mood_score : Mood Score serves as our target variable and reflects how a person feels. Predicting and improving mood is a central objective in personalized wellness.
> diet_category : Nutrition is a foundational pillar of health and can affect both mental and physical performance.

In [None]:
# Realism & Assumptions in Data :
To ensure realistic relationship and variability we need the synthetic data which is plausible and meaningful.
> Users who walk more and sleep better tend to report higher mood scores.
> Higher stress levels are  associated with lower mood scores.
> A balanced or high protein diet slightly improves mood compared to a carb-heavy one.
> To data varies normally (e.g., most people sleep around 7 hours or take 6000 steps a day), with  some natural variation (standard deviation) added for realism.

In [None]:
# Visual Insights
To understand the trends and validate our assumptions, I created simple visualizations:

> Scatter Plot: Daily Steps vs. Sleep Duration
- Reveals a positive pattern where users with more sleep and more steps often report better mood.
- Highlights the interplay between physical activity, rest, and emotional state.

> Histogram: Mood Score Distribution
- Shows that most mood scores fall between 5 and 8, aligning with typical human emotional distribution.
- This supports the realism of the synthetic data, as extreme moods (very high or low) are rare.

### Model Selection & Justification
Choose one core ML problem within your Wellness AI (e.g., predicting mood, recommending
activities, identifying wellness trends).    
• ML Approach: Which machine learning algorithm(s) did you choose (e.g., Regression,
Classification, Clustering, Recommendation System, Time-Series Forecasting)?    
• Justification: Explain why you selected this model. Discuss its strengths for wellness
data and the trade-offs considered (e.g., interpretability, non-linear relationships).

In [None]:
For the core machine learning problem, I selected:
> Predicting Mood Score based on lifestyle factors such as physical activity, sleep, stress, and diet.
This is a regression problem, as the mood score is a continuous numeric value (typically on a scale of 1 to 10). By predicting this score, the system can offer personalized suggestions to improve a user’s emotional well-being.

- To solve this regression problem, I chose the Random Forest Regressor as the primary machine learning model.

Reasons to select Random Forest are :
- It’s an ensemble model that builds multiple decision trees and averages their predictions, reducing overfitting and improving generalization.
- It captures non-linear relationships between inputs (e.g., how stress and sleep might interact) that simpler models might miss.
- It’s robust to noise and can handle mixed data types — both numerical and categorical features (like diet_category after encoding).
- It requires minimal preprocessing, which makes it ideal for a synthetic, fast-prototype system.

### Evaluation Strategy
How would you evaluate your chosen model on your synthetic data?       
• Metrics: What specific evaluation metrics (e.g., RMSE, Accuracy/F1, Silhouette Score)
would you use, and why are they appropriate?      
• Validation: How would you validate your model's performance (e.g., train-test split,
cross-validation)?     
• Future Refinements: With more time, what specific steps would you take to refine your
model's performance and robustness?

In [None]:
# Evaluation Metrics Used
Since the task is to predict a numeric value (mood score), this is a regression problem, and the following metrics are most appropriate:

> RMSE (Root Mean Squared Error)
Measures the average difference between predicted and actual mood scores.
Penalizes larger errors more than smaller ones (due to squaring).
Easy to interpret since it is in the same unit as the target (mood score scale: 1–10).

2. MAE (Mean Absolute Error)
Measures average magnitude of errors.
Unlike RMSE, treats all errors equally.
Can provide a more realistic view if data has outliers.

# Why RMSE is preferred:
In this case, we want to be more sensitive to larger mistakes (e.g., predicting 9 when the true mood is 2), so RMSE is prioritized.

# Validation Strategy
To ensure the model is not just memorizing the data (overfitting), I would apply the following validation methods:

> Train-Test Split
Split the synthetic dataset (e.g., 80% training, 20% testing)
Train on one part, evaluate on the other
Quick and effective for proof-of-concept stage

> Cross-Validation (Optional for later stage)
For more robust validation, use K-Fold Cross-Validation
Especially useful when moving from proof-of-concept to real data

# Future Refinements
If I had more time or real-world data, I would improve the model by:

> Hyperparameter Tuning
- Use GridSearchCV or RandomizedSearchCV to fine-tune the Random Forest parameters (like depth, number of trees)

> Model Comparison
- Benchmark Random Forest against models like XGBoost or Neural Networks

> Outlier Detection
- Remove unrealistic values from synthetic data to improve model stability

> Bias & Fairness Checks
- Ensure no diet category or stress level is unfairly skewing predictions

## Phase 2 : Impact and Reflection

### Real-World Impact & Considerations
Based on your design, what actionable insights or potential value could "Personalized
Wellness AI" provide in a real-world scenario? What are the primary risks, ethical
considerations (e.g., data privacy, recommendation bias), or significant limitations if
deployed?

In [None]:
# Potential Value of Personalized Wellness AI
If deployed in the real world, a Personalized Wellness AI system can have meaningful and transformative impacts on individuals and public health:
> Actionable Insights
Users could receive personalized recommendations such as:
- Increasing physical activity if mood is consistently low
- Sleeping earlier or reducing screen time to improve sleep quality
- Managing stress through mindfulness or breaks
Over time, the system could detect patterns in user behavior and mood and offer early warnings (e.g., burnout, sleep disorders)

> Risks, Ethical Concerns & Limitations
While the benefits are exciting, there are also serious challenges to consider:
- Data Privacy & Consent
Wellness data is highly personal. The system must ensure:
User consent before collecting data, Encryption and secure storage, Clear control over who can access or delete data.
- Bias in Recommendations
If the model is trained on data skewed toward a certain age group, gender, or lifestyle, it might give:
Unfair or inaccurate suggestions, One-size-fits-all advice that doesn’t suit diverse users and Personalization must be fair, inclusive, and adaptive.
- Over-reliance on AI
Users might start depending on the system instead of consulting professionals
The AI should clearly state that it's supportive, not diagnostic

In the real world:
Data is often noisy, missing, or biased. Human behavior is more unpredictable.

### Challenges & Growth
Describe a specific moment during this assignment where you faced a challenge (e.g., data
generation, model choice). How did you overcome it? How did this project deepen your
understanding of ML or your own interests?

In [None]:
> Challenge Faced
One of the most significant challenges I encountered during this assignment was:
Designing realistic synthetic data that reflects complex wellness behavior.
Although generating data with random values is easy, ensuring that the data:
Follows realistic health patterns (e.g., higher sleep improves mood),Maintains internal consistency, and Reflects real-world variability (not just noise) was much more difficult.
I had to carefully decide which features would influence mood, how strongly they should relate, and what assumptions were both realistic and generalizable. Finding the right balance between simplicity and realism took a lot of critical thinking.

> How I Overcame It
To tackle this:
I researched common wellness factors, like how physical activity, stress, and diet influence mood.
I designed controlled relationships using math (e.g., giving positive weight to sleep, negative weight to stress).
I used visualizations to validate if the synthetic data made sense — like scatter plots to observe trends.
This iterative process helped me build more credible and meaningful data, even in the absence of real-world input.

> How This Project Helped Me Grow
This project significantly deepened my understanding in several ways:
- Machine Learning as a System
I now see ML not just as algorithms, but as a full pipeline — from data design to modeling, evaluation, and ethics.
Every part matters. Even the best model will fail if the input data is flawed or unrepresentative.
- Personal Insight
This project aligned with my interest in both health and technology.
It inspired me to explore more AI for good projects — using technology to solve human problems in a thoughtful way.
- This assignment challenged me technically and conceptually, and helped me grow as both a machine learning practitioner and a responsible technologist.