## Objective of this assignment
In this assignment, you will choose two classical machine learning algorithms implemented by
scikit-learn, spend some time researching them, then describe each algorithm in your own words,
compare and contrast their strengths and weaknesses, and apply both algorithms to the same
dataset - of your choice

## Part 1: Algorithm Selection
#### Decision: Decision Tree and Random Forest

## Part 2: Description

### Main Concepts
**Decision Tree**
is a hierarchical model with root, internal, and leaf nodes that make sequential decisions. It recursively partitions data using feature thresholds that maximize information gain. It uses Gini impurity or entropy during training to find the purest splits. For prediction, it follows a single path from root to leaf for final classification. Highly interpretable "white box" model that shows clear decision rules.

**Random Forest**
is a model that combines multiple decision trees to create a stronger predictor. Each tree trains on random data subsets (bootstrap samples) for diversity. As regards Feature Randomness, each split considers only random feature subsets to decorrelate trees. As regards prediction, it adopts simple majority to make its final prediction determined by combining all tree votes. This model reduces overfitting through collective wisdom - errors of individual trees cancel out.

#### How the Algorithms Work
**Decision Tree** builds a single tree by repeatedly finding the "best" feature splits that separate classes most effectively, creating a flowchart-like structure where each path from root to leaf represents a classification rule.

**Random Forest** builds hundreds of decision trees, each trained on random data samples and using random feature subsets, then combines their predictions through majority voting to produce a more accurate and stable final result.

**Key Difference**: Decision Tree creates one optimized model, while Random Forest creates many diverse models and averages their predictions to reduce errors.

### Part 3: Comparison

## Decision Tree Strengths
**High Interpretability**: Easy to understand and explain - shows clear decision rules

**No Data Preprocessing**: Handles both numerical and categorical data without scaling

**Fast Prediction**: Makes quick decisions by following a simple path

**Feature Importance**: Naturally ranks feature importance through split selection

**Handles Non-linearity**: Captures complex relationships without transformation

## Random Forest Strengths
**High Accuracy**: Typically outperforms single decision trees and many other algorithms

**Reduced Overfitting**: Averaging multiple trees prevents overfitting to noise

**Handles Missing Data**: Robust to missing values through bootstrap sampling

**Feature Importance**: Provides more reliable importance scores than single trees

**Works "Out-of-the-Box"**: Requires less parameter tuning than many algorithms

**Parallelizable**: Trees can be built simultaneously for faster training

## Decision Tree Weaknesses

**Prone to Overfitting**: Can create overly complex trees that memorize noise rather than learning patterns

**High Variance**: Small data changes can lead to completely different tree structures

**Unstable**: Sensitive to training data variations - poor generalization

**Greedy Nature**: Makes locally optimal decisions that may not be globally optimal

**Limited Performance**: Often outperformed by ensemble methods on complex tasks

**Biased with Imbalanced Data**: Can create skewed trees if some classes dominate

## Random Forest Weaknesses

**Black Box Model**: Difficult to interpret - loses Decision Tree's transparency

**Computationally Expensive**: Requires more memory and processing power

**Slower Prediction**: Must run through multiple trees instead of one

**Overfitting Risk**: Can still overfit on very noisy datasets

**Memory Intensive**: Stores multiple large trees in memory

**Parameter Sensitivity**: Performance depends on proper tuning of tree count and depth

## Core Strength of algorithms
**Decision Tree**: Best when you need model transparency and explanation

**Random Forest**: Best when you need maximum predictive accuracy


## Part 4: Application of Dataset

### Predicting Sleep Trouble Among Older Adults
#### A Machine Learning Analysis Using the National Poll on Healthy Aging (NPHA)

[Malani, P. N., Kullgren, J., & Solway, E. (2019). National Poll on Healthy Aging (NPHA), [United States], April 2017 (ICPSR 37305) [Data set]. Inter-university Consortium for Political and Social Research (ICPSR).]( https://doi.org/10.3886/ICPSR37305.v1)

##### Objective: The purpose of this project is to use machine learning models (via scikit-learn) to predict whether a patient reports trouble sleeping based on demographic, health, and lifestyle factors collected in the National Poll on Healthy Aging (NPHA).

The study aims to:
1. Identify which health and lifestyle factors are most predictive of sleep trouble.
2. Evaluate how well a supervised learning model (**Logistic Regression, Random Forest**) can classify individuals into “has trouble sleeping” vs “does not have trouble sleeping”.
3. Provide interpretable insights that may inform healthcare or behavioral interventions for older adults.

#### ❓ Questions I Seek to answer

1. Which variables (e.g., pain, medication, stress, physical or mental health) most strongly predict sleep trouble?
2. Can machine learning models **reliably** predict sleep trouble using survey-based data?
3. How do demographic factors (age, race, gender, employment) interact with health indicators in shaping sleep health?

#### 📋 Dataset Description

The NPHA dataset contains survey responses from U.S. adults aged 50–80 about:

1. Physical, mental, and dental health

2. Sleep behaviors and challenges

3. Use of medications

4. Doctor visits and healthcare engagement

5. Demographics (age, gender, race, employment)

For this analysis, the following subset of variables will be used:

| Variable                           | Description                          | Type        | Role                |
| ---------------------------------- | ------------------------------------ | ----------- | ------------------- |
| Number of Doctors Visited          | Number of different doctors seen     | Ordinal     | Feature             |
| Age                                | Patient’s age group (50–64, 65–80)   | Categorical | Feature             |
| Physical Health                    | Self-reported physical health rating | Ordinal     | Feature             |
| Mental Health                      | Self-reported mental health rating   | Ordinal     | Feature             |
| Dental Health                      | Self-reported dental health rating   | Ordinal     | Feature             |
| Employment                         | Employment or work status            | Categorical | Feature             |
| Stress Keeps from Sleeping         | Stress prevents sleep                | Binary      | Feature             |
| Medication Keeps from Sleeping     | Medication prevents sleep            | Binary      | Feature             |
| Pain Keeps from Sleeping           | Pain prevents sleep                  | Binary      | Feature             |
| Bathroom Needs Keeps from Sleeping | Bathroom needs prevent sleep         | Binary      | Feature             |
| Unknown Keeps from Sleeping        | Other unknown causes prevent sleep   | Binary      | Feature             |
| Prescription Sleep Medication      | Use of prescription sleep medication | Categorical | Feature             |
| Race                               | Racial/ethnic background             | Categorical | Feature             |
| Gender                             | Gender identity                      | Categorical | Feature             |
| **Trouble Sleeping**               | Reports having trouble sleeping      | Binary      | **Target Variable** |


[Applying this dataset using Logistic Regression model](logistic.ipynb)