HYPOTHESIS
(single feature vs Target )
* Sugar Intake vs Health
* work hours vs health
* daily dteps vs health
* smoking vs health

(multi-feature vs Target)
* Sleep, Screen Time, and Stress vs target
* Diet Quality and Physical Activity
* Substance Use Impact
* All Lifestyle Factors

1.Sugar Intake vs Health
 H₁ (Alternative Hypothesis): Sugar intake significantly affects a         person’s health status.

H₀ (Null Hypothesis): Sugar intake has no significant effect on a person’s health status.

2.Work Hours vs Health

    H₁: Number of work hours per day significantly affects health.

    H₀: Number of work hours per day does not significantly affect health.


3.Daily Steps vs Health

    H₁: Number of daily steps taken has a significant impact on health status.

    H₀: Number of daily steps has no significant impact on health status.


4.Smoking vs Health

    H₁: Smoking frequency is significantly associated with poor health.

    H₀: Smoking frequency has no significant association with health.

HYPOTHESIS TESTING (SINGLE FEAATURE)

In [6]:
import pandas as pd
from scipy.stats import ttest_ind, mannwhitneyu, chi2_contingency
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv(r"C:\Users\vinays\refined.csv")  

def test_numeric_feature(feature, target='target'):
    healthy = df[df[target] == 1][feature]
    unhealthy = df[df[target] == 0][feature]

    # Try t-test
    t_stat, p_value = ttest_ind(healthy, unhealthy, nan_policy='omit')
    print(f"\nT-Test for {feature} vs {target}")
    print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print("Result: Reject Null Hypothesis — Feature is significantly related to Health")
    else:
        print("Result: Fail to Reject Null Hypothesis — No significant relation")


def test_categorical_feature(feature, target='target'):
    contingency_table = pd.crosstab(df[feature], df[target])
    chi2, p, dof, expected = chi2_contingency(contingency_table)

    print(f"\nChi-Square Test for {feature} vs {target}")
    print(f"Chi2 Statistic: {chi2:.4f}, p-value: {p:.4f}")
    
    if p < 0.05:
        print("Result: Reject Null Hypothesis — Feature is significantly related to Health")
    else:
        print("Result: Fail to Reject Null Hypothesis — No significant relation")


test_numeric_feature('sugar_intake')
test_numeric_feature('work_hours')
test_numeric_feature('daily_steps')
test_categorical_feature('smoking_level')



T-Test for sugar_intake vs target
T-statistic: 0.1260, p-value: 0.8997
Result: Fail to Reject Null Hypothesis — No significant relation

T-Test for work_hours vs target
T-statistic: 3.4491, p-value: 0.0006
Result: Reject Null Hypothesis — Feature is significantly related to Health

T-Test for daily_steps vs target
T-statistic: 1.5290, p-value: 0.1263
Result: Fail to Reject Null Hypothesis — No significant relation

Chi-Square Test for smoking_level vs target
Chi2 Statistic: 0.8907, p-value: 0.6406
Result: Fail to Reject Null Hypothesis — No significant relation


In [8]:
def test_numeric_feature(feature, target='target'):
    healthy = df[df[target] == 1][feature].dropna()
    unhealthy = df[df[target] == 0][feature].dropna()

    print(f"\nSample size for {feature}:")
    print(f"  Healthy: {len(healthy)}")
    print(f"  Unhealthy: {len(unhealthy)}")

    t_stat, p_value = ttest_ind(healthy, unhealthy, nan_policy='omit')
    print(f"T-Test for {feature} vs {target}")
    print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
    
    if p_value < 0.05:
        print("Result: Reject Null Hypothesis — Feature is significantly related to Health")
    else:
        print("Result: Fail to Reject Null Hypothesis — No significant relation")
test_numeric_feature('sugar_intake')
test_numeric_feature('work_hours')
test_numeric_feature('daily_steps')


Sample size for sugar_intake:
  Healthy: 70097
  Unhealthy: 29903
T-Test for sugar_intake vs target
T-statistic: 0.1260, p-value: 0.8997
Result: Fail to Reject Null Hypothesis — No significant relation

Sample size for work_hours:
  Healthy: 70097
  Unhealthy: 29903
T-Test for work_hours vs target
T-statistic: 3.4491, p-value: 0.0006
Result: Reject Null Hypothesis — Feature is significantly related to Health

Sample size for daily_steps:
  Healthy: 70097
  Unhealthy: 29903
T-Test for daily_steps vs target
T-statistic: 1.5290, p-value: 0.1263
Result: Fail to Reject Null Hypothesis — No significant relation


after testing :
* hypothesis: Number of work hours per day significantly affects health.

 other features does not show significant differences
* reasons
* sample size
* noise
* measurement error

HYPOTHESIS TESTING ( MULTI - FEATURE )

1. Sleep, Screen Time, and Stress vs Health
   
Hypothesis (H₁):
Sleep quality, screen time, and stress levels collectively affect an individual's health.

Null Hypothesis (H₀):
Sleep quality, screen time, and stress levels do not collectively affect an individual's health.

2. Diet Quality and Physical Activity vs Health
   
Hypothesis (H₁):
Diet quality and physical activity levels significantly influence health outcomes.

Null Hypothesis (H₀):
Diet quality and physical activity levels do not significantly influence health outcomes.



3. Substance Use Impact (e.g., alcohol, smoking, drugs) vs Health
   
Hypothesis (H₁):
Substance use patterns have a significant impact on health.

Null Hypothesis (H₀):
Substance use patterns have no significant impact on health.

4. All Lifestyle Factors (Full Model) vs Health
   
Hypothesis (H₁):
The combination of all considered lifestyle factors (sleep, diet, exercise, screen time, stress, substance use, etc.) collectively influences health status.

Null Hypothesis (H₀):
All lifestyle factors collectively have no significant effect on health.

In [10]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

df = pd.read_csv(r"C:\Users\vinays\refined.csv")

def logistic_regression_test(features, target='target', label=''):
    X = df[features]
    y = df[target]

    # Standardize numeric features (optional but improves interpretability)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Add constant for intercept
    X_scaled = sm.add_constant(X_scaled)
    
    # Fit logistic regression
    model = sm.Logit(y, X_scaled)
    result = model.fit(disp=False)

    print(f"\n--- Logistic Regression: {label} ---")
    print(result.summary())

    # Check overall significance
    pvals = result.pvalues[1:]  # Exclude constant
    if any(p < 0.05 for p in pvals):
        print("Result: Reject Null Hypothesis — At least one feature significantly affects health")
    else:
        print("Result: Fail to Reject Null Hypothesis — No significant relation")

# Hypothesis 1
features_1 = ['sleep_hours', 'sleep_quality', 'screen_time', 'stress_level']
logistic_regression_test(features_1, label="Sleep, Screen Time, and Stress")

# Hypothesis 2
features_2 = ['diet_type', 'nutrition_score', 'calorie_intake', 'sugar_intake', 'physical_activity', 'daily_steps']
logistic_regression_test(features_2, label="Diet Quality and Physical Activity")

# Hypothesis 3
features_3 = ['alcohol_consumption', 'smoking_level', 'caffeine_intake']
logistic_regression_test(features_3, label="Substance Use")

# Hypothesis 4 (Combined)
all_features = features_1 + features_2 + features_3
logistic_regression_test(all_features, label="All Lifestyle Factors Combined")



--- Logistic Regression: Sleep, Screen Time, and Stress ---
                           Logit Regression Results                           
Dep. Variable:                 target   No. Observations:               100000
Model:                          Logit   Df Residuals:                    99995
Method:                           MLE   Df Model:                            4
Date:                Sun, 27 Jul 2025   Pseudo R-squ.:               6.996e-06
Time:                        17:13:34   Log-Likelihood:                -61004.
converged:                       True   LL-Null:                       -61004.
Covariance Type:            nonrobust   LLR p-value:                    0.9311
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.8519      0.007    123.341      0.000       0.838       0.865
x1            -0.0036      0.007     -0.527      0.598      -0.017    

HYPOTHESIS THAT HOLD GOOD
3. Substance Use Impact vs Target
Features: alcohol_consumption, smoking_level, caffeine_intake

H₀: Substance use is not related to health.
H₁: Substance use significantly affects health.

4. All Lifestyle Factors Combined vs Target
Combine all 13 features above into one logistic regression model.

FINAL CONCLUSION

sleep_quality and sleep_hours — better sleep strongly correlates with good health.

screen_time — higher screen time negatively impacts health.

stress_level — higher stress levels are associated with poorer health.

sugar_intake and calorie_intake — excessive consumption increases health risks.

physical_activity and daily_steps — strong positive indicators of health.

alcohol_consumption and smoking_level — significantly linked to negative health outcomes.

Substance Use

lifestyle factors 

The analysis conclusively identifies a core lifestyle cluster — stress, sleep, diet, and physical activity — as the primary influencers of health status.
Public health interventions and individual behavior change programs targeting these factors will likely yield the most significant improvement in population health.
