# End-to-End Data Engineering Project

**Author:** Iuliia Vitiugova  
**Repository:** Data Engineering & Data Structures – Research Portfolio

---

## Overview

Integrated pipeline from raw data ingestion to analytical reporting.

### Reproducibility Notes
- All outputs are cleared; execute cells sequentially from top to bottom.
- Python 3 environment; see `requirements.txt` at the repo root.
- Any paths are relative; adjust the `DATA_DIR` variable if needed.

---



## Structure of this Notebook
1. Problem Statement & Goals
2. Data Ingestion & Validation
3. Preprocessing & Cleaning
4. Transformations / Feature Engineering
5. Analysis & Evaluation
6. Conclusions & Next Steps
---


In [None]:
import numpy as np
import pandas as pd
import io
import matplotlib.pyplot as plt
from scipy.stats import f_oneway, levene, shapiro, norm, lognorm

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
gain = pd.read_excel(io.BytesIO(uploaded['Gain.xlsx']))
gain

In [None]:
period_14_28 = gain[['Période 14-28 jours', 'Unnamed: 1', 'Unnamed: 2']].copy()
period_28_42 = gain[['Période 28-42 jours', 'Unnamed: 4', 'Unnamed: 5']].copy()
period_14_28.columns = ['D1', 'D2', 'D3']
period_28_42.columns = ['D1', 'D2', 'D3']


period_14_28 = period_14_28.apply(pd.to_numeric, errors='coerce')
period_28_42 = period_28_42.apply(pd.to_numeric, errors='coerce')

# ANOVA-One Way

In [None]:
from scipy.stats import shapiro, levene

# Function to check normality
def check_normality(data, group, period):
    stat, p = shapiro(data)
    print(f"Group {group}, Period {period}: W-statistic={stat:.4f}, p-value={p:.4f}")
    if p > 0.05:
        print("    Data is normally distributed.")
    else:
        print("    Data is not normally distributed.")
    return p > 0.05

# Checking normality for each group
for period, data in zip(["14-28 days", "28-42 days"], [period_14_28, period_28_42]):
    print(f"\nPeriod: {period}")
    for group in ['D1', 'D2', 'D3']:
        group_data = data[group].dropna()
        check_normality(group_data, group, period)

# Function to check equality of variances using Levene's test
def check_variance_equality(data1, data2, data3, period):
    stat, p = levene(data1, data2, data3)
    print(f"\nEquality of variances for period {period}:")
    print(f"Statistic={stat:.4f}, p-value={p:.4f}")
    if p > 0.05:
        print("    Variances are equal.")
    else:
        print("    Variances are not equal.")

# Applying Levene's test for each period
check_variance_equality(
    period_14_28['D1'].dropna(),
    period_14_28['D2'].dropna(),
    period_14_28['D3'].dropna(),
    "14-28 days"
)

check_variance_equality(
    period_28_42['D1'].dropna(),
    period_28_42['D2'].dropna(),
    period_28_42['D3'].dropna(),
    "28-42 days"
)

###Discussion:
Since normality is violated for Group D3 in both periods, ANOVA assumptions are not fully satisfied. (For Period 14-28 days, regular ANOVA can be used, but keep in mind the normality issue in Group D3.)

In [None]:
def estimate_params(data):
    mean, std = norm.fit(data)
    shape, loc, scale = lognorm.fit(data, floc=0)
    return {'normal': (mean, std), 'lognormal': (shape, loc, scale)}

def plot_with_distributions(data, group, period):
    plt.figure(figsize=(10, 6))
    plt.hist(data, bins=15, density=True, alpha=0.6, color='skyblue', edgecolor='black', label='Observed Data')
    params = estimate_params(data)

    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100)

    mean, std = params['normal']
    plt.plot(x, norm.pdf(x, mean, std), 'r-', label=f'Normal: μ={mean:.2f}, σ={std:.2f}')

    shape, loc, scale = params['lognormal']
    plt.plot(x, lognorm.pdf(x, shape, loc, scale), 'g--', label=f'Lognormal: Shape={shape:.2f}, Scale={scale:.2f}')

    plt.title(f"Distribution Fit for Group {group} ({period})")
    plt.xlabel("Observed Values")
    plt.ylabel("Density")
    plt.legend()
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()

for period, data in zip(["14-28 days", "28-42 days"], [period_14_28, period_28_42]):
    for group in ['D1', 'D2', 'D3']:
        group_data = data[group].dropna()
        plot_with_distributions(group_data, group, period)


###Discussion:
The normal distribution fits symmetric data well, as seen in Groups D1 and D2 for both periods, but struggles with skewness in Group D3. The lognormal distribution better captures skewed data (e.g., Group D3 in both periods and D1 for 28-42 days) but can overestimate densities at the tails. Overall, the fit is reasonable but not perfect, with some deviations in the tails and peaks. To improve the fit, consider using goodness-of-fit tests or alternative distributions like gamma or Weibull for skewed data.

In [None]:
def bootstrap(data, n_bootstrap=1000):
    means = [np.mean(np.random.choice(data, size=len(data), replace=True)) for _ in range(n_bootstrap)]
    medians = [np.median(np.random.choice(data, size=len(data), replace=True)) for _ in range(n_bootstrap)]
    return {'means': means, 'medians': medians}
bootstrap_results = {}

for period, data in zip(["14-28 days", "28-42 days"], [period_14_28, period_28_42]):
    bootstrap_results[period] = {}
    for group in ['D1', 'D2', 'D3']:
        group_data = data[group].dropna()
        bootstrap_results[period][group] = bootstrap(group_data)


print(bootstrap_results["14-28 days"]["D1"]["means"][:5])

In [None]:
def confidence_intervals(bootstrap_data, confidence_level=95):
    lower_percentile = (100 - confidence_level) / 2
    upper_percentile = 100 - lower_percentile
    mean_ci = np.percentile(bootstrap_data['means'], [lower_percentile, upper_percentile])
    median_ci = np.percentile(bootstrap_data['medians'], [lower_percentile, upper_percentile])
    return {'mean_ci': mean_ci, 'median_ci': median_ci}
ci_results = []

for period, groups in bootstrap_results.items():
    for group, stats in groups.items():
        ci = confidence_intervals(stats)
        ci_results.append({
            'Period': period,
            'Group': group,
            'Mean Lower CI': ci['mean_ci'][0],
            'Mean Upper CI': ci['mean_ci'][1],
            'Median Lower CI': ci['median_ci'][0],
            'Median Upper CI': ci['median_ci'][1]
        })


ci_results_df = pd.DataFrame(ci_results)
print(ci_results_df)

In [None]:
anova_results = []

for period, data in zip(["14-28 days", "28-42 days"], [period_14_28, period_28_42]):
    f_stat, p_value = f_oneway(data['D1'].dropna(), data['D2'].dropna(), data['D3'].dropna())
    anova_results.append({
        'Period': period,
        'F-statistic': f_stat,
        'p-value': p_value
    })


anova_results_df = pd.DataFrame(anova_results)
print(anova_results_df)

## Discussion
14-28 days:

- **F-statistic** (1.829) indicates some differences between groups D1, D2, and D3.
- **p-value** (0.165) is higher than the significance level of 0.05. This indicates that there are no statistically significant differences in the average weight gains for groups D1, D2, and D3. Therefore, it cannot be concluded that the different diets have significantly different effects on weight gain in this period. The observed differences are likely due to random variation. (fail to reject H0)

28-42 days:

- **F-statistic** (0.290) suggests minimal differences between the groups
- **p-value** (0.749) is much higher than the significance level of 0.05. This means that there are no statistically significant differences in weight gain between groups D1, D2, and D3 in this period. The observed differences are likely due to chance, and the diets do not seem to have a distinct effect on weight gain during this time. (fail to reject H0)


In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd

def prepare_long_format(data, period):
    long_data = data.melt(var_name='Group', value_name='Weight', ignore_index=False).dropna()
    long_data['Period'] = period
    return long_data

long_data_14_28 = prepare_long_format(period_14_28, "14-28 days")
long_data_28_42 = prepare_long_format(period_28_42, "28-42 days")


def tukey_hsd_test(data, period):
    print(f"\nTukey's HSD Test for {period}:")
    tukey_result = pairwise_tukeyhsd(endog=data['Weight'], groups=data['Group'], alpha=0.05)
    print(tukey_result)
    return tukey_result

tukey_hsd_test(long_data_14_28, "14-28 days")
tukey_hsd_test(long_data_28_42, "28-42 days")


In [None]:
from statsmodels.stats.multicomp import MultiComparison
import statsmodels.api as sm

def bonferroni_test(data, period):
    print(f"\nBonferroni Test for {period}:")
    mc = MultiComparison(data['Weight'], data['Group'])
    result = mc.allpairtest(sm.stats.ttest_ind, method='b', alpha=0.05)
    print(result[0])
    return result

def scheffe_test(data, period):
    print(f"\nScheffé Test for {period}:")
    mc = MultiComparison(data['Weight'], data['Group'])
    result = mc.allpairtest(lambda x, y: sm.stats.ttest_ind(x, y, usevar='pooled'), method="b")
    print(result[0])
    return result

long_data_14_28 = prepare_long_format(period_14_28, "14-28 days")
long_data_28_42 = prepare_long_format(period_28_42, "28-42 days")

bonferroni_test(long_data_14_28, "14-28 days")
bonferroni_test(long_data_28_42, "28-42 days")

scheffe_test(long_data_14_28, "14-28 days")
scheffe_test(long_data_28_42, "28-42 days")

**Test Results:**
- **Tukey Test:**
The results of the Tukey test showed that, for both periods (14-28 days and 28-42 days), there are no statistically significant differences in weight gain between diets D1, D2, and D3. All group pairs have p-values above the significance level of 0.05, confirming the absence of significant differences.
- **Scheffé Test:**
The Scheffé test also confirmed that there are no significant differences in weight gain between groups D1, D2, and D3 for both periods. All p-values remain high, and the reject indicator confirms the absence of differences after accounting for multiple comparisons.
- **Bonferroni Test:**
The Bonferroni test results align with those of the Tukey and Scheffé tests. The adjusted p-values exceed the significance level of 0.05, and none of the group pairs demonstrated statistically significant differences in weight gain.

# Two-Way ANOVA

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

df_long = pd.concat([period_14_28, period_28_42], keys=['14-28 days', '28-42 days'], names=['Period'])
df_long = df_long.reset_index().melt(id_vars=['Period'], value_vars=['D1', 'D2', 'D3'], var_name='Diet', value_name='Weight').dropna()

model = ols('Weight ~ C(Diet) * C(Period)', data=df_long).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("Two-Way ANOVA Results:")
print(anova_table)

In [None]:
import seaborn as sns
df_long['Period'] = df_long['Period'].astype('category')
df_long['Diet'] = df_long['Diet'].astype('category')

interaction_means = df_long.groupby(['Period', 'Diet'])['Weight'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.lineplot(
    data=interaction_means,
    x='Period',
    y='Weight',
    hue='Diet',
    marker='o',
    palette='Set2'
)
plt.title("Interaction Plot: Diet vs Period")
plt.xlabel("Period")
plt.ylabel("Mean Weight")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(title="Diet")
plt.tight_layout()
plt.show()


In [None]:
print("Statistical Significance of Factors:")
print(anova_table[['F', 'PR(>F)']])

In [None]:
import seaborn as sns
pivot_table = df_long.pivot_table(index='Diet', columns='Period', values='Weight', aggfunc='mean')


plt.figure(figsize=(8, 6))
sns.heatmap(pivot_table, annot=True, cmap='coolwarm', cbar=True)
plt.title("Heatmap: Average Weight by Diet and Period")
plt.xlabel("Period")
plt.ylabel("Diet")
plt.show()

In [None]:
r_squared = model.rsquared
print(f"Model Quality (R-squared): {r_squared:.4f}")

###Discussion:
- **Significant Factor:**
The period has a strong and statistically significant effect on weight gain.
- **Non-Significant Factors:**
Neither diet nor the interaction between diet and period significantly impacts weight gain.
- **Model Performance:**
The low R2 suggests the need to investigate additional factors or improve the model.


In [None]:
interaction_p_value = anova_table.loc["C(Diet):C(Period)", "PR(>F)"]
interaction_significant = interaction_p_value < 0.05

if interaction_significant:
    print(f"The interaction between Diet and Period is significant (p = {interaction_p_value:.4f}).")
else:
    print(f"The interaction between Diet and Period is not significant (p = {interaction_p_value:.4f}).")


In [None]:
diet_stats = df_long.groupby('Diet')['Weight'].agg(['mean', 'sem']).reset_index()


plt.figure(figsize=(8, 6))
sns.barplot(data=diet_stats, x='Diet', y='mean', ci=None, palette='pastel', edgecolor='black')
plt.errorbar(
    x=diet_stats['Diet'], y=diet_stats['mean'], yerr=diet_stats['sem'], fmt='none', c='black', capsize=5
)
plt.title("Impact of Diet on Weight Gain")
plt.xlabel("Diet")
plt.ylabel("Mean Weight (with SEM)")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


print("Impact of Diet:")
print(diet_stats)

In [None]:
period_stats = df_long.groupby('Period')['Weight'].agg(['mean', 'sem']).reset_index()

plt.figure(figsize=(8, 6))
sns.barplot(data=period_stats, x='Period', y='mean', ci=None, palette='pastel', edgecolor='black')
plt.errorbar(
    x=period_stats['Period'], y=period_stats['mean'], yerr=period_stats['sem'], fmt='none', c='black', capsize=5
)
plt.title("Impact of Period on Weight Gain")
plt.xlabel("Period")
plt.ylabel("Mean Weight (with SEM)")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

print("Impact of Period:")
print(period_stats)


# Discussion
- **Diet:**
While D1 appears to lead to slightly higher weight gain, the differences between diets are minor and likely not statistically significant.
- **Period:**
The period has a strong effect on weight gain, with the second period (28-42 days) yielding significantly higher weights compared to the first period.

In [None]:
from scipy.stats import levene
levene_test = levene(
    df_long[df_long['Diet'] == 'D1']['Weight'],
    df_long[df_long['Diet'] == 'D2']['Weight'],
    df_long[df_long['Diet'] == 'D3']['Weight']
)
print(f"Levene's test p-value: {levene_test.pvalue}")


In [None]:
from scipy.stats import shapiro
for diet in ['D1', 'D2', 'D3']:
    for period in ['14-28 days', '28-42 days']:
        subset = df_long[(df_long['Diet'] == diet) & (df_long['Period'] == period)]['Weight']
        stat, p = shapiro(subset)
        print(f"Shapiro-Wilk test for {diet}, {period}: p-value = {p}")


In [None]:
from statsmodels.stats.multicomp import MultiComparison

df_long['Diet_Period'] = df_long['Diet'].astype(str) + "_" + df_long['Period'].astype(str)

mc = MultiComparison(df_long['Weight'], df_long['Diet_Period'])
bonferroni_results = mc.allpairtest(sm.stats.ttest_ind, method='b', alpha=0.05)

print("Bonferroni Test Results:")
print(bonferroni_results[0])


In [None]:
from statsmodels.stats.multicomp import MultiComparison
import statsmodels.api as sm

def scheffe_test(data, period):
    print(f"\nScheffé Test for {period}:")
    mc = MultiComparison(data['Weight'], data['Diet_Period'])
    result = mc.allpairtest(sm.stats.ttest_ind, method="b")
    print(result[0])
    return result
df_long['Diet_Period'] = df_long['Diet'].astype(str) + "_" + df_long['Period'].astype(str)

scheffe_results_all = scheffe_test(df_long, "All Periods")

long_data_14_28 = df_long[df_long['Period'] == "14-28 days"]
long_data_28_42 = df_long[df_long['Period'] == "28-42 days"]

scheffe_results_14_28 = scheffe_test(long_data_14_28, "14-28 days")
scheffe_results_28_42 = scheffe_test(long_data_28_42, "28-42 days")


In [None]:
df_long['Diet_Period'] = df_long['Diet'] + "_" + df_long['Period']

tukey_results = pairwise_tukeyhsd(
    endog=df_long['Weight'],
    groups=df_long['Diet_Period'],
    alpha=0.05
)

print("Tukey's HSD Results for Diet and Period combinations:")
print(tukey_results)

tukey_results.plot_simultaneous(figsize=(10, 6))
plt.title("Tukey's HSD Results for Diet and Period")
plt.show()

In [None]:
for diet in ['D1', 'D2', 'D3']:
    for period in ['14-28 days', '28-42 days']:
        subset = df_long[(df_long['Diet'] == diet) & (df_long['Period'] == period)]
        plt.figure(figsize=(8, 6))
        sns.histplot(subset['Weight'], kde=True, bins=10, color='skyblue', edgecolor='black')
        plt.title(f"Histogram of Weight: {diet} ({period})")
        plt.xlabel("Weight")
        plt.ylabel("Frequency")
        plt.grid(axis='y', linestyle='--', alpha=0.7)
        plt.show()

---

In [None]:
def calculate_confidence_intervals(data, confidence_level=95):
    means = []
    for _ in range(1000):
        sample = np.random.choice(data, size=len(data), replace=True)
        means.append(np.mean(sample))
    lower = np.percentile(means, (100 - confidence_level) / 2)
    upper = np.percentile(means, 100 - (100 - confidence_level) / 2)
    return lower, upper

ci_results = []

for diet in ['D1', 'D2', 'D3']:
    for period in ['14-28 days', '28-42 days']:
        subset = df_long[(df_long['Diet'] == diet) & (df_long['Period'] == period)]['Weight']
        lower, upper = calculate_confidence_intervals(subset)
        ci_results.append({
            'Diet': diet,
            'Period': period,
            'Mean': np.mean(subset),
            'Lower CI': lower,
            'Upper CI': upper
        })

ci_df = pd.DataFrame(ci_results)
print("Confidence Intervals for Diet and Period combinations:")
print(ci_df)

1. **Levene's Test (Homogeneity of Variances):**
- Result: p-value = 0.022.
- While the variances are not perfectly homogeneous (p-value < 0.05), ANOVA is robust to minor deviations, and the results remain valid.
2. **Shapiro-Wilk Test (Normality):**

- D1 and D2: Normally distributed in both periods (p-values > 0.05).
- D3: Some deviations from normality (p-values < 0.05), particularly in both periods.
- Most groups meet the normality assumption, ensuring the reliability of the analysis, with only minor deviations in group D3.
3. **Bonferroni Test:**
- Significant Differences observed between periods for several diet-period combinations, such as D1_14-28 days vs. D2_28-42 days and D2_14-28 days vs. D2_28-42 days.
- Within the same period, diets show no significant differences, indicating consistency in their impact on weight gain.
- The results suggest a strong relationship between periods and weight gain, with diets showing comparable effects within each period.
4. **Scheffé Test:**
- All Periods: Confirms significant differences between several diet-period combinations, particularly across periods, supporting the trends observed in Bonferroni and Tukey tests.
- Within Periods: No significant differences between diets, emphasizing their similar influence on weight gain.
- Scheffé highlights consistent effects of diets within each period and significant changes in weight gain across periods.
5. **Tukey’s HSD Test:**
- Clear differences between periods for combinations like D1_14-28 days vs. D1_28-42 days and D2_14-28 days vs. D2_28-42 days.
- Confidence intervals visually confirm significant pairwise differences while showing overlaps within the same period, reflecting consistency in diet effects.
- Tukey emphasizes the strong effect of periods on weight gain while showing that diets perform similarly within each period.

---


**Histograms for Diet and Period:**

- The histograms for 14-28 days show lower weight distributions compared to 28-42 days, confirming the significant effect of the period.
- Diet D3 exhibits a more skewed distribution compared to D1 and D2, indicating variation in weight gain.
- The shapes of the distributions suggest some diets are more consistent (e.g., D1 and D2) than others (e.g., D3).


---


**Confidence Intervals:**

- Confidence intervals for the same diet across periods (e.g., D1 and D3) show non-overlapping ranges, indicating statistically significant differences between periods.
- Confidence intervals for different diets within the same period overlap in most cases, indicating no significant differences between diets.

In [None]:
one_way_results_14_28 = f_oneway(
    df_long[df_long['Diet_Period'] == 'D1_14-28 days']['Weight'],
    df_long[df_long['Diet_Period'] == 'D2_14-28 days']['Weight'],
    df_long[df_long['Diet_Period'] == 'D3_14-28 days']['Weight']
)

one_way_results_28_42 = f_oneway(
    df_long[df_long['Diet_Period'] == 'D1_28-42 days']['Weight'],
    df_long[df_long['Diet_Period'] == 'D2_28-42 days']['Weight'],
    df_long[df_long['Diet_Period'] == 'D3_28-42 days']['Weight']
)

print("One-Way ANOVA Results for 14-28 days:")
print(f"F-statistic: {one_way_results_14_28.statistic}, p-value: {one_way_results_14_28.pvalue}")

print("One-Way ANOVA Results for 28-42 days:")
print(f"F-statistic: {one_way_results_28_42.statistic}, p-value: {one_way_results_28_42.pvalue}")

###Discussion
###Comparison of Results:
Two-Way ANOVA:
- Period had a statistically significant effect (p<0.05).
- Diet and the interaction between Diet and Period were not significant.


---


One-Way ANOVA:
- 14-28 days: Diet differences may not be significant since Tukey's HSD showed overlapping confidence intervals.
- 28-42 days: Similar conclusion, with overlapping confidence intervals.


---



**Statistical Analysis**
- Period: Significant effect (p<0.05) in Two-Way ANOVA and confirmed by Tukey's HSD. Weight gain was consistently higher in the second period (28-42 days) across all diets.
- Diet: No significant effect within individual periods based on One-Way ANOVA and confidence intervals.
- Interactions:
The interaction between Diet and Period was not significant, indicating that the effect of Diet on weight gain is consistent across periods.
- Confidence Intervals:
Significant differences were observed between periods for the same diet, but differences between diets within a period were not statistically significant.
- Tukey's Test:
Highlighted significant pairwise differences between diet-period combinations, particularly between period


---


**Visualizations** \\
Histograms:
- Clear shift in weight distributions between 14-28 days and 28-42 days, confirming the impact of the period.
- Diet D3 shows greater variability and skewness in weight gain compared to D1 and D2. \\
Tukey Plot:
- Visualizes significant pairwise differences, especially between diets across periods

# Analyse de variance muti - facteurs


In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
consum = pd.read_excel(io.BytesIO(uploaded['Consum.xlsx']))
consum

In [None]:
consum.columns = [
    'D1_14-28', 'D2_14-28', 'D3_14-28',
    'D1_28-42', 'D2_28-42', 'D3_28-42'
]

In [None]:
consum_long = consum.melt(var_name="Diet_Period", value_name="Consumption")
consum_long[['Diet', 'Period']] = consum_long['Diet_Period'].str.extract(r'(D\d)_(\d+-\d+)')
consum_long['Diet'] = consum_long['Diet'].astype('category')
consum_long['Period'] = consum_long['Period'].astype('category')
missing_values = consum_long.isnull().sum()
summary_table = consum_long.groupby(['Period', 'Diet']).describe()['Consumption']

In [None]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

consum_long['Consumption'] = pd.to_numeric(consum_long['Consumption'], errors='coerce')
consum_long = consum_long.dropna(subset=['Consumption'])

model_fixed = ols('Consumption ~ C(Diet) * C(Period)', data=consum_long).fit()
anova_results = anova_lm(model_fixed, typ=2)

anova_results

In [None]:
import seaborn as sns

plt.figure(figsize=(8, 6))
sns.pointplot(data=consum_long, x='Period', y='Consumption', hue='Diet', markers=['o', 's', '^'], linestyles=['-', '--', '-.'])
plt.title("Interaction Plot: Consumption by Diet and Period")
plt.ylabel("Mean Consumption")
plt.xlabel("Period")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.legend(title='Diet')
plt.tight_layout()
plt.show()

---

In [None]:
heatmap_data = consum_long.groupby(['Period', 'Diet'])['Consumption'].mean().unstack()

plt.figure(figsize=(8, 6))
sns.heatmap(heatmap_data, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title("Heatmap: Average Consumption by Diet and Period")
plt.xlabel("Diet")
plt.ylabel("Period")
plt.tight_layout()
plt.show()

missing_values, summary_table, anova_results

###Discussion:
1. **Diet significantly**
- F-statistic: 0.8676
- p-value: 0.4239
- Since the p-value is greater than 0.05, diet does not affect consumption statistically significantly.
2. **Period significantly**
- F-statistic: 209.6098
- p-value: 8.31e-24
- The p-value is far below 0.05, indicating that the period has a highly significant effect on consumption. This means that the average consumption changes significantly between the two periods.
3. **Interaction between Diet and Period**
- F-statistic: 0.1985
- p-value: 0.8204
- The p-value is much greater than 0.05, indicating that there is no statistically significant interaction between diet and period. This means that the effect of diet on consumption does not depend on the period.

In [None]:
import scipy.stats as stats
import numpy as np

normality_results = {}
for period in consum_long['Period'].unique():
    for diet in consum_long['Diet'].unique():
        group_data = consum_long[(consum_long['Period'] == period) & (consum_long['Diet'] == diet)]['Consumption']
        stat, p_value = stats.shapiro(group_data)
        normality_results[f'{diet}_{period}'] = {'Shapiro-Wilk Stat': stat, 'p-value': p_value}
levene_stat, levene_p = stats.levene(
    consum_long[consum_long['Diet'] == 'D1']['Consumption'],
    consum_long[consum_long['Diet'] == 'D2']['Consumption']
)
import scipy.stats as stats
import numpy as np

normality_results = {}
for period in consum_long['Period'].unique():
    for diet in consum_long['Diet'].unique():
        group_data = consum_long[(consum_long['Period'] == period) & (consum_long['Diet'] == diet)]['Consumption']
        stat, p_value = stats.shapiro(group_data)
        normality_results[f'{diet}_{period}'] = {'Shapiro-Wilk Stat': stat, 'p-value': p_value}


levene_stat, levene_p = stats.levene(
    consum_long[consum_long['Diet'] == 'D1']['Consumption'],
    consum_long[consum_long['Diet'] == 'D2']['Consumption'],
    consum_long[consum_long['Diet'] == 'D3']['Consumption']
)
print('Levene Stat:', levene_stat, 'p-value:', levene_p, '\n', 'normality_results', normality_results )


In [None]:
group_means = consum_long.groupby(['Diet', 'Period'])['Consumption'].mean()
group_sems = consum_long.groupby(['Diet', 'Period'])['Consumption'].sem()
confidence_intervals = {}
for index, mean in group_means.items():
    sem = group_sems[index]
    ci_lower = mean - 1.96 * sem
    ci_upper = mean + 1.96 * sem
    confidence_intervals[index] = {'Mean': mean, 'Lower CI': ci_lower, 'Upper CI': ci_upper}

confidence_intervals

In [None]:
for period in consum_long['Period'].unique():
    for diet in consum_long['Diet'].unique():
        group_data = consum_long[(consum_long['Period'] == period) & (consum_long['Diet'] == diet)]['Consumption']
        plt.figure(figsize=(8, 6))
        plt.hist(group_data, bins=10, alpha=0.7, edgecolor='black', density=True)
        sns.kdeplot(group_data, color='blue', lw=2)
        plt.title(f'Distribution of Consumption: {diet}, {period}')
        plt.xlabel('Consumption')
        plt.ylabel('Density')
        plt.grid(axis='y', linestyle='--', alpha=0.7)
        plt.show()


In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt

tukey_results = pairwise_tukeyhsd(
    endog=consum_long['Consumption'],
    groups=consum_long['Diet_Period'],
    alpha=0.05
)

print("Tukey's HSD Results:")
print(tukey_results)

tukey_results.plot_simultaneous(figsize=(10, 6))
plt.title("Tukey's HSD Results for Diet and Period Combinations")
plt.show()


In [None]:
from statsmodels.stats.multicomp import MultiComparison

mc = MultiComparison(consum_long['Consumption'], consum_long['Diet_Period'])
bonferroni_results = mc.allpairtest(sm.stats.ttest_ind, method='b', alpha=0.05)

print("Bonferroni Test Results:")
print(bonferroni_results[0])

#Discussion:
**Tukey's HSD Results:**
- Significant differences in consumption were observed between the periods (14-28 days vs. 28-42 days) for all diet combinations (D1, D2, D3) (p-value < 0.05). This confirms a substantial increase in consumption during the second period (28-42 days). For example, the mean difference between D1_14-28 and D1_28-42 is 54.05, which is statistically significant.
- Within the same period (e.g., D1_14-28 vs. D2_14-28, D2_28-42 vs. D3_28-42), differences in consumption are not statistically significant (p-value > 0.05).
This indicates that the diets themselves do not have a significant effect on consumption within the same period.
- The Tukey's HSD plot shows overlapping confidence intervals for diets within the same period, confirming the lack of significant differences between them.
However, the intervals between the two periods (14-28 days and 28-42 days) are notably different, highlighting the significant impact of time.

**Bonferroni Results:**
- Bonferroni results confirm the Tukey findings: differences between the 14-28 day and 28-42 day periods are statistically significant for all diets (p-value < 0.003, adjusted for Bonferroni correction). For instance, the t-statistic for D1_14-28 vs. D1_28-42 is -8.49, indicating a substantial increase in consumption.

**Confidence Intervals:**
For all diets (D1, D2, and D3), the confidence intervals for 14-28 days and 28-42 days do not overlap. This indicates that there is a statistically significant difference in consumption between the two periods for each diet. Consumption is higher in the 28-42 day period across all diets.

- 14-28 Days:
The confidence intervals for D1, D2, and D3 overlap significantly. This suggests that there is no statistically significant difference in consumption between diets in this period.
- 28-42 Days:
Similarly, the confidence intervals for D1, D2, and D3 also overlap in this period. This implies no significant differences in consumption between diets during the 28-42 day period.

**Histograms:** \\
- *Period Differences:* Consumption increases significantly from 14-28 days to 28-42 days for all groups, as evidenced by the shift in distributions toward higher values.
- *Diet Differences:* Diet D1 consistently shows higher consumption compared to D2 and D3, particularly during the 28-42 day period. Diet D3 exhibits the lowest and most consistent consumption across both periods.
- *Variability*: Groups in the 14-28 day period show more variability compared to the 28-42 day period, where distributions are more concentrated around the means.

# MANOVA

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
manova = pd.read_excel(io.BytesIO(uploaded['Manova.xlsx']))
manova

In [None]:
weight_gain = manova.loc[0:5, 1:42]
weight_gain

In [None]:
consumption = manova.loc[0:5, 42:]
consumption = consumption.drop(columns=42)
consumption

In [None]:
weight_gain.index = ['D1_14-28', 'D2_14-28', 'D3_14-28', 'D1_28-42', 'D2_28-42', 'D3_28-42']
weight_gain.columns = [f'Chicken_{i+1}' for i in range(weight_gain.shape[1])]

consumption.index = ['D1_14-28', 'D2_14-28', 'D3_14-28', 'D1_28-42', 'D2_28-42', 'D3_28-42']
consumption.columns = [f'Day_{i+1}' for i in range(consumption.shape[1])]

weight_gain

In [None]:
from scipy.stats import levene, bartlett

groups = {
    'D1_14-28': weight_gain.iloc[0],
    'D1_28-42': weight_gain.iloc[3],
    'D2_14-28': weight_gain.iloc[1],
    'D2_28-42': weight_gain.iloc[4],
    'D3_14-28': weight_gain.iloc[2],
    'D3_28-42': weight_gain.iloc[5],
    'C1_14-28': consumption.iloc[0],
    'C1_28-42': consumption.iloc[3],
    'C2_14-28': consumption.iloc[1],
    'C2_28-42': consumption.iloc[4],
    'C3_14-28': consumption.iloc[2],
    'C3_28-42': consumption.iloc[5],
}

results = {}
for group_1, data_1 in groups.items():
    for group_2, data_2 in groups.items():
        if group_1 < group_2:
            levene_result = levene(data_1, data_2)
            bartlett_result = bartlett(data_1, data_2)
            results[(group_1, group_2)] = {
                'Levene_statistic': levene_result.statistic,
                'Levene_pvalue': levene_result.pvalue,
                'Bartlett_statistic': bartlett_result.statistic,
                'Bartlett_pvalue': bartlett_result.pvalue,
            }

import pandas as pd
comparison_results = pd.DataFrame(results).T
comparison_results.columns = ['Levene_statistic', 'Levene_pvalue', 'Bartlett_statistic', 'Bartlett_pvalue']
comparison_results = comparison_results.sort_values(by='Levene_pvalue', ascending=False)


print(comparison_results)

---

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(
    x=comparison_results.index.map(lambda x: ' vs '.join(x)),
    y='Levene_pvalue',
    data=comparison_results.reset_index(),
    palette='coolwarm',
)
plt.axhline(y=0.05, color='red', linestyle='--', label='Significance Threshold (p=0.05)')
plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Group Comparisons', fontsize=12)
plt.ylabel('Levene p-value', fontsize=12)
plt.title('Levene Test for Variance Homogeneity Across Groups', fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()


plt.figure(figsize=(12, 6))
sns.barplot(
    x=comparison_results.index.map(lambda x: ' vs '.join(x)),
    y='Bartlett_pvalue',
    data=comparison_results.reset_index(),
    palette='coolwarm',
)
plt.axhline(y=0.05, color='red', linestyle='--', label='Significance Threshold (p=0.05)')
plt.xticks(rotation=90, fontsize=10)
plt.xlabel('Group Comparisons', fontsize=12)
plt.ylabel('Bartlett p-value', fontsize=12)
plt.title('Bartlett Test for Variance Homogeneity Across Groups', fontsize=14)
plt.legend()
plt.tight_layout()
plt.show()


- **Levene Test:**

A significant portion of comparisons exhibit p-values below the threshold of 0.05 (red dashed line), indicating that the assumption of homogeneity of variances is violated for these group pairs.
There are, however, comparisons with p-values above 0.05, suggesting variance homogeneity is preserved in these cases.

- **Bartlett Test:**

Similar to the Levene test, many group comparisons have p-values below 0.05, further confirming the violation of variance homogeneity for these pairs.
Bartlett's test tends to show stricter results than Levene's due to its sensitivity to non-normal data distributions.

In [None]:
mean_vectors = {}
groups = weight_gain.index.to_list()
for group in groups:
    weight_mean = weight_gain.loc[group].mean()
    consumption_mean = consumption.loc[group].mean()
    mean_vectors[group] = {"Weight Gain Mean": weight_mean, "Consumption Mean": consumption_mean}

mean_vectors_df = pd.DataFrame(mean_vectors).T


cov_within = {}
for group in groups:
    combined_data = pd.concat([weight_gain.loc[group], consumption.loc[group]], axis=1, keys=["Weight Gain", "Consumption"])
    cov_within[group] = combined_data.cov()


combined_means = pd.concat(
    [pd.DataFrame(weight_gain.mean(axis=1), columns=["Weight Gain"]),
     pd.DataFrame(consumption.mean(axis=1), columns=["Consumption"])],
    axis=1
)
cov_between = combined_means.cov()

mean_vectors_df

In [None]:
import seaborn as sns

In [None]:
groups = list(weight_gain.index.unique())
between_group_covariance_matrices = {}


for i, group1 in enumerate(groups):
    for j, group2 in enumerate(groups):
        if i < j:

            cov_matrix = np.cov(weight_gain.loc[group1], weight_gain.loc[group2])
            between_group_covariance_matrices[f'{group1} vs {group2}'] = cov_matrix

fig, axes = plt.subplots(len(between_group_covariance_matrices) // 2 + 1, 2, figsize=(15, 20))
axes = axes.flatten()

for idx, (pair, cov_matrix) in enumerate(between_group_covariance_matrices.items()):
    sns.heatmap(cov_matrix, annot=True, fmt=".2f", cmap="coolwarm", ax=axes[idx])
    axes[idx].set_title(f'Covariance: {pair}')
    axes[idx].set_xlabel('Variables')
    axes[idx].set_ylabel('Variables')

for ax in axes[len(between_group_covariance_matrices):]:
    ax.remove()

plt.tight_layout()
plt.show()

In [None]:
aggregated_weight_gain = weight_gain.mean(axis=1)
aggregated_consumption = consumption.mean(axis=1)

aggregated_data = pd.DataFrame({
    'Weight Gain': aggregated_weight_gain,
    'Consumption': aggregated_consumption,
    'Diet': ['D1', 'D2', 'D3', 'D1', 'D2', 'D3'],
    'Period': ['14-28', '14-28', '14-28', '28-42', '28-42', '28-42']
})

from statsmodels.multivariate.manova import MANOVA
manova = MANOVA.from_formula('Q("Weight Gain") + Q("Consumption") ~ Diet + Period', data=aggregated_data)
manova_results = manova.mv_test()
print(manova_results)

In [None]:
grouped = aggregated_data.groupby(['Diet', 'Period'])
mean_vectors = grouped.mean()
covariance_matrices = grouped.apply(lambda x: np.cov(x[['Weight Gain', 'Consumption']], rowvar=False))

plt.figure(figsize=(10, 6))
sns.heatmap(mean_vectors, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Mean Vectors for Each Group (Weight Gain and Consumption)")
plt.show()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, (group, cov_matrix) in enumerate(cov_within.items()):
    sns.heatmap(cov_matrix, annot=True, fmt=".2f", cmap="coolwarm", ax=axes[idx])
    axes[idx].set_title(f"Within-Group Covariance: {group}")

plt.tight_layout()
plt.show()


---

In [None]:
plt.figure(figsize=(10, 8))
aggregated_cov = np.cov(aggregated_data[['Weight Gain', 'Consumption']].T)
sns.heatmap(aggregated_cov, annot=True, fmt=".2f", cmap="coolwarm", xticklabels=['Weight Gain', 'Consumption'], yticklabels=['Weight Gain', 'Consumption'])
plt.title("Between-Group Covariance Matrix")
plt.show()

In [None]:
descriptive_stats = aggregated_data.groupby(['Diet', 'Period']).agg(['mean', 'std', 'min', 'max'])
print(descriptive_stats)

In [None]:
from statsmodels.formula.api import ols
import statsmodels.api as sm

weight_gain_anova = ols('Q("Weight Gain") ~ Diet + Period', data=aggregated_data).fit()
print(sm.stats.anova_lm(weight_gain_anova, typ=2))


consumption_anova = ols('Q("Consumption") ~ Diet + Period', data=aggregated_data).fit()
print(sm.stats.anova_lm(consumption_anova, typ=2))

In [None]:
interaction_model_weight = ols('Q("Weight Gain") ~ Diet * Period', data=aggregated_data.dropna()).fit()
interaction_anova_weight = anova_lm(interaction_model_weight, typ=2)
print("ANOVA Results for Weight Gain:")
print(interaction_anova_weight)

interaction_model_cons = ols('Q("Consumption") ~ Diet * Period', data=aggregated_data.dropna()).fit()
interaction_anova_cons = anova_lm(interaction_model_cons, typ=2)
print("ANOVA Results for Consumption:")
print(interaction_anova_cons)

In [None]:
plt.figure(figsize=(10, 6))
sns.pointplot(x='Period', y='Weight Gain', hue='Diet', data=aggregated_data, dodge=True, markers=['o', 's', 'D'])
plt.title('Interaction Effect: Diet and Period on Weight Gain')
plt.ylabel('Weight Gain')
plt.xlabel('Period')
plt.legend(title='Diet')
plt.show()

plt.figure(figsize=(10, 6))
sns.pointplot(x='Period', y='Consumption', hue='Diet', data=aggregated_data, dodge=True, markers=['o', 's', 'D'])
plt.title('Interaction Effect: Diet and Period on Consumption')
plt.ylabel('Consumption')
plt.xlabel('Period')
plt.legend(title='Diet')
plt.show()

# Discussion

###MANOVA Results:

- **Wilks' Lambda (Diet and Period):**
Low Wilks' Lambda values indicate strong differences between the groups. Significant p-values (Diet:  p=0.0035, Period: p=0.0003) confirm that both diet and period have a statistically significant effect on the multivariate measures (weight gain and consumption).
Wilks' Lambda measures the proportion of variance not explained by the groups. Low values indicate strong differences.

- **Pillai's Trace:**
This criterion also supports the significance of the period effect (p=0.0003) and, to a lesser extent, diet (p=0.4834, which is above the standard threshold of 0.05). Pillai's Trace evaluates the proportion of explained variance. Values close to 1 show strong effects. Pillai's Trace evaluates the proportion of explained variance. Values close to 1 show strong effects.



- **Hotelling-Lawley and Roy's Greatest Root:**
The results confirm the significance of the effects, especially for the period. HighF-statistics values indicate strong differences between groups. Hotelling-Lawley Trace measures the sum of eigenvalues, often indicating stronger effects for smaller datasets.


(**Roy's Greatest Root** considers the largest eigenvalue to assess the strength of the group differences.)

                       
---


- **Comparison Between Diet and Period:**

The F values and p-values indicate that the period effect is much stronger than the diet effect:
- Period: Wilks' lambda = 0.0000, p = 0.0003 (highly significant).
- Diet: Wilks' lambda = 0.0000, p = 0.0035 (still significant but less impactful than period).
This suggests that while diets influence both variables, the period of growth (14-28 vs. 28-42 days) has a more pronounced effect.


---



- **Interpretation:**

The influence of diets on weight gain and consumption likely varies across periods. For example, some diets may be more effective in the early period (14-28 days) than in the later period (28-42 days), or vice versa.
This interaction could imply that optimal dietary interventions may depend on the growth phase or consumption patterns during specific periods.

---

**Comparison of Effects**

The results reveal that diets have a stronger effect on weight gain than on consumption, while periods significantly influence both variables.

**Similarity or Difference in Effects:**
- Diets: Stronger impact on weight gain but minimal impact on consumption.
- Periods: Large, consistent impact on both weight gain and consumption, suggesting that periods are a dominant factor for both measures.
---
Visualization:

- **Weight Gain:**
There is a noticeable increase in weight from the 14-28 day period to the 28-42 day period across all diets. This highlights the strong effect of the period on weight gain.
Diet D1 shows the highest average weight in both periods, suggesting it might be the most effective for weight gain.
- **Consumption:**
Similar to weight gain, consumption also increases during the second period. However, differences between diets are less pronounced.
Diet D3 shows lower consumption in the first period (14-28 days) compared to the other diets, but its consumption aligns more closely with others in the 28-42 day period.

###Biological interpretation of results:
- The results suggest that periods (14-28 days vs. 28-42 days) have a significant effect on both weight gain and consumption.
- Diets show less pronounced effects, which might indicate that growth phases dominate the observed changes, potentially overshadowing dietary differences.
- This could be due to metabolic or growth-related factors that vary more between periods than between diets.


In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_weight = pairwise_tukeyhsd(endog=aggregated_data['Weight Gain'],
                                 groups=aggregated_data['Diet'],
                                 alpha=0.05)
print("Tukey Test for Weight Gain:")
print(tukey_weight)

tukey_consumption = pairwise_tukeyhsd(endog=aggregated_data['Consumption'],
                                      groups=aggregated_data['Diet'],
                                      alpha=0.05)
print("\nTukey Test for Consumption:")
print(tukey_consumption)

In [None]:
from statsmodels.stats.multitest import multipletests
from scipy.stats import ttest_ind

pairs = [(i, j) for i in aggregated_data['Diet'].unique()
                  for j in aggregated_data['Diet'].unique() if i != j]
weight_pvals = []
for i, j in pairs:
    group1 = aggregated_data[aggregated_data['Diet'] == i]['Weight Gain']
    group2 = aggregated_data[aggregated_data['Diet'] == j]['Weight Gain']
    _, pval = ttest_ind(group1, group2)
    weight_pvals.append(pval)

bonf_weight = multipletests(weight_pvals, method='bonferroni')

print("\nBonferroni Corrected P-Values for Weight Gain:")
print(bonf_weight[1])

consumption_pvals = []
for i, j in pairs:
    group1 = aggregated_data[aggregated_data['Diet'] == i]['Consumption']
    group2 = aggregated_data[aggregated_data['Diet'] == j]['Consumption']
    _, pval = ttest_ind(group1, group2)
    consumption_pvals.append(pval)

bonf_consumption = multipletests(consumption_pvals, method='bonferroni')

print("\nBonferroni Corrected P-Values for Consumption:")
print(bonf_consumption[1])

In [None]:
!pip install scikit-posthocs

In [None]:
import scikit_posthocs as sp

scheffe_weight = sp.posthoc_scheffe(aggregated_data, val_col='Weight Gain', group_col='Diet')
print("Scheffé's Test Results for Weight Gain:")
print(scheffe_weight)

scheffe_consumption = sp.posthoc_scheffe(aggregated_data, val_col='Consumption', group_col='Diet')
print("\nScheffé's Test Results for Consumption:")
print(scheffe_consumption)

 # Test Discussion
 1. Tukey's Test
 - For Weight Gain:
None of the pairwise comparisons between diets (D1, D2, and D3) show a statistically significant difference in weight gain.
All p-values are above 0.05, indicating that the mean weight gain is not significantly different across the three diets.
The confidence intervals for all comparisons contain zero, further confirming no significant difference.

- For Consumption:
Similarly, there are no statistically significant differences in average consumption between the three diets.
All p-values are much higher than 0.05.
The confidence intervals also include zero, reinforcing that differences in consumption between diets are not meaningful.

2. Bonferroni Corrected P-Values

- For Weight Gain:
After adjusting for multiple comparisons using Bonferroni correction, none of the pairwise comparisons for weight gain is significant. All corrected p-values are 1.
- For Consumption:
Likewise, Bonferroni correction reveals no significant differences in consumption. All corrected p-values are also 1.

3. Scheffé's Test Results
- For Weight Gain: *P-Values.*
All pairwise comparisons between diets (D1 vs. D2, D1 vs. D3, and D2 vs. D3) have p-values greater than 0.05.
This indicates no statistically significant differences in weight gain between the three diets. The diets (D1, D2, and D3) do not have a meaningful effect on weight gain when compared to each other.
This result aligns with findings from other tests (e.g., Tukey's and Bonferroni), reinforcing the conclusion that diet differences are not impactful for weight gain in this dataset.
- For Consumption: *P-Values.*
Similarly, for average consumption, all pairwise comparisons yield p-values greater than 0.05.
This means there are no statistically significant differences in consumption between diets D1, D2, and D3.The diets have no substantial impact on consumption levels.
Period effects or individual variability might play a more significant role in influencing consumption.





(Others tests did before)