# Storage Strategies & Optimized Manipulation

**Author:** Iuliia Vitiugova  
**Repository:** Data Engineering & Data Structures – Research Portfolio

---

## Overview

Columnar vs row storage considerations, efficient joins, and memory profiling.

### Reproducibility Notes
- All outputs are cleared; execute cells sequentially from top to bottom.
- Python 3 environment; see `requirements.txt` at the repo root.
- Any paths are relative; adjust the `DATA_DIR` variable if needed.

---



## Structure of this Notebook
1. Problem Statement & Goals
2. Data Ingestion & Validation
3. Preprocessing & Cleaning
4. Transformations / Feature Engineering
5. Analysis & Evaluation
6. Conclusions & Next Steps
---


 # Iuliia Vitiugova

 ## One-factorial Analysis of Variance (One-way ANOVA)

In [None]:
import numpy as np
import pandas as pd
import io
import matplotlib.pyplot as plt

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
gain = pd.read_excel(io.BytesIO(uploaded['Gain.xlsx']))
gain

In [None]:
period_14_28 = gain[['Période 14-28 jours', 'Unnamed: 1', 'Unnamed: 2']].copy()
period_14_28.columns = ['D1', 'D2', 'D3']
period_14_28

In [None]:
period_28_42 = gain[['Période 28-42 jours', 'Unnamed: 4', 'Unnamed: 5']].copy()
period_28_42.columns = ['D1', 'D2', 'D3']
period_28_42

In [None]:
period_14_28 = period_14_28.apply(pd.to_numeric, errors='coerce')
period_28_42 = period_28_42.apply(pd.to_numeric, errors='coerce')

print(period_14_28.dtypes)
print()
print(period_28_42.dtypes)

---

In [None]:
import scipy.stats as stats

def normal(data, group, period):
    plt.figure(figsize=(6, 4))

    plt.hist(data, bins=10, density=True, alpha=0.6, color='blue', edgecolor='black', label='Discrete')

    mu, std, median = np.mean(data), np.std(data), np.median(data)

    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, mu, std)

    plt.plot(x, p, 'r-', linewidth=2, label=f'Normal \nMean: {mu:.2f}, Std: {std:.2f}, Median: {median:.2f}')

    plt.title(f'{group} during {period}')
    plt.xlabel('Weights')
    plt.ylabel('Density')
    plt.legend()
    plt.grid(True)
    plt.show()

for group in ['D1', 'D2', 'D3']:
    for period, data in zip(['14-28 days', '28-42 days'], [period_14_28, period_28_42]):
        normal(data[group].dropna(), group, period)


##Plots Discussion Normal Distribution

- **D1 and D2** (in both periods): Mostly spread as normal distributions, but some extreme values or tails suggest minor deviations.
- **D3** (in both periods): It is more skewed, with a prominent right tail, especially in the 14-28 day period.

### Overall :
The normal distribution is acceptable for D1 and D2. For groups with noticeable asymmetry (like D3), asymmetrical distributions may fit better. In this case, we should try a log-normal distribution, as it may help address the right-side skewness for D3.

In [None]:
def lognormal(data, group, period):
    plt.figure(figsize=(6, 4))

    plt.hist(data, bins=10, density=True, alpha=0.6, color='blue', edgecolor='black', label='Discrete')

    shape, loc, scale = stats.lognorm.fit(data, floc=0)
    median = np.median(data)

    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100)
    p = stats.lognorm.pdf(x, shape, loc, scale)

    plt.plot(x, p, 'r-', linewidth=2, label=f'Lognormal\nShape: {shape:.2f}, Scale: {scale:.2f}, Median: {median:.2f}')

    plt.title(f'{group} during {period}')
    plt.xlabel('Weights')
    plt.ylabel('Density')
    plt.legend()
    plt.grid(True)
    plt.show()

for group in ['D1', 'D2', 'D3']:
    for period, data in zip(['14-28 days', '28-42 days'], [period_14_28, period_28_42]):
        lognormal(data[group].dropna(), group, period)


##Plots Discussion Log-Normal Distribution
Lognormal distribution works well for groups with right-skewed data (like D3). However, for more symmetric data in groups D1 and D2, using a normal distribution could improve accuracy.


In [None]:
def bootstrap(data, n_bootstrap=1000):
    means = [np.mean(np.random.choice(data, size=len(data), replace=True)) for i in range(n_bootstrap)]
    medians = [np.median(np.random.choice(data, size=len(data), replace=True)) for i in range(n_bootstrap)]
    return means, medians

bootstrap_results = {}

for period, period_data in zip(['14-28 days', '28-42 days'], [period_14_28, period_28_42]):
    bootstrap_results[period] = {}
    for group in ['D1', 'D2', 'D3']:
        data = period_data[group].dropna()
        means, medians = bootstrap(data)
        bootstrap_results[period][group] = {'means': means, 'medians': medians}

bootstrap_results['14-28 days']['D1']['means']

In [None]:
fig, axs = plt.subplots(4, 3, figsize=(15, 20))

for period_idx, (period, group_results) in enumerate(bootstrap_results.items()):
    for group_idx, (group, results) in enumerate(group_results.items()):
        axs[2 * period_idx, group_idx].hist(results['means'], bins=20, color='blue', alpha=0.7)
        axs[2 * period_idx, group_idx].set_title(f'Bootstrap Means ({group}, {period})')
        axs[2 * period_idx, group_idx].set_xlabel('Mean Value')
        axs[2 * period_idx, group_idx].set_ylabel('Frequency')

        axs[2 * period_idx + 1, group_idx].hist(results['medians'], bins=20, color='green', alpha=0.7)
        axs[2 * period_idx + 1, group_idx].set_title(f'Bootstrap Medians ({group}, {period})')
        axs[2 * period_idx + 1, group_idx].set_xlabel('Median Value')
        axs[2 * period_idx + 1, group_idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

##Discussion

For all groups (D1, D2, D3) and both periods (14-28 days and 28-42 days):

**Bootstrap Means:**
The distributions of means are relatively symmetric and follow a pattern close to a normal distribution.
This suggests that the mean is a stable statistic for these groups, meaning that even when the data is resampled, the means are consistent across different samples.

**Bootstrap Medians:**
The distributions of medians are more varied and have more irregularities compared to the means. The median appears to be more sensitive to the underlying data's characteristics, such as skewness and outliers.

We see some uneven spikes in the distribution of medians, indicating that the median values fluctuate more compared to the mean when the data is resampled.

In [None]:
def CI(bootstrap_data):
    lower_percentile = (100 - 95) / 2
    upper_percentile = 100 - lower_percentile
    mean_ci = np.percentile(bootstrap_data['means'], [lower_percentile, upper_percentile])
    median_ci = np.percentile(bootstrap_data['medians'], [lower_percentile, upper_percentile])
    return mean_ci, median_ci

ci = []
for period, group_results in bootstrap_results.items():
    for group, results in group_results.items():
        mean_ci, median_ci = CI(results)
        ci.append({
            'Period': period,
            'Group': group,
            'Mean Lower CI': mean_ci[0],
            'Mean Upper CI': mean_ci[1],
            'Median Lower CI': median_ci[0],
            'Median Upper CI': median_ci[1]
        })


results_ci = pd.DataFrame(ci)
results_ci

In [None]:
period_14_28

##  Analysis of Variance (One-way ANOVA)

In [None]:
from scipy.stats import f_oneway

anova_14_28 = f_oneway(period_14_28['D1'].dropna(), period_14_28['D2'].dropna(), period_14_28['D3'].dropna())
anova_28_42 = f_oneway(period_28_42['D1'].dropna(), period_28_42['D2'].dropna(), period_28_42['D3'].dropna())

results_anova = pd.DataFrame({
    'Period': ['14-28 days', '28-42 days'],
    'F-statistic': [anova_14_28.statistic, anova_28_42.statistic],
    'p-value': [anova_14_28.pvalue, anova_28_42.pvalue]
})

results_anova

## Discussion
14-28 days:

- **F-statistic** (1.829) indicates some differences between groups D1, D2, and D3.
- **p-value** (0.165) is higher than the significance level of 0.05. This indicates that there are no statistically significant differences in the average weight gains for groups D1, D2, and D3. Therefore, it cannot be concluded that the different diets have significantly different effects on weight gain in this period. The observed differences are likely due to random variation. (fail to reject H0)

28-42 days:

- **F-statistic** (0.290) suggests minimal differences between the groups
- **p-value** (0.749) is much higher than the significance level of 0.05. This means that there are no statistically significant differences in weight gain between groups D1, D2, and D3 in this period. The observed differences are likely due to chance, and the diets do not seem to have a distinct effect on weight gain during this time. (fail to reject H0)


## Two-way ANOVA

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

df_merged = pd.concat([period_14_28, period_28_42], keys=['14-28 days', '28-42 days'], names=['Period'])
df_merged = df_merged.reset_index()

df_long = pd.melt(df_merged, id_vars=['Period'], value_vars=['D1', 'D2', 'D3'], var_name='Diet', value_name='Weight')


model = ols('Weight ~ C(Diet) * C(Period)', data=df_long).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table


In [None]:
from statsmodels.graphics.factorplots import interaction_plot

plt.figure(figsize=(8, 6))
interaction_plot(df_long['Period'], df_long['Diet'], df_long['Weight'], markers=['D', 'o', '^'], colors=['r', 'g', 'b'])
plt.title('Diet and Period')
plt.xlabel('Period')
plt.ylabel('Weight')
plt.show()

In [None]:
pivot_table = df_long.pivot_table(index='Diet', columns='Period', values='Weight', aggfunc=np.mean)

plt.figure(figsize=(8, 6))
sns.heatmap(pivot_table, annot=True, cmap='coolwarm', cbar=True)
plt.title('Heatmap of Average Weight by Diet and Period')
plt.show()


## Discussion Statistics
1. Effect of Diets (C(Diet)):
- F-statistic: 1.386
- p-value: 0.252

This indicates that the differences in weight gain between diets D1, D2, and D3 are not statistically significant. The p-value is greater than 0.05, meaning that the **diets do not have a significant impact on the chickens' weight gain**.

2. Effect of Period (C(Period)):
- F-statistic: 38.228
- p-value: 2.60e-09

The effect of the time period (14-28 days and 28-42 days) is statistically significant, as the p-value is much less than 0.05. This indicates that the **weight gain of the chickens significantly changes depending on the period.**

3. Interaction between Diet and Period (C(Diet)
(Period)):
- F-statistic: 0.072
- p-value: 0.931

The interaction between diets and periods is not statistically significant, as the p-value is much greater than 0.05. This means that the effect of diets on weight gain does not vary depending on the period. In other words, **diets have the same impact on weight gain during both the 14-28 day period and the 28-42 day period.**

## Discussion Plots:
Diet and Period plot:
- The lines do not cross, indicating no significant interaction between diets and periods.
- All diets show an increase in weight gain from the 14-28 day period to the 28-42 day period, but diet D3 shows a smaller gain compared to D1 and D2.
- Diets D1 and D2 are very similar in their results and are close together on the graph.

Heatmap:

- The heatmap highlights that D1 and D2 diets yield similar results across both periods, while diet D3 leads to lower weight gain, especially in the first period (14-28 days).