# Prototype Data Pipeline

**Author:** Iuliia Vitiugova  
**Repository:** Data Engineering & Data Structures – Research Portfolio

---

## Overview

Loading, cleaning, validation; establishing a reproducible baseline pipeline.

### Reproducibility Notes
- All outputs are cleared; execute cells sequentially from top to bottom.
- Python 3 environment; see `requirements.txt` at the repo root.
- Any paths are relative; adjust the `DATA_DIR` variable if needed.

---



## Structure of this Notebook
1. Problem Statement & Goals
2. Data Ingestion & Validation
3. Preprocessing & Cleaning
4. Transformations / Feature Engineering
5. Analysis & Evaluation
6. Conclusions & Next Steps
---


# Iuliia Vitiugova
##A. Discrete Series

1. Using the corresponding Python function, generate a vector containing 1000 random values between 0 and 10.
2. Display this data in the form of a histogram using the corresponding Python function.
3. Determine the mode, mean, and median of this data without using predefined Python functions.
4. Verify the results for the mean and median using predefined Python functions. The results should be identical.
5. Explain why the value of the median can be very different from the value of the mean for this data.
6. Calculate the domain of definition, standard deviation, and variance of this data:
   - Without using predefined Python functions.
   - Using predefined functions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = np.random.uniform(0, 11, 1000)
data

In [None]:
plt.hist(data, bins=10, color='lightblue', edgecolor='black', alpha=0.7)

plt.axvline(mean_manual, color='blue', linestyle='dashed', linewidth=2, label=f'Mean: {mean_manual:.2f}')
plt.axvline(median_manual, color='green', linestyle='dashed', linewidth=2, label=f'Median: {median_manual:.2f}')
plt.axvline(mode_manual, color='red', linestyle='dashed', linewidth=2, label=f'Mode: {mode_manual:.2f}')

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()
plt.show()

In [None]:
def mean(data):
    return sum(data) / len(data)

def median(data):
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    if n % 2 == 0:  # if even
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:  # if odd
        return sorted_data[mid]

def mode(data):
    frequency = {}
    for item in data:
        frequency[item] = frequency.get(item, 0) + 1
    return max(frequency, key=frequency.get)


print('Mean:', mean(data), 'Median:', median(data), 'Mode:', mode(data))

In [None]:
import pandas as pd
from statistics import mean, median

mean_manual = mean(data)
median_manual = median(data)
mode_manual = mode(data)

mean_np = np.mean(data)
median_np = np.median(data)
def func_mode(data):
    series = pd.Series(data)
    frequency = series.value_counts()
    return frequency.idxmax()
mode_pd = func_mode(data)

results = pd.DataFrame({
    'Statistic': ['Mean', 'Median', 'Mode'],
    'Manual': [mean_manual, median_manual, mode_manual],
    'NumPy/Pandas': [mean_np, median_np, mode_pd]
})
results

### Results
1. The **mean** values from both methods are the same. For a distribution ranging from 0 to 10, the mean should be close to 5, as it was calculated.
2. The **median** values from both methods are the same. The value is expected to be close to the middle of the distribution.
3. The **mode** values from both methods are the same. Based on the plot, the values are accurate.

##B. Discrete Series by Frequency
1. Create the dataset and display it as a histogram.
2. Calculate measures of central tendency and dispersion.
3. Explain why the distribution is bimodal.


In [None]:
xi = [5, 8, 9, 10, 11, 12, 13, 14, 16]
ni = [10, 12, 48, 23, 24, 48, 9, 7, 13]

plt.bar(xi, ni, color='coral', edgecolor='black')
plt.xlabel('Value (xi)')
plt.ylabel('Frequency (ni)')
plt.show()

In [None]:
def weighted_mean(values, frequencies):
    return sum(x * n for x, n in zip(values, frequencies)) / sum(frequencies)

def weighted_median(values, frequencies):
    cumulative_frequencies = np.cumsum(frequencies)
    total_count = cumulative_frequencies[-1]
    for i, cum_freq in enumerate(cumulative_frequencies):
        if cum_freq >= total_count / 2:
            return values[i]

def weighted_mode(values, frequencies):
    return values[frequencies.index(max(frequencies))]

def weighted_variance(values, frequencies, mean_value):
    mean_squared = sum((x ** 2) * n for x, n in zip(values, frequencies)) / sum(frequencies)
    return mean_squared - mean_value ** 2

def weighted_std(variance_value):
    return variance_value ** 0.5

mean_weighted = weighted_mean(xi, ni)
median_weighted = weighted_median(xi, ni)
mode_weighted = weighted_mode(xi, ni)
variance_weighted = weighted_variance(xi, ni, mean_weighted)
std_weighted = weighted_std(variance_weighted)

print(f"Mean: {mean_weighted}, Median: {median_weighted}, Mode: {mode_weighted}, Variance: {variance_weighted}, Std: {std_weighted}")

### Explanation
The dataset is **bimodal** because there are two peaks in the distribution (values 9 and 12).

##C. Gaussian Distributions
1. Generate a random sample from a normal distribution ( mean = 100 and standard deviation = 225.), display the probability density of the distribution.
2. Generate a histogram for a sample of size 100,000.
3. Calculate the mean and variance of the sample.
4. Calculate percentage of individuals for X ≤ 60.
5. Calculate percentage of individuals for X ≥ 130
6. Find the interval containing 95% of the values around the mean.





In [None]:
import scipy.stats as stats

mean_given = 100
std_given = 225
n = 100000

rv = np.random.normal(mean_given, std_given, n)

density = stats.gaussian_kde(rv)
x = np.linspace(min(rv), max(rv), 1000)
plt.plot(x, density(x))
plt.title('Probability Density Function')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()

In [None]:
plt.hist(rv, bins=50, density=True, alpha=0.7, color='pink', edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

In [None]:
mean_rv = np.mean(rv)
variance_rv = np.var(rv)

x_under = np.mean(rv <= 60) * 100
x_over = np.mean(rv >= 130) * 100
alpha = np.percentile(rv, [2.5, 97.5])

print(f"Mean: {mean_rv}, Variance: {variance_rv}")
print(f"Percentage of X <= 60: {x_under}, Percentage of X >= 130: {x_over}")
print(f"95% of values lie between: {alpha}")

## D. Importance of Sample Size
1. Generate samples with different sizes and evaluate statistics.
2. Compare the results with theoretical values and compute 95% confidence intervals.



In [None]:
mean_d = 100
std_d = 15

rv_n1 = np.random.normal(mean_d, std_d, 10)
rv_n2 = np.random.normal(mean_d, std_d, 1000)
rv_n3 = np.random.normal(mean_d, std_d, 100000)

mean_n1, std_n1 = np.mean(rv_n1), np.std(rv_n1)
mean_n2, std_n2 = np.mean(rv_n2), np.std(rv_n2)
mean_n3, std_n3 = np.mean(rv_n3), np.std(rv_n3)

results_n123 = pd.DataFrame({
    'Sample Size': [10, 1000, 100000],
    'Mean Theory': [100, 100, 100],
    'Mean Calculated': [mean_n1, mean_n2, mean_n3],
    'Std Theory': [15, 15, 15],
    'Std Calculated': [std_n1, std_n2, std_n3]
})
results_n123

###Results
*   **n = 10:**
The calculated mean (103.53) and standard deviation (17.41) deviate from the theory due to the small sample size, which may cause more extreme fluctuations.
*   **n = 1000:**
The calculated mean (99.67) and standard deviation (14.54) are much closer to the theory; they show improved accuracy as the sample size increases.
*   **n = 100000:**
The calculated mean (100.03) and standard deviation (14.98) are nearly equal to the theory, which demonstrates the Law of Large Numbers, which states that larger samples provide better estimates.

##E. Comparing Two Populations
1. Generate two samples and evaluate the statistics.
2. Calculate 95% confidence intervals for each population and for the difference in means.

In [None]:
sample_X = np.random.normal(13, np.sqrt(225), 250)
sample_Y = np.random.normal(12, 15, 25000)

mean_X, std_X = np.mean(sample_X), np.std(sample_X)
mean_Y, std_Y = np.mean(sample_Y), np.std(sample_Y)

print(f"X: Mean={mean_X}, Std Dev={std_X}")
print(f"Y: Mean={mean_Y}, Std Dev={std_Y}")

In [None]:
def confidence_interval(data, confidence=0.95):
    mean = np.mean(data)
    std_err = np.std(data) / np.sqrt(len(data))
    margin_of_error = 1.96 * std_err  # For 95% confidence interval (Z-score 1.96)
    return mean - margin_of_error, mean + margin_of_error

alpha_X = confidence_interval(sample_X)
alpha_Y = confidence_interval(sample_Y)

mean_diff = mean_X - mean_Y
std_diff = np.sqrt((std_X**2 / len(sample_X)) + (std_Y**2 / len(sample_Y)))
alpha_diff = (mean_diff - 1.96 * std_diff, mean_diff + 1.96 * std_diff)

results_XY = pd.DataFrame({
    'Statistic': ['Mean X', 'Mean Y', 'Mean Difference', 'Std X', 'Std Y','Std Difference', '95% Confidence Interval X', '95% Confidence Interva Y', '95% Confidence Interva Difference'],
    'Value': [mean_X, mean_Y, mean_diff, std_X, std_Y, std_diff, alpha_X, alpha_Y, alpha_diff]
})
results_XY

## Results
*   Both populations are statistically similar in both mean and variability, with no significant difference between them.
*   The 95% Confidence Interval for X sample is wider (10.06, 13.84), while Y sample is narrower (11.76, 12.13), likely due to the larger sample size for Y.
*   The 95% Confidence Interval for the difference in means includes zero (-1.89, 1.90), indicating no significant difference between X and Y.