# Advanced Transformations & Performance

**Author:** Iuliia Vitiugova  
**Date:** 2025-09-17  
**Repository:** Data Engineering & Data Structures – Research Portfolio

---

## Overview

Feature transformations at scale, vectorization, and performance profiling.

### Reproducibility Notes
- All outputs are cleared; execute cells sequentially from top to bottom.
- Python 3 environment; see `requirements.txt` at the repo root.
- Any paths are relative; adjust the `DATA_DIR` variable if needed.

---



## Structure of this Notebook
1. Problem Statement & Goals
2. Data Ingestion & Validation
3. Preprocessing & Cleaning
4. Transformations / Feature Engineering
5. Analysis & Evaluation
6. Conclusions & Next Steps
---


#Iuliia Vitiugova

##A. Multivariate Analysis: Fisher’s Iris Dataset
1. Calculate correlation coefficients without Python libraries
2. Confidence intervals for correlations.

In [None]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import io

iris_data = pd.read_csv(io.BytesIO(uploaded['iris.csv']))
iris_data

In [None]:
fig = make_subplots(rows=2, cols=2, subplot_titles=iris_data.columns)

for i, col in enumerate(iris_data.columns):
    row = (i // 2) + 1
    col_num = (i % 2) + 1
    fig.add_trace(go.Histogram(x=iris_data[col], marker_color='coral', name=col), row=row, col=col_num)

fig.update_layout(height=600, width=800, showlegend=False, title_text="Iris Data Distributions")
fig.update_xaxes(title_text="Value")
fig.update_yaxes(title_text="Count")

fig.show()

In [None]:
def R(df, col1, col2):
  x = df[col1]
  y = df[col2]
  n = len(x)

  x_mean = sum(x) / n
  y_mean = sum(y) / n

  num = sum([(x[i] - x_mean) * (y[i] - y_mean) for i in range(n)])
  den = np.sqrt(sum([(x[i] - x_mean)**2 for i in range(n)]) * sum([(y[i] - y_mean)**2 for i in range(n)]))
  return num / den

iris_data = pd.read_csv('iris.csv').drop(columns=['variety'])
pairs = [('sepal.length', 'sepal.width'), ('sepal.length', 'petal.length'),
         ('sepal.length', 'petal.width'), ('sepal.width', 'petal.length'),
         ('sepal.width', 'petal.width'), ('petal.length', 'petal.width')]


results_iris = pd.DataFrame([(col1, col2, R(iris_data, col1, col2)) for col1, col2 in pairs], columns=['col1', 'col2', 'R Manual'])
results_iris['R Theory'] = [iris_data.corr().loc[col1, col2] for col1, col2 in zip(results_iris['col1'], results_iris['col2'])]
results_iris

In [None]:
def CI(r, df, confidence_level=95):
    if confidence_level == 95:
        z_critical = 1.96
    elif confidence_level == 99:
        z_critical = 2.57
    else:
        raise ValueError("Only 95% and 99% confidence levels are supported")

    n = len(df)
    Z = 0.5 * np.log((1 + r) / (1 - r))
    sZ = 1 / np.sqrt(n - 3)
    Zinf = Z - z_critical * sZ
    Zsup = Z + z_critical * sZ
    Rinf = (np.exp(2 * Zinf) - 1) / (np.exp(2 * Zinf) + 1)
    Rsup = (np.exp(2 * Zsup) - 1) / (np.exp(2 * Zsup) + 1)
    return Rinf, Rsup

results_iris['Inf CI (95%)'], results_iris['Sup CI (95%)'] = zip(*[CI(r, iris_data['petal.length']) for r in results_iris['R Manual']])
results_iris['Inf CI (99%)'], results_iris['Sup CI (99%)'] = zip(*[CI(r, iris_data['petal.length'], confidence_level=99) for r in results_iris['R Manual']])
results_iris

---

#Result
1. sepal.length and sepal.width:
- Correlation: -0.117570 (weak negative)
- 95% CI: [-0.272696, 0.043515] — Includes zero, suggesting no significant linear relationship.
- 99% CI: [-0.318598, 0.093579] — Wider interval, still includes zero, confirming non-significance.
2. sepal.length and petal.length:
- Correlation: 0.871754 (strong positive)
- 95% CI: [0.827035, 0.905509] — Strong significant relationship, narrow interval.
- 99% CI: [0.810460, 0.914166] — Slightly wider, still highly significant.
3. sepal.length and petal.width:
- Correlation: 0.817941 (strong positive)
- 95% CI: [0.756896, 0.864837] — Significant, narrow interval.
- 99% CI: [0.734576, 0.876980] — Slightly wider but remains significant.
4. sepal.width and petal.length:
- Correlation: -0.428440 (moderate negative)
- 95% CI: [-0.550879, -0.287947] — Significant, does not include zero.
- 99% CI: [-0.584950, -0.241169] — Slightly wider, still significant.
5. sepal.width and petal.width:
- Correlation: -0.366126 (moderate negative)
- 95% CI: [-0.497215, -0.218694] — Significant, does not include zero.
- 99% CI: [-0.534134, -0.170296] — Slightly wider, still significant.
6. petal.length and petal.width:
- Correlation: 0.962865 (very strong positive)
- 95% CI: [0.949052, 0.972985] — Very narrow, highly significant.
- 99% CI: [0.943810, 0.975540] — Slightly wider, still highly significant.

### Discussion
The **narrower intervals** for **strong correlations**, such as petal.length and petal.width, indicate high precision, while **wider intervals** for **weaker correlations**, such as sepal.length and sepal.width, suggest greater uncertainty and the potential for non-significance.

In [None]:
import seaborn as sns

sns.heatmap(iris_data.corr(), annot=True)
plt.show()

# B. Multivariate Data: Anthropometry Dataset
1. Calculate correlation&determination coefficients without Python libraries
2. Confidence intervals for correlations.
---



In [None]:
uploaded = files.upload()

In [None]:
mansize = pd.read_csv(io.BytesIO(uploaded['mansize.csv']), sep =';')
mansize

In [None]:
fig = make_subplots(rows=3, cols=3, subplot_titles=mansize.columns)

for i, col in enumerate(mansize.columns):
    row = (i // 3) + 1
    col_num = (i % 3) + 1
    fig.add_trace(go.Histogram(x=mansize[col], marker_color='lightgreen', name=col), row=row, col=col_num)

fig.update_layout(height=900, width=900, showlegend=False, title_text="Mansize Data Distributions")
fig.update_xaxes(title_text="Value")
fig.update_yaxes(title_text="Count")

fig.show()

# Distributions:
- **Age** - Multimodal, with peaks around 20 and 22 years.
- **Height** - Approximately normal, slight right skew, centered  ~ 170 cm.
- **Weight** - Approximately normal, slightly right-skewed, centered  ~ 75 kg.
- **Femur Length** - Approximately normal, slightly left-skewed, centered  ~ 46 cm.
- **Feet Size** - Approximately normal, centered  ~ 25 cm.
- **Arm Span** - Approximately normal, symmetric,  centered  ~ 180 cm.
- **Hand Length** - Approximately normal, slightly left-skewed, centered ~ 19 cm.
- **Cranial Volume** - Approximately normal, symmetric, centered  ~ 1400 cm^3.
- **Penis Size** - Approximately normal, symmetric, centered ~ 13 cm.


In [None]:
def R(df, col1, col2):
  x = df[col1]
  y = df[col2]
  n = len(x)

  x_mean = sum(x) / n
  y_mean = sum(y) / n

  numerator = sum([(x[i] - x_mean) * (y[i] - y_mean) for i in range(n)])
  denominator = np.sqrt(sum([(x[i] - x_mean)**2 for i in range(n)]) * sum([(y[i] - y_mean)**2 for i in range(n)]))
  return numerator / denominator

def R2(df, col1, col2):
    r = R(df, col1, col2)
    return r**2


results = []
for i, col1 in enumerate(mansize.columns):
    for j, col2 in enumerate(mansize.columns[i + 1:]):
        r_manual = R(mansize, col1, col2)
        r2_manual = R2(mansize, col1, col2)
        r_theory = mansize[col1].corr(mansize[col2])
        r2_theory = r_theory ** 2
        results.append((col1, col2, r_manual, r2_manual, r_theory, r2_theory))


results_mansize = pd.DataFrame(results, columns=['col1', 'col2',
                                                 'R Manual', 'R2 Manual',
                                                 'R Theory', 'R2 Theory'])
results_mansize

In [None]:
def CI(r, df, confidence_level=95):

    if confidence_level == 95:
        z_critical = 1.96
    elif confidence_level == 99:
        z_critical = 2.57
    else:
        raise ValueError("Only 95% and 99% confidence levels are supported")

    n = len(df)
    Z = 0.5 * np.log((1 + r) / (1 - r))
    sZ = 1 / np.sqrt(n - 3)
    Zinf = Z - z_critical * sZ
    Zsup = Z + z_critical * sZ
    Rinf = (np.exp(2 * Zinf) - 1) / (np.exp(2 * Zinf) + 1)
    Rsup = (np.exp(2 * Zsup) - 1) / (np.exp(2 * Zsup) + 1)
    return Rinf, Rsup


results_mansize['Inf CI (95%)'], results_mansize['Sup CI (95%)'] = zip(*[CI(r, mansize, confidence_level=95) for r in results_mansize['R Manual']])
results_mansize['Inf CI (99%)'], results_mansize['Sup CI (99%)'] = zip(*[CI(r, mansize, confidence_level=99) for r in results_mansize['R Manual']])
results_mansize

# Results Discussion

The correlations for variables such as **Height** and **Femur Length**, **Height** and **Arm Span**, and **Feet Size** and **Hand Length** show strong positive relationships with narrow confidence intervals, indicating high precision and significance.

Relationships like **Age** and **Penis Size** or **Weight** and **Penis Size** show very weak correlations, with wider confidence intervals that include zero, suggesting greater uncertainty and potential non-significance.

In [None]:
sns.heatmap(mansize.corr(), annot=True)
plt.show()

# C. Independence test and categorical variables: Weather Dataset

In [None]:
uploaded = files.upload()

In [None]:
weather = pd.read_csv(io.BytesIO(uploaded['weather.csv']), sep =';')
weather

In [None]:
fig = make_subplots(rows=1, cols=3, subplot_titles=("Outlook Distribution", "Humidity Distribution", "Temperature Distribution"))

outlook_counts = weather['Outlook'].value_counts()
fig.add_trace(go.Bar(x=outlook_counts.index, y=outlook_counts.values, marker_color='skyblue', name='Outlook'), row=1, col=1)

humidity_counts = weather['Humidity'].value_counts()
fig.add_trace(go.Bar(x=humidity_counts.index, y=humidity_counts.values, marker_color='salmon', name='Humidity'), row=1, col=2)

temperature_counts = weather['Temperature'].value_counts()
fig.add_trace(go.Bar(x=temperature_counts.index, y=temperature_counts.values, marker_color='lightgreen', name='Temperature'), row=1, col=3)

fig.update_layout(height=400, width=900, showlegend=False, title_text="Distributions of Weather Data", title_x=0.5)

fig.show()

In [None]:
from scipy.stats import chi2_contingency

contingency_temperature_outlook = pd.crosstab(weather['Outlook'], weather['Temperature'])
chi2_temperature_outlook, p_val_temperature_outlook, dof_temperature_outlook, expected_temperature_outlook = chi2_contingency(contingency_temperature_outlook)

contingency_outlook_humidity = pd.crosstab(weather['Outlook'], weather['Humidity'])
chi2_outlook_humidity, p_val_outlook_humidity, dof_outlook_humidity, expected_outlook_humidity = chi2_contingency(contingency_outlook_humidity)

contingency_temperature_humidity = pd.crosstab(weather['Temperature'], weather['Humidity'])
chi2_temperature_humidity, p_val_temperature_humidity, dof_temperature_humidity, expected_temperature_humidity = chi2_contingency(contingency_temperature_humidity)

results_data = {
    'Test': ['Temperature vs Outlook', 'Outlook vs Humidity', 'Temperature vs Humidity'],
    'Chi2 Statistic': [chi2_temperature_outlook, chi2_outlook_humidity, chi2_temperature_humidity],
    'p-value': [p_val_temperature_outlook, p_val_outlook_humidity, p_val_temperature_humidity],
    'Degrees of Freedom': [dof_temperature_outlook, dof_outlook_humidity, dof_temperature_humidity],
    'Expected Frequencies': [expected_temperature_outlook, expected_outlook_humidity, expected_temperature_humidity]
}

results_weather = pd.DataFrame(results_data)
print(f'\nContingency Table: Temperature vs Outlook\n {contingency_temperature_outlook}')
print(f'\nContingency Table: Outlook vs Humidit\n {contingency_outlook_humidity}')
print(f'\nContingency Table: Temperature vs Humidity\n {contingency_temperature_humidity}')
results_weather

## Discussion:

1. **Temperature vs. Outlook:** The p-value (0.2041) exceeds the standard significance level of 0.05, suggesting **no statistically significant dependence** between the variables Temperature and Outlook.

2. **Outlook vs. Humidity:** The p-value is nearly zero, well below the significance threshold of 0.05, indicating a **robust statistical dependence between** Outlook and Humidity. This close relationship between the variables precludes an arm's length association.

3. **Temperature vs. Humidity:** The p-value (0.0352) falls below 0.05, signifying a **statistically significant relationship** between Temperature and Humidity. While this relationship may not be as pronounced as that between Outlook and Humidity, the interconnected nature of these variables rules out an independent, arm's length relationship.

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(x=contingency_temperature_outlook.index, y=contingency_temperature_outlook['Cold'], name='Cold', marker_color='skyblue'))
fig.add_trace(go.Bar(x=contingency_temperature_outlook.index, y=contingency_temperature_outlook['Hot'], name='Hot', marker_color='orange'))
fig.add_trace(go.Bar(x=contingency_temperature_outlook.index, y=contingency_temperature_outlook['Mild'], name='Mild', marker_color='lightgreen'))

fig.update_layout(barmode='stack',
                  title="Distribution of Temperature by Outlook",
                  xaxis_title="Outlook",
                  yaxis_title="Number of Observations",
                  legend_title_text="Temperature",
                  height=600, width=900)

fig.show()

##Plots and values Discussion:

- **Foggy:** Few occurrences, with the majority of foggy days being cold.
- **Overcast:** Dominated by mild temperatures (36 cases), but there are also many cold days (19 cases).
- **Rainy:** Rainy conditions are more evenly spread between mild and hot temperatures, with some cold days as well.
- **Sunny:** Sunny conditions are mostly associated with hot (30 cases) and mild (34 cases) temperatures, but cold days are also observed (17 cases).

### Degrees of Freedom:
- There are 4 rows (Outlook categories: Foggy, Overcast, Rainy, Sunny).
- There are 3 columns (Temperature categories: Cold, Hot, Mild).

 $dоf = (4 - 1) \times (3 - 1) = 3 \times 2 = 6$
