# **PHASE 3: ADVANCED ANALYTICS (Days 9-11)**

## **DAY 11 (19/01/26) - Statistical Analysis & ML Prep**

### **Section 1 - Learn**:

### **_1. Descriptive statistics_**

Descriptive statistics are the first step in **Exploratory Data Analysis (EDA)**. In Databricks, you can generate these summaries using built-in methods that work across billions of rows without breaking a sweat.

##### **1. The Quick Look: `describe()`**

The `describe()` method is the fastest way to get a high-level overview of your numerical and string columns. It provides a fixed set of statistics: **count**, **mean**, **stddev**, **min**, and **max**.

```python
# Generate basic stats for all columns
df.describe().show()

# Focus on specific columns
df.describe("age", "salary").show()

```

##### **2. The Deep Dive: `summary()`**

If you need more detail—specifically **percentiles**—use the `summary()` method. By default, it includes everything from `describe()` plus the **25%**, **50% (median)**, and **75%** quartiles.

* **Custom Stats:** You can pass specific statistics as strings to `summary()` to limit the output.
* **Percentiles:** You can request any arbitrary percentile (e.g., `"95%"`, `"99%"`).

```python
# Default summary (includes quartiles)
df.summary().show()

# Custom summary for specific metrics
df.select("amount").summary("count", "min", "50%", "95%", "max").show()

```

##### **3. Manual Aggregations with `agg()`**

For production pipelines or custom metrics (like **Skewness** or **Kurtosis**), you should use the `agg()` function. This is more efficient than `summary()` because it doesn't calculate metrics you don't need.

```python
from pyspark.sql import functions as F

stats_df = df.agg(
    F.count("user_id").alias("total_users"),
    F.mean("purchase_value").alias("avg_spend"),
    F.stddev("purchase_value").alias("spend_volatility"),
    F.skewness("purchase_value").alias("spend_skew"),
    F.kurtosis("purchase_value").alias("spend_tail_risk")
)

```

##### **4. Advanced Statistics with `stat`**

The `df.stat` attribute provides specialized statistical functions that aren't available in standard aggregations:

* **`crosstab(col1, col2)`**: Generates a frequency table (contingency table) for two columns.
* **`corr(col1, col2)`**: Calculates the Pearson correlation coefficient between two variables.
* **`cov(col1, col2)`**: Calculates the sample covariance.
* **`freqItems(cols)`**: Finds frequent items (approximate) in a large dataset using the Frequent Items algorithm.

##### **Summary of Descriptive Tools**

| Tool | Best For... | Output Format |
| --- | --- | --- |
| **`describe()`** | Quick sanity check | Row-based table (Standard stats) |
| **`summary()`** | Distribution analysis | Row-based table (includes Percentiles) |
| **`agg()`** | Custom/Production logic | Single-row DataFrame |
| **`df.stat`** | Relationships & Probabilities | Varies (Tables/Numbers) |



---

### **_2. Hypothesis testing_**

Hypothesis testing in Databricks allows you to move beyond simple descriptive stats to determine if your findings are **statistically significant** or just due to random chance.

In a distributed environment, you have two primary ways to perform these tests: using the native **Spark MLlib** (for massive datasets) or **SciPy/Statsmodels** (for smaller samples or aggregated results).

##### **1. Chi-Square Test (Independence)**

The Chi-Square test is the primary hypothesis test built directly into **PySpark MLlib**. It is used to determine if two categorical variables are independent (e.g., "Does a customer's region affect their preferred product category?").

* **Null Hypothesis ():** The variables are independent (no relationship).
* **Alternative Hypothesis ():** The variables are dependent (significant relationship).

```python
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest

# Sample data: (label, features)
# label: categorical (e.g., Purchased=1, Browsed=0)
# features: Vector of categorical values (e.g., [Region_ID, Gender_ID])
data = [
    (0.0, Vectors.dense(0, 10)),
    (1.0, Vectors.dense(1, 20)),
    (1.0, Vectors.dense(1, 10)),
    (0.0, Vectors.dense(0, 20))
]
df = spark.createDataFrame(data, ["label", "features"])

# Run the test
result = ChiSquareTest.test(df, "features", "label").head()

print(f"p-values: {result.pValues}")
print(f"Degrees of Freedom: {result.degreesOfFreedom}")
print(f"Statistics: {result.statistics}")

```

##### **2. T-Tests (Comparing Means)**

Spark MLlib does not have a native "distributed" T-test. For this, you typically aggregate your data in Spark and then use the **SciPy** library.

* **Independent T-Test:** Compares the means of two different groups (e.g., "Does Group A spend more than Group B?").
* **Paired T-Test:** Compares the same group at different times (e.g., "Before vs. After a marketing campaign").

```python
import scipy.stats as stats

# 1. Aggregate data in Spark
group_a = spark.table("sales").filter("group == 'A'").select("amount").toPandas()["amount"]
group_b = spark.table("sales").filter("group == 'B'").select("amount").toPandas()["amount"]

# 2. Run the test in Python
t_stat, p_val = stats.ttest_ind(group_a, group_b)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_val:.4f}")

```

##### **3. Bootstrapping (The "Spark Way")**

When you have billions of rows and traditional tests (which assume a normal distribution) aren't fitting, you can use **Bootstrapping**. This involves sampling your data with replacement thousands of times to build an actual distribution of your metric.

* **Why use it?** It’s mathematically simpler for distributed systems and doesn't require your data to be "normally distributed."
* **Decision:** If your 95% confidence interval from bootstrapping does not contain the "null" value (e.g., 0), you reject the null hypothesis.

##### **4. Summary of Test Selection**

| If you want to test... | Use this test... | Recommended Library |
| --- | --- | --- |
| **Independence of categories** | Chi-Square Test | `pyspark.ml.stat` |
| **Difference in means (2 groups)** | T-Test | `scipy.stats` (on sampled data) |
| **Difference in means (>2 groups)** | ANOVA | `scipy.stats` or `statsmodels` |
| **Relationship between numbers** | Pearson/Spearman | `pyspark.ml.stat.Correlation` |
| **Massive, non-normal data** | Bootstrapping | Custom Spark logic (sampling) |

##### **Interpreting the P-Value**

* **:** Reject the Null Hypothesis. There is a **statistically significant** difference/relationship.
* **:** Fail to reject the Null Hypothesis. Any difference is likely due to **random noise**.

---

### **_3. A/B test design_**

Designing an **A/B Test** (or randomized controlled trial) in Databricks requires a combination of statistical rigor and engineering setup. Because Databricks is a unified platform, you can handle the entire lifecycle: from power analysis in a notebook to model serving and real-time tracking.

##### **1. The Design Phase (Notebooks)**

Before writing any code, you must establish the mathematical boundaries of your experiment.

* **Formulate a Hypothesis:** Use a specific format: *"If we change [Independent Variable], then [Dependent Variable/KPI] will increase because [Reasoning]."*
* **Power Analysis:** Determine your **Minimum Detectable Effect (MDE)** and required **Sample Size**.
* ** (Significance):** Usually 0.05. The risk of a "False Positive."
* ** (Power):** Usually 0.80. The probability of detecting a real effect if it exists.
* **MDE:** The smallest change that is actually meaningful to your business (e.g., a 2% lift in conversion).


##### **2. Implementation: User Allocation**

You must ensure that a user is **consistently** assigned to either Group A or Group B to avoid "pollution."

* **Hash-Based Allocation:** Instead of random numbers (which can change on refresh), hash the `user_id` and use the modulo operator. This ensures that `user_123` always sees the same variant.
* **Experiment Metadata:** Store the assignment in a Delta table. A common schema is `(user_id, experiment_name, variant, timestamp)`.

```python
import hashlib

def get_variant(user_id, experiment_name):
    # Salt the hash with the experiment name to avoid correlation between tests
    combined_id = f"{user_id}_{experiment_name}"
    hash_val = int(hashlib.md5(combined_id.encode()).hexdigest(), 16)
    return "B" if hash_val % 2 == 0 else "A"

```

##### **3. Engineering: The "Silver" Experiment Table**

In your **Silver layer**, create a table that joins user events with their experiment assignments. This is where the actual analysis happens.

* **Key Columns:** `user_id`, `variant`, `metric_value` (e.g., total spend), `is_converted`.
* **Unity Catalog Tip:** Use **Tags** on your experiment tables to track ownership and "Start/End" dates.


##### **4. Model-Based A/B Testing (MLflow)**

If you are testing two Machine Learning models (e.g., two different recommendation algorithms), Databricks provides built-in **Traffic Splitting** in Model Serving.

* **MLflow Deployment:** You can deploy two models to a single "Served Endpoint" and tell Databricks to send 50% of traffic to `Model_A` and 50% to `Model_B`.
* **Inference Logging:** Enable **Inference Tables** to automatically record every prediction and the model ID used. You can later join this with "Ground Truth" data (actual purchases) to see which model performed better.


##### **5. Post-Test Analysis (SQL & Stats)**

Once the test duration is complete (at least 1-2 full business cycles):

* **Check for Skew:** Ensure that the number of users in A and B is roughly equal. If it's 70/30, your allocation logic might be broken.
* **Calculate P-Value:** Use **SciPy** or **Statsmodels** in a notebook to check for significance.
* **Visualization:** Build a **Databricks SQL Dashboard** to track the lift in real-time. Use a "Counter" for the P-Value and a "Bar Chart" for the metric comparison.

##### **Best Practices Checklist**

* **Don't "Peek":** Avoid stopping a test early just because it looks significant at day 3. Wait until you reach your pre-determined sample size.
* **Run Simultaneously:** Never run Group A one week and Group B the next. External factors (holidays, sales) will ruin your data.
* **Standardize Metrics:** Use a single "Source of Truth" for metrics like `Revenue` across all experiments to ensure results are comparable.

---

### **_4. Feature engineering_**

**Feature Engineering** is the process of transforming raw data into "features" that better represent the underlying problem to machine learning models, thereby improving their accuracy.

In Databricks, feature engineering often mirrors the **Silver to Gold** transition in the Medallion architecture, where data is moved from a cleaned state to a highly optimized, domain-specific state.

##### **1. Common Feature Engineering Techniques**

Feature engineering can be categorized into four main processes:

| Process | Technique | Description |
| --- | --- | --- |
| **Transformation** | **Scaling / Normalization** | Adjusting numeric values to a common range (e.g.,  to ) so large numbers don't dominate the model. |
| **Encoding** | **One-Hot Encoding** | Converting categorical strings (like "Red", "Blue") into binary columns (s and s). |
| **Creation** | **Aggregation** | Creating features like `average_spend_30d` or `total_logins_last_week` using window functions. |
| **Extraction** | **PCA / Dimensionality** | Reducing the number of features by combining variables into principal components. |


##### **2. Practical PySpark Examples**

###### **A. Handling Categorical Data (StringIndexer & OneHotEncoder)**

Machine learning models require numerical input. You first convert strings to indices, then into binary vectors.

```python
from pyspark.ml.feature import StringIndexer, OneHotEncoder

# 1. Convert Category Name to Index (0, 1, 2...)
indexer = StringIndexer(inputCol="category", outputCol="category_index")
df_indexed = indexer.fit(df).transform(df)

# 2. Convert Index to Binary Vectors
encoder = OneHotEncoder(inputCol="category_index", outputCol="category_vec")
df_encoded = encoder.fit(df_indexed).transform(df_indexed)

```

###### **B. Time-Series Window Aggregations**

Creating features that capture trends over time is crucial for behavioral prediction (like churn).

```python
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Define a window for each customer looking back 7 days
window_spec = Window.partitionBy("customer_id").orderBy(F.col("timestamp").cast("long")).rangeBetween(-7*86400, 0)

# Create a "Rolling Average Spend" feature
df_features = df.withColumn("avg_spend_7d", F.avg("amount").over(window_spec))

```

##### **3. The Databricks Feature Store**

In 2026, Databricks uses **Unity Catalog** as a built-in **Feature Store**. This allows you to:

* **Discover & Reuse:** Search for existing features created by other teams to avoid "reinventing the wheel."
* **Point-in-Time Joins:** Automatically join features based on the correct timestamp to prevent **data leakage** (accidentally using "future" information to train a model).
* **Lineage Tracking:** See exactly which models are using a specific feature.

```python
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

# Create a Feature Table in Unity Catalog
fe.create_table(
    name="catalog.schema.customer_features",
    primary_keys=["customer_id"],
    df=df_features,
    description="Customer behavioral and trend features"
)

```

##### **Best Practices**

* **Avoid Data Leakage:** Ensure your features only use data available *before* the time of the event you are trying to predict.
* **VectorAssembler:** Most Spark ML algorithms expect a single column called `features` containing a vector. Use `VectorAssembler` as the final step to combine all your engineered columns.
* **Automate with DLT:** Use **Delta Live Tables** to keep your feature tables updated incrementally as new raw data arrives.

---

### **Practice**

In [0]:
from pyspark.sql import functions as F

In [0]:
def load_ecommerce_dataset(Month_name):
    df = spark.read.csv(f"/Volumes/workspace/ecommerce/ecommerce_data/2019-{Month_name}.csv", header=True, inferSchema=True)
    return df

In [0]:
# df_n = load_ecommerce_dataset("Nov")
df_o = load_ecommerce_dataset("Oct")

#### **1. Calculate statistical summaries** 

In [0]:
# Basic summary: Count, Mean, Stddev, Min, Max
display(df_o.select("price").describe())

----

#### **2. Test hypotheses (weekday vs weekend)** 

##### **_hypothesis is: "The average price of items purchased on weekends is significantly higher than on weekdays."_**

In [0]:
from pyspark.sql import functions as F

In [0]:
# 1 is Sunday, 7 is Saturday in PySpark
df_hypo = df_o.filter("event_type = 'purchase'") \
    .withColumn("day_of_week", F.dayofweek("event_time")) \
    .withColumn("is_weekend", F.when(F.col("day_of_week").isin(1, 7), "Weekend")
                               .otherwise("Weekday"))

In [0]:
# View basic distribution
display(df_hypo.groupBy("is_weekend").agg(
    F.count("*").alias("total_purchases"),
    F.avg("price").alias("avg_price"),
    F.stddev("price").alias("std_dev")
))

In [0]:
from scipy import stats

In [0]:
# Extract the price arrays for both groups
weekday_prices = df_hypo.filter("is_weekend = 'Weekday'").select("price").toPandas()["price"]
weekend_prices = df_hypo.filter("is_weekend = 'Weekend'").select("price").toPandas()["price"]

In [0]:
# Perform T-Test
t_stat, p_value = stats.ttest_ind(weekday_prices, weekend_prices, equal_var=False)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

if p_value < 0.05:
    print("Result: Statistically Significant. We REJECT the null hypothesis.")
else:
    print("Result: Not Significant. We FAIL TO REJECT the null hypothesis.")

---

#### **3. Identify Correlations**

In [0]:
# Correlation
df_hypo.stat.corr("price", "category_id")

---

#### __4. Engineering features For ML__

In [0]:
from pyspark.sql.window import Window

In [0]:
# Feature engineering
features = df_o.withColumn("hour", F.hour("event_time")) \
    .withColumn("day_of_week", F.dayofweek("event_time")) \
    .withColumn("price_log", F.log(F.col("price")+1)) \
    .withColumn("time_since_first_view",
        F.unix_timestamp("event_time") -
        F.unix_timestamp(F.first("event_time").over(Window.partitionBy("user_id").orderBy("event_time"))))

In [0]:
display(features.head(10))

---

### **Resources**
- [ML](https://spark.apache.org/docs/latest/ml-guide.html)

----