<h1><center>Laboratory Work 10.</center></h1>
<h2><center>
    Flight Delays Prediction using Gradient Boosting
    <div>(Прогнозування затримок вильоту за допомогою градієнтного підсилення)</div>
</center></h2>

**Виконав:** Прізвище І.П.

**Варіант:** №__

Your task is to beat at least 2 benchmarks in this [Kaggle Inclass competition](https://www.kaggle.com/c/flight-delays-spring-2018/leaderboard). Below, there is a brief description of how the second benchmark was achieved using Xgboost. At this stage of the 'Intelligent Data Analysis' course, it should enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting performs well. Most likely it will be Xgboost, however, take into consideration that there are plenty of categorical features in the dataset provided.

<figure>
  <img src="https://raw.githubusercontent.com/radiukpavlo/intelligent-data-analysis/refs/heads/main/03_img/10_5_xgboost-2.jpeg" align="center" width="40%" alt="XGBoost model visualization">
  <figcaption> 
    <a href="https://community.ultralytics.com/t/a-new-meme-2024-paris-olympics/160">Source</a>
  </figcaption>
</figure>

<a class="anchor" id="lab-10"></a>

## Outline

- [10.1. Data Exploration $\&$ Preprocessing with Gradient Boosting](#lab-10.1)
- [10.2. Feature Engineering $\&$ Dimensionality Reduction](#lab-10.2)
- [10.3. Hyperparameter Tuning $\&$ Model Evaluation](#lab-10.3)
- [10.4. Handling Missing Data, Outliers $\&$ Special Cases](#lab-10.4)
- [10.5. Advanced Ensemble Strategies $\&$ Interpretability](#lab-10.5)

In [1]:
import warnings

warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

In [2]:
train = pd.read_csv('https://raw.githubusercontent.com/radiukpavlo/intelligent-data-analysis/refs/heads/main/02_assignments/ida_lab-10_flight-delays-kaggle/flight_delays_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/radiukpavlo/intelligent-data-analysis/refs/heads/main/02_assignments/ida_lab-10_flight-delays-kaggle/flight_delays_test.csv')

In [3]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [4]:
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take Xgboost classifier and two features that are easiest to take: DepTime and Distance. Such model results in 0.68202 on the LB.

In [5]:
X_train = train[['Distance', 'DepTime']].values
y_train = train['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_test = test[['Distance', 'DepTime']].values

X_train_part, X_valid, y_train_part, y_valid = train_test_split(
    X_train, y_train, test_size=0.3, random_state=17
)

We'll train Xgboost with default parameters on part of data and estimate holdout ROC AUC.

In [6]:
xgb_model = XGBClassifier(seed=17)

xgb_model.fit(X_train_part, y_train_part)
xgb_valid_pred = xgb_model.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, xgb_valid_pred)

0.7002330590349366

Now we do the same with the whole training set, make predictions to test set and form a submission file. This is how you beat the first benchmark. 

In [7]:
xgb_model.fit(X_train, y_train)
xgb_test_pred = xgb_model.predict_proba(X_test)[:, 1]

pd.Series(xgb_test_pred, name='dep_delayed_15min').to_csv(
    'xgb_2feat.csv', index_label='id', header=True
)

The second benchmark in the leaderboard was achieved as follows:

- Features `Distance` and `DepTime` were taken unchanged
- A feature `Flight` was created from features `Origin` and `Dest`
- Features `Month`, `DayofMonth`, `DayOfWeek`, `UniqueCarrier` and `Flight` were transformed with OHE (`LabelBinarizer`)
- Logistic regression and gradient boosting (xgboost) were trained. Xgboost hyperparameters were tuned via cross-validation. First, the hyperparameters responsible for model complexity were optimized, then the number of trees was fixed at 500 and learning step was tuned.
- Predicted probabilities were made via cross-validation using `cross_val_predict`. A linear mixture of logistic regression and gradient boosting predictions was set in the form $w_1 * p_{logit} + (1 - w_1) * p_{xgb}$, where $p_{logit}$ is a probability of class 1, predicted by logistic regression, and $p_{xgb}$ – the same for xgboost. $w_1$ weight was selected manually.
- A similar combination of predictions was made for test set. 

Following the same steps is not mandatory. That’s just a description of how the result was achieved by the author of this assignment. Perhaps you might not want to follow the same steps, and instead, let’s say, add a couple of good features and train a random forest of a thousand trees. In any case, follow the tasks below to achieve the highest result.

<a class="anchor" id="lab-10.1"></a>

## <span style="color:blue; font-size:1.5em;">10.1. Data Exploration $\&$ Preprocessing with Gradient Boosting</span>

[Back to the outline](#lab-10)

### <span style='color:red; font-size:1.4em;'>Task 1</span>

---
**Variant 1:**  
Conduct an initial data cleaning and exploration pipeline on **flight_delays_train.csv** and **flight_delays_test.csv**, focusing on removing or correcting corrupted entries (e.g., invalid departure times). Then, split the dataset into training and validation subsets. Fit a simple XGBoost model using only the numerical features (`DepTime` and `Distance`) to get a baseline classification of whether a flight is delayed by 15+ minutes. Present overall accuracy and confusion matrices to measure how well the baseline model performs. Finally, discuss how data cleaning influenced model performance versus training on the raw dataset.

*Technical note:*  
Use `pandas` for data exploration (e.g., checking for anomalies in `Month`, `DayOfWeek`, or `DepTime` columns). Apply `train_test_split` from `sklearn.model_selection`. In XGBoost, keep parameters at defaults (e.g., `max_depth=6`, `learning_rate=0.3`) for simplicity. Evaluate performance via `accuracy_score` and confusion matrix from `sklearn.metrics`. Document how the confusion matrix changes if data is not cleaned thoroughly.

---

**Variant 2:**  
Parse the date-like columns (`Month`, `DayofMonth`, `DayOfWeek`) and reassign them into numeric or categorical encodings, ensuring consistency between **flight_delays_train.csv** and **flight_delays_test.csv**. Then, systematically remove any rows with out-of-range or missing `DepTime` values. Train a basic gradient boosting classifier (XGBoost) on the cleaned dataset, using `DepTime`, `Distance`, and the newly encoded date features. Investigate if the time-of-day aspect (morning, afternoon, or evening flights) correlates with higher or lower delay rates. Compare the new model’s performance to a baseline ignoring date features, reporting differences in AUC or F1-score.

*Technical note:*  
Convert `Month`, `DayofMonth`, `DayOfWeek` to either one-hot vectors or cyclical encodings (e.g., sine/cosine transformation if desired). For time-of-day categorization, create bins such as `[0–600]=Early Morning, 601–1200=Morning, 1201–1800=Afternoon, >1800=Night].` Use `xgboost.XGBClassifier` with default parameters or minimal tuning (`max_depth=4`, `n_estimators=100`). Evaluate with `roc_auc_score` or `f1_score`.

---

**Variant 3:**  
Analyze the distribution of flight distances in **flight_delays_train.csv**. If certain entries exceed a plausible maximum for domestic flights, either remove or re-check those outliers to see if they are input errors. Next, standardize numeric features (e.g., `Distance` scaled by `StandardScaler`) but leave time-based features in their raw form. Train a gradient boosting model to classify delayed flights, and see if standardizing `Distance` improves accuracy or causes any negative effect on interpretability. Highlight how scaling might or might not help tree-based models.

*Technical note:*  
Perform outlier detection by capping `Distance` at, for instance, the 99th percentile or a domain-based threshold. Use `sklearn.preprocessing.StandardScaler` for numeric scaling. In XGBoost or another gradient boosting tool (LightGBM/CatBoost are also possible, but XGBoost suffices), measure performance with `accuracy_score` and `precision_score`. Summarize whether standardization changes variable importance or yields minimal effect (since boosting trees often handle unscaled numeric data adequately).

---

**Variant 4:**  
Check for duplicate rows in **flight_delays_train.csv**—for instance, repeated flights with the same `Month`, `DayofMonth`, `DayOfWeek`, `DepTime`, and `Origin-Dest` pair. Remove duplicates and compare how a gradient boosting classifier trained on the deduplicated dataset performs against the original dataset. Additionally, implement a random 5-fold cross-validation approach to gauge if deduplication influences model stability. Conclude if duplicate records were harmful or beneficial for predictive accuracy.

*Technical note:*  
Leverage `df.drop_duplicates()` on relevant feature subsets (excluding the label) to remove repeated instances. Use `sklearn.model_selection.cross_val_score` with 5-fold splits on the training set. Evaluate via `accuracy_score`, `f1_score`, or `roc_auc_score`. Present both the cross-validation mean score and standard deviation to measure stability. Keep default XGBoost parameters or minimal tuning (like `n_estimators=200`, `learning_rate=0.1`).

---

**Variant 5:**  
Investigate missing values or uninitialized placeholders in the `Month`, `DayofMonth`, `DayOfWeek`, or `DepTime` columns. For each column, compare discarding rows with missing data to an imputation strategy (such as replacing missing times with the median departure time). Then fit a gradient boosting model to see which approach yields better predictive performance. Reflect on why a certain approach (imputation vs. dropping) might work best, given flight data patterns.

*Technical note:*  
Identify null or placeholder values (like “c-0” or obviously invalid numeric codes). Test both removing those rows and performing an imputation (e.g., median `DepTime`). Evaluate each approach with `roc_auc_score` or `f1_score` on a hold-out validation set. For the gradient boosting model, specify `n_estimators=150` and a moderate `max_depth=5`. Summarize how missing data strategies affect model performance.

---

**Variant 6:**  
Focus on data partitioning strategies: randomly split **flight_delays_train.csv** or perform a time-based split (e.g., flights from earlier months as the training set, later months as the validation set). Then train an XGBoost classifier with identical parameters on both splits. Compare performance metrics such as AUC or accuracy to see if a time-based split yields more realistic predictions for future flights. Discuss the concept of data leakage in random splits vs. chronological splits in flight delay scenarios.

*Technical note:*  
Implement two splits: (1) `train_test_split` with `shuffle=True`, (2) a chronological approach (e.g., first 70% of data by date for training, last 30% for validation). Fit the same `XGBClassifier`. Evaluate `roc_auc_score` on the validation portion. Emphasize differences in how well the model generalizes to “future” flights in the second approach.  

---

**Variant 7:**  
Combine explanatory data analysis with visualizations. Plot the distribution of departure times and see if the proportion of delayed flights changes throughout the day. Then remove extreme values (e.g., flights departing after 2359 or coded incorrectly) and train a gradient boosting model. Provide a side-by-side comparison of the AUC or F1-score before and after removing suspicious time entries. Present any shifts in departure-time-based delay trends.

*Technical note:*  
Use `matplotlib` or `seaborn` to plot a histogram of `DepTime` vs. the percentage of delays (`dep_delayed_15min`). Filter out improbable times (e.g., “2424” or negative times). In XGBoost, tune parameters slightly (e.g., `max_depth=3`, `learning_rate=0.05`, `n_estimators=300`). Evaluate with `f1_score`, and highlight how discarding erroneous times can clarify daily delay patterns.

---

**Variant 8:**  
Engineer a new column indicating whether the flight occurred on a public holiday or major event day by cross-referencing an external holiday calendar. (Since we can’t import new data in practice, you can illustrate the concept by randomly tagging certain days as “holiday.”) Then see if adding this holiday indicator to the training features improves gradient boosting classification performance. Discuss how real holiday data might be integrated in a production environment and whether such external data typically helps.

*Technical note:*  
Create a synthetic `IsHoliday` feature with a small random subset of days flagged as holiday. Convert that to a binary (0/1) or boolean column. Use `xgboost.XGBClassifier` with standard or lightly tuned parameters. Compare `accuracy_score` or `f1_score` with and without `IsHoliday`. Summarize the potential benefit of real external features in flight delay prediction.

---

**Variant 9:**  
Divide **flight_delays_train.csv** into training subsets based on origin airport—e.g., one subset for big hubs (like ATL, ORD, DFW) and another for smaller airports. Inspect whether the distribution of departure times, distances, or delay rates differs significantly between these subsets. Then train a gradient boosting model on each subset separately and compare results to a single “global” model trained on the combined data. Address the possibility of building specialized models per airport category vs. a universal approach.

*Technical note:*  
Use `groupby('Origin')` or custom classification of airports into “large” vs. “small.” For the separate models, filter the dataset and train an XGBoost instance per group. Evaluate validation performance or cross-validation AUC. Compare to a single model trained on the entire dataset. Document findings on whether specialized models for large hubs might outperform a single one-size-fits-all approach.

---

**Variant 10:**  
Perform label-encoding or one-hot encoding on categorical features such as `UniqueCarrier`, `Origin`, and `Dest`. Compare the effect of each encoding on gradient boosting results. For instance, one-hot encoding might create many sparse columns, while label-encoding assigns numeric IDs. Evaluate memory usage, training time, and classification performance for both approaches. Conclude which encoding is the best compromise for flight delay prediction.

*Technical note:*  
For one-hot, use `pandas.get_dummies()` or `sklearn.preprocessing.OneHotEncoder`; for label-encoding, use `sklearn.preprocessing.LabelEncoder`. Train an XGBoost classifier with moderate settings (`n_estimators=200`, `learning_rate=0.1`). Track memory usage with `df.memory_usage()` or external tools. Assess performance via `f1_score`. Summarize the trade-offs in high-cardinality columns (like `Origin`/`Dest`).

---

**Variant 11:**  
Explore how combining rare categories in `UniqueCarrier` or `Origin` can reduce data fragmentation. For instance, group carriers with fewer than a threshold number of flights into a single “Other” category. Train a gradient boosting model on both the original dataset (with all carriers separately) and the grouped dataset. Compare whether grouping improves or worsens model performance, and whether interpretability becomes easier with fewer categories.

*Technical note:*  
Create a function that counts occurrences of each category, then merges categories below a chosen threshold (e.g., 500 flights) into “Other.” Re-run XGBoost with identical hyperparameters. Evaluate AUC or F1. Consider whether removing very rare categories might help the model generalize better, especially for small carriers with sporadic flights.

---

**Variant 12:**  
Generate pairwise interaction features from existing columns, such as multiplying `DayOfWeek` by `UniqueCarrier` or combining `DepTime` bins with `Month`. Then check if these additional interaction features significantly boost gradient boosting performance. You might need to limit the total number of new features if dimensionality expands excessively. Demonstrate how to do a simple correlation or feature importance check to see if these interactions are meaningful.

*Technical note:*  
Select a few plausible interactions, e.g., `(DayOfWeek, UniqueCarrier)` or `(DepTimeCategory, Month)`. Encode them properly (like `str(DayOfWeek) + '_' + UniqueCarrier` for a new categorical). Use XGBoost’s `feature_importances_` or `xgb.plot_importance` to measure the newly created features’ impact. Evaluate changes in `roc_auc_score` or `f1_score`.

---

**Variant 13:**  
Extend numeric features by using polynomial transformations, for instance squaring `Distance` or cross-multiplying `Distance` with `DepTime`. Since gradient boosting sometimes picks up complex patterns naturally, compare the model’s performance with vs. without these polynomial expansions. Check if feature engineering with polynomials is redundant or beneficial for flight delay classification.

*Technical note:*  
Use `sklearn.preprocessing.PolynomialFeatures` (with degree=2, interaction_only=True) on numeric fields. Then feed these into XGBoost. Evaluate training time, memory usage, and `precision_score` or `recall_score`. Summarize whether the model’s complexity soared and if performance gains are worth the added dimensionality.

---

**Variant 14:**  
Create a binary feature reflecting whether the flight is departing from a “major airport” (for example, select the top five busiest `Origin` airports in the dataset). Compare classification performance in a gradient boosting model that uses this new binary feature against a baseline model. Evaluate if the busiest airports face more systematic delays that the model can learn.

*Technical note:*  
Identify the top five or top ten `Origin` airports by flight count. Create a new column like `isMajorOrigin = 1 if Origin in top_airports else 0`. Train XGBoost with `n_estimators=300` and `max_depth=6`. Compare `f1_score` or `roc_auc_score` for the baseline (no `isMajorOrigin`) vs. the new feature. Discuss the interpretability gains from such a feature.

---

**Variant 15:**  
Handle potential data imbalance (if “Y” is less frequent than “N” for `dep_delayed_15min`) by comparing the effect of class-weighting in XGBoost. Experiment with the `scale_pos_weight` parameter to see whether it improves recall or F1-score. Summarize potential pitfalls of adjusting the class balance, such as over-penalizing negative classes or inflating false positives.

*Technical note:*  
Calculate the ratio of delayed flights to non-delayed flights in **flight_delays_train.csv**. Use this ratio to set `scale_pos_weight`. For instance, if delayed flights are 1:5, set `scale_pos_weight=5`. Compare performance with a baseline (no weighting). Evaluate `precision`, `recall`, and the confusion matrix to see trade-offs between capturing more delayed flights vs. false alarms.

---

**Variant 16:**  
Perform target encoding on `UniqueCarrier`, `Origin`, and `Dest`. For each of these columns, replace the category with the mean delay rate observed in training data. Then feed these target-encoded features into a gradient boosting model. Use a careful cross-validation approach to avoid overfitting on the target encoding. Show if target encoding outperforms one-hot encoding in terms of AUC or F1-score, especially if categories are numerous.

*Technical note:*  
Implement a K-fold target encoding scheme: for each fold, compute mean delay rate for categories based on other folds, then apply. This prevents data leakage. Evaluate `roc_auc_score` or `f1_score`. In XGBoost, keep typical parameters like `max_depth=4`, `learning_rate=0.1`, `n_estimators=200`. Summarize pros/cons of target encoding for high-cardinality categorical features.

---

**Variant 17:**  
Assess whether engineering a “late-arrival risk” feature helps. Create a new feature by referencing how many minutes a flight is *already delayed* based on the same `Origin` in the preceding hour (e.g., approximate concurrency). Since we cannot truly introduce new data, you can simulate this by a group-shift approach: for each flight, consider the average delay status of flights departing from the same airport within the prior hour block in the training set. Then see if this “concurrent delay” feature improves XGBoost classification, reflecting a chain reaction of delays.

*Technical note:*  
Sort flights by departure time within each `Origin`. For each row, compute the average delay label of the preceding flights in the last hour (within the training set). Store that as `recentDelayRate`. Mind potential data leakage by ensuring this is only computed from prior flights in chronological order. Evaluate `f1_score` or `recall_score` to see if it captures patterns of knock-on delays.

---

**Variant 18:**  
Create a simplified weather indicator column, labeling flights that depart from an airport known (synthetically) to have a higher chance of weather delays. For instance, if `Origin` is one of [“ORD”, “JFK”, “EWR”, “BOS”], mark it as a “weather-critical” zone. Investigate if gradient boosting learns that certain airports suffer more from adverse weather. Evaluate the difference in feature importance for the “weather-critical” flag vs. raw `Origin`.

*Technical note:*  
Make a new binary column, e.g. `weatherCritical = 1` for designated airports, else 0. Fit XGBoost with default or minimal hyperparameters. Inspect `model.feature_importances_` or use `xgb.plot_importance(model)`. Track any changes in `precision`, `recall`, or `auc`. Summarize whether the new feature displaces or complements the original `Origin` in importance.

---

**Variant 19:**  
Subset the dataset to only flights between specific city pairs (e.g., `ORD` to `LGA`, `ATL` to `DFW`, etc.). Compare how a gradient boosting classifier fits each route in isolation versus the entire training set. Then examine whether combining route-specific models in an ensemble improves overall performance. Discuss potential practical challenges of managing multiple route-based models.

*Technical note:*  
Filter the training data for each pair (e.g., `Origin='ORD', Dest='LGA'`). Train a separate XGBoost on each route. For the test set, if a flight matches that route, use the route’s specialized model predictions. Otherwise, fall back to a global model. Evaluate the combined approach’s accuracy or F1. Summarize how data volume might hamper route-specific modeling if certain routes are infrequent.

---

**Variant 20:**  
Compare simple data cleaning (dropping rows with invalid `DepTime` or `Distance`) to a more aggressive approach that also filters out flights departing at improbable hours or having extremely long distances. Fit gradient boosting models on each cleaned version and measure differences in performance. Conclude if strict cleaning yields a more robust model or leads to data under-representation, especially for less common flights.

*Technical note:*  
Create multiple cleaning pipelines: (1) remove only obviously erroneous rows, (2) remove also borderline valid flights. Keep the same XGBoost hyperparameters for both. Evaluate with `f1_score` or `accuracy_score`. Reflect on potential bias introduced by discarding valid but unusual flights. Show numeric results of how many flights remain after each cleaning step.

---

<a class="anchor" id="lab-10.2"></a>

## <span style="color:blue; font-size:1.5em;">10.2. Feature Engineering $\&$ Dimensionality Reduction</span>

[Back to the outline](#lab-10)

### <span style='color:red; font-size:1.4em;'>Task 2</span>

---
**Variant 1:**  
Focus on extracting more informative time-based features from `DepTime`. Convert each `DepTime` into separate columns: `DepHour` (integer hour), `DepMinute` (minutes in the hour), and a binary flag for early-morning flights (e.g., `DepHour < 6`). Then feed these new columns into an XGBoost classifier while dropping the original `DepTime`. Observe whether splitting `DepTime` into smaller pieces boosts performance. Provide partial dependence plots or feature importances to see how departure hour influences predictions.

*Technical note:*  
Using `pandas`, transform `DepTime` into `DepHour = DepTime // 100` and `DepMinute = DepTime % 100`. Also define an indicator column: `isEarlyMorning = (DepHour < 6)`. Train with `XGBClassifier(n_estimators=300, max_depth=5)`. Evaluate `roc_auc_score`. Use `xgboost.plot_importance(model)` or partial dependence from `sklearn.inspection` to interpret how the hour of departure affects delay likelihood.

---

**Variant 2:**  
Utilize a dimensionality reduction technique (e.g., PCA or TruncatedSVD) after one-hot encoding the categorical variables `UniqueCarrier`, `Origin`, and `Dest`. Compare classification performance for the full set of one-hot vectors vs. the top K principal components. Present metrics of how many PCA components are needed to retain most variance and how that translates into improved or reduced gradient boosting performance.

*Technical note:*  
After `pd.get_dummies()` on the train set’s categorical features, you might end up with dozens or hundreds of columns. Apply `sklearn.decomposition.PCA` or `TruncatedSVD(n_components=K)` for dimension reduction (especially if data is sparse). Train XGBoost on the transformed dataset. Evaluate both speed and accuracy. Summarize the trade-off between compression and potential signal loss.

---

**Variant 3:**  
Engineer a `weekend_indicator` feature that is 1 if `DayOfWeek` is Saturday or Sunday, else 0. Additionally, create a `peak_travel_indicator` for the busiest travel days in the dataset (e.g., day-of-month 15–17). Insert these features into an XGBoost classifier and measure how feature importance changes. Present how the model performance shifts if you drop these new indicators, highlighting whether weekend or mid-month patterns are relevant to flight delays.

*Technical note:*  
Use `(DayOfWeek in [6,7])` or `(DayOfWeek in [5,6])` if your dataset codes Sunday differently. Also define `peak_travel_indicator` based on domain knowledge or by counting flights per day-of-month. Fit `xgb_model = XGBClassifier(n_estimators=250, learning_rate=0.05)`. Check feature importances. Evaluate with `f1_score`. Conclude how these straightforward flags might capture recurring patterns.

---

**Variant 4:**  
Perform binning on the `Distance` feature to create categories (e.g., short-haul, medium-haul, long-haul). Compare performance using binned categories vs. the continuous numeric `Distance`. Then combine `Distance` categories with `DepHour` in a cross-feature (e.g., short-haul-night, short-haul-day, etc.). Fit a gradient boosting model and see if these cross-features add predictive power or lead to overfitting.

*Technical note:*  
Define distance bins—e.g., short < 400 miles, 400–800 miles = medium, >800 miles = long. Encode them as one-hot. Optionally create an interaction with daypart (morning/afternoon/evening). Use XGBoost with moderate hyperparameters (`max_depth=4`, `subsample=0.8`). Compare `accuracy_score` or `roc_auc_score`. Summarize the effect on feature importances and whether the bins are more interpretable than raw distance.

---

**Variant 5:**  
Leverage the date features (`Month`, `DayofMonth`, `DayOfWeek`) to create a cyclical representation, e.g., mapping them onto sine/cosine transformations to reflect cyclical nature (days of week repeating, months repeating). Evaluate if cyclical encoding helps the gradient boosting model identify repeating patterns. Compare it to standard integer or one-hot encoding of these time features.

*Technical note:*  
For day-of-week, define `day_sin = sin(2π × DayOfWeek/7)` and `day_cos = cos(2π × DayOfWeek/7)`. Similarly for month if you want to treat it as cyclical. Use XGBoost with fixed hyperparameters (`n_estimators=250`, `learning_rate=0.1`). Compare `f1_score` or `precision_score` for cyclical vs. standard encoding. Analyze if cyclical features rank higher in importance or yield improved performance.

---

**Variant 6:**  
Generate a combined location feature that merges `Origin` and `Dest` into a single route code (e.g., “JFK->ORD”). Then convert this route code into numeric or one-hot form. Train a gradient boosting model with the route feature plus `Distance` to see if it captures route-specific delay tendencies. Compare results to a model that uses `Origin` and `Dest` as separate columns.

*Technical note:*  
Create `df['Route'] = df['Origin'] + '->' + df['Dest']`. One-hot encode or label-encode. Evaluate an XGBoost classifier with limited complexity (like `max_depth=3`) to see if route-level patterns significantly impact performance. Summarize any improvements in `roc_auc_score` or `f1_score`. Check if route is among top features in `model.feature_importances_`.

---

**Variant 7:**  
Apply a rank-based transformation on the numeric columns (e.g., `Distance`), converting them to their percentile ranks in the training set. Then feed these rank-transformed variables into the gradient boosting model to see if it helps handle outliers or heavy-tailed distributions. Evaluate if the rank approach outperforms raw numeric or standard scaling for flight delay classification.

*Technical note:*  
Use `pandas.Series.rank(pct=True)` to transform, for instance, `Distance` into a [0,1] percentile measure. For gradient boosting, keep standard hyperparameters. Compare `f1_score` and `precision_recall_fscore_support` for each transformation method: raw, scaled, rank-based. Summarize how rank-based transformations can mitigate the impact of extreme values.

---

**Variant 8:**  
Create an additional numeric feature based on the ratio of `DepTime` to `Distance`, interpreting whether short flights have very late departure times (which might cause more delays) or if longer flights departing at certain times are prone to delays. Evaluate whether your ratio-based feature helps gradient boosting discover new patterns. Inspect partial dependence or shap values to interpret the new ratio’s effect.

*Technical note:*  
Compute `DepTimeDistanceRatio = DepTime / (Distance+1)` or something similar to avoid division by zero. Use XGBoost with standard or lightly tuned parameters. Evaluate metrics and generate partial dependence plots with `sklearn.inspection.plot_partial_dependence` or SHAP library. Summarize if the ratio stands out as an important predictor.

---

**Variant 9:**  
Encode the `Month` and `DayofMonth` as one continuous variable representing the day of the year (e.g., 1 through 365) to capture seasonality. Then see if gradient boosting can better identify patterns in certain times of year (like winter storms). Compare performance of the “day of year” approach to separate columns for `Month` and `DayofMonth`.  

*Technical note:*  
Compute `day_of_year = (Month - 1)*30 + DayofMonth` as a rough approximation or reference an actual calendar mapping. Train XGBoost, evaluate with `roc_auc_score` and/or `f1_score`. Summarize whether the single “day of year” feature helps or if separate columns were better.  

---

**Variant 10:**  
Introduce polynomial expansions for time-based features, e.g., `DepHour^2` or cross terms between `DepHour` and `Distance`. Then apply a correlation check to detect potential multi-collinearity before feeding them into gradient boosting. Summarize how the final model results differ from a simpler set of features. Reflect on whether tree-based methods benefit from polynomial expansions of time features.

*Technical note:*  
Use `PolynomialFeatures(degree=2, include_bias=False)` on selected numeric columns. If multi-collinearity arises, consider dropping highly correlated expansions. Fit XGBoost with `n_estimators=200, subsample=0.8`. Evaluate `precision`, `recall`, or `accuracy`. Examine feature importances to see if the polynomial expansions rank significantly higher than original features.

---

**Variant 11:**  
Apply an unsupervised approach such as K-means clustering on `(DepTime, Distance)` or other numeric columns to group flights into “types” (e.g., short-late flights, long-early flights, etc.). Then add the cluster labels as a new categorical feature. Evaluate if the cluster-based feature helps XGBoost find distinct flight patterns correlated with delays. Analyze confusion matrices or AUC for the new model vs. the baseline.

*Technical note:*  
Use `sklearn.cluster.KMeans(n_clusters=5)` on standardized numeric data. Append the cluster labels to your training set. Then train an XGBoost classifier with moderate hyperparameters. Track `roc_auc_score` or `f1_score`. Summarize the new cluster-based feature’s influence via `model.feature_importances_`.  

---

**Variant 12:**  
Implement a feature-selection step based on mutual information or the chi-square test for categorical columns. Identify the top N features from among `Month`, `DayOfWeek`, `UniqueCarrier`, `Route` (combined from `Origin`/`Dest`), `Distance`, etc. Train XGBoost only on these top features and compare performance to using the entire set. Reflect on whether focusing on fewer but stronger predictors leads to a simpler, equally accurate model.

*Technical note:*  
Use `sklearn.feature_selection.mutual_info_classif` or `chi2` for categorical data. Pick the top features with something like `SelectKBest(k=10)`. Re-train XGBoost with default parameters. Evaluate `accuracy` or `f1_score`. Summarize if a narrower feature set helps speed or interpretability without sacrificing performance.

---

**Variant 13:**  
Construct a derived measure of flight congestion, such as “number of flights departing from the same airport within ±2 hours.” In practice, you can approximate this by counting how many flights in **flight_delays_train.csv** have the same `Origin` and a `DepTime` within a 2-hour window. Add it as a numeric feature (`originCongestionScore`). Then see if gradient boosting uses it to predict higher delays for congested intervals.

*Technical note:*  
For each row, filter the training set by `Origin` and check all flights with `|DepTime_i - DepTime_j| <= 200` (approx. 2 hours). Take the count as `originCongestionScore`. Mind the potential for large computations if done naively; a simplified approach is acceptable. Train XGBoost with `max_depth=6`, `min_child_weight=3`. Compare `f1_score` or `roc_auc_score` to a baseline with no congestion feature.

---

**Variant 14:**  
Leverage the flight’s `DayOfWeek` to create a simplified “week segment” feature: e.g., Monday–Wednesday are segment 1, Thursday–Friday are segment 2, and weekend is segment 3. Evaluate how segment-based grouping interacts with `DepTime` or `Distance`. Fit a gradient boosting model using these new segments. Discuss whether such chunking of days can highlight typical business travel vs. weekend leisure flight differences in the delay patterns.

*Technical note:*  
Define a function mapping day-of-week: `[1,2,3]` for `[Mon–Wed, Thu–Fri, Sat–Sun]`. Use standard or minimal XGBoost hyperparameters. Evaluate `f1_score` or `recall_score`. Compare model performance to ignoring day grouping. Summarize how day segmentation might reflect business vs. leisure patterns.

---

**Variant 15:**  
Introduce a “connecting flight likelihood” feature: for each flight, approximate whether it’s part of a multi-leg journey. For example, if the same passenger had a flight from the same `Origin` earlier in the day and arrived just before `DepTime`, it might indicate a connection. You can simulate or approximate this by searching for flights that arrive at `Origin` within 1–2 hours. Then see if a higher probability of connecting flights correlates with more frequent delays. Evaluate the effect in gradient boosting.

*Technical note:*  
This is conceptual, but you can artificially tag 10% of flights as “connected” if they share the same `Origin` with another flight that lands within a short time window. Add this binary column as `isLikelyConnection`. Train an XGBoost classifier, compare confusion matrices. Summarize the potential real-world effect of connection-induced delays.

---

**Variant 16:**  
Try building a meta-feature that aggregates historical delay rates by `DepHour`. For each hour (0–23), compute the fraction of delayed flights in the training set. Then map each flight’s departure hour to that fraction. Compare a gradient boosting model that uses this aggregated meta-feature to a baseline that only sees raw `DepTime`. Present how well the meta-feature captures the daily delay patterns.

*Technical note:*  
Group data by `DepHour = DepTime // 100`. Compute `delay_rate_hour = sum(Y_delayed)/count(total_flights)` for that hour. For each flight, `metaDepHourDelay = delay_rate_hour[DepHour]`. Use K-fold or separate time splits to avoid target leakage. Train XGBoost with typical settings. Evaluate `roc_auc_score`. Summarize whether the meta-feature is more direct than learning from raw times.

---

**Variant 17:**  
For each airport in `Origin`, compute the average `Distance` of flights departing there. Then define a feature capturing how far the current flight’s distance deviates from that airport’s average. Hypothesize that unusually long flights from certain airports might face more potential for delay. Train a gradient boosting model with this deviation-based feature and check if feature importance suggests it’s helpful.

*Technical note:*  
Group the training data by `Origin`. Calculate the mean distance: `meanDistOrigin[o] = average distance of flights from airport o`. Then for each flight, `distanceDeviation = Distance - meanDistOrigin[Origin]`. Fit XGBoost with standard hyperparameters. Evaluate `f1_score` or `accuracy_score`. Inspect `model.feature_importances_` or SHAP values. Summarize whether distance deviation is predictive.

---

**Variant 18:**  
Construct a new numeric feature measuring the ratio of weekend flights to weekday flights for each airport. For each `Origin`, determine how many flights depart on weekends vs. weekdays, forming `weekendRatio = weekendFlights / (weekdayFlights + 1)`. Merge that ratio back into the training data. Then see if airports that primarily serve weekend traffic have different delay behaviors. Evaluate the resulting gradient boosting classifier.

*Technical note:*  
Group data by `Origin`. Count flights for day-of-week in `[6,7]` vs. `[1..5]`. Compute ratio. Merge the ratio into each row. Use XGBoost with default or lightly tuned parameters. Evaluate `precision` and `recall`. Summarize if the ratio-based feature ranks high in importance or if the model mostly relies on day-of-week directly.

---

**Variant 19:**  
Study the effect of adding a “flight index” representing an approximate position of a flight in the daily departure schedule at each `Origin`. For instance, sort all flights at airport `A` by `DepTime` and assign an index. Then hypothesize that later flights (higher index in daily sequence) accumulate delays. Add this index as a numeric feature to an XGBoost classifier. Check if it outperforms the raw `DepTime`.

*Technical note:*  
For each `Origin`, sort flights by `DepTime`. Assign an integer rank from 1 up to the number of flights that day. `flightIndex` might capture scheduling patterns. Train XGBoost, measure `f1_score`. Summarize if `flightIndex` is more or less predictive than direct departure hour.  

---

**Variant 20:**  
Create composite time blocks based on both `DayOfWeek` and `DepHour`: for example, a categorical feature “weekday-morning,” “weekday-afternoon,” “weekday-night,” “weekend-morning,” etc. This merges day-of-week with broad time-of-day bins. Encode it as one-hot or label-encode. Compare gradient boosting results to using separate features. Evaluate whether this composite feature helps capture synergy between day and hour.

*Technical note:*  
Define day categories `[Weekday, Weekend]` and hour categories `[Morning(0–5), Day(6–17), Night(18–23)]`. Combine them. Then feed these categories into XGBoost. Evaluate `accuracy_score` or `f1_score`. Summarize if the combined time block is more important than separate day/hour columns.  

---

<a class="anchor" id="lab-10.3"></a>

## <span style="color:blue; font-size:1.5em;">10.3. Hyperparameter Tuning $\&$ Model Evaluation</span>

[Back to the outline](#lab-10)

### <span style="color:red; font-size:1.2em;">Task 3</span>

---
**Variant 1:**  
Perform a basic grid search over `max_depth` and `n_estimators` in an XGBoost model to predict delayed flights. Try small, medium, and large values for `max_depth` (e.g., [3, 6, 9]) and [100, 300, 500] for `n_estimators`. Compare results using a hold-out validation set or cross-validation. Summarize which combination yields the best ROC AUC, acknowledging trade-offs in training time.

*Technical note:*  
Use `GridSearchCV` from `sklearn.model_selection` with param grid `{ 'max_depth': [3,6,9], 'n_estimators': [100,300,500] }`. Evaluate with `roc_auc` or `accuracy`. Keep other parameters (like `learning_rate=0.1`) fixed. Provide the best parameter combination and the corresponding metric.

---

**Variant 2:**  
Investigate how varying `learning_rate` from 0.01 to 0.3 impacts model performance for flight delay prediction. Keep `n_estimators=300` fixed and `max_depth=6`, then compare final accuracy or AUC across those different learning rates. Show if a higher learning rate quickly overfits or if a lower rate requires more iterations for the same performance. Provide a learning curve plot if possible.

*Technical note:*  
Iterate `learning_rate` in `[0.01, 0.05, 0.1, 0.2, 0.3]`. Fit XGBoost each time, measuring validation AUC or accuracy. Possibly store partial results (train vs. validation) to plot a learning curve. Summarize the sweet spot that balances training time, overfitting risk, and final performance.

---

**Variant 3:**  
Use early stopping in XGBoost to find an optimal number of boosting rounds automatically. Set `n_estimators` high (e.g., 1000) but specify `early_stopping_rounds=20` to halt training when validation performance no longer improves. Display how many rounds are actually used and compare final metrics to a model without early stopping. Emphasize potential training time savings.

*Technical note:*  
Split off a validation set or use `eval_set=[(X_valid, y_valid)]` in `xgb_model.fit(...)`. Provide `early_stopping_rounds=20`. Document the best iteration found. Evaluate `f1_score` or `accuracy_score`. Show that if early stopping halts at ~200 rounds, it might match or exceed the performance of running all 1000 rounds.

---

**Variant 4:**  
Experiment with the `subsample` and `colsample_bytree` parameters. For instance, fix `max_depth=6, learning_rate=0.1, n_estimators=300`, then vary `subsample` (e.g., [0.5, 0.7, 1.0]) and `colsample_bytree` ([0.5, 0.8, 1.0]). Evaluate how these sampling strategies reduce overfitting. Summarize the combination that yields the best cross-validation AUC or F1.

*Technical note:*  
Use `GridSearchCV` or nested loops over `subsample` and `colsample_bytree`. Each step, record `roc_auc_score`. Analyze whether partial row sampling (`subsample<1.0`) or feature sampling (`colsample_bytree<1.0`) helps generalization. Summarize if any combination stands out in performance or overfitting reduction.

---

**Variant 5:**  
Evaluate the effect of `scale_pos_weight` in an imbalanced dataset scenario. Estimate the ratio of “Y” vs. “N” classes and set `scale_pos_weight` accordingly. For instance, if the ratio is 1:4, try `scale_pos_weight=4`. Then run a cross-validation to measure changes in recall or precision for the minority (delayed) class. Decide if adjusting `scale_pos_weight` is beneficial or leads to too many false positives.

*Technical note:*  
Check class distribution with `train['dep_delayed_15min'].value_counts()`. Suppose delayed flights are 20%. Then set `scale_pos_weight=4`. Evaluate with `sklearn.metrics.classification_report` or `precision_recall_curve`. Summarize if the recall improvement is worth the drop in precision.

---

**Variant 6:**  
Use random search with a broad parameter space for XGBoost, including `max_depth`, `learning_rate`, `colsample_bytree`, `subsample`, and `min_child_weight`. Limit the number of trials (e.g., 20–30) to keep computation feasible. Compare the best random search result to a manual or grid search approach. Reflect on the pros/cons of random search for flight delay classification.

*Technical note:*  
Apply `RandomizedSearchCV` from `sklearn.model_selection` with a parameter distribution, e.g. `max_depth` in [3..10], `learning_rate` in [0.01..0.3], etc. Specify `n_iter=20` or so. Evaluate with 3-fold cross-validation and `roc_auc`. Summarize the best found hyperparameters and final performance.  

---

**Variant 7:**  
Implement Bayesian optimization for hyperparameter tuning using a library like `scikit-optimize` or `hyperopt` (if allowed). Optimize for up to 50 iterations. Focus on `max_depth`, `learning_rate`, and `gamma` (regularization term). Track how the tuner selects parameter sets over iterations, concluding whether it finds a better solution than a random or grid search under the same budget.

*Technical note:*  
Use an external library such as `skopt` (if permissible) or describe the approach conceptually. Evaluate each configuration’s AUC or F1 via cross-validation. Display the final best parameters. Summarize if Bayesian optimization improved speed or final metrics for gradient boosting.  

---

**Variant 8:**  
Compare performance across various metrics: accuracy, precision, recall, F1, and ROC AUC in a single experiment. Train an XGBoost model with moderate parameters (e.g., `max_depth=5, n_estimators=300, learning_rate=0.1`), then measure each metric on a validation set. Highlight potential conflicts (e.g., high accuracy but low recall) and suggest which metric is more appropriate if missed delays are costly.

*Technical note:*  
Fit the model, then use `accuracy_score`, `precision_score`, `recall_score`, `f1_score`, and `roc_auc_score`. Provide a small table or summary. Discuss business context: if it’s critical not to miss delayed flights, recall or F1 might be more important than raw accuracy.  

---

**Variant 9:**  
Incorporate cross-validation with a time-based split (e.g., split data by month) to mimic real flight scheduling. For each fold, train on earlier months and validate on later months. Then systematically tune `max_depth` and `learning_rate`. Compare these time-based CV results to standard random folds. Summarize how the chosen hyperparameters differ under time-based constraints.

*Technical note:*  
Implement custom cross-validation: e.g. first fold = train on months 1–3, validate on month 4; second fold = train on months 1–4, validate on month 5, etc. Evaluate AUC or F1 for each fold. Summarize the final chosen parameters. Compare them to parameters found with conventional K-fold.  

---

**Variant 10:**  
Use a validation curve approach to see how `max_depth` alone affects training and validation scores. Keep other parameters fixed. Plot training and validation accuracy (or AUC) as `max_depth` goes from 1 to 10. Identify the region where overfitting begins. Illustrate how deeper trees might capture more complex patterns but degrade validation performance if they overshoot.

*Technical note:*  
Use `validation_curve` from `sklearn.model_selection` with param_name='max_depth', param_range=range(1,11). Evaluate `train_scores, valid_scores` based on `accuracy` or `roc_auc`. Plot them. Summarize the sweet spot for `max_depth`.  

---

**Variant 11:**  
Measure training time vs. performance for different model complexities. For example, vary `max_depth` and `n_estimators` systematically, and record both training runtime and validation AUC. Present a scatter plot or table showing the trade-off: minimal training time vs. best performance. Indicate where diminishing returns set in.

*Technical note:*  
Loop through `max_depth` in [3,6,9] and `n_estimators` in [100,300,500]. Use Python’s `time` module to measure fit time. Evaluate with `roc_auc_score` on a validation split. Summarize or plot the results, highlighting “optimal” points balancing performance and time.  

---

**Variant 12:**  
Create an ensemble of multiple XGBoost models, each with different hyperparameters found via cross-validation. Then average their predicted probabilities on the validation set. Compare this ensemble approach to a single best model. Evaluate changes in recall, precision, and overall AUC. Conclude if an ensemble of “good but diverse” gradient boosting models yields an improvement.

*Technical note:*  
Train several XGBoost classifiers with different configurations (e.g., different `max_depth`, `learning_rate`, etc.). For each instance, get predicted probabilities. Average them. Evaluate `f1_score` or `roc_auc_score`. Summarize if ensembling multiple XGBoost variants outperforms the single best model from hyperparameter tuning.  

---

**Variant 13:**  
Set up nested cross-validation for both hyperparameter tuning and model evaluation. In the outer loop, partition the data into 5 folds, and for each fold, do an inner loop grid search for the best XGBoost parameters. This ensures the final performance estimate is unbiased. Summarize the computational cost and final average AUC or F1.

*Technical note:*  
Implement a 5-fold outer loop. In each iteration, run a grid or random search on the training portion (inner loop), pick the best model, evaluate on the outer loop’s validation portion. Collect the scores and average them. Summarize the final estimate, which is robust to overfitting during hyperparameter search.  

---

**Variant 14:**  
Focus on the effect of `gamma` (the minimum loss reduction required to make a further partition on a leaf node). Vary `gamma` from 0 to 10 in steps (0, 1, 2, 5, 10) and observe how it affects model complexity (number of leaves) and validation performance. Document changes in training time, feature importances, or overfitting behavior.

*Technical note:*  
Use a loop: `for gamma in [0,1,2,5,10]: ...`. Keep other hyperparameters constant (e.g., `max_depth=5, learning_rate=0.1, n_estimators=300`). Evaluate `roc_auc_score` or `accuracy`. Summarize how higher `gamma` might reduce overfitting but also might hamper model flexibility.  

---

**Variant 15:**  
Incorporate `reg_alpha` (L1) or `reg_lambda` (L2) regularization in XGBoost to combat overfitting. Compare three scenarios: no regularization, moderate `reg_alpha=1`, and moderate `reg_lambda=1`. Use the same training/validation split to measure any shift in performance. Conclude if regularization helps or hinders flight delay prediction.

*Technical note:*  
Set `xgb.XGBClassifier(reg_alpha=1.0)` or `xgb.XGBClassifier(reg_lambda=1.0)`. Keep `max_depth=6, n_estimators=300, learning_rate=0.1`. Evaluate `f1_score`, `roc_auc_score`, or confusion matrix. Summarize if the regularized models behave better on validation sets or if performance declines.  

---

**Variant 16:**  
Apply a repeated cross-validation strategy, e.g., repeating K-fold 3 times with different random splits, to get a more stable estimate of model performance. Use moderate XGBoost hyperparameters. Summarize the mean and standard deviation of the model’s AUC across all folds and repeats. Discuss how repeating CV might produce more reliable performance metrics.

*Technical note:*  
Use `RepeatedKFold(n_splits=5, n_repeats=3)` in `sklearn.model_selection`. For each fold, train XGBoost with fixed parameters (like `max_depth=5, n_estimators=200`). Collect `roc_auc_score`. Summarize average and standard deviation. Compare to single 5-fold CV results, highlighting potentially improved reliability.  

---

**Variant 17:**  
Investigate how partial dependence plots (PDP) or SHAP values can be used to interpret the tuned XGBoost model. After selecting best hyperparameters, pick key features (e.g., `Distance`, `DepTime`, `UniqueCarrier`) and produce PDPs or SHAP plots. Summarize the main insights: e.g., longer `Distance` might reduce or increase delay risk, depending on the interplay with `DepTime`.

*Technical note:*  
Fit XGBoost with chosen parameters. Use `shap.TreeExplainer(model).shap_values(X)` or `sklearn.inspection.plot_partial_dependence(model, X, [feature])`. Evaluate the shape of dependencies. Summarize the interpretive advantage for stakeholders wanting to see how each feature influences delay probability.  

---

**Variant 18:**  
Implement a custom callback function to log the validation metrics at each iteration of XGBoost. For instance, in Python, pass a callback to the XGBoost training routine that captures `eval_metric='auc'` every 10 iterations. Present the learning curve, showing how the model improves with successive trees. Summarize whether the curve plateaus early or continues improving.

*Technical note:*  
Use `xgb.train()` with the `evals` parameter and a custom callback (like `xgboost.callback.EvaluationMonitor`). Or rely on built-in logic with `verbose_eval=10`. Plot iteration vs. AUC from the logs. Summarize your findings on how quickly the model converges.  

---

**Variant 19:**  
Study the effect of different random seeds (e.g., [0, 17, 42, 99]) on the final performance of an XGBoost model with a moderately tuned hyperparameter set. Track any changes in final AUC or F1. Discuss random seed sensitivity and whether it meaningfully impacts model performance in this flight delay scenario.

*Technical note:*  
Set `seed` or `random_state` in XGBoost’s constructor. Run multiple times, collecting `roc_auc_score`. Summarize the mean and standard deviation. Reflect on the role of randomness (subsample, column sampling, random initialization) in boosting.  

---

**Variant 20:**  
Conduct a parameter-sensitivity analysis by picking a well-tuned baseline model and varying one parameter at a time slightly (±10%). Monitor changes in validation AUC or F1. This reveals which parameters (e.g., `max_depth`, `learning_rate`, `colsample_bytree`) are more sensitive for final performance. Provide a short rank ordering of parameter importance from a tuning perspective.

*Technical note:*  
Say your baseline is `max_depth=6, learning_rate=0.1, n_estimators=300, colsample_bytree=0.8, subsample=0.8`. Create small perturbations around each parameter. Evaluate each perturbation’s performance on a hold-out set. Summarize which parameters are “highly sensitive” (small changes lead to big performance shifts) vs. “less sensitive.”  

---

<a class="anchor" id="lab-10.4"></a>

## <span style="color:blue; font-size:1.5em;">10.4. Handling Missing Data, Outliers $\&$ Special Cases</span>

[Back to the outline](#lab-10)

### <span style='color:red; font-size:1.4em;'>Task 4</span>

---
**Variant 1:**  
Deliberately introduce synthetic missing values in the `DepTime` column for 20% of rows, simulating incomplete flight data. Compare two approaches: dropping those rows vs. imputing with median or mean `DepTime`. Train an XGBoost model each time and evaluate which approach yields the highest F1 score or AUC. Reflect on whether a single numeric imputation is sufficient for flight scheduling data.

*Technical note:*  
Randomly pick 20% of rows, set `DepTime` to NaN. Then do (1) drop the rows, (2) fill with median. Evaluate `roc_auc_score` or `f1_score`. Summarize whether discarding data might lose valuable patterns or if mean/median imputation might distort the feature distribution.

---

**Variant 2:**  
Examine extreme outliers in `Distance`: for instance, flights recorded with `Distance` below 50 miles or above 3000 miles. Decide whether to remove them or cap them (winsorizing) at certain percentiles. Retrain an XGBoost classifier after outlier handling, comparing it to a baseline with no outlier treatment. Show how outlier removal affects accuracy or recall.

*Technical note:*  
Identify outliers using percentile thresholds (e.g., 1st and 99th percentile). Optionally clip distances outside this range. Fit XGBoost with moderate hyperparameters. Evaluate `accuracy_score`, `recall_score`. Summarize if removing improbable distances helps reduce model noise or if you lose legitimate (rare) flights.

---

**Variant 3:**  
Implement a systematic approach to handle missing or inconsistent data in `Origin` and `Dest`. Suppose some flights have mis-typed airport codes or blank strings. Either fix them via domain knowledge or label them as “Unknown.” Compare flight delay classification performance to ignoring these flights. Emphasize the importance of correctly categorizing airports to preserve route-based patterns.

*Technical note:*  
Check for anomalies in the `Origin` and `Dest` columns (e.g., 3-letter codes that aren’t recognized). Reassign them to “UNK” category or drop them. Use `LabelEncoder` or one-hot encoding. Train XGBoost with standard parameters. Summarize any performance differences.  

---

**Variant 4:**  
Use interpolation-based methods for numeric columns if the data is sorted by time. For instance, if a flight’s `DepTime` is missing, approximate it from the average of neighboring flights in chronological order at the same airport. Then train a gradient boosting model and see if time-series-informed imputation yields better performance than a global median.

*Technical note:*  
Sort by `Origin, Month, DayOfMonth, DepTime` (or a more complete date-time if available). For missing `DepTime`, take the mean of previous and next valid `DepTime` from the same `Origin`. Evaluate with XGBoost’s `f1_score` or `roc_auc_score`. Summarize if localized, time-based imputation improves results.

---

**Variant 5:**  
Treat day-of-month outliers similarly. For instance, if you see day-of-month coded as 32 or 0, set them to a default (like 1) or remove those rows. Then train a gradient boosting model. Compare how the presence of invalid day-of-month entries might degrade predictions, and whether correcting them significantly improves metrics.

*Technical note:*  
Identify rows where `DayofMonth` is outside [1..31]. Either fix or drop them. Use the same model configuration to compare `f1_score` or `accuracy_score`. Summarize the improvement or any changes in distribution once erroneous days are removed.

---

**Variant 6:**  
Apply multiple imputation for numeric columns (like `Distance` or `DepTime`) using an iterative approach (e.g., MICE — Multiple Imputation by Chained Equations). Then feed the imputed data to XGBoost. Evaluate whether a more sophisticated imputation approach outperforms simpler strategies, like median fill. Also, note if the increased complexity is justifiable in practice.

*Technical note:*  
If libraries allow, implement MICE via `sklearn.experimental.enable_iterative_imputer` and `IterativeImputer`. Compare XGBoost performance for MICE-imputed data vs. median-imputed data. Summarize complexities, potential improvements in `roc_auc_score`.  

---

**Variant 7:**  
Use a “missingness indicator” technique: whenever `DepTime`, `Distance`, or other columns are missing, set an additional binary flag. Then fill the missing numeric values with 0 or median. Let gradient boosting learn if the missingness pattern itself is predictive of delays. Compare results to ignoring missingness patterns.

*Technical note:*  
For each column with missing values, create `columnName_isNA = (columnName.isnull())`. Fill the numeric column with 0. Evaluate XGBoost’s `f1_score`. Summarize if the missingness indicator feature is among the top importances.  

---

**Variant 8:**  
Simulate a scenario in which `Distance` is missing for an entire subset of flights (e.g., certain carriers didn’t report distance). Compare a standard approach (dropping those carriers) to a model that uses the partial data with missing distances. Possibly create a specialized model for flights with known distance and a fallback for flights with unknown distance. Evaluate the combined approach.

*Technical note:*  
Identify flights from a synthetic “mask” of carriers that have no `Distance` data. For those flights, either drop them or train a second XGBoost ignoring `Distance`. Then blend the predictions if `Distance` is missing. Summarize if this layered approach outperforms dropping all missing rows in terms of final AUC or accuracy.

---

**Variant 9:**  
Construct a robust “winsorized” version of `DepTime` if it’s outside plausible bounds (e.g., below 0 or above 2400). Cap the times at 0 and 2359. Then train a gradient boosting model. Compare F1 or recall with a baseline that retains the raw `DepTime`. Reflect on whether capping times prevents bizarre outliers from skewing the model.

*Technical note:*  
For each `DepTime`, do something like:  
```
if DepTime < 0: DepTime = 0  
if DepTime > 2359: DepTime = 2359
```
Train XGBoost. Evaluate `precision`, `recall`. Summarize the difference.  

---

**Variant 10:**  
Create an “uncertain day-of-week” category if `DayOfWeek` is missing or inconsistent. For example, if the dataset has placeholders “c-0” or negative codes, treat those flights as “unclassified day.” Then see if that uncertain category is strongly predictive of delays. Train a gradient boosting model and check the partial dependence or feature importance for this special category.

*Technical note:*  
Replace invalid day-of-week codes with “UNK” or similar. One-hot encode or label-encode. Evaluate XGBoost with moderate parameters. Summarize if “UNK” day-of-week is highly correlated with delays, potentially indicating data-quality issues that correlate with flight disruptions.

---

**Variant 11:**  
Implement a consistency check: ensure that `DayOfMonth` corresponds properly with `Month` (e.g., day 31 in February is invalid). For rows that fail this check, attempt a correction (like shifting the day to 28) or label them as invalid. Compare the XGBoost model’s performance with and without these corrections. Emphasize real-world data cleaning challenges.

*Technical note:*  
Programmatically check if `(Month, DayOfMonth)` forms a valid date (assuming no leap years). If invalid, fix or remove them. Train XGBoost with standard parameters. Evaluate `f1_score`. Summarize if data correction meaningfully reduces noise or is negligible.

---

**Variant 12:**  
Check for potential label noise: flights that are marked “Y” for delayed, but appear to have suspicious `DepTime` or partial data. Hypothesize some fraction of these might be mislabels. Evaluate how artificially flipping a small percentage (e.g., 5%) of “Y” labels to “N” or vice versa changes your model training. Summarize if gradient boosting is robust to moderate label noise.

*Technical note:*  
Randomly sample some delayed flights and relabel them “N,” or vice versa. Retrain XGBoost, compare differences in `roc_auc_score`. Document how the presence of label noise might degrade performance.  

---

**Variant 13:**  
Apply a “soft labeling” approach: if you suspect certain flights might only be borderline delayed or incorrectly labeled, assign them a probability label (e.g., 0.8 if it’s “Y” but uncertain). Then train a gradient boosting model capable of handling probability or regression outputs (transforming them back for classification). Compare if this soft approach changes predictions significantly.

*Technical note:*  
Select 10–20% of flights with ambiguous data patterns (like borderline `DepTime` anomalies) and set their label to 0.5 or 0.8. In XGBoost, treat the problem as a regression to predict these probabilities, then threshold at 0.5. Evaluate if the final classification improves in recall or precision.  

---

**Variant 14:**  
For any flights that are missing both `DepTime` and `Distance`, test a more advanced approach: cluster-based or iterative imputation. If the row is still unsalvageable, label it as “imputed with high uncertainty.” Compare a gradient boosting model that includes this uncertainty flag to one ignoring it. Evaluate if the uncertainty flag helps highlight flights with questionable data.

*Technical note:*  
If a row has both `DepTime` and `Distance` missing, apply a multi-step approach or cluster-based guess. Then mark `isHighUncertainty=1`. Fit XGBoost. Evaluate `roc_auc_score`. Summarize if the uncertainty feature is relevant or overshadowed by other signals.

---

**Variant 15:**  
Use robust scaling methods (e.g., `RobustScaler` from scikit-learn) on numeric columns to handle outliers gracefully, comparing them to standard min-max or z-score scaling. Fit XGBoost on each scaled variant. Summarize if the robust scaled dataset leads to more stable or better predictive performance, or if standard scaling is sufficient for tree-based models.

*Technical note:*  
Apply `RobustScaler` to `Distance` and `DepTime`. Also try `StandardScaler` or min-max. Evaluate XGBoost `f1_score`. Summarize performance differences. Reflect that trees typically handle outliers better, so scaling might have minimal effect.

---

**Variant 16:**  
Simulate partial data corruption: some fraction of flights in `Month` or `DayOfWeek` are overwritten with random valid values. Then train a gradient boosting model to see if it can still find robust patterns. Next, add a “isCorrupted” flag for these flights and compare if the model can learn to down-weight them. Evaluate changes in recall or precision.

*Technical note:*  
Randomly corrupt 10% of rows in `Month` or `DayOfWeek`. Then create `isCorrupted=1` for those. Compare `xgboost` classification with/without `isCorrupted`. Summarize changes in `precision`, `recall`.  

---

**Variant 17:**  
Perform feature-wise outlier detection using the interquartile range method on `DepTime` or `Distance`. Specifically, for each feature, remove rows that lie outside `[Q1 - 1.5*IQR, Q3 + 1.5*IQR]`. Train an XGBoost model with the reduced dataset. Compare performance to using the full dataset. Summarize if ignoring outliers might remove important edge cases or is beneficial.

*Technical note:*  
Compute Q1, Q3 for each numeric feature, define `IQR=Q3-Q1`. Filter out rows that exceed the typical range. Evaluate `f1_score`. Summarize how many rows were removed and the effect on model performance or generalization.

---

**Variant 18:**  
Evaluate the interplay of outlier removal with hyperparameter tuning. For instance, remove outliers in `Distance`, then tune `max_depth` and `n_estimators`. Compare the best found model’s performance to an equivalent tuning done on the full dataset. Conclude if outlier removal helps or if the tuned model can handle outliers anyway.

*Technical note:*  
Perform outlier detection (like IQR or percentile-based). Then run `GridSearchCV` on `[max_depth=[3,6], n_estimators=[100,300]]`. Evaluate `roc_auc_score`. Repeat the same grid search on the unfiltered dataset. Summarize differences in best hyperparameters and final metrics.

---

**Variant 19:**  
Model flights with potential data entry mistakes, such as `DepTime > 2400`, by forcibly correcting them modulo 2400 (e.g., 2500 becomes 100). Evaluate if this leads to a more consistent dataset that helps XGBoost or simply introduces new confusion. Compare a baseline ignoring these flights to a model using the modulo-corrected times.

*Technical note:*  
Apply `DepTimeCorrected = DepTime % 2400`. Train XGBoost, measure `f1_score`. Also test dropping those flights entirely. Summarize which approach yields better performance or data coherence.

---

**Variant 20:**  
Consolidate your entire missing-data and outlier-handling pipeline into a reproducible “data cleaning” function. Then demonstrate how the pipeline can be applied to both **flight_delays_train.csv** and **flight_delays_test.csv**. Evaluate a final gradient boosting model post-cleaning, discussing how robust data cleaning ensures consistent performance in real usage.

*Technical note:*  
Build a function `clean_flight_data(df)` that addresses outliers, missing values, and invalid codes. Apply it consistently to train and test sets. Fit XGBoost with a small parameter tune. Evaluate test set `accuracy_score` or `f1_score`. Summarize best practices for handing over such a pipeline to production teams.

---

<a class="anchor" id="lab-10.5"></a>

## <span style="color:blue; font-size:1.5em;">10.5. Advanced Ensemble Strategies $\&$ Interpretability</span>

[Back to the outline](#lab-10)

### <span style='color:red; font-size:1.4em;'>Task 5</span>

---
**Variant 1:**  
Train a LightGBM model and an XGBoost model on **flight_delays_train.csv**, then blend their predictions via simple average or weighted average. Compare the ensemble’s performance on **flight_delays_test.csv** to each individual model’s performance. Summarize the synergy between different boosting implementations.

*Technical note:*  
Fit `LGBMClassifier` and `XGBClassifier` with moderate hyperparameters. Get predicted probabilities, e.g. `p_lgbm` and `p_xgb`, then `p_ensemble = 0.5*p_lgbm + 0.5*p_xgb`. Evaluate with `accuracy_score` or `f1_score`. Summarize if the ensemble beats both models alone.

---

**Variant 2:**  
Build a two-level stacking ensemble: first-level models might be (1) XGBoost with default parameters, (2) a random forest, (3) a logistic regression. Then feed their predicted probabilities into a second-level XGBoost classifier that tries to best combine them. Evaluate the stacked model’s AUC or accuracy on a hold-out set. Reflect on the complexity vs. potential gains.

*Technical note:*  
Split training data into folds, train each base learner, store out-of-fold predictions. Then train a meta-learner XGBoost on these predictions. Evaluate final performance on a separate validation or test set. Summarize improvements in `roc_auc_score`.  

---

**Variant 3:**  
Implement a gradient boosting approach that uses a custom loss function, such as a cost-sensitive loss prioritizing the detection of delayed flights. For instance, penalize false negatives more heavily. In XGBoost, define a custom objective if feasible. Compare the performance to the standard logistic loss, emphasizing recall or cost-weighted metrics.

*Technical note:*  
If custom objectives are allowed, define `grad` and `hess` in Python for an asymmetric cost function. Or approximate by adjusting `scale_pos_weight`. Evaluate recall. Summarize if the custom approach yields significantly better detection of delayed flights (Y) at the expense of some false positives.

---

**Variant 4:**  
Adopt CatBoost for the flight delay task, making use of its built-in categorical handling for `UniqueCarrier`, `Origin`, and `Dest`. Compare training time and final F1 score to an XGBoost pipeline with manual label encoding. Document any advantage CatBoost might have in handling many categorical features automatically.

*Technical note:*  
Use `CatBoostClassifier(cat_features=[list of categorical columns], iterations=300, depth=6)`. Evaluate `f1_score` or `roc_auc_score`. Summarize training speed, ease of setup, and final performance.  

---

**Variant 5:**  
Try a combination of bagging and boosting by training multiple XGBoost models on different bootstrap samples of **flight_delays_train.csv**, then averaging their predictions. Compare performance to a single XGBoost model with the same hyperparameters but full data. Conclude if bagging adds resilience to data anomalies or helps reduce variance.

*Technical note:*  
Generate e.g. 5 bootstrap samples (each the same size as the original training set, sampled with replacement). Train a new XGBoost on each. Average predicted probabilities. Evaluate `f1_score`. Summarize if the ensemble outperforms a single model or if it’s marginally different.

---

**Variant 6:**  
Experiment with a multi-output approach: treat `Month`, `DayOfMonth`, `DayOfWeek` as potential targets to see if a multi-task boosting method helps or if it complicates the classification of `dep_delayed_15min`. The idea is to see if predicting day/time patterns simultaneously can indirectly improve delay predictions. Compare single-task vs. multi-task frameworks if your library supports it.

*Technical note:*  
Conceptually, multi-task might not be natively supported in XGBoost, so this might be simulated or described theoretically. Summarize potential synergy vs. confusion. Evaluate with a standard metric for the main classification.  

---

**Variant 7:**  
Deploy multiple gradient boosting models specialized by month. For instance, build 12 separate XGBoost classifiers (one per month in the training set). In the test set, for a flight in month M, use the model specialized for M. Compare this approach to a single global model. Summarize if monthly specialization captures seasonal patterns or leads to data fragmentation.

*Technical note:*  
Split training data by `Month=1..12`. Train a distinct model for each. For the test set, route each flight to the model for that flight’s month. Evaluate final performance. Summarize if the improvement is worth the overhead of 12 separate models.  

---

**Variant 8:**  
Implement a custom monotonicity constraint in XGBoost or LightGBM—e.g., we might expect that as `Distance` grows extremely large, the probability of delay might not necessarily monotonically increase. Show how to specify monotonic constraints for numeric features. Evaluate changes in performance or interpretability.

*Technical note:*  
For XGBoost, use parameter `monotone_constraints` if possible. For instance, specify that an increase in `Distance` does not necessarily yield an increased chance of delay. Summarize if the model respects constraints or if performance changes.  

---

**Variant 9:**  
Combine external textual data (hypothetical) with the flight delays to see if flight number or aircraft type is correlated with delays. Even though we can’t add new datasets, you can show a conceptual pipeline: transform textual codes into embeddings or counts, then feed them into a gradient boosting model. Evaluate if such textual features might help or are overshadowed by time and distance data.

*Technical note:*  
Conceptually describe reading an “aircraft_type” column, using TF-IDF or label encoding. Summarize how you’d append the embeddings to the numeric features. Evaluate classification metrics to see if new text-based or code-based features are important.  

---

**Variant 10:**  
Train an XGBoost model on the entire dataset. Then apply per-feature and per-instance Shapley values to interpret each feature’s contribution to predicted delay. Summarize typical patterns: e.g., high `Distance` lowers the chance of “Y,” or certain carriers raise it. Display a few example flights from the test set with their SHAP waterfalls to illustrate interpretability.

*Technical note:*  
Use the `shap` library: `explainer = shap.TreeExplainer(xgb_model)`. Then `shap_values = explainer.shap_values(X_test)`. Plot summary or individual force plots. Summarize interesting patterns.  

---

**Variant 11:**  
Compare LIME (Local Interpretable Model-Agnostic Explanations) to SHAP for a single trained gradient boosting model. Pick a few delayed flights and a few non-delayed flights, then generate local explanations with LIME and SHAP. Evaluate whether they align or contradict each other about which features matter for each instance.

*Technical note:*  
Use `lime.lime_tabular.LimeTabularExplainer` to produce local feature attributions. Compare with SHAP values for the same flights. Summarize how model-agnostic vs. model-specific approaches differ in explanation.  

---

**Variant 12:**  
Perform a fairness check: see if predictions are biased toward certain origins or carriers. For each `Origin`, measure the model’s false positive and false negative rates. Summarize if certain airports consistently get predicted as “delayed” or “not delayed” incorrectly. If so, propose how one might incorporate fairness constraints in gradient boosting.

*Technical note:*  
After training XGBoost, group test predictions by `Origin`. Compare real vs. predicted outcomes. Summarize false positives and false negatives per group. Identify any large discrepancies. Discuss potential fairness metrics or constraints (though XGBoost may not natively incorporate them).  

---

**Variant 13:**  
Combine an XGBoost model with a calibrator (e.g., Platt scaling or isotonic regression) to refine predicted delay probabilities. Show if calibration improves reliability, measuring a calibration curve or Brier score. Summarize if post-calibration probabilities align better with actual delay frequencies.

*Technical note:*  
Train XGBoost, then feed predicted probabilities into `CalibratedClassifierCV` with method='isotonic' or 'sigmoid'. Evaluate a calibration plot or the Brier score. Summarize improvements in well-calibrated predictions.  

---

**Variant 14:**  
Conduct error analysis: gather a subset of flights where the model is most uncertain (e.g., predicted probability near 0.5). Inspect their features manually (like borderline distances or departure times). Summarize patterns in these ambiguous flights. Then try adding specialized features or additional data transformations to see if you can reduce uncertainty.

*Technical note:*  
Predict probabilities for each flight. Filter where `p` is in `[0.4..0.6]` or near 0.5. Analyze features, possibly create new domain-inspired features. Retrain, then measure if overall uncertain predictions decrease in volume. Summarize your conclusions.  

---

**Variant 15:**  
Implement a multi-class extension artificially: classify flights as “on-time”, “slightly-delayed” (under 30 min), or “severely-delayed” (over 30 min). You can create “slightly-delayed” from the borderline cases if the dataset allows. Then train a gradient boosting classifier with multi-class capabilities. Evaluate confusion matrix across the three classes. Compare to the simpler binary classification.

*Technical note:*  
If data is available, define 3 categories or create them artificially. Use `XGBClassifier(objective='multi:softmax', num_class=3)`. Evaluate multi-class confusion matrix from `sklearn.metrics.confusion_matrix`. Summarize if this finer-grained approach helps.  

---

**Variant 16:**  
Apply a threshold-moving strategy. Train a standard XGBoost to get predicted probabilities for “Y.” Then systematically shift the classification threshold from 0.3 to 0.7. Plot precision vs. recall. Summarize the best threshold if your goal is high recall or high precision. Conclude if the standard 0.5 cutoff is optimal in flight delay detection.

*Technical note:*  
Generate predicted probabilities on a validation set. For thresholds in `[0.3..0.7, step=0.05]`, compute `precision_score` and `recall_score`. Plot or tabulate. Summarize the threshold that gives your desired balance.  

---

**Variant 17:**  
Implement a partial or incremental training approach: train XGBoost on the first half of the training data, then use the `warm_start` or incremental update to add the second half. Compare if performance is the same as training on the full dataset at once. Summarize if partial fitting helps in an environment where flights arrive in real-time.

*Technical note:*  
XGBoost’s support for incremental learning can be limited, but you can attempt a naive approach (train once, continue training). Evaluate performance differences. Summarize feasibility for real-time updates in flight delay predictions.  

---

**Variant 18:**  
Create a model-evaluation pipeline that logs misclassified flights. Each time you train a gradient boosting model, store flights that were predicted incorrectly (false positives and false negatives). Inspect these logs for recurring patterns or shared features. Summarize potential data pipeline changes or additional features that might fix repeated misclassifications.

*Technical note:*  
Programmatically gather misclassified indexes and their features. Summarize repeated carriers, time windows, or airports. Possibly propose new features or data cleaning steps. Evaluate if re-training with additional features reduces these recurring mistakes.  

---

**Variant 19:**  
Combine your final gradient boosting classifier with a simple dynamic threshold that depends on the flight’s departure hour. For instance, require a stricter threshold for flights departing at peak times. Evaluate if this custom approach improves overall on-time detection while maintaining decent recall. Summarize how to tailor thresholds to different conditions.

*Technical note:*  
Compute predicted probabilities. If `DepHour` in [6..9, 16..19], use threshold=0.4, else threshold=0.5. Evaluate precision, recall, overall accuracy. Summarize if hour-based dynamic thresholds are beneficial or add complexity.  

---

**Variant 20:**  
Outline a final “production-ready” pipeline that includes data cleaning, feature engineering, hyperparameter tuning, calibration, and interpretability steps. Provide a step-by-step narrative: loading the CSVs, cleaning outliers, encoding features, splitting data, tuning XGBoost, calibrating, and generating SHAP or feature-importance plots. Conclude how each stage ensures stable and interpretable flight delay predictions.

*Technical note:*  
Enumerate each stage, referencing relevant Python libraries: `pandas` for data prep, `sklearn` for splitting/tuning, `xgboost` for the model, `shap` for interpretation, etc. Summarize best practices like storing the pipeline, ensuring consistent transformations on test data, and monitoring real-time performance if deployed.

---