<h1 style="text-align:center;">Kaggle Masters</h1>

### Introduction

In this notebook, you will acquire valuable insights and techniques from Kaggle Masters who have utilized XGBoost to secure victories in Kaggle contests. While we won't be participating in a Kaggle contest ourselves, the competencies you develop here are broadly applicable to enhancing your machine learning models. You'll particularly understand the importance of an extra hold-out set, discover the process of feature engineering by creating new data columns through mean encoding, learn the implementation of `VotingClassifier` and `VotingRegressor` for crafting non-correlated machine learning ensembles, and explore the benefits of stacking a final model.

This notebook will focus on the following key areas:

- Delving into Kaggle competitions
- Crafting new data columns through feature engineering
- Developing non-correlated ensembles
- Stacking final models

### Kaggle Competitions

This section delves into the realm of Kaggle competitions, focusing on their historical context, structural aspects, and the crucial role of hold-out/test sets, distinct from validation/test sets.

Kaggle competitions serve as a battleground for machine learning practitioners, challenging them to outperform peers in predictive accuracy to win monetary rewards. A standout in these competitions has been XGBoost, a machine learning algorithm that gained fame for its consistent victories, often in conjunction with or against deep learning models like neural networks. Its winning streak is well-documented, with a list of victorious Kaggle competitions available on the Distributed (Deep) Machine Learning Community's GitHub page (https://github.com/dmlc/xgboost/tree/master/demo#machine-learning-challenge-winning-solutions) and a broader compilation on Kaggle (https://www.kaggle.com/sudalairajkumar/winning-solutions-of-kaggle-competitions).

XGBoost's entry into the spotlight was marked by its impressive performance in the 2014 Higgs Boson Machine Learning Challenge, where it rapidly climbed the leaderboard. Between 2014 and 2018, XGBoost was particularly dominant in competitions involving tabular data, which is structured in rows and columns, unlike unstructured data such as images or text that typically favor neural networks. However, with the 2017 advent of LightGBM by Microsoft, a fast and efficient gradient boosting alternative, XGBoost found a worthy competitor, especially in handling tabular data. For those interested in learning more about LightGBM, the introductory paper "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" (https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf) is an excellent resource.

However, merely implementing powerful algorithms like XGBoost or LightGBM is not sufficient in the competitive landscape of Kaggle. Nor is fine-tuning model hyperparameters the sole answer. A holistic approach that includes individual model predictions, innovative data engineering, and the strategic combination of optimal models is key to achieving higher scores and securing wins in these competitions.

Understanding the structure of Kaggle competitions is crucial for grasping why certain strategies, like building non-correlated ensembles and stacking models, are prevalent and effective. This knowledge is not only academically enriching but also practical, especially for those considering participating in Kaggle competitions in the future.

**Key Insights:**

1. **Kaggle's Structure and Recommendations:** Kaggle competitions are organized on their website, providing a platform for machine learning enthusiasts to test and improve their skills. For those transitioning from basic to more advanced levels in machine learning, Kaggle suggests starting with specific competitions like the "Housing Prices: Advanced Regression Techniques" (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). Such competitions, often without cash prizes, focus on deepening knowledge and honing skills.

2. **Historical Context of XGBoost's Success:** A notable aspect of Kaggle competitions is the success stories of various algorithms. For instance, XGBoost, a highly efficient algorithm, has been a frequent winner in these competitions. The 2015 Avito Context Ad Clicks competition (https://www.kaggle.com/c/avito-context-ad-clicks/overview), won by XGBoost user Owen Zhang, is one such example. The widespread use of XGBoost in Kaggle competitions, especially before the publication of Tianqi Chen's influential paper "XGBoost: A Scalable Tree Boosting System" in 2016 (https://arxiv.org/pdf/1603.02754.pdf), showcases its early adoption and effectiveness in the field.

Understanding these aspects of Kaggle competitions can boost your confidence and readiness to engage in these challenges, whether for learning or competing at a higher level.

This comprehensive overview provides a detailed look into the structure and nuances of Kaggle competitions, emphasizing the importance of different datasets, the process of model testing, and how these practices relate to real-world machine learning applications.

**Key Elements of Kaggle Competitions:**

1. **Competition Structure:** The Kaggle competition page typically includes several sections:
   - **Data:** Access to competition datasets.
   - **Notebooks:** A repository for shared solutions and starter notebooks.
   - **Discussion:** A forum for questions and answers.
   - **Leaderboard:** Displays top scores.
   - **Rules:** Details the competition's guidelines.
   - **Late Submission:** Indicates ongoing submissions post-competition, a common Kaggle policy.

2. **Joining and Downloading Data:** To participate, you must sign up for a free Kaggle account. The data is generally divided into `training.csv` for model building, and `test.csv` for model scoring. Initial scores are based on a public leaderboard, with a final model evaluated against a private test set at the competition's conclusion.

3. **Hold-Out Sets in Kaggle vs. General Practice:**
   - In both Kaggle and general machine learning practices, datasets are split into training and test sets. However, in Kaggle, the test set remains unseen to ensure fair competition.
   - **Training Set (`training.csv`):** Used for training and internal scoring, often split further into training and validation sets.
   - **Test Set (`test.csv`):** A separate hold-out set, used only for final model evaluation. Its purpose is to preserve competition integrity.

4. **Overfitting Risks and Real-World Relevance:** Overfitting to the test set, especially in pursuit of marginal leaderboard gains, is a common pitfall. A model that excels on training data but fails on unseen data is of little practical value. The real test of a model's efficacy is its performance on new, unknown data.

5. **Validating and Testing Models:**
   - **Initial Split:** Divide data into a training set and a hold-out set, avoiding any peek at the hold-out set.
   - **Further Split or Cross-Validation:** Use the training set for model fitting and validation, iteratively improving performance.
   - **Final Test:** Evaluate the final model on the hold-out set. If results are unsatisfactory, revisit model adjustments but refrain from using the hold-out set for further tuning.

6. **Kaggle's Unique Testing Approach:** Kaggle competitions often have a dual test set structure – a public set for ongoing scoring and a private set revealed at competition's end. Success in Kaggle requires excelling on the private test set.

7. **Impact on Machine Learning Practices:** The precision and rigor demanded in Kaggle competitions have spurred innovative techniques in machine learning. Understanding and applying these techniques can lead to the development of more robust models and a deeper comprehension of the field.

In summary, Kaggle competitions offer a structured and competitive environment for machine learning practitioners. They emphasize the importance of proper data handling, the perils of overfitting, and the significance of model validation and testing, paralleling real-world machine learning challenges and solutions.

### Feature Engineering

Feature engineering is a pivotal aspect of data science and machine learning, often consuming significant time and effort. In this section, we'll delve into using pandas to create new data columns through feature engineering.

**Understanding Feature Engineering:**
Feature engineering is the practice of generating new data columns from existing ones. The effectiveness of machine learning models heavily depends on the quality and robustness of the data they are trained on. When the available data is lacking, feature engineering becomes essential to enhance the dataset.

The aim is not just to decide whether to engage in feature engineering, but to determine the extent to which it should be applied. This process often involves creatively extracting and combining information from existing columns to form new, informative features.

**Application in Uber and Lyft Data:**
For instance, consider a dataset for predicting cab fares for services like Uber and Lyft. Feature engineering might involve creating new columns such as:
- **Distance:** Calculating the distance between the pickup and drop-off points.
- **Time of Day:** Extracting the part of the day (morning, afternoon, evening) from the timestamp, as fares might vary with time.
- **Day of the Week:** Identifying the day of the week, since weekends or specific weekdays might have different pricing dynamics.
- **Weather Conditions:** Integrating weather data, as adverse weather might affect fare prices.

This kind of dataset, along with many others, is available on Kaggle, such as the Uber and Lyft cab prices dataset (https://www.kaggle.com/ravi72munde/uber-lyft-cab-prices). By creatively manipulating and enriching this data, we can build more accurate and reliable predictive models.

```python
import pandas as pd
import os
from pathlib import Path

# URL of the CSV file
url = 'https://media.githubusercontent.com/media/theAfricanQuant/XGBoost4machinelearning/main/data/cab_rides.csv'

# Reading the CSV file directly from the URL
df = pd.read_csv(url)

# Path for the data directory
data_dir = Path('data')

# Create a data folder if it doesn't exist
if not data_dir.exists():
    data_dir.mkdir()

# File path for saving, using the Path object
file_path = data_dir / 'cab_rides.csv'

# Save the DataFrame to a CSV file
df.to_csv(file_path, index=False)
```

In this code:
- `Path('data')` creates a `Path` object for the 'data' directory.
- `data_dir.mkdir()` is used instead of `os.makedirs('data')`.
- `file_path = data_dir / 'cab_rides.csv'` leverages `Path` object's `/` operator to concatenate paths in a platform-independent manner.

In [1]:
import numpy as np
import pandas as pd
import os
from pathlib import Path

from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.datasets import load_breast_cancer

from category_encoders.target_encoder import TargetEncoder

import datetime as dt

from helper_file import *

import warnings

warnings.filterwarnings('ignore')

In [2]:
# URL of the CSV file
url = 'https://media.githubusercontent.com/media/theAfricanQuant/XGBoost4machinelearning/main/data/cab_rides.csv'

# Reading the CSV file directly from the URL
df = pd.read_csv(url)

# Path for the data directory
data_dir = Path('data')

# Create a data folder if it doesn't exist
if not data_dir.exists():
    data_dir.mkdir()

# File path for saving, using the Path object
file_path = data_dir / 'cab_rides.csv'

# Save the DataFrame to a CSV file
df.to_csv(file_path, index=False)

In [3]:
df = pd.read_csv(file_path, nrows=10000)
df.sample(n=5, random_state=43)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
9415,2.32,Uber,1544727314997,Haymarket Square,Back Bay,15.0,1.0,12d1d3df-a56a-4eb3-8701-934b8a124a2b,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL
6377,1.45,Lyft,1543887779314,Back Bay,Fenway,13.5,1.0,6935e9e9-1441-43c2-b12b-bde33ffeed6d,lyft_plus,Lyft XL
8019,2.42,Lyft,1544408018848,Beacon Hill,Fenway,9.0,1.0,dc4370ad-2fcd-4b1c-ac8e-3852ede9e2ce,lyft,Lyft
7754,1.28,Lyft,1543428682509,Haymarket Square,Financial District,16.5,1.0,f8d74ea0-9b61-41ae-87e8-f98cad871d5d,lyft_lux,Lux Black
4961,1.5,Uber,1544896513984,Back Bay,Fenway,7.0,1.0,21a5aa7e-1fa4-45a9-a6b9-69febaee879c,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   distance          10000 non-null  float64
 1   cab_type          10000 non-null  object 
 2   time_stamp        10000 non-null  int64  
 3   destination       10000 non-null  object 
 4   source            10000 non-null  object 
 5   price             9227 non-null   float64
 6   surge_multiplier  10000 non-null  float64
 7   id                10000 non-null  object 
 8   product_id        10000 non-null  object 
 9   name              10000 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 781.4+ KB


From the output seen above, we can notice that the `price` column has null values, since the number of Non-Null is less than 10,000.

In [5]:
df[df.isna().any(axis=1)]

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
18,1.11,Uber,1543673584211,West End,North End,,1.0,fa5fb705-03a0-4eb9-82d9-7fe80872f754,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
31,2.48,Uber,1543794776318,South Station,Beacon Hill,,1.0,eee70d94-6706-4b95-a8ce-0e34f0fa8f37,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
40,2.94,Uber,1543523885298,Fenway,North Station,,1.0,7f47ff53-7cf2-4a6a-8049-83c90e042593,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
60,1.16,Uber,1544731816318,West End,North End,,1.0,43abdbe4-ab9e-4f39-afdc-31cfa375dc25,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
69,2.67,Uber,1543583283653,Beacon Hill,North End,,1.0,80db1c49-9d51-4575-a4f4-1ec23b4d3e31,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
...,...,...,...,...,...,...,...,...,...,...
9949,1.08,Uber,1543272429665,North End,North Station,,1.0,74fffcba-da67-42d1-b585-13d546a125be,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
9953,2.46,Uber,1545045010035,Beacon Hill,Fenway,,1.0,18c2e91d-d594-4a22-9be7-0a5829efa4bf,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
9965,2.58,Uber,1544815809335,Beacon Hill,South Station,,1.0,77adadfb-4ac7-4cdf-aeab-6c4cfe8f7b26,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
9985,1.89,Uber,1544695512211,Beacon Hill,North End,,1.0,f2dfa974-f9d1-4e90-a0e6-77f7eea16956,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi


Since we can't really see anything glaring from the null values, we can only conclude that the prices were never recorded. We will just drop those rows off.

In [6]:
(df
 .dropna()
)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.44,Lyft,1544952607890,North Station,Haymarket Square,5.0,1.0,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.44,Lyft,1543284023677,North Station,Haymarket Square,11.0,1.0,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.44,Lyft,1543366822198,North Station,Haymarket Square,7.0,1.0,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,0.44,Lyft,1543553582749,North Station,Haymarket Square,26.0,1.0,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,0.44,Lyft,1543463360223,North Station,Haymarket Square,9.0,1.0,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL
...,...,...,...,...,...,...,...,...,...,...
9995,3.05,Uber,1543504379037,Fenway,North Station,11.5,1.0,934d2fbe-f978-4495-9786-da7b4dd21107,997acbb5-e102-41e1-b155-9df7de0a73f2,UberPool
9996,3.05,Uber,1543800477997,Fenway,North Station,26.0,1.0,af8fd57c-fe7c-4584-bd1f-beef1a53ad42,6c84fd89-3f11-4782-9b50-97c468b19529,Black
9997,3.05,Uber,1543407083241,Fenway,North Station,19.5,1.0,b3c5db97-554b-47bf-908b-3ac880e86103,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL
9998,3.05,Uber,1544896813623,Fenway,North Station,36.5,1.0,fcb35184-9047-43f7-8909-f62a7b17b6cf,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV


### Feature engineering numerical columns

Sometimes Timestamp columns tend to represent Unix time (the number of milliseconds since January 1st, 1970). Specific time data can be extracted from the timestamp column that may help predict cab fares, such as the month, hour of the day, whether it is rush hour, and so on.

I will copy the last code and convert it into a function for the next stage of the feature engineering.

In [7]:
def feat_eng(df):
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']))
           )

df_cabs = feat_eng(df)
df_cabs.sample(n=5, random_state=43)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date
3913,3.36,Lyft,1543317862189,Northeastern University,North Station,16.5,1.0,1bbad703-ab15-4ed2-ad0d-3f3da5240b8c,lyft_plus,Lyft XL,1970-01-01 00:25:43.317862189
1261,3.08,Uber,1543416685669,Northeastern University,West End,11.0,1.0,b38b4ca5-317a-4b4a-b15f-bb78e01a3a28,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,1970-01-01 00:25:43.416685669
9354,0.6,Lyft,1544893805700,South Station,Theatre District,3.0,1.0,3099d03b-db48-4913-8871-948116a094d5,lyft_line,Shared,1970-01-01 00:25:44.893805700
3473,2.84,Uber,1543425428309,Fenway,West End,27.5,1.0,2e76f105-448d-4f91-adac-8881902118c6,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,1970-01-01 00:25:43.425428309
3722,2.73,Uber,1543544581286,Back Bay,North End,15.0,1.0,9ca032f4-6e37-4c22-9688-d0d9536e82fb,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,1970-01-01 00:25:43.544581286


The conversion is telling us that Uber and Lyft have existed and have collected data since the 1970's. There must be something wromg with our conversion. The extra decimal places are a clue that the conversion is incorrect.

After trying several multipliers to make an appropriate conversion, it discovered that 10**6 gives the appropriate result. So we make the multiplication before the conversion.

In [8]:
def feat_eng(df):
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']*(10**6)))
           )

df_cabs = feat_eng(df)
df_cabs.sample(n=5, random_state=43)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date
3913,3.36,Lyft,1543317862189,Northeastern University,North Station,16.5,1.0,1bbad703-ab15-4ed2-ad0d-3f3da5240b8c,lyft_plus,Lyft XL,2018-11-27 11:24:22.189
1261,3.08,Uber,1543416685669,Northeastern University,West End,11.0,1.0,b38b4ca5-317a-4b4a-b15f-bb78e01a3a28,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,2018-11-28 14:51:25.669
9354,0.6,Lyft,1544893805700,South Station,Theatre District,3.0,1.0,3099d03b-db48-4913-8871-948116a094d5,lyft_line,Shared,2018-12-15 17:10:05.700
3473,2.84,Uber,1543425428309,Fenway,West End,27.5,1.0,2e76f105-448d-4f91-adac-8881902118c6,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-11-28 17:17:08.309
3722,2.73,Uber,1543544581286,Back Bay,North End,15.0,1.0,9ca032f4-6e37-4c22-9688-d0d9536e82fb,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-11-30 02:23:01.286


With a datetime column, you can extract new columns, such as `month`, `hour`, and `day of week`, after importing `datetime`.

In [9]:
def feat_eng(df):
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']*(10**6)),
                   month = lambda x: x['date'].dt.month,
                    hour = lambda x: x['date'].dt.hour,
                    dayofweek = lambda x: x['date'].dt.dayofweek)
           )

df_cabs = feat_eng(df)
df_cabs.sample(n=5, random_state=43)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date,month,hour,dayofweek
3913,3.36,Lyft,1543317862189,Northeastern University,North Station,16.5,1.0,1bbad703-ab15-4ed2-ad0d-3f3da5240b8c,lyft_plus,Lyft XL,2018-11-27 11:24:22.189,11,11,1
1261,3.08,Uber,1543416685669,Northeastern University,West End,11.0,1.0,b38b4ca5-317a-4b4a-b15f-bb78e01a3a28,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,2018-11-28 14:51:25.669,11,14,2
9354,0.6,Lyft,1544893805700,South Station,Theatre District,3.0,1.0,3099d03b-db48-4913-8871-948116a094d5,lyft_line,Shared,2018-12-15 17:10:05.700,12,17,5
3473,2.84,Uber,1543425428309,Fenway,West End,27.5,1.0,2e76f105-448d-4f91-adac-8881902118c6,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-11-28 17:17:08.309,11,17,2
3722,2.73,Uber,1543544581286,Back Bay,North End,15.0,1.0,9ca032f4-6e37-4c22-9688-d0d9536e82fb,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-11-30 02:23:01.286,11,2,4


Next we determine whether a day of the week is a weekend by checking whether the column `dayofweek` is equivalent to 5 or 6, which represent Saturday or Sunday.

In [10]:
def feat_eng(df):
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']*(10**6)),
                    month = lambda x: x['date'].dt.month,
                    hour = lambda x: x['date'].dt.hour,
                    dayofweek = lambda x: x['date'].dt.dayofweek,
                    weekend = lambda x: np.where((x['dayofweek'] == 5) | (x['dayofweek'] == 6), 1, 0)
                   )
           )

df_cabs = feat_eng(df)
df_cabs.sample(n=5, random_state=43)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date,month,hour,dayofweek,weekend
3913,3.36,Lyft,1543317862189,Northeastern University,North Station,16.5,1.0,1bbad703-ab15-4ed2-ad0d-3f3da5240b8c,lyft_plus,Lyft XL,2018-11-27 11:24:22.189,11,11,1,0
1261,3.08,Uber,1543416685669,Northeastern University,West End,11.0,1.0,b38b4ca5-317a-4b4a-b15f-bb78e01a3a28,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,2018-11-28 14:51:25.669,11,14,2,0
9354,0.6,Lyft,1544893805700,South Station,Theatre District,3.0,1.0,3099d03b-db48-4913-8871-948116a094d5,lyft_line,Shared,2018-12-15 17:10:05.700,12,17,5,1
3473,2.84,Uber,1543425428309,Fenway,West End,27.5,1.0,2e76f105-448d-4f91-adac-8881902118c6,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-11-28 17:17:08.309,11,17,2,0
3722,2.73,Uber,1543544581286,Back Bay,North End,15.0,1.0,9ca032f4-6e37-4c22-9688-d0d9536e82fb,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-11-30 02:23:01.286,11,2,4,0


Using NumPy for vectorization is a great idea, especially for large datasets, as it often provides performance enhancements over pure pandas operations. This approach should be faster than applying a function to each row, as `np.where` operates on the entire array at once.

- `np.where` is a vectorized conditional function from NumPy. It works like an efficient, element-wise version of an if-else statement.
- `(df['dayofweek'] == 5) | (df['dayofweek'] == 6)` creates a boolean condition where `True` is assigned for weekend days (assuming 5 and 6 represent Saturday and Sunday, respectively).
- The first argument after the condition in `np.where` is the value to assign when the condition is `True` (in this case, `1`), and the second argument is the value for `False` (`0`).

The same strategy can be implemented to create a `rush_hour` column by seeing whether the hour is between 6–10 AM (hours 6–10) and 3–7 PM (hours 15–19) and not the weekend.

In [11]:
def feat_eng(df):
    # Define rush hour conditions
    rush_hours = [6, 7, 8, 9, 15, 16, 17, 18]
    
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']*(10**6)),
                    month = lambda x: x['date'].dt.month,
                    hour = lambda x: x['date'].dt.hour,
                    dayofweek = lambda x: x['date'].dt.dayofweek,
                    weekend = lambda x: np.where((x['dayofweek'] == 5) | (x['dayofweek'] == 6), 1, 0),
                    rush_hour = lambda x: np.where((np.isin(x['hour'], 
                                                            rush_hours) & (x['weekend'] == 0)), 1, 0)
                   )
           )

df_cabs = feat_eng(df)
df_cabs.sample(n=5, random_state=43)

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date,month,hour,dayofweek,weekend,rush_hour
3913,3.36,Lyft,1543317862189,Northeastern University,North Station,16.5,1.0,1bbad703-ab15-4ed2-ad0d-3f3da5240b8c,lyft_plus,Lyft XL,2018-11-27 11:24:22.189,11,11,1,0,0
1261,3.08,Uber,1543416685669,Northeastern University,West End,11.0,1.0,b38b4ca5-317a-4b4a-b15f-bb78e01a3a28,9a0e7b09-b92b-4c41-9779-2ad22b4d779d,WAV,2018-11-28 14:51:25.669,11,14,2,0,0
9354,0.6,Lyft,1544893805700,South Station,Theatre District,3.0,1.0,3099d03b-db48-4913-8871-948116a094d5,lyft_line,Shared,2018-12-15 17:10:05.700,12,17,5,1,0
3473,2.84,Uber,1543425428309,Fenway,West End,27.5,1.0,2e76f105-448d-4f91-adac-8881902118c6,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-11-28 17:17:08.309,11,17,2,0,1
3722,2.73,Uber,1543544581286,Back Bay,North End,15.0,1.0,9ca032f4-6e37-4c22-9688-d0d9536e82fb,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-11-30 02:23:01.286,11,2,4,0,0


In this code:

- `np.isin(df['hour'], rush_hours)` checks if each element in `df['hour']` is in the list of rush hour times.
- `& (df['weekend'] == 0)` further filters the condition to non-weekend days.
- Finally, `np.where` is used to assign `1` where the condition is met, and `0` otherwise.

### Feature engineering categorical columns

Feature engineering often involves converting categorical data into numerical forms for machine learning models. Here's a concise overview of the process and alternatives:

1. **Standard Methods**: 
   - **`pd.get_dummies`**: This pandas function is commonly used to convert categorical columns into a series of binary (0s and 1s) columns, indicating the presence or absence of each category.
   - **`OneHotEncoder` from Scikit-learn**: Similar to `pd.get_dummies`, but it creates sparse matrices which are memory-efficient, especially useful for large datasets with many categories. It's a preferred option in scenarios like XGBoost model deployment, as discussed in Chapter 10 of your referenced material.

2. **Alternatives to Binary Encoding**: 
   - While binary encoding (0s and 1s) is standard, other methods might yield better results in certain cases.
   - **Frequency Encoding**: This method involves converting categories into their frequencies within the column. Here, each category is represented by its percentage occurrence in the column. It provides a different numerical representation that might be more informative, especially in datasets where the frequency of categories is important.

In summary, while `pd.get_dummies` and `OneHotEncoder` are common techniques for handling categorical data, exploring alternative methods like frequency encoding can be beneficial, particularly when the relative frequency of categories carries significant information for the model.

### Engineering frequency columns

To engineer a categorical column, such as `cab_type`, we will first view the number of values for each category.

In [12]:
df_cabs['cab_type'].value_counts()

cab_type
Uber    4654
Lyft    4573
Name: count, dtype: int64

We next use `groupby` to place the `counts` in a new column. We will do that inside the function we have been building.

In [13]:
def feat_eng(df):
    # Define rush hour conditions
    rush_hours = [6, 7, 8, 9, 15, 16, 17, 18]
    
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']*(10**6)),
                    month = lambda x: x['date'].dt.month,
                    hour = lambda x: x['date'].dt.hour,
                    dayofweek = lambda x: x['date'].dt.dayofweek,
                    weekend = lambda x: np.where((x['dayofweek'] == 5) | (x['dayofweek'] == 6), 1, 0),
                    rush_hour = lambda x: np.where((np.isin(x['hour'], 
                                                            rush_hours) & (x['weekend'] == 0)), 1, 0),
                    cab_freq = lambda x: x.groupby('cab_type')['cab_type'].transform('count')/len(x)
                   )
           )

df_cabs = feat_eng(df)
df_cabs.tail()

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date,month,hour,dayofweek,weekend,rush_hour,cab_freq
9995,3.05,Uber,1543504379037,Fenway,North Station,11.5,1.0,934d2fbe-f978-4495-9786-da7b4dd21107,997acbb5-e102-41e1-b155-9df7de0a73f2,UberPool,2018-11-29 15:12:59.037,11,15,3,0,1,0.504389
9996,3.05,Uber,1543800477997,Fenway,North Station,26.0,1.0,af8fd57c-fe7c-4584-bd1f-beef1a53ad42,6c84fd89-3f11-4782-9b50-97c468b19529,Black,2018-12-03 01:27:57.997,12,1,0,0,0,0.504389
9997,3.05,Uber,1543407083241,Fenway,North Station,19.5,1.0,b3c5db97-554b-47bf-908b-3ac880e86103,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,2018-11-28 12:11:23.241,11,12,2,0,0,0.504389
9998,3.05,Uber,1544896813623,Fenway,North Station,36.5,1.0,fcb35184-9047-43f7-8909-f62a7b17b6cf,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-12-15 18:00:13.623,12,18,5,1,0,0.504389
9999,2.03,Lyft,1543812781166,Theatre District,Northeastern University,7.0,1.0,7f0e8caf-e057-41eb-bdef-27eb14c88122,lyft_line,Shared,2018-12-03 04:53:01.166,12,4,0,0,0,0.495611


Let's break down the code `df_cabs.groupby('cab_type')['cab_type'].transform('count') / len(df_cabs)`:

1. **Grouping by 'cab_type'**:
   - `df_cabs.groupby('cab_type')`: This part of the code groups the DataFrame `df_cabs` by the values in the 'cab_type' column. It essentially organizes the data such that rows with the same 'cab_type' value are grouped together.

2. **Selecting the 'cab_type' Column**:
   - `['cab_type']`: After grouping, this code selects the 'cab_type' column within each group. It's a bit redundant in this case since we're grouping by 'cab_type' and then selecting the same column, but it's necessary for the next step.

3. **Applying the Transform Function**:
   - `.transform('count')`: The `transform` function is applied to the 'cab_type' column of each group. The argument `'count'` tells transform to count the number of occurrences of each unique 'cab_type' value in the DataFrame. Unlike aggregation functions like `groupby().count()`, which reduce the data to the number of unique groups, `transform` maintains the original DataFrame's shape. It will replicate the count for each occurrence of 'cab_type'.

4. **Dividing by the Length of the DataFrame**:
   - `/ len(df_cabs)`: The total count for each 'cab_type' group obtained from `transform('count')` is then divided by the total number of rows in `df_cabs` (given by `len(df_cabs)`). This step converts the counts into frequencies, representing the proportion of each 'cab_type' category within the entire DataFrame.

In summary, this line of code calculates the frequency (as a proportion of the total number of rows) of each unique value in the 'cab_type' column. The result is a Series where each value corresponds to the frequency of the 'cab_type' for that row. This technique is often used in feature engineering to replace categorical variables with a numeric representation that reflects the prevalence of each category within the dataset.

### Mean Encoding in Feature Engineering
In the domain of machine learning, particularly in Kaggle competitions, **Mean Encoding** has emerged as a noteworthy technique. While it's not necessarily superior to one-hot encoding in every scenario, its effectiveness in certain competitions makes it a valuable method to consider.


**Mean Encoding**, also known as **Target Encoding**, is a technique in machine learning where categorical variables are replaced with the mean of the target variable. For example, if you have a categorical feature 'color' with the value 'orange', and this color corresponds to seven instances of the target variable being 1 and three instances of it being 0, the mean encoding for 'orange' would be 0.7 (calculated as 7/10).

#### Key Points:

- **Purpose**: It transforms categorical columns into numerical values based on the mean of the target variable, providing a potentially more meaningful representation of the category in relation to the target.

- **Data Leakage Concerns**: One major issue with mean encoding is the risk of data leakage. Data leakage happens when information from the target variable influences the predictor variables, leading to overly optimistic performance estimates. This can occur if the mean encoding uses the target data in a way that is not representative of how the model will encounter data in the real world.

- **Regularization**: To mitigate data leakage and overfitting, regularization techniques are employed. Regularization adjusts the encoding to be more conservative, thereby reducing the model's complexity and its tendency to overfit on the training data.

- **Applicability**: Mean encoding is particularly effective for large datasets with a deep distribution of categorical values. It's most beneficial when the distribution of mean values in the training data is similar to that in the new, incoming data.

- **Scikit-learn's `TargetEncoder`**: To facilitate mean encoding while addressing data leakage and overfitting, scikit-learn offers the `TargetEncoder`. This tool automatically applies mean encoding with built-in options for r

- **Comparative Analysis**: Mean encoding isn't presented as a blanket replacement for one-hot encoding. Instead, it's an alternative that has shown promise in specific contexts, particularly in Kaggle competitions.

- **Practical Application**: The value of mean encoding lies in its potential to improve model performance in certain datasets. By replacing categorical values with the mean of the target variable, it offers a different perspective on the data, which can sometimes lead to better predictive accuracy.

-  **Resource for In-Depth Understanding**: For a detailed exploration of mean encoding, a comprehensive study is available on Kaggle, titled "Mean Likelihood Encodings: A Comprehensive Study" (available [here](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)). This resource provides extensive insights into the technique.egularization.

#### Example:
Suppose a dataset for predicting house prices includes a categorical feature 'Neighborhood'. If houses in 'Neighborhood A' typically sell for higher prices than those in 'Neighborhood B', mean encoding will assign a higher numerical value to 'Neighborhood A' based on the average sale price (target variable) associated with it.

#### Conclusion:
Mean encoding is a powerful feature engineering technique tested in competitions like Kaggle. It converts categorical data into a numerical format that reflects the mean of the target variable, offering a nuanced representation of categories in relation to the target. However, its implementation must be handled carefully to avoid data leakage and overfitting, with regularization playing a crucial role in ensuring model robustness.

We continue to build our `feat_eng` function with the following steps.
```python
encoder = TargetEncoder()
```
After initializing encoder as above, we will next create a new column to apply the mean encoding using the `fit_transform` method on the encoder.

In [14]:
def feat_eng(df):
    # Define rush hour conditions
    rush_hours = [6, 7, 8, 9, 15, 16, 17, 18]

    #  initialize encoder
    encoder = TargetEncoder()
    return (df
            .dropna()
            .assign(date = pd.to_datetime(df['time_stamp']*(10**6)),
                    month = lambda x: x['date'].dt.month,
                    hour = lambda x: x['date'].dt.hour,
                    dayofweek = lambda x: x['date'].dt.dayofweek,
                    weekend = lambda x: np.where((x['dayofweek'] == 5) | (x['dayofweek'] == 6), 1, 0),
                    rush_hour = lambda x: np.where((np.isin(x['hour'], 
                                                            rush_hours) & (x['weekend'] == 0)), 1, 0),
                    cab_freq = lambda x: x.groupby('cab_type')['cab_type'].transform('count')/len(x),
                    cab_type_mean = encoder.fit_transform(df['cab_type'], df['price'])
                   )
           )

df_cabs = feat_eng(df)
df_cabs.tail()

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name,date,month,hour,dayofweek,weekend,rush_hour,cab_freq,cab_type_mean
9995,3.05,Uber,1543504379037,Fenway,North Station,11.5,1.0,934d2fbe-f978-4495-9786-da7b4dd21107,997acbb5-e102-41e1-b155-9df7de0a73f2,UberPool,2018-11-29 15:12:59.037,11,15,3,0,1,0.504389,15.743446
9996,3.05,Uber,1543800477997,Fenway,North Station,26.0,1.0,af8fd57c-fe7c-4584-bd1f-beef1a53ad42,6c84fd89-3f11-4782-9b50-97c468b19529,Black,2018-12-03 01:27:57.997,12,1,0,0,0,0.504389,15.743446
9997,3.05,Uber,1543407083241,Fenway,North Station,19.5,1.0,b3c5db97-554b-47bf-908b-3ac880e86103,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,2018-11-28 12:11:23.241,11,12,2,0,0,0.504389,15.743446
9998,3.05,Uber,1544896813623,Fenway,North Station,36.5,1.0,fcb35184-9047-43f7-8909-f62a7b17b6cf,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-12-15 18:00:13.623,12,18,5,1,0,0.504389,15.743446
9999,2.03,Lyft,1543812781166,Theatre District,Northeastern University,7.0,1.0,7f0e8caf-e057-41eb-bdef-27eb14c88122,lyft_line,Shared,2018-12-03 04:53:01.166,12,4,0,0,0,0.495611,16.916357


While one-hot encoding remains a standard and effective approach, exploring mean encoding can be beneficial, especially in complex datasets where this method might capture nuances in the data more effectively. Its use in Kaggle competitions underlines its potential for enhancing model performance.

### Advancing Feature Engineering Techniques

Feature engineering is a critical and ongoing process in data science, with potential for continuous improvement and innovation.

#### Extended Techniques:
- **Statistical Grouping**: Apply `groupby` to derive statistical measures from other columns. This can reveal insightful patterns based on categories.
- **Encoding Geographical Data**: For categorical data like destinations or arrival points, consider converting to geographical coordinates (latitude and longitude). From these, calculate distances using methods like taxicab (Manhattan) distance or Vincenty distance, which considers the Earth's spherical shape.

#### Kaggle Practices:
- **Massive Feature Creation**: In Kaggle competitions, creating thousands of new features is common to enhance model accuracy, even if the improvements are marginal.
- **Feature Selection**: Use techniques like `.feature_importances_` (from decision trees) to identify the most impactful features.
- **Avoiding Redundancy**: Eliminate highly correlated features to ensure diversity in the data, enhancing the robustness of models.

#### Dealing with Missing Data:
- **Incorporating External Data**: In cases like a missing weather dataset for cab rides, you can independently research and integrate relevant external data (e.g., historical weather information).

#### The Art of Feature Engineering:
- **A Multifaceted Approach**: Effective feature engineering requires a blend of research, experimentation, and domain knowledge. It involves standardizing columns, assessing new features' impact on model performance, and ultimately selecting the most effective features.
- **Continuous Exploration**: The techniques mentioned are just a fraction of the possibilities. Ongoing exploration and learning are key to mastering this aspect of data science.

#### Moving Forward:
Having understood these diverse strategies for feature engineering, the next step in the journey is building non-correlated ensembles, which focuses on combining diverse models to improve predictive performance.

In summary, feature engineering is an expansive field requiring creativity, analytical skills, and a deep understanding of the data and the problem domain. It's a process of trial and improvement, where each new feature could potentially lead to a more robust and accurate model.

### Building Non-Correlated Ensembles

> "In our final model, we had XGBoost as an ensemble model, which included 20 XGBoost models, 5 random forests, 6 randomized decision tree models, 3 regularized greedy forests, 3 logistic regression models, 5 ANN models, 3 elastic net models and 1 SVM model."
> 
> – [Song, Kaggle Winner](https://hunch243.rssing.com/chan-68612493/all_p1.html)

Kaggle competitions' winning strategies often involve more than just a single machine learning model. The champions usually leverage an array of diverse models, forming what is known as an ensemble. However, these ensembles are not limited to standard boosting or bagging techniques like XGBoost or Random Forests. They are composite ensembles that integrate a variety of distinct models, which can include XGBoost, Random Forests, and other algorithms.

In this discussion, our focus will be on creating such non-correlated ensembles. The goal is to blend multiple machine learning models in a way that maximizes accuracy and minimizes the risk of overfitting. This approach is a critical factor in achieving top performance in competitive machine learning environments.

### Range of models

The Wisconsin Breast Cancer dataset, used to predict whether a patient has breast cancer, has 569 rows and 30 columns and we can get the dataset from scikit learn's datasets.

Assign the predictor columns to `X` and the target column to `y` by setting the `return_X_y=True` parameter:

```python
X, y = load_breast_cancer(return_X_y=True)

```

In [15]:
X, y = load_breast_cancer(return_X_y=True)

# Prepare 5-fold cross-validation using StratifiedKFold for consistency

kfold = StratifiedKFold(n_splits=5)

In [16]:
def classification_model(model):

    scores = cross_val_score(model, X, y, cv=kfold)

    return scores.mean()

In [17]:
classification_model(XGBClassifier())

0.9771619313771154

In [18]:
classification_model(XGBClassifier(booster='gblinear'))

0.4730321378667909

In [19]:
classification_model(XGBClassifier(booster='dart', one_drop=True))

0.9683744760130415

For the dart booster, we set `one_drop=True` to ensure that trees are actually dropped.

In [20]:
classification_model(RandomForestClassifier(random_state=2))

0.9666356155876418

In [21]:
classification_model(LogisticRegression(max_iter=10000))

0.9508150908244062

From our choice set of models, we can see good performances overall except for the XGBoost with `dart` as a booster. We will not continue with it to the next phase.

In [22]:
classification_model(XGBClassifier(max_depth=2, n_estimators=500, learning_rate=0.1))

0.9701133364384411

Now that we have a small set of models, we will now turn our attention to the issue of correlations.

### Understanding and Applying Correlation in Machine Learning Ensembles

#### **1. Introduction to Correlation**
- **Definition**: Correlation is a statistical metric ranging from -1 to 1, representing the strength of a linear relationship between two datasets.
  - **Correlation of 1**: Indicates a perfect linear relationship (a straight line).
  - **Correlation of 0**: No linear relationship is evident.
- **Visual Examples**:
  - *Listed Correlations*: Visuals show that higher correlations align points closer to a straight line.
  - *Anscombe's Quartet*: Demonstrates how datasets with identical correlations (0.816 in this case) can display vastly different distributions. This illustrates that correlation provides useful information but doesn't fully describe the data relationship.

#### **2. Correlation in Machine Learning Ensembles**
- **Goal**: To select non-correlated models for the ensemble.
- **Why Avoid High Correlation?**
  - If two models in an ensemble yield identical predictions, the second model doesn't add value. 
  - Ensembles benefit from diversity in model predictions. High correlation suggests similar predictions, reducing the ensemble's effectiveness.
- **Majority Rules Context**: In an ensemble using majority rules, a prediction is only incorrect if most models err. Therefore, having models that perform well individually but differ in their predictions (low correlation) is advantageous.
- **How to Compute Model Correlations**:
  - Obtain predictions from different models.
  - Merge these predictions into a DataFrame.
  - Utilize the `.corr` method in the DataFrame to calculate correlations between model predictions, enabling the identification of non-correlated, diverse models for the ensemble.

The focus here is not on selecting all possible models but on choosing those that offer diverse, non-correlated insights, essential for a robust machine learning ensemble.

In [23]:
def y_pred(model):

    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    score = accuracy_score(y_pred, y_test)

    print(score)

    return y_pred

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=43)

In [25]:
y_pred_gbtree = y_pred(XGBClassifier())

0.9790209790209791


In [26]:
y_pred_dart = y_pred(XGBClassifier(booster='dart', one_drop=True))

0.9790209790209791


In [27]:
y_pred_forest = y_pred(RandomForestClassifier())

0.993006993006993


In [28]:
y_pred_logistic = y_pred(LogisticRegression(max_iter=10000))

0.9790209790209791


In [29]:
y_pred_xgb = y_pred(XGBClassifier(max_depth=2, n_estimators=500, learning_rate=0.1))

0.986013986013986


We will next concatenate the predictions into a new DataFrame using `np.c` and run correlations on the DataFrame using the `.corr()` method:

In [30]:
df_pred = pd.DataFrame(data= np.c_[y_pred_gbtree, y_pred_dart, 
                       y_pred_forest, y_pred_logistic, y_pred_xgb], 
                       columns=['gbtree', 'dart','forest', 'logistic', 'xgb'])

df_pred.corr()

Unnamed: 0,gbtree,dart,forest,logistic,xgb
gbtree,1.0,1.0,0.967574,0.936398,0.984011
dart,1.0,1.0,0.967574,0.936398,0.984011
forest,0.967574,0.967574,1.0,0.968456,0.984011
logistic,0.936398,0.936398,0.968456,1.0,0.952321
xgb,0.984011,0.984011,0.984011,0.952321,1.0


1. **Correlation Among Models**: In ensemble learning, it's common to check the correlation among different models' predictions. A lower correlation between models generally indicates that they are capturing different aspects of the data, which can be beneficial in an ensemble. The diagonal showing a correlation of 1.0 simply reflects that any model is perfectly correlated with itself.

2. **Selecting Non-Correlated Models**: There's no absolute threshold for what constitutes "non-correlated" in this context. The idea is to select models that are less correlated with each other to maximize the diversity of the ensemble. In our case, after the XGBoost model (`xgb`), the next least correlated models are the random forest and logistic regression.

3. **Using VotingClassifier**: This is a technique in ensemble learning where multiple models are combined to make predictions. Each model in the ensemble votes for a class, and the class receiving the majority of the votes is chosen as the final prediction. The `VotingClassifier` in Python's `sklearn` library is a straightforward way to implement this.

#### Understanding the VotingClassifier Ensemble in Scikit-learn
The VotingClassifier ensemble in Scikit-learn is crafted to amalgamate several classification models. It operates on the principle of majority voting to determine the final output for each prediction. It's important to note that scikit-learn provides a similar tool for regression tasks, named VotingRegressor. This regressor works by averaging the outputs of various regression models.


In [31]:
# Initialize an empty list:
estimators = []

# Initialize the first model:
logistic_model = LogisticRegression(max_iter=10000)

# Append the model to the list as a tuple in the form (model_name, model):
estimators.append(('logistic', logistic_model))

# Repeat steps 2 and 3 as many times as desired:
xgb_model = XGBClassifier(max_depth=2, n_estimators=500, learning_rate=0.1)

estimators.append(('xgb', xgb_model))

rf_model = RandomForestClassifier(random_state=43)

estimators.append(('rf', rf_model))

# Initialize VotingClassifier (or VotingRegressor) using the list of models as input:
ensemble = VotingClassifier(estimators)

# Score the classifier using cross_val_score:
scores = cross_val_score(ensemble, X, y, cv=kfold)

print(scores.mean())

0.9666200900481291


We took out time to do this step by step. We will now take our time to create a function that would accomplish all of these into one for automation.

In [32]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

def evaluate_ensemble(models, X, y, cv):
    """
    Create and evaluate an ensemble model using VotingClassifier.

    Parameters:
    models (list of tuples): A list of tuples where each tuple contains the model name as a string 
                             and the model instance.
    X: Feature set.
    y: Target variable.
    cv: Cross-validation splitting strategy.

    Returns:
    float: The mean score of the ensemble model.
    """
    # Initialize an empty list for estimators
    estimators = []

    # Iterate over the provided models, initializing and adding them to the estimators list
    for model_name, model in models:
        estimators.append((model_name, model))

    # Initialize VotingClassifier using the list of models
    ensemble = VotingClassifier(estimators)

    # Score the classifier using cross_val_score
    scores = cross_val_score(ensemble, X, y, cv=cv)

    # Return the mean of the scores
    return scores.mean()


In [33]:
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier

# Your models configurations
models = [
    ('logistic', LogisticRegression(max_iter=10000)),
    ('xgb', XGBClassifier(max_depth=2, n_estimators=500, learning_rate=0.1)),
    ('rf', RandomForestClassifier(random_state=43))
]

# Example usage
mean_score = evaluate_ensemble(models, X, y, kfold)
print(mean_score)


0.9666200900481291


### Stacking

Stacking is a powerful technique in machine learning, often utilized by Kaggle competition winners for its effectiveness. David Austin, a Kaggle winner, notably mentioned his preference for xgboost for stacking and boosting in an interview with PyImageSearch:

> "For stacking and boosting I use xgboost, again primarily due to familiarity and its proven results." (Source: [PyImageSearch Interview with David Austin](https://www.pyimagesearch.com/2018/03/26/interview-david-austin-1st-place-25000-kaggles-popular-competition/)).

**Understanding Stacking:**
Stacking is a method that involves two levels of machine learning models. The first level consists of base models that make predictions on the dataset. The second, or meta level, uses these predictions as its input to generate the final output. Unlike traditional models, the final model in a stacking approach does not use the original data directly but rather the outputs of the base models.

**Success in Competitions:**
Stacked models are a common strategy in Kaggle competitions. Competitions often have deadlines for mergers, allowing individuals and teams to combine their efforts. This collaboration is beneficial in stacking as it enables the creation of more robust ensembles and the combination of diverse models, enhancing performance.

**Distinctive Feature of Stacking:**
The key differentiator of stacking from other ensemble methods is the meta-model. This model combines predictions from base models. For regression tasks, linear regression is commonly used as the meta-model, while logistic regression is a typical choice for classification tasks.

**Implementing Stacking in scikit-learn:**
Scikit-learn simplifies the implementation of stacking with its stacking regressor and classifier. The process mirrors the ensemble approach: selecting various base models and then using a linear or logistic regression model as the meta-model. This structured approach in scikit-learn streamlines stacking, making it accessible for various machine learning tasks.

In [34]:
# Create an empty list of base models:
base_models = []

# Append all base models to the base model list as tuples using the syntax (name, model):
base_models.append(('lr', LogisticRegression()))
base_models.append(('xgb', XGBClassifier()))
base_models.append(('rf', RandomForestClassifier(random_state=2)))

# Choose a meta model, preferably linear regression for regression and logistic regression for classification:
meta_model = LogisticRegression()

# Initialize StackingClassifier (or StackingRegressor) using base_models for estimators and meta_model for final_estimator:
clf = StackingClassifier(estimators=base_models, final_estimator=meta_model)

# Validate the stacked model using cross_val_score or any other scoring method:
scores = cross_val_score(clf, X, y, cv=kfold)
print(scores.mean())

0.9789318428815401


This is the strongest result yet.

As you can see, stacking is an incredibly powerful method and outperformed the non-correlated ensemble from the previous sectio

n. e now create a function to handle it all for both classification and regression mode then we download models from scikit learn to test the function outls.

In [35]:
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import StackingClassifier, StackingRegressor
from sklearn.model_selection import cross_val_score

def create_stacked_model(base_models, task_type, X, y, cv):
    """
    Create and validate a stacked model for classification or regression.

    :param base_models: List of tuples (name, model instance).
    :param task_type: 'classification' or 'regression'.
    :param X: Feature data.
    :param y: Target data.
    :param cv: Cross-validation strategy.
    :return: Mean cross-validation score of the stacked model.
    """
    # Choose meta model based on task type
    if task_type == 'classification':
        meta_model = LogisticRegression()
        StackingModel = StackingClassifier
    elif task_type == 'regression':
        meta_model = LinearRegression()
        StackingModel = StackingRegressor
    else:
        raise ValueError("task_type must be 'classification' or 'regression'")

    # Initialize Stacking Model
    stacked_model = StackingModel(estimators=base_models, final_estimator=meta_model)

    # Validate the model
    scores = cross_val_score(stacked_model, X, y, cv=cv)
    return scores.mean()


In [36]:
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import numpy as np

# Example data (Iris dataset)
iris = load_iris()
X, y = iris.data, iris.target

# Define base models
base_models = [
    ('lr', LogisticRegression()),
    ('rf', RandomForestClassifier(random_state=2)),
    ('xgb', XGBClassifier())
]

# Cross-validation strategy
kfold = KFold(n_splits=5)

# Call the function for a classification task
mean_score = create_stacked_model(base_models, 'classification', X, y, kfold)
print(f"Mean CV Score for Classification: {mean_score}")


Mean CV Score for Classification: 0.9066666666666666


In [37]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
from sklearn.model_selection import KFold

# Fetch the California housing dataset
california_housing = fetch_california_housing()
X, y = california_housing.data, california_housing.target

# Define base models for regression
base_models = [
    ('ridge', Ridge()),
    ('rf', RandomForestRegressor(random_state=2)),
    ('xgb', XGBRegressor())
]

# Cross-validation strategy
kfold = KFold(n_splits=5)

# Call the function for a regression task
mean_score = create_stacked_model(base_models, 'regression', X, y, kfold)
print(f"Mean CV Score for Regression: {mean_score}")


Mean CV Score for Regression: 0.6795318996485826
