## XGBoost

Why use XGBoost?
- Execution Speed
- Model Performance
- Strong single model performance

**Note:**
- Internally, XGBoost models represent all problems as a regression predictive modeling problem that only takes numerical values as input.
- If data is in different form, it must be prepared into the expected format.
- Encode string output variables for classification
- Prepare categorical input variables using One Hot Encoding
- Automatically handle missing data with XGBoost

### Worked Example of Using One-Hot Encoding for categorical inputs, and LabelEncoder for string target

In [1]:
# binary classification, breast cancer dataset, label and one hot encoded
from numpy import column_stack
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
# load data
data = read_csv('breast-cancer.csv', header=None)
dataset = data.values
# split data into X and y
X = dataset[:,0:9]
X = X.astype(str)
Y = dataset[:,9]

In [None]:
# encode string input values as integers
columns = []

# We can one hot encode each feature after we have label encoded it.
# First we must transform the feature array into a 2-dimensional NumPy array where each integer value is a
# feature vector with a length 1.

for i in range(0, X.shape[1]):
    label_encoder = LabelEncoder()
    feature = feature.reshape(X.shape[0],1)
    onehot_encoder = OneHotEncoder(sparse=False, categories='auto')
    feature = onehot_encoder.fit_transform(feature)
    columns.append(feature)
    
# Collapse columns into array
encoded_x = column_stack(columns)
print(f"X shape: {encoded_x.shape}")

In [None]:
# Encode string class values as integers
label_encoder