# Intermediate Training Summary
The topics covered in the intermediate machine learning course were:
- [Handling Missing Values](#Handling-Missing-Values)
- [Handling Categorical Values](#Handling-Categorical-Data)
- [Pipelines](#Pipelines)
- [Cross-Validation](#Cross-Validation)
- [XGBoost (Gradient Boosting)](#XGBoost-(Gradient-Boosting))
- [Data Leakage](#Data-Leakage)

<a id='Handling-Missing-Values'></a>
## Handling Missing Values
There are generally three different ways of handling missing data (NaN values)

### 1. Drop columns with missing values
The simplest of the three options is to just drop all columns with missing values. This will is a rather extreme approach and not ideal.

In [1]:
# Dropping columns example
# Get names of columns with missing values
missing_cols = [
    col for col in X_train.columns
    if X_train[col].isnull().any()
]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

### 2. Imputation
Imputation fills in missing values with some number, generally the mean value of the column. Imputation isn't exactly right, but leads to more accurate models.

In [0]:
# Imputation Example
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

### 3. Extended Imputation
Sometimes imputating may produce non-ideal values, or maybe the missing values were specifically unique. In this case, adding an additional column to show which rows were initially empty can help with accuracy.

In [0]:
# Extended Imputation Example
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

<a id='Handling-Categorical-Values'></a>
## Handling Categorical Values




<a id='Pipelines'></a>
## Pipelines

<a id='Cross-Validation'></a>
## Cross-Validation

<a id='XGBoost-(Gradient-Boosting)'></a>
## XGBoost (Gradient Boosting)

<a id='Data-Leakage'></a>
## Data Leakage