# Step 1 Understand  the problem  
## data type:
- tabular data
- time series
- image/text data
    
## problem type
- classification
- regression 
- ranking

## evaluation metric
- roc auc
- f1 
- mse
- mae
- log loss for classification


# Step 2  EDA
Goals
- size of  data
- properties of target variable : unblanced , skewed 
- properties of features: look for peculiarities and dependencies between features and target variable 

# Step 3 Feature Engineer
## Encode
### label encode: 
    - assign different number(0, 1, 2, ... ) for different category, ranking dependency 
    - this is okay for tree-based model, but harmful for linear models
### onhot encode: 
    - 1 for category and o for others
    - pd.get_dummies(df[''] ,prefix = '')
     drop original featue 
### target encode

    - mean target encoding
        1. on train,  calculate mean of target by category -> apply to test 
        2. split train into K folds, calculate mean on (K-1) folds , apply to Kth fold
        3. apply mean target encoded feature to the model
    - practical guide: rare category with only one or two category, will get strict 0 or 1 mean encoding
        1. smoothing
            smoothed_mean_enc of category i = (target_sum of category i  + alpha * global mean) / size of category i + alpha 
            alpha usually between 5 to 10 
        2. new category
            fill new category with global meanFor binary classification usually mean target encoding is used
 
#### target encoding tye: 
- For _binary classification_ usually __mean target encoding__ is used
- For _regression_ __mean could be changed to median, quartiles, etc__.
- For _multi-class classification_ with N classes we create __N features with target mean for each category in one vs. all fashion__

## Missing Data Input
### Numerical data
- mean/median imputation
- constant value imputation
    - missing value = -999, this works for tree-based model. but not good for linear model
### Categorical data
- most frequent category imputation
- new category imputation

# Step 4 Modeling
### Hyperparameter optimization

| model type   | feature engineer   |   hyperparameter tuning     |
| :------------- | :----------: | -----------: |
| classic ML |  +++ | +|
|deep learning | no need to do | +++|

### Model ensembling
#### 1. Model blending: average of multiple models' predictions
#### 2. Model stacking:  The idea is to train multiple single models, take their predictions and use these predictions as features in the 2nd level model.
    So, we need to perform the following steps: 
        1. Split train data into two parts
        2. Train multiple models on Part 1
        3. Make predictions on Part 2
        4. Make predictions on the test data
        5. Train a new model on Part 2 using predictions as features. This model is called the __2nd level model or meta-model__. 
        6. Make predictions on the test data using the 2nd level model. its called stacking output 


# EXAMPLE OF MEAN TARGET SPLIT

In [98]:
def test_mean_target_encoding(train, test, target, categorical, alpha=5):
    # Calculate global mean on the train data
    global_mean = train[target].mean()
    
    # Group by the categorical feature and calculate its properties
    train_groups = train.groupby(categorical)
    category_sum = train_groups[target].sum()
    category_size = train_groups.size()
    
    # Calculate smoothed mean target statistics
    train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
    
    # Apply statistics to the test data and fill new categories
    test_feature = test[categorical].map(train_statistics).fillna(global_mean)
    print(train_statistics)
    return test_feature.values

In [99]:
def train_mean_target_encoding(train, target, categorical, kf_object ,  alpha=5):

    train_feature = pd.Series(index=train.index)
    
    # For each folds split
    for train_index, valid_index in kf_object :
        cv_train, cv_valid = train.iloc[train_index], train.iloc[valid_index]
      
        # Calculate out-of-fold statistics and apply to cv_test
        cv_valid_feature = test_mean_target_encoding(cv_train, cv_valid, target, categorical, alpha)
        
        # Save new feature for this particular fold
        train_feature.iloc[valid_index] = cv_valid_feature       
    return train_feature.values

In [100]:
def mean_target_encoding(train, test, target, categorical, kf_object,  alpha=5):
  
    # Get the cv train feature
    train_feature = train_mean_target_encoding(train, target, categorical, kf_object, alpha)
    
    # Get the test feature
    test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
    
    # Return new features to add to the model
    return train_feature, test_feature

In [102]:
from sklearn.model_selection import StratifiedKFold

# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits =3, shuffle=True, random_state=123)
str_kf_obj = str_kf.split(train_data, train_data['interest_level'])
rtn_1 = train_mean_target_encoding(train= train_data, target= 'price', categorical= 'interest_level', kf_object=str_kf_obj, alpha = 10)

interest_level
high      2700.972007
low       4002.263539
medium    3152.018955
dtype: float64
interest_level
high      2687.557962
low       4302.890160
medium    3161.397332
dtype: float64
interest_level
high      2725.533743
low       4224.189221
medium    3165.572935
dtype: float64


  This is separate from the ipykernel package so we can avoid doing imports until


# EXAMPLE OF MISSING VALUE IMPUTATION

In [34]:
from sklearn.impute import SimpleImputer
# mean imputatin
mean_imputer = SimpleImputer(strategy='mean')
constant_imputer = SimpleImputer(strategy='constant', fill_value= -999)
# mean_imputer.fit_transform(df[[]]) have to use double brackets here 

In [38]:
train_data.isnull().sum()

bathrooms          0
bedrooms           0
description        0
display_address    0
features           0
latitude           0
longitude          0
price              0
street_address     0
interest_level     0
dtype: int64

# EXAMPLE OF STACKING

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# 1.  Split train data into two parts
part_1, part_2 = train_test_split(train, test_size=0.5, random_state=123)

# 2. Train a Gradient Boosting model on Part 1
gb = GradientBoostingRegressor().fit(part_1[features], part_1.fare_amount)

# Train a Random Forest model on Part 1
rf = RandomForestRegressor().fit(part_1[features], part_1.fare_amount)

# 3. Make predictions on the Part 2 data
part_2['gb_pred'] = gb.predict(part_2[features])
part_2['rf_pred'] = rf.predict(part_2[features])

# 4. Make predictions on the test data
test['gb_pred'] = gb.predict(test[features])
test['rf_pred'] = rf.predict(test[features])

from sklearn.linear_model import LinearRegression

# 5. Create linear regression model without the intercept
lr = LinearRegression(fit_intercept=False)

# 6. Train 2nd level model on the Part 2 data
lr.fit(part_2[['gb_pred', 'rf_pred']], part_2.fare_amount)

# Look at the model coefficients: higher coefficient means more trust
print(lr.coef_)

# 7. Make stacking predictions on the test data
test['stacking'] = lr.predict(test[['gb_pred', 'rf_pred']])
