# Titanic: model refinement ideas (blog post summary)

Mainly a summary of the following blog post on Kaggle: Titanic - Advanced Feature Engineering Tutorial ([link](https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial)). The summaries after each section in the post are very helpful.

# Section 1: Exploratory data analysis

1. `Age`, `Cabin` and `Embarked` are commonly missing variables in the training set. Gunes has a function that displays how many missing values there are in each variable.
1. `Age` varies a lot by `Sex`, `PClass` etc, so imputation should take that into account.
1. `Embarked` and `Fare` (test set) are only missing for a couple of people. Use imputation based on the characteristics of those people.
1. `Cabin` is a very interesting feature.
    1. The first letter of the cabin tells you what deck the cabin it was on. **A** , **B** and **C** were only for first class, **D** and **E** were for all classes, **F** and **G** were for both 2nd and 3rd class passengers. You can plot these distributions if you would like. Going from **A** to **G**, distance to the staircase increases which might be a factor of survival.
    1. You can plot survival probabilty by deck
    1. You can group decks together to reduce dimensionality
1. Gunes created a new feature `Deck` and dropped the `Cabin` feature.
1. You can plot correlations between variables. There are a lot of correlated variables.
1. You can plot distributions of numeric and categorical variables by survival. Gunes' plots are counts, my plots are "prob survival". I like my version more. 
1. Split points / spikes are visible in continuous features. They can be captured easily with a decision tree model, but linear models may not be able to spot them.


# Section 2: Feature Engineering

1. Bin `Fare` using `pd.qcut(df['Fare'], 13)`, plot survival counts with new fare variable.
1. Bin `Age` the same way, use 10 quantile based bins, plot survival counts with this.
1. Encode `Family_Size` by combining `SibSp` and `Parch`. Values are **Alone**, **Small**, **Medium** and **Large**. Plot survival by this variable. 
1. See how many people are on each `Ticket`, create a `Ticket_Frequency` feature based on this, `df_all['Ticket_Frequency'] = df_all.groupby('Ticket')['Ticket'].transform('count')`. This is similar to `Family Size`
1. Create a `Title` feature for the first word in someone's Name. Create a `Is_Married` feature based on the title `Mrs`.
    1. Nice way to do ifelses: `df_all['Title'] = df_all['Title'].replace(['Miss', 'Mrs','Ms', 'Mlle', 'Lady', 'Mme', 'the Countess', 'Dona'], 'Miss/Mrs/Ms')`
1. Extract surnames and create a `Family` feature based on it. Create a family survival rate feature. 
1. Add ticket and family survival rates in both the train and test data based on average survival rate in the ticket or family in the training data. This seems like a pretty sketchy method since it seems like we are leaking data, so I don't think that I will do it.
1. Label encode non-numerical features:
    1. Non numeric features are converted to numerical type with `LabelEncoder`. `LabelEncoder` basically labels the classes from 0 to n. Variables will look like `Embarked_0`, `Embarked_1` etc.
    1. `non_numeric_features = ['Embarked', 'Sex', 'Deck', 'Title', 'Family_Size_Grouped', 'Age', 'Fare']`
    1. `LabelEncoder().fit_transform(df[feature])`
1. One-Hot encode categorical features:
    1. `cat_features = ['Pclass', 'Sex', 'Deck', 'Embarked', 'Title', 'Family_Size_Grouped']`
    1. `OneHotEncoder().fit_transform(df[feature].values.reshape(-1, 1)).toarray()`



Final features for classifier in blog post:
1. `Age`: Binned into 10 quantile based bins
1. `Fare`: Binned into 13 quantile based bins
1. `Deck`: One hot encoded into four decks
1. `Title`: One hot encoded into categories (*Miss/Mrs/Ms*, *Dr/Military/Noble/Clergy*, *Master* (a male under age of 26), *Mr*)
1. `Family_Size_Grouped`: One hot encoded into four "family sizes"
1. `Embarked`: One hot encoded into the three cities
1. `Is_Married`: One hot encoded 
1. `Pclass`: One hot encoded into three classes
1. `Sex`: One hot encoded
1. `Survival_Rate`: Average of family and ticket survival rates (don't do this...seems like data leakage).
1. `Survival_Rate_NA`: Whether the survival rate is NA or not.
1. `Ticket_Frequency`: How many people are on the person's ticket (this clearly seems like train/test leakage, since it is computed on both variables).

**Label Encoding** vs **One-Hot Encoding**: Label encoding gives you an ordinal value (1 is bigger than 0, 2 is bigger than 1), while One-Hot encoding just gives you 0's and 1's

My roadmap for improving my classifier:
1. *[done]* Bin training set `Age` and `Fare`, and make survival plots based on the quantiles. 
1. *[done]* Figure out how to impute `NaN`s taking feature correlations into account.
1. *[done]* Get a full pipeline to work with an advanced imputation methodology
1. *[done]* Modify code to do cross validator for AUC metrics (not just one train test split). 
    1. Figure out how to summarize AUC and PR metrics across folds (use `np.interp`?)

1. *[done]* NLP `Cabin` and `Name` features to extract `Deck` and `Title`. Plot survival vs these new features.
    1. Make an `Is_Married` feature.
1. *[done]* Create a `Family_Size_Grouped` feature by exploring and analyzing training set family size (`Parch` and `SibSp`). Plot survival rates by these features.
1. Feature importance plot for model evaluation
1. Hyperparameter optimization
1. Combine notebooks to be one clearly defined narrative. Have one top cell that does most of the feature engineering.

Ideas I've already done
1. One hot encode `Embarked`, `Pclass`, `Sex`.
1. Train models with continuous and binned `Age` / `Fare` variables to see which is better.

My history of improvements to the model
1. Baseline AUC ~0.84
1. Using better imputing (eg IterativeImputer) improves AUC to ~0.85 (~13 columns)
1. Discretizing Age and Fare improves AUC to ~0.87. The age/fare buckets are one-hot encoded (~41 columns)

**Jan 21 2023** -- actually the AUC for discretizing vs standardizing age and fare is about the same when you do 10-fold cross validation -- it's between 0.850 and 0.853

**March 22 2023** Let's see whether sklearn pipelines are the right thing to do, or whether we should just do the pre-processing in a function. Pre-processing steps:
1. `Age` and `Fare` should be binned or standardized
1. Make a family size column based on `SibSp` and `Parch`. 
1. [done]One hot encode `Pclass`, `Sex`, `Embarked`
1. [done]Create a `Deck` feature, one hot encode it, potentially use a `CabinNo` feature
1. [done]Create a `Married` feature, extract `Title` feature from names

`numeric_cols = ['SibSp', 'Parch', 'CabinNo', 'Fare', 'Age']`

`categorical_cols = ['Pclass', 'Sex', 'Embarked', 'Deck', 'Title']`

General notebook structure
1. Start with data exploration and feature engineering
1. Then go into model training?

**April 3 2023** Prep for meeting Maggie

1. Check if updating sklearn leads to you being able to get feature names out
1. Use full data (not just Kaggle training set)
1. Put together a pipeline implementation, maybe try rotating imputers / disc vs scale for hyperparameter optimization
1. Do hyperparameter optimization
1. Make a feature importances plot
1. Plot learning curve / check for overfitting?

Open questions / observations about the process:
1. Imputation / preprocessing is really important. 
1. How important is being fastidious about train test leakage (eg. should we make dummy variables, do mean scaling only using training set?)
1. How should we plot PR curves / ROC curves after k-fold cross validation?
1. How do we mix hyperparameter optimization / gridsearch with model structure selection? Maybe we can add scale vs discretize to the hyperparameter grid? [Example](https://towardsdatascience.com/getting-the-most-out-of-scikit-learn-pipelines-c2afc4410f1a)

Useful blogposts:

[Hyperparameter optimization with pipelines](https://towardsdatascience.com/getting-the-most-out-of-scikit-learn-pipelines-c2afc4410f1a)

[How to keep feature names with pipelines](https://medium.com/@anderson.riciamorim/how-to-keep-feature-names-in-sklearn-pipeline-e00295359e31)

# Section 3: Model

1. Standard scale all columns
1. Random forest models (leaderboard model overfits to the test data, single best model is more resilient):

    ```
    single_best_model = RandomForestClassifier(criterion='gini', 
                                            n_estimators=1100,
                                            max_depth=5,
                                            min_samples_split=4,
                                            min_samples_leaf=5,
                                            max_features='auto',
                                            oob_score=True,
                                            random_state=SEED,
                                            n_jobs=-1,
                                            verbose=1)

    leaderboard_model = RandomForestClassifier(criterion='gini',
                                            n_estimators=1750,
                                            max_depth=7,
                                            min_samples_split=6,
                                            min_samples_leaf=6,
                                            max_features='auto',
                                            oob_score=True,
                                            random_state=SEED,
                                            n_jobs=-1,
                                            verbose=1) 
    ```

1. Use a `StratifiedKFold` with 5 splits to train the models, and get AUC scores.
1. Get predictions and feature importances:
    1. `leaderboard_model.predict_proba(X_test)[:, 1]`
    1. `leaderboard_model.feature_importances_`
1. Plot feature importances, ROC curves (averaging over the 5 folds)