# Notes on SKlearn and Pandas

Copied over from `december_logistic_regression_people_analytics.ipynb` on Jan 16 2023

# SKLearn notes

The  sklearn preprocessing methods have fit and transform methods. `fit` fits the data, while `transform` returns you back data that's centered or one hot encoded. `fit_transform` can do both.

The general sklearn pipeline is as follows:
1. The `sklearn.model_selection` module will help you split data / randomly sample your data.
    1. Use `train_test_split` to split data into train and test
    1. The `StratifiedKFold` class is a "splitter class / cross validator" that you can use to do cross validation. 
    1. The `GridSearchCV` is a hyper-paramater optimizer, that will do exhaustive search over specified parameter values for an estimator. You can also call `cross_validate` if you don't need to do full hyperparameter optimization, and you just want average model performance across your K-fold cross validator.
1. The `sklearn.preprocessing` module will help you scale / transform columns so that theyre ready for a model. 
    1. Use a `ColumnTransformer` to do `StandardScaler` or `OneHotEncoder` / `LabelEncoder`. Use the `fit` and `transform` of these preprocessors to make life easier. `fit` fits the scaler/encoder to your data, and `transform` transforms your data with the appropriate rule. `fit_transform` does both at once. You may also need an imputer, such as a `SimpleImputer`.
    1. A `ColumnTransformer` will only work with named columns in a Pandas DataFrame, which means you can only use it at the first step of a pipeline (otherwise upstream steps make things into numpy arrays which arent named -- where you can use the numeric indexer if really needed). A `sklearn.compose.make_column_selector` can help with unnamed columns as well. Creating `passthrough` columnn selectors are pretty useful as well.
    1. A `FunctionTransformer` can help you construct a transformer from an arbitrary callable. This won't store data between train and test set (eg. if you wanted to impute data with medians).
1. Use the `Pipeline` method to make a pipeline with your feature engineering and classifier, `LogisticRegression` in this case.
    1. A `sklearn.pipeline.FeatureUnion` can concatenate results of multiple transformer objects. Eg. you do PCA and SVD on the same input dataset. This can be useful for performing different operations for different types of columns. This [blog](https://adamnovotny.com/blog/custom-scikit-learn-pipeline.html) has some good examples.
1. Call the `fit` method of your pipeline to fit the data. At this point you can do more advanced model selection:
    1. You can do a `GridSearchCV` over a `StratifiedKFold` to do hyperparameter optimization
    1. You can also `model_selection.cross_validate` over a `StratifiedKFold` to get the average model performance, without doing hyperparameter optimization. In case you want to do anything more advanced using the `StratifiedKFold`, you will likely need to get the train and test indices from the skf. You can do this by looping through `for train_index, test_index in skf.split(X, y)`.
1. The `sklearn.metrics` module will get you precision recall and ROC curves. 
    1. `metrics.precision_recall_curve` and `metrics.roc_curve` will take predicted probabilities and true outcomes to generate PR and ROC curves with many thresholds. 
    1. `metrics.accuracy_score` and `metrics.f1_score` will take predicted and true outcomes and give you an accuracy or f1 score. Be careful about whether the function takes `y_pred` (0 or 1) or predicted probabilities as an input.
1. To get the predictions of a classifier, you need to use the `predict_proba` and `decision_function` methods (depending on the classifier)

SKlearn tutorial video notes [link](https://www.youtube.com/watch?v=0Lt9w-BxKFQ&t=217s). 

1. Wine quality dataset, predict quality based on things like acidity, density, alcohol content, pH etc.
1. How to process nulls. `wine.isnull().sum`
1. Make a LabelEncoder fit/transform on labels.
1. Define X and y. You can drop a single column as follows: `wine.drop('quality', axis = 1)`
1. `train_test_split`
1. Scale input columns with a `StandardScaler`, `X_train = sc.fit_transform(X_train)` (maybe using a pipeline is better).
1. Random forest initiate: `rfc = RandomForestClassifier(n_estimators=200)`. Works well in medium sized data
1. Fit the random forest: `rfc.fit(X_train, y_train)`
1. Predict: `pred_rfc = rfc.predict(X_test)`, might need to use `predict_proba` sometimes
1. See how well the model performed: `print(classification_report(y_test, pred_rfc))`
1. SVM initiate `clf = svm.SVC()`. Then fit / predict based on them. SVM does better with small datasets generally.
1. Neural network (does well with huge data, images, text): `mlpc = MLPClassifier(hidden_layer_sizes = (11, 11, 11), max_iter = 500)`
1. Accuracy score: `cm = metrics.accuracy_score(y_test, pred_rfc)` 


# ML notes

I wanted to ask you some questions about topics in applied machine learning that doen't really come up in textbooks. I'd love to get your thoughts on these, as well as any other important topics you see in practice that don't come up in most classes.

How to impute `NA` values. 
1. Right now I see a couple possibilities here. One approach is to use `sklearn`'s `SimpleImputer` or `IterativeImputer` to do the work for you. The other is to some exploratory data analysis to look at major patterns in the data (eg. Titanic missing age and survival are very correlated with passenger class and gender, so imputing age with class/gender medians might make sense). I'd love to know what strategies you use to impute missing data. 
1. I also have noticed that how you impute `NA`'s has a big impact on model performance. How do you keep track of what the best imputation strategy is? Do you treat it as a hyperparameter to be optimized with other variables (eg. tree depth), or do you just deal with it at the training stage (eg. do a bit of k-fold cross validation to understand the best imputation strategy, then fix the imputation strategy and move on?)

Model validation / k-fold cross validation.
1. How to choose the value of k. Eg the Titanic dataset has ~800 observations. My current workflow when testing whether doing something (eg. bucketing features) helps the model or not is doing 10-fold cross validation over a `StratifiedKFold` in sklearn and plot the ROC curves for each test fold. I'll average the ROC curves to get an AUC metric that I use as the overall judge of model performance. Does this seem reasonable? Would you add other metrics? Eg. I also look at AUPRC and test set accuracy in the test folds.

How to deal with imbalanced classes

Minimizing leakage between train and test sets

How to keep track of all the different hyperparameters (eg. how to impute, tree depth etc.)

# Pandas notes

1. `pd.merge` to merge datasets
1. `df.loc` vs `df.iloc`: integer indexing
1. `value_counts` also `data.groupby(groups).size().reset_index().rename(columns = {0:'size'})`
1. `df.isna().sum()`
1. `df['column'].astype('int')`
1. `A.query('Age == median')` to get rows where Age is equal to median
1. Dummy column `df.assign(is_ib_test = (df.channel == 'instant_book').astype('int'))`
1. Make new column as a function of the value in another column: `df_n_reviews_race.loc[df_n_reviews_race.n_reviews == '5', 'n_reviews'] = '5+'`

Pandas tips from sklearn video

1. `data.isnull.sum()`
1. `pd.cut(wine['quality'], bins = bins, labels = group_names)`
1. `wine['quality'].value_counts()`

Matt Harrison Effective Pandas video notes [link](https://www.youtube.com/watch?v=UURvPeczxJI&t=505s).
1. Pandas is built on top of NumPy, a package to make numerical computation faster in Python.
1. `autos[cols].dtypes` you see a lot of `int64` / `float64` (fast) or `object` (mixed type or string).
1. `autos[cols].memory_usage(deep = True)` 
1. You can save memory by converting 16 byte integers to 8 byte integers etc. 
1. `autos.cylinders.describe()` - the count is of not null values, so it can get you how many nulls
1. `autos.cyclinders.value_counts(dropna = False)` how many nulls
1. `autos['drive']` is a low cardinality categorial originally coded as an `object`. `autos.drive.value_counts(dropna = False)` will tell you what the cardinality is. You can assign things to be `category` to save memory `astype('make': 'category')`

Summary up to 31:20
1. Dot chaining is good. Use `query`, `assign`, `astype`.

In [None]:
#dot chaining example
(df
 [cols]
  .select_dtypes(int)
  .describe()
)

#assign types
(df
 [cols]
  .astype({'highway08': 'int8', 'city08' : 'int16'})
  .describe()
)

#quickly see missing values
(df 
 [cols]
  .query('cylinders.isna()')
)

#impute na's by filling them with zeros
(df 
 [cols]
  .assign(cylinders = df.cylinders.fillna(0).astype('int8'))
)

#number of unique values for column drive by year
(
    df[cols]
    .groupby('year')
    .drive
    .nunique
)

#split speeds from tran variable
(
    df[cols]
    .assign(automatic = autos.tran.str.contains('Auto'),
            speeds = autos.tran.str.extract(r'(\d)+').fillna('20').astype('int8'),
    )
)


Summary pivot table in pandas:
1. `df.assign` is a great way to rename columns, make new columns. Can use lambda functions to perform custom transformations on the current version of the dot chain.
1. `pivot_table`, will pivot data from long to wide. You will need to set an index, and then either `reset_index()` or `reset_index(0)` to get the data back into a decent shape.

In [None]:
channel_usage = (df
     .groupby(['perceived_race'])['channel']
     .value_counts()
     .to_frame()
     .rename(columns = {'channel': 'n'})
     .reset_index()
     .assign(n_total = lambda x: x.groupby('perceived_race')['n'].transform('sum'))
     .assign(pct = lambda x: round(100 * x.n / x.n_total, 2))
     .pivot_table(index = ['perceived_race', 'n_total'], columns = ['channel'], values = ['n', 'pct'])
     .reset_index()
     .assign(quarter = quarter)
)

channel_usage.columns = ['perceived_race', 'n_total', 'n_ib', 'n_rtb', 'pct_ib', 'pct_rtb', 'quarter']

channel_usage