In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")

# Lab 4: Putting it all together in a mini project

For this lab, **you can choose to work alone or in a group of up to three students**. You are in charge of how you want to work and who you want to work with. Maybe you really want to go through all the steps of the ML process yourself or maybe you want to practice your collaboration skills, it is up to you! Just remember to indicate who your group members are (if any) when you submit on Gradescope. If you choose to work in a group, you only need to use one GitHub repo (you can create one on github.ubc.ca and set the visibility to "public"). If it takes a prohibitively long time to run any of the steps on your laptop, it is OK if you sample the data to reduce the runtime, just make sure you write a note about this.

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## Submission instructions
rubric={mechanics}

<p>You receive marks for submitting your lab correctly, please follow these instructions:</p>

<ul>
  <li><a href="https://ubc-mds.github.io/resources_pages/general_lab_instructions/">
      Follow the general lab instructions.</a></li>
  <li><a href="https://github.com/UBC-MDS/public/tree/master/rubric">
      Click here to view a description of the rubrics used to grade the questions</a></li>
  <li>Make at least three commits.</li>
  <li>Push your <code>.ipynb</code> file to your GitHub repository for this lab and upload it to Gradescope.</li>
    <ul>
      <li>Before submitting, make sure you restart the kernel and rerun all cells.</li>
    </ul>
  <li>Make sure to only make one gradescope submission per group, and to assign all group members on gradescope at submission time.</li>
  <li>Also upload a <code>.pdf</code> export of the notebook to facilitate grading of manual questions (preferably WebPDF, you can select two files when uploading to gradescope)</li>
  <li>Don't change any variable names that are given to you, don't move cells around, and don't include any code to install packages in the notebook.</li>
  <li>The data you download for this lab <b>SHOULD NOT BE PUSHED TO YOUR REPOSITORY</b> (there is also a <code>.gitignore</code> in the repo to prevent this).</li>
  <li>Include a clickable link to your GitHub repo for the lab just below this cell
    <ul>
      <li>It should look something like this https://github.ubc.ca/MDS-2020-21/DSCI_531_labX_yourcwl.</li>
    </ul>
  </li>
</ul>
</div>


_Points:_ 2

https://github.com/will-chh/573-lab4-creditcard-default-predictor#

<!-- END QUESTION -->

## Introduction <a name="in"></a>

In this lab you will be working on an open-ended mini-project, where you will put all the different things you have learned so far in 571 and 573 together to solve an interesting problem.

A few notes and tips when you work on this mini-project: 

#### Tips
1. Since this mini-project is open-ended there might be some situations where you'll have to use your own judgment and make your own decisions (as you would be doing when you work as a data scientist). Make sure you explain your decisions whenever necessary. 
2. **Do not include everything you ever tried in your submission** -- it's fine just to have your final code. That said, your code should be reproducible and well-documented. For example, if you chose your hyperparameters based on some hyperparameter optimization experiment, you should leave in the code for that experiment so that someone else could re-run it and obtain the same hyperparameters, rather than mysteriously just setting the hyperparameters to some (carefully chosen) values in your code. 
3. If you realize that you are repeating a lot of code try to organize it in functions. Clear presentation of your code, experiments, and results is the key to be successful in this lab. You may use code from lecture notes or previous lab solutions with appropriate attributions. 

#### Assessment
We don't have some secret target score that you need to achieve to get a good grade. **You'll be assessed on demonstration of mastery of course topics, clear presentation, and the quality of your analysis and results.** For example, if you just have a bunch of code and no text or figures, that's not good. If you instead try several reasonable approaches and you have clearly motivated your choices, but still get lower model performance than your friend, don't sweat it.


#### A final note
Finally, the style of this "project" is different from other assignments. It'll be up to you to decide when you're "done" -- in fact, this is one of the hardest parts of real projects. But please don't spend WAY too much time on this... perhaps "several hours" but not "many hours" is a good guideline for a high quality submission. Of course if you're having fun you're welcome to spend as much time as you want! But, if so, try not to do it out of perfectionism or getting the best possible grade. Do it because you're learning and enjoying it. Students from the past cohorts have found such kind of labs useful and fun and we hope you enjoy it as well. 

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 1. Pick your problem and explain the prediction problem <a name="1"></a>
rubric={reasoning}

In this mini project, you will pick one of the following problems: 

1. A classification problem of predicting whether a credit card client will default or not. For this problem, you will use [Default of Credit Card Clients Dataset](https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). In this data set, there are 30,000 examples and 24 features, and the goal is to estimate whether a person will default (fail to pay) their credit card bills; this column is labeled "default.payment.next.month" in the data. The rest of the columns can be used as features. You may take some ideas and compare your results with [the associated research paper](https://www.sciencedirect.com/science/article/pii/S0957417407006719), which is available through [the UBC library](https://www.library.ubc.ca/). 

OR 

2. A regression problem of predicting `reviews_per_month`, as a proxy for the popularity of the listing with [New York City Airbnb listings from 2019 dataset](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). Airbnb could use this sort of model to predict how popular future listings might be before they are posted, perhaps to help guide hosts create more appealing listings. In reality they might instead use something like vacancy rate or average rating as their target, but we do not have that available here.

**Your tasks:**

1. Spend some time understanding the problem and what each feature means. Write a few sentences on your initial thoughts on the problem and the dataset. 
2. Download the dataset and read it as a pandas dataframe. 
3. Carry out any preliminary preprocessing, if needed (e.g., changing feature names, handling of NaN values etc.)
    
</div>


_Points:_ 3

_Type your answer here, replacing this text._

Our problem of interest is predicting whether a credit card client will default on their payment next month. Given a dataset of 30,000 examples and 24 features, we aim to build a binary classification model that can accurately predict default payment behavior. This can be used by financial institutions in areas such as risk assessment and decision-making regarding credit issuance. 

- ID: row identifier 
- LIMIT_BAL: amount of given credit in NT dollars (includes individual and family/supplementary credit)
- SEX: gender (1=male, 2=female)
- EDUCATION: education level (1=graduate school; 2=university; 3=high school; 4=others)
- MARRIAGE: marital status (1=married; 2=single; 3=others)
- AGE: age in years         
- PAY_0 to PAY_6: history of past monthly payment (from September 2005 to April 2005). Values are -1=pay duly, 1=payment delay for one month, 2=payment delay for two months, and so on.
- PAY_AMT1 to PAY_AMT6: amount of previous payment (payment status from the last 6 months starting from September)
- BILL_AMT1 to BILL_AMT6: amount of bill statement (from September 2005 to April 2005)
- default.payment.next.month: default payment (1=yes, 0=no). This is our target variable

There are a mix of categorical and numerical features in the dataset, demographic features include SEX, EDUCATION, MARRIAGE, and AGE. Financial features include LIMIT_BAL, PAY_0 to PAY_6, PAY_AMT1 to PAY_AMT6, and BILL_AMT1 to BILL_AMT6. Behavioural features could include payment history (PAY_0 to PAY_6) and previous payment amounts (PAY_AMT1 to PAY_AMT6). An initial thought is that PAY_0 to PAY_6 may be most predictive of default payment next month, as they likely reflect the client's payment behaviour over the past six months.


#### Package Importation

In [None]:
# Data & plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt
from ydata_profiling import ProfileReport

# Scikit-learn utilities
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    cross_validate,
    GridSearchCV,
    RandomizedSearchCV,
)

# Preprocessing / transformations
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline

# Models — classification & regression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor 

#### Data Loading

In [None]:
cr_card_df = pd.read_csv('data/UCI_Credit_Card.csv', header=0)

In [None]:
#viewing the first 5 rows of data
cr_card_df.head()

In [None]:
#checking for null values
cr_card_df.info() #checking datatypes and non-null counts
cr_card_df.isna().sum()

In [None]:
#cleaning column names so that they are more consistent and readable (in snake_case)
cr_card_df.columns = cr_card_df.columns.str.lower().str.replace('.','_')
print(cr_card_df.columns)

In [None]:
#dropping irrelevant columns
cr_card_df = cr_card_df.drop(columns=['id'])

In [None]:
#according to the research paper, there are some extra categories that are not represented correctly
print(cr_card_df['education'].value_counts()) #5,6 are undefined/unknown, can be treated as 'others'
print(cr_card_df['marriage'].value_counts()) #0 is undefined, can be treated as 'others'

cr_card_df['education'] = cr_card_df['education'].replace({5:4, 6:4, 0:4})
cr_card_df['marriage'] = cr_card_df['marriage'].replace({0:3})


In [None]:
#checking class imbalance 
cr_card_df['default_payment_next_month'].value_counts(normalize=True)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 2. Data splitting <a name="2"></a>
rubric={reasoning}

**Your tasks:**

1. Split the data into train and test portions.

> Make the decision on the `test_size` based on the capacity of your laptop. 
    
</div>


_Points:_ 1

In [None]:
# splitting into 70% train and 30% test portions
cr_card_train, cr_card_test = train_test_split(cr_card_df, test_size=0.3, random_state=123)
X_train = cr_card_train.drop(columns=['default_payment_next_month'])
y_train = cr_card_train['default_payment_next_month']
X_test = cr_card_test.drop(columns=['default_payment_next_month'])
y_test = cr_card_test['default_payment_next_month']

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 3. EDA <a name="3"></a>
rubric={viz,reasoning}
    
Perform exploratory data analysis on the train set.

**Your tasks:**

1. Include at least two summary statistics and two visualizations that you find useful, and accompany each one with a sentence explaining it.
2. Summarize your initial observations about the data. 
3. Pick appropriate metric/metrics for assessment. 
    
</div>


_Points:_ 6

_Type your answer here, replacing this text._

2. The profile report shows that there are no missing values in the dataset. The target variable, default_payment_next_month, has a class imbalance with 22.3% of instances being 1 (default) and 77.7% being 0 (no default). This indicates that the model might need to account for this imbalance during training. \
In the correlations section, we can see that pay_0 has the highest positive correlation with the target variable (0.324), suggesting that recent payment history is a strong predictor of default payment next month. BILL_AMT1 also shows a moderate positive correlation (0.208), indicating that higher bill amounts may be associated with a higher likelihood of default. \
Many financial features like bill_amt, pay_amt are highly skewed to the right, indicating that most clients have lower amounts while a few have very high amounts. This suggests that transformations like log transformation might be beneficial for these features.

3. Based on the class imbalance in the target variable, accuracy may not be sufficient to evaluate model performance. Therefore, we will use recall, which indicates how many actual defaults were correctly identified by the model, since missed defaults are the most expensive and important to identify. We can consider changing the decision threshold to optimize recall. Additionally, we will also report precision and F1-score to provide a more comprehensive evaluation of the model's performance (including a confusion matrix), as well as the ROC-AUC score to assess the model's ability to discriminate between classes across different thresholds. 

In [None]:
profile = ProfileReport(cr_card_train, title='Credit card default training set EDA', explorative=True)
profile

In [None]:
#given that pay_* are our strongest features, we can make a histogram of 
alt.data_transformers.disable_max_rows()



In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 4. Feature engineering (Challenging)
rubric={reasoning}

**Your tasks:**

1. Carry out feature engineering. In other words, extract new features relevant for the problem and work with your new feature set in the following exercises. You may have to go back and forth between feature engineering and preprocessing.
    
</div>


_Points:_ 0.5

Reasoning: 
- payment difference: pay_0 - pay_6 captures the change in payment behaviour from 6months ago to this month. A negative value indicates improvement (since pay_6 was bad and pay_0 was good), while a positive value indicates worsening behaviour, which may be predictive of default risk (behavioural trend over the 6 mo)
- average pay amount: avg_pay_amt = captures the average payment amount over the past 6 months, which may indicate the client's ability to pay the next one
- standard deviation of pay: captures payment volatility over the past 6 months, which may indicate inconsistent payment behaviour (for example: high standard deviation 0,1,3,5,0,6 -> erratic payment behaviour -> higher default risk). Low std (1,1,1,1,1,1) indicates consistent payment behaviour -> lower default risk

In [None]:
#aggregate features for pay*

#may potentially make avg_max instead of avg since it correlates more w default
pay_cols = ['pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6']

cr_card_train['avg_pay'] = cr_card_train[pay_cols].mean(axis=1)
cr_card_train['pay_diff'] = cr_card_train['pay_0'] - cr_card_train['pay_6']
cr_card_train['pay_std'] = cr_card_train[pay_cols].std(axis=1)

cr_card_test['avg_pay'] = cr_card_test[pay_cols].mean(axis=1)
cr_card_test['pay_diff'] = cr_card_test['pay_0'] - cr_card_test['pay_6']
cr_card_test['pay_std'] = cr_card_test[pay_cols].std(axis=1)

#not worth aggregating values for pay_amt*, bill_amt* since there is a high risk of multicollinearity without much gain

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 5. Preprocessing and transformations <a name="5"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Identify different feature types and the transformations you would apply on each feature type. 
2. Define a column transformer, if necessary. 
    
</div>


_Points:_ 4

In [None]:
#Seeing the skew in the 'age' column, binning age into groups may help interpretability, also default probability doesn't change with age
#pipe = Pipeline([('age_binned', KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile'))])

# Dropping target column after adding new features
X_train = cr_card_train.drop(columns=['default_payment_next_month'])
X_test = cr_card_test.drop(columns=['default_payment_next_month'])

y_train = cr_card_train['default_payment_next_month']
y_test = cr_card_test['default_payment_next_month']

# Identifying feature types 

numeric_feats = ['limit_bal', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 
                 'bill_amt6','pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6',
                 'avg_pay', 'pay_diff', 'pay_std']
ordinal_feats = ['education', 'pay_0','pay_2','pay_3','pay_4','pay_5','pay_6'] 
categorical_feats = ['sex', 'marriage']
bin_feats = ['age']

In [None]:
# Applying transformations 

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_feats),
    (OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=-1), ordinal_feats),
    (OneHotEncoder(handle_unknown='ignore'), categorical_feats),
    (KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile', quantile_method='linear'), bin_feats),
)

preprocessor.fit(X_train)
X_train_enc = preprocessor.transform(X_train)

In [None]:
onehot_col = preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(categorical_feats)
all_columns = numeric_feats + bin_feats + ordinal_feats + list(onehot_col)

X_train_df = pd.DataFrame(X_train_enc, columns=all_columns, index=X_train.index)
X_train_df

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 6. Baseline model <a name="6"></a>
rubric={accuracy}

**Your tasks:**
1. Train a baseline model for your task and report its performance.
    
</div>


_Points:_ 2

In [None]:
cross_val_results = {}
dummy = DummyClassifier(strategy = "most_frequent")
scoring_metrics = ['accuracy', 'f1', 'precision', 'recall', 'roc_auc']

cross_val_results['dummy'] = pd.DataFrame(cross_validate(
    dummy, X_train, y_train, scoring = scoring_metrics, cv=5, return_train_score=True)).agg(['mean', 'std']).round(3).T

cross_val_results['dummy']

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 7. Linear models <a name="7"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Try a linear model as a first real attempt. 
2. Carry out hyperparameter tuning to explore different values for the regularization hyperparameter. 
3. Report cross-validation scores along with standard deviation. 
4. Summarize your results.
    
</div>


_Points:_ 8

The logistic regression model was tuned using Randomized Search to find the best regularization parameter, C, which is 8.756. The model achieved a test accuracy of 0.733 and the model appears quite stable across the different folds of data as it acheived a standard deviation of 0.006 and only varied 0.6% across all folds. The training and test scores are very similar (0.734 vs. 0.733), indicating that the model is not overfitting. However, the moderate f-1 score of 0.516 indicates the balance between precision and recall for the default class is not very good. As mentioned previously, the recall score which indicates how many actual defaults were correctly identified by the model, is particulary important since missed defaults are the most expensive. The recall score is 0.637, meaning the model only identifies 63.7% of all actual defaults, which was improved by the setting class_weight='balanced' but still leaves ~36% of the actual defaults missed. Lastly, the ROC AUC score of 0.745 suggests the model performs reasonably well when identifying positive and negative cases. 

The confusion matrix confirms the low precision score of 0.433 due to the high number of false positives predicted (1766). However, this is expected since the dataset has a high class imbalance, and balancing the classes sacrifices precision to increase recall. 

In [None]:
from scipy.stats import loguniform, randint

pipe_lr = make_pipeline(preprocessor, LogisticRegression(class_weight='balanced', max_iter=1000))

param_dist = {"logisticregression__C": loguniform(0.01, 10)}

random_search = RandomizedSearchCV(pipe_lr, param_dist, n_iter=20, cv=5, n_jobs=-1,random_state=123, return_train_score=True)

random_search.fit(X_train, y_train)

pipe_lr_tuned = random_search.best_estimator_

logreg_cv_scores = pd.DataFrame(cross_validate(
    pipe_lr_tuned, X_train, y_train, cv=5, scoring = scoring_metrics, return_train_score=True
)).agg(['mean', 'std']).round(3).T

cross_val_results['logreg'] = logreg_cv_scores

cross_val_results['logreg']

In [None]:
print("Best Parameters:", random_search.best_params_)

In [None]:
print("CV Score:", random_search.best_score_)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

pipe_lr.fit(X_train, y_train)

predictions = pipe_lr.predict(X_test)

confusion_matrix(y_test, predictions)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 8. Different models <a name="8"></a>
rubric={accuracy,reasoning}

**Your tasks:**
1. Try out three other models aside from the linear model. 
2. Summarize your results in terms of overfitting/underfitting and fit and score times. Can you beat the performance of the linear model? 
    
</div>


_Points:_ 10

The first model we tried was an SVC() with class_weight='balanced' which achieved a test accuracy of 0.772 and a train accuracy of 0.779 which were very close, suggesting that the model generalizes well with only minor overfitting. Overall, the SVC model scored better than the linear model on most metrics (accuracy, F1, ROC AUC), but it had a lower recall. For this particular problem, we identified recall the most important metric and the linear model had a higher recall score (0.637 for linear model vs. 0.595). However, the fit and score time for SVC was extremely slow compared to the linear model.  

The second model was the GradientBoostingClassifier which achieved the highest test accuracy out of all models. The training accuracy was 0.829 and the test was 0.820, suggesting the model generalizes well and only has slight overfitting. Compared to the linear model, GradientBoostingClassifier has a lower recall score (0.372 vs. 0.637 for linear model). In terms of fit and score time, GradientBoostingClassifier was slower than the linear model but faster than SVC. 

The third model we tried was the RandomForestClassifier with class_weight='balanced' which achieved a test accuracy of 0.812 and a train score of 0.998, indicating the model is severely overfitting. Although it performs better than the linear model in accuracy and ROC AUC, its recall of 0.352 is much lower compared to the linear model (0.637). In terms of timing, the model fit quickly (faster than SVC and GBC) but it was slower to score compared to GradientBoostingClassifier. 

Overall, both SVC and GradientBoostingClassifier outperformed the linear model on most scoring metrics. However, the logistic regression model achieved the highest recall score, which is an important metric for predicting defaults. Therefore, despite a lower score for accuracy, the linear model is the best option for identifying defaults unless further tuning is applied to the other models.

In [None]:
pipe_svc = make_pipeline(preprocessor, SVC(class_weight='balanced'))

cross_val_results['svc'] = pd.DataFrame(cross_validate(
    pipe_svc, X_train, y_train, return_train_score=True, scoring = scoring_metrics, cv = 5
)).agg(['mean', 'std']).round(3).T

cross_val_results['svc'] 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

pipe_GBC = make_pipeline(preprocessor, GradientBoostingClassifier(random_state=123))

cross_val_results['GBC'] = pd.DataFrame(cross_validate(
    pipe_GBC, X_train, y_train, return_train_score=True, scoring = scoring_metrics, cv = 5
)).agg(['mean', 'std']).round(3).T

cross_val_results['GBC'] 

In [None]:
pipe_tree = make_pipeline(preprocessor, RandomForestClassifier(class_weight="balanced", random_state=123))

cross_val_results['tree'] = pd.DataFrame(cross_validate(
    pipe_tree, X_train, y_train, return_train_score=True, scoring = scoring_metrics, cv = 5
)).agg(['mean', 'std']).round(3).T

cross_val_results['tree']

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 9. Feature selection (Challenging)
rubric={reasoning}

**Your tasks:**

Make some attempts to select relevant features. You may try `RFECV`, forward/backward selection or L1 regularization for this. Do the results improve with feature selection? Summarize your results. If you see improvements in the results, keep feature selection in your pipeline. If not, you may abandon it in the next exercises unless you think there are other benefits with using fewer features.
    
</div>


_Points:_ 0.5

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 10. Hyperparameter optimization
rubric={accuracy,reasoning}

**Your tasks:**

Make some attempts to optimize hyperparameters for the models you've tried and summarize your results. In at least one case you should be optimizing multiple hyperparameters for a single model. You may use `sklearn`'s methods for hyperparameter optimization or fancier Bayesian optimization methods.  Briefly summarize your results.
  - [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)   
  - [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
  - [scikit-optimize](https://github.com/scikit-optimize/scikit-optimize) 
    
</div>


_Points:_ 6

_Type your answer here, replacing this text._

In [None]:
# SVC:

param_dist_svc = {"svc__C": loguniform(1e-2, 1e2), 
                  "svc__gamma": loguniform(1e-3, 10)}

random_search = RandomizedSearchCV(pipe_svc, param_dist_svc, n_iter=20, cv=5, n_jobs=-1,random_state=123, return_train_score=True)

random_search.fit(X_train, y_train)

pipe_svc_tuned = random_search.best_estimator_

In [None]:
# GBC 

param_dist_GBC = {"gradientboostingclassifier__n_estimators": randint(50, 300),
    "gradientboostingclassifier__learning_rate": loguniform(0.01, 0.5),
    "gradientboostingclassifier__max_depth": randint(2, 5)}

random_search = RandomizedSearchCV(pipe_GBC, param_dist_GBC, n_iter=20, cv=5, n_jobs=-1,random_state=123, return_train_score=True)

random_search.fit(X_train, y_train)

pipe_GBC_tuned = random_search.best_estimator_

In [None]:
# rdm forest

param_dist_tree = {"randomforestclassifier__n_estimators": randint(100, 500),
    "randomforestclassifier__max_depth": randint(3, 15),
    "randomforestclassifier__min_samples_leaf": randint(1, 5)}

random_search = RandomizedSearchCV(pipe_tree, param_dist_tree, n_iter=20, cv=5, n_jobs=-1,random_state=123, return_train_score=True)

random_search.fit(X_train, y_train)

pipe_tree_tuned = random_search.best_estimator_

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 11. Interpretation and feature importances <a name="1"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Use the methods we saw in class (e.g., `permutation_importance` or `shap`) (or any other methods of your choice) to examine the most important features of one of the non-linear models. 
2. Summarize your observations. 
    
</div>


_Points:_ 8

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 12. Results on the test set <a name="12"></a>
rubric={accuracy,reasoning}

**Your tasks:**

1. Try your best performing model on the test data and report test scores. 
2. Do the test scores agree with the validation scores from before? To what extent do you trust your results? Do you think you've had issues with optimization bias? 
3. Take one or two test predictions and explain them with SHAP force plots.  
    
</div>


_Points:_ 6

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-info">

## 13. Summary of results <a name="13"></a>
rubric={reasoning}

Imagine that you want to present the summary of these results to your boss and co-workers. 

**Your tasks:**

1. Create a table summarizing important results. 
2. Write concluding remarks.
3. Discuss other ideas that you did not try but could potentially improve the performance/interpretability . 
3. Report your final test score along with the metric you used at the top of this notebook.
    
</div>


_Points:_ 8

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 14. Creating a data analysis pipeline (Challenging)
rubric={reasoning}

**Your tasks:**

- Convert this notebook into scripts to create a reproducible data analysis pipeline with appropriate documentation. Submit your project folder in addition to this notebook on GitHub and briefly comment on your organization in the text box below.
    
</div>


_Points:_ 0.5

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<div class="alert alert-warning">

## 15. Your takeaway from the course (Challenging)
rubric={reasoning}

**Your tasks:**

What is your biggest takeaway from this course? 
    
</div>


_Points:_ 0.5

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<div class="alert alert-danger" style="color:black">
    
**Restart, run all and export a PDF before submitting**
    
Before submitting,
don't forget to run all cells in your notebook
to make sure there are no errors
and so that the TAs can see your plots on Gradescope.
You can do this by clicking the ▶▶ button
or going to `Kernel -> Restart Kernel and Run All Cells...` in the menu.
This is not only important for MDS,
but a good habit you should get into before ever committing a notebook to GitHub,
so that your collaborators can run it from top to bottom
without issues.
    
After running all the cells,
export a PDF of the notebook (preferably the WebPDF export)
and upload this PDF together with the ipynb file to Gradescope
(you can select two files when uploading to Gradescope)
</div>

---

## Help us improve the labs

The MDS program is continually looking to improve our courses, including lab questions and content. The following optional questions will not affect your grade in any way nor will they be used for anything other than program improvement:

1. Approximately how many hours did you spend working or thinking about this assignment (including lab time)?

#Ans:

2. Do you have any feedback on the lab you be willing to share? For example, any part or question that you particularly liked or disliked?

#Ans:

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)