# Module 3: Machine Learning

## Sprint 4: Capstone Project

# About this Sprint

It is time for your third Capstone Project.
You will work on this project for the whole Sprint.
The outcome of this Sprint should potentially serve as your portfolio item, so try to show your best work!

This time your objective is even more challenging - you will be required to iteratively build and implement a plan for a large dataset based on business objectives.
You'll have to translate the business requirements, making assumptions where necessary, into a plan for your project.

Even though you have learned a lot about machine learning in this Sprint, working on a larger project will be the real challenge.

Good luck!

## Context

You and your friend came up with a brilliant startup idea - provide risk evaluation as a service for retail banks.
As with most successful startup teams, both of you have your specialty.
Your friend is responsible for sales and operations, while you are responsible for everything product-related, from planning to data analysis to building the solution.
You have quickly identified that machine learning will be an essential part of your offering because you believe that the models can capture statistical patterns in the defaults on bank loans.
You decide to start your investigation by downloading [this](https://storage.googleapis.com/341-home-credit-default/home-credit-default-risk.zip) dataset from Home Credit Group.
You are not yet sure, what is the most crucial problem for your potential clients, so you had a meeting with your friend to discuss how your proof-of-concept (POC) product should look like.
After a lot of arguing, you both agreed to create a number of different models so that you have a robust and diversified offering when you get your first meeting with the potential clients.
You are eager to investigate the dataset and see what you can predict, so you propose that you come up with interesting features to analyze and predict - this way, you'll focus on building a solid offering, and she can work on getting meetings with the banks.

## Objectives for this Part

- Practice translating business requirements into data science tasks.
- Practice performing EDA.
- Practice applying statistical inference procedures.
- Practice using machine learning to solve business problems.
- Practice deploying multiple machine learning models.

## Requirements

- Download the data from [here](https://storage.googleapis.com/341-home-credit-default/home-credit-default-risk.zip) and the data description from [here](https://storage.googleapis.com/341-home-credit-default/Home%20Credit%20Default%20Risk.pdf).
- Create a plan for your investigation, analysis, and POC building. This should include your assumptions, overall objectives, and objectives for each step in your plan. You are not expected to have a plan for the whole project but instead have a clear understanding of what you'll try to achieve in the next step and build the plan one step at a time.
- Perform exploratory data analysis. This should include creating statistical summaries and charts, testing for anomalies, checking for correlations and other relations between variables, and other EDA elements.
- Perform statistical inference. This should include defining the target population, forming multiple statistical hypotheses and constructing confidence intervals, setting the significance levels, conducting z or t-tests for these hypotheses.
- Use machine learning models to predict the target variables based on your proposed plan. You should use hyperparameter tuning, model ensembling, the analysis of model selection, and other methods. The decision of where to use and not to use these techniques is up to you; however, they should be aligned with your team's objectives.
- Deploy these machine learning models to Google Cloud Platform. You are free to choose any deployment option you wish as long as it can be called an HTTP request.
- Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, what results you got, and what these results mean.
- Provide suggestions about how your analysis and models can be improved.

## Bonus Challenges

As a data scientist, you will spend a significant amount of your time learning new things.
Sometimes you will do that for fun, but most of the time, you will have an urgent problem, and you will need to quickly learn some new skills to be able to solve it.
It is essential to build this skill gradually - it is extremely valuable for all data scientists.
The bonus challenges are designed to simulate these types of situations.
These challenges require you to do something that we haven't covered in the course yet.
Instead of trying to do all of the bonus challenges, concentrate on just one or two and do them well.
All of the bonus challenges are optional - no points will be deducted if you skip them.

- Provide a write-up on your project. Appendix B. Machine Learning Project Checklist from the [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/app02.html).
- Try using a multi-output model.
- Try using a deep neural network-based model.

## Evaluation Criteria

- Adherence to the requirements. How well did you meet the requirements?
- Depth of your analysis. Did you skim the surface, or did you explore the dataset in-depth?
- Models' performance. How well did your models perform the predictions?
- Models' deployment. How performant, robust, and scalable your models' deployment is?
- Visualization quality. Did you use charts effectively to visualize patterns in the data? Are your visualizations properly labeled? Did you use colors effectively? Did you adhere to the principle of proportional ink?
- Code quality. Was your code well-structured? Did you use the appropriate levels of abstraction? Did you remove commented-out and unused code? Did you adhere to the PEP8?
- Code performance. Did you use suitable algorithms and data structures to solve the problems?
- Delivery of the presentation. How clear are your reasoning and explanations?
- Bonus challenges. Have you put extra effort into doing one or more of the bonus challenges (no points will be deducted if you skip them)?

## Correction

During your project correction, you should present it as if talking to a technical team lead and a senior co-worker working in your team.
You can assume that they will have strong data science and software engineering skills - they will understand technical jargon, they are expected to notice things that could have been done better, ask about the choices you've made (especially if you've made some questionable choices).
In addition, be careful not to spend your time explaining trivial concepts or code snippets that are simple - your best bet is to focus your presentation on the more difficult portions of your code.

## Resources

This task is based on the Home Credit Default Risk Kaggle Competition.
