<a href="https://colab.research.google.com/github/victorviro/Machine-Learning-Python/blob/master/CRISP_DM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRISP-DM

# Table of contents

1. [ℹ️ Introduction](#1)
2. [👨‍💼 Business understanding](#2)
  1. [🎯 Determine the business objectives](#2.1)
  2. [🔎 Asses the situation](#2.2)
  3. [🤔 Determine the data mining goals](#2.3)
  4. [📋 Produce a project plan](#2.4)
3. [🧐 Data understanding](#3)
  1. [⚙️ Collect the initial data](#3.1)
  2. [👀 Describe the data](#3.2)
  3. [📉 Explore the data](#3.3)
  4. [🐛 Verify data quality](#3.4)
4. [🧑‍🔧 Data preparation](#4)
  1. [👉 Select data](#4.1)
  2. [🧹️ Clean data](#4.2)
  3. [👷 Construct data](#4.3)
  4. [⛓ Integrate data](#4.4)
  5. [🪛 Format data](#4.5)
5. [🧙 Modeling](#5)
  1. [👉 Select the modeling technique](#5.1)
  2. [✔️ Generate test design](#5.2)
  3. [🏗️ Build and train the model](#5.3)
  4. [🔎 Assess the model](#5.4)
6. [🕵️ Evaluation](#6)
  1. [🧐 Evaluate results](#6.1)
  2. [⭐ Review process](#6.2)
  3. [⏩ Determine next steps](#6.3)
7. [🚀 Deployment](#7)
  1. [📋 Plan deployment](#7.1)
  2. [🖥️🔧 Plan monitoring and maintenance](#7.2)
  3. [📄 Produce final report](#7.3)
  4. [✨ Review project](#7.4)
8. [📘 References](#8)


# ℹ️ Introduction <a name="1"></a>

CRISP-DM (*CRoss-Industry Standard Process for Data Mining*) is a **data mining methodology and process model** that provides a complete **📋 blueprint for conducting a data mining project**. CRISP-DM is **based on practical, real-world** data mining 👨‍💼 **experience**. It highlights the fact that a data mining project does not finish once a well-suited model is built, it has to be 🚀 deployed, 🔧 maintained, and 🖥 monitored but also exhaustively 📝 documented so that other data mining projects (or 🔄 iterations of the same project) can benefit from the knowledge gain and work from it.

- As a methodology, it includes **descriptions of the phases of a project**, the **tasks needed in each phase**, and an explanation of the 🔗 relationships between tasks. 
- As a process model, CRISP-DM offers a summary of the **life cycle of a data mining project**, breaking it down into **6️⃣ phases** shown in the following figure:

![](https://www.ibm.com/docs/es/SS3RA7_sub/modeler_crispdm_ddita/clementine/images/crisp_process.jpg)

The **arrows** indicate frequent and important **dependencies between the phases**. The sequence of the phases is **not strict** ↕️ (a project can forward and backward between phases if it's necessary). The 🔄 **outer circle** symbolizes the **cyclical nature of data mining** itself and illustrates that the lessons learned during the data mining process can trigger new model iterations that will benefit from previous experiences.

The CRISP-DM is **flexible**. For example, if a company wants to understand better the behavior of their customers, the work can focus on exploring and 📊 visualizing the data to obtain insights into the behavior of the customers. In this situation, the phases of modeling, evaluation, and deployment can be less relevant than the phases for data understanding and data preparation. However, it's important to consider some questions that emerge during the posterior phases for the planning of long-term and future data mining objectives.  

# 👨‍💼 Business understanding <a name="2"></a>

An important phase of any data mining project is **understanding the project objectives and expectations from a 👨‍💼 business perspective**. We can then use that knowledge to **define a data mining problem and a 📝 preliminary plan** designed to achieve the objectives.

The business understanding phase involves several **key steps or tasks**.

## 🎯 Determine the business objectives <a name="2.1"></a>

The first step of the project is to understand, from a 👨‍💼 business perspective, **what the client wants to achieve** and describe **criteria to consider whether the project's result is ✅ valid** or useful.

- Identify the **area of the problem** (e.g., marketing, customer service, business development, etc).
- **Describe the problem** to be solved in a general and informal manner. E.g., "retain current customers" or "encourage customer loyalty with a more personalized service".
- Define the requirements of the project: motivation, objectives, success criterion.
- Identify the **key people and roles**. Who will be the **project owner?**. Who will evaluate the criteria of success? **What business units will be affected by the data mining project?**
- Specify all **related business ❓ questions** and other nuances. E.g., "will lower fees reduce the number of customers who leave?" or "Are some customers more important than others?"
- Identify the needs and expectations of customers.
- Specify **advantages expected in business terms** (e.g., "reduce lost customers").
- If the project is an evolution of another project, **analyze the current solution** (pros and cons, acceptance level,...). 
- **Specify in detail ☑️ success criteria** (e.g., "reduce lost customers by 10 percent" or "customers spend more time and view more pages on the site per visit"). Each **success criterion** must 🔗 **relate** to at least one of the specified **business objectives**. The success criteria can also be subjective (e.g., discover groups of customers with similar behavior).

## 🔎 Asses the situation <a name="2.2"></a>

This step implies a 🧐 detailed **investigation of resources available**, to accomplish the data mining project. From personnel to data available, 💻 computational resources, and software.

- Identify basic 💾 **hardware and its availability** for the data mining project.
- Identify **data sources, their type** (database, stream sources, external sources...), and their **correct access**.
- Check 🛠️ **tools** and techniques **available**.
- Identify **system admin, database admin, technical support personnel, business domain experts**, for future questions.
- Identify 👩‍💻 **market analysts, data mining experts** and check **their availability**.


List the **requirements** of the project. While the main requirements are the business objectives mentioned in the previous task, it's necessary to consider other elements.

- Are there 🧑‍⚖️ **legal or 🔒 security constraints about the data or project results**?, Is the use of the data allowed?
- Are there **requirements about the 🚀 deployment of the results**? E.g. will the results show in a 🌐 web app? or will they store in a database for later consumption? If the results are predictions, what is the latency required (batch vs online inference)?

List **assumptions about the data or the business**.

- List assumptions done about **data quality** (accuracy, availability). There is a **minimum 🎚 level of quality** of the results?
- Identify **external factors** (like economic factors) that can affect the project (e.g. competency products)
- How does the project team expect to view the results? Do they want to **understand the model or simply view the results?** How important is the level of interpretability of the results?


List **constraints** about availability of resources, technological (amount of data)

- **Legal** constraints about the **use of the data**
- 💰 **Budget** constraints (fixed costs, implementation costs)
- Check if the **data is 🚫 accessible**


Identify the **project risks** or events that can delay the project and **list potential solutions** to those risks. 

- 🗓 Scheduling: What happens if the project takes longer than scheduled? 
- 💸 Financial: What happens if the project sponsor detects budgetary problems?
- Data: What happens if the data is of poor quality or coverage?
- Results: What happens if the initial results are not so good as expected?

Finally, construct a **cost-benefit analysis** for the project

- Estimate 💲 **cost of data collection, operationalization and deployment**.
- Identify the **benefits of the business objective** (e.g. improve customer satisfaction, ROI, increase profits), and **additional benefits** such as the knowledge acquired from 📊 data exploration or possible benefits from a better comprehension of the data

To make sure the business and data mining "speak the same language" we can generate a 📝 **glossary of business and data mining terms** (especially, terms that can generate confusion) with illustrative examples related to the business problem.

## 🤔 Determine the data mining goals <a name="2.3"></a>

The data mining goal **translates project objectives in business and technical terms**. If the business goal cannot be translated into a data mining goal, it may be wise to 🤚⏪ consider redefining the problem at this point. For **example**, the business objective "retain current customers" can be translated into a data mining goal like "predict the churn likelihood of each customer given the data about their purchases, behavior...". If this data mining goal is accomplished, it can be used by the company to retain current customers (e.g. send offers with discounts to customers who are likely to churn).

It's necessary also to **define ☑️ success criterion in technical terms** (e.g. certain 🎚 level of predictive accuracy). 

- Describe expected outputs of the project (usually techniques) which allow the achievement of the business objectives
- Translate business questions to data mining objectives 🎯 (e.g. customer segmentation by using a clustering algorithm)
- Specify the **type of the data mining problem** (
🏷️ classification, 📈 regression, clustering, etc).
- Specify **success criterion to evaluate the model** (accuracy, complexity, interpretability). Define tests for this criterion and threshold values.
- Specify rules to apply **subjective evaluation criterion** (e.g. descriptive ability of the model). 

## 📋 Produce a project plan <a name="2.4"></a>

Make a **plan for achieving the data mining 🎯 goals** (and therefore business objectives). The **questions** that have posed so far and the **business and data mining objectives** formulated will **form the basis of this plan**. It must include the **steps** to perform during the rest of the project (for each step, its ⏲ duration, **resources** required, dependencies), an assessment of **potential risks**, and an initial assessment of the 🔧 **tools and techniques needed** to support the project. 

This plan is **dynamic**. At the end of each phase, we can **review** the progress and achievements, and ✍️ **update it if it is necessary**. Review points for these updates are part of the plan.

- Estimate the **effort and resources required** to develop the solution.
- Estimate the ⏳︎ **duration of each phase** of the project
- Identify 🔑 critical steps of the project
- Define **review and decision 📆 points**
- Identify the phases with more iterations (modeling)
- Create a list with **criteria for selecting ⚒️ tools** and techniques. Evaluate if they are adequate for the problem.
- Define the project plan and 🤝 **discuss** its viability **with all personnel involved**.
- Combine all objectives identified and techniques selected into a coherent procedure that tackles the business objectives and defines the success criterion.

An example of a simple project plan is available [here](https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=plan-sample-project).

# 🧐 Data understanding <a name="3"></a>

It starts with an **initial data collection** and continues **getting familiarity with the data**, identify **data quality 🐛 problems**, discover **initial 📊 insights** from the data, or detect interesting subsets to formulate ❔ hypotheses about hidden information.

## ⚙️ Collect the initial data <a name="3.1"></a>

In this task we **acquire** the necessary **data** that is listed out in the resources of the project (including data access, and loading and integrating this data if it's necessary). 

We can create a 📝 **report with the data sets**, the **data sources** (existing data, additional data), the **methods used to acquire them, and problems encountered and the solutions** adopted to aid with future replications of the project (e.g., when collecting data from different sources, some of these sources may have a long lag time, it is helpful to know this in advance to avoid potential delays).

- Specify **data selection criterion** (e.g., **what attributes are necessary** for the data mining 🎯 objectives and what not? why? how many attributes can we manage with the techniques selected?)
- **Is there enough amount of data** to obtain conclusions or accurate predictions?
- **Be careful with 🐛 data quality** issues **when collecting data from different data sources**. E.g, are missing values managed equally in the data sources?
- **If data is 🖼️ unstructured, how** do will **encode** it to be modeled?
- Can we automate the data extraction process?

An example of a data collection report is available [here](https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=data-e-retail-example-initial-collection).

## 👀 Describe the data <a name="3.2"></a>

In this task, we examine the **surface properties of the acquired data and ✍️ report on the results**, examining 🐛 **data issues** such as its **format, its quantity**, and other surface features (**basic statistics**). The 🗝 key question to ask is: **Does the data acquired satisfy the relevant requirements?** E.g., if age is an important field and the data does not reflect the entire age range, it may be wise to collect a different set of data. This step also provides a basic understanding of the data on which subsequent steps will build.

- Describe data tables and their ⛓ relationship
- Analyze **data volume** and data complexity
- Check accessibility and availability of attributes
- Check **attribute types** (🔢 numerical, 🏷 categorical, etc)
- Check the **range of values** of the attributes
- Analyze **correlation** between attributes and key 🔗 relationships
- Understand the **meaning of the attributes** and classify (describe) its values in business terms. Are they **coherent?**
- Compute **basic statistics of the attributes** (📊 distribution, mean, standard deviation, etc) and relate the results with their meaning in business terms
- Decide if the attribute is useful ✅ for the data mining 🎯 objectives
- Interview 👨‍💼 **business domain experts** and get **their opinion about the importance of the attributes**
- Decide if it's necessary to ⚖️ **balance the data** (in unbalanced datasets)

## 📉 Explore the data <a name="3.3"></a>

This task tackles the data mining questions that can be addressed using **querying** (e.g., what types of products the customers in a particular group usually buy?), 📊 **visualization** (e.g., uncover potential fraud patterns), and **reporting** (distribution of related attributes, **aggregations**, significative **properties of subpopulations**, basic statistics). 

Finally, we can create a **data exploration 📄 report** that **outlines first findings, initial hypothesis**, and the potential impact on the remainder of the project.

- Analyze in detail the **properties of interesting attributes** (basic statistics, interesting **subpopulations**...). What attributes look 🥇 promising for future analysis? 
- Has this exploration revealed new characteristics of the data? 
- **Formulate ❔ hypothesis** and convert them in data mining objectives if it's possible
- **Clarify data mining 🎯 objectives**. Has this exploration modified the objectives?
- Perform **basic analysis to ✔️ verify the hypothesis**. Has this exploration modified the initial hypothesis?
- Consider and evaluate information and conclusions of the previous task of describing data

## 🐛 Verify data quality <a name="3.4"></a>

Real-world data is often messy 🙈. In this task, we examine the quality of the data. Is the **data complete?** Do data **cover all cases?** Are values correct or contain 🐛 **errors?** If there are errors, **how often do they occur?** Are there **missing values?** If there are, how are they represented, when, and how often? Other common items to check include: the plausibility of values; the spelling of values; **attributes values that conflict with common sense** (e.g., teenagers with high income❗).

- Check the **coverage** of data (are all possible values represented?)
- Check the **meaning of the attributes 🔗 matches with their values**
- Identify **missing and blank values**
- Establish the **cause** and meaning **of missing or erroneous data**, and check attributes with similar meaning that have very different values
- 🧐 Check the orthography and format of values (e.g., the same value but sometimes it starts with lowercase and sometimes it starts with a capital letter)
- Check **deviations (outliers)** and establish if they **are noise or** they can indicate **interesting phenomenons** 🤔
- Check the plausibility of values (e.g, attributes with almost always the same value)
- Check noise and **inconsistencies between sources** ⚠️
- Sometimes we can exclude some data if it has no enough quality or if it does not have valid information for the business or the data mining problem.

Based on the exploration and verification of data quality, we can prepare a **data quality 📝 report** that will guide the next phase.

# 🧑‍🔧 Data preparation <a name="4"></a>

The data preparation phase covers all activities to 👷 **construct the final data set** (data that will be **fed into the modeling** tools) from the initial raw data. This phase usually takes 50-70% of a project's ⌛️ time and effort. Devoting adequate effort to the earlier 👨‍💼 business understanding and 🧐 data understanding phases can minimize this overhead, but we still need to spend a good amount of effort preparing and packaging the data for mining.

Tasks include 👉 attribute selection, as well as transformation, and 🧹️ cleaning of data for modeling tools.

## 👉 Select data <a name="4.1"></a>

**Deciding on the data** that will be used for the analysis is **based on several criteria**: its **relevance** to the data mining 🎯 goals, **quality, and technical constraints** (such as limits on data volume or data types). Some tasks of this phase can be:

- Define the **data ✔️ included and ❌ excluded**, and the **reasons** for that decision. We can perform **correlation** tests (or other methods for selecting important features) **to decide which attributes include**. It is also a good idea to decide if one or more attributes are more important than others 🥇🥈🥉.
- Collect additional appropriate data (from internal or external sources)
- **Reconsider the data selection criterion** (from the "Collect initial data" phase) **by the real 🎚 quality of the data** and the results from data exploration and modeling.
- Are there any **constraints** on using particular fields such as gender or race or sensitive information?
- Select different **data subsets** (e.g. different attributes, only data that verify certain conditions, etc)
- Consider the use of **sampling techniques** 

Based on the exploration and verification of data quality, we need to prepare a **📝 report** with all this information.

## 🧹️ Clean data <a name="4.2"></a>

If necessary, we can ⬆️ **increase the quality of the data** to the level required for the analysis techniques selected. This can imply **select 👌 clean subsets** of data **or** incorporate techniques for **estimating missing data** through modeling analyses (or by adequate default values). ✍️ **Report how we address each quality problem** reported in the earlier "Verify Data Quality" step (see data quality 📄 report).

- Consider how to treat any **noise** observed. We can correct it, 🔥 remove it, or 🤷 ignore it.
- Decide how to treat **missing values**. We can exclude rows or attributes or we can fill blanks with an estimated value (mean, mode, regression, KNN...).
- **Reconsider the data selection criterion** (from the "Collect initial data" phase) **by the 🎚 quality of the data after the cleaning** process.

## 👷 Construct data <a name="4.3"></a>

This task involves data preparation operations such as **creation of new records, 🤔 derived attributes** (these should only be added if they ease the model process or facilitate the modeling algorithm) or single-attribute **transformations**, usually performed **to fit the needs of the 🛠 modeling tools** (such as one-hot encoding or normalization). This task is also called "Feature engineering".

- Build derived attributes, create new records, or transform existent attributes
- Decide if **data** need be **normalized** before modeling
- Consider adding information about the 🥇🥈🥉 **importance of attributes** (attributes with **weight**, weighted normalization)
- Specify transformation steps necessary to execute the operations (e.g. binning a 🔢 numeric attribute)
- Execute the transformation steps

## ⛓ Integrate data <a name="4.4"></a>

Integrating data involves **combining information from multiple
tables or records to create new records or values.** With table-based data, an analyst can 🔗 **join** tables that have different information (attributes) about the same objects or ↪️ **concatenate** tables that have the same attributes but different records. 

- ✅ Check if **integration applications can integrate the input sources** as required
- Integrate sources and consider 💾 **saving the resulting output** before proceeding to modeling. 
- **Reconsider the 👉 data selection criterion** (from the "Collect initial data" phase) **by the results of ⛓ data integration** 
- Perform **aggregations**, operations where **new values are computed by summarizing information from multiple records/tables**. E.g. we have two tables of 👟👠👢 products and product sales, we can compute an aggregation in the table of sales to create a new attribute in the table of products: "Number of sales of a product in the last 6 months"

## 🪛 Format data <a name="4.5"></a>

We **maybe need to change the format or design of the data**. Sometimes these changes are needed to make the data suitable for a specific modeling 🛠 tool. These changes might be simple (like trimming strings to a maximum length) or more complex (like order sequential data such as a 📈 time series). Some tasks of this phase can be:

- Which models do we plan to use?
- Do these **models require a particular data format or order?**

# 🧙 Modeling <a name="5"></a>

Modeling is usually conducted in 🔄 **multiple iterations**. It's rare for a data mining question to be answered satisfactorily with a single model and a single execution. In this phase, **various modeling techniques** are 👉 **selected** and applied and their **parameters are tuned** to optimal values. Typically, several techniques exist for the same data mining problem type. Some techniques have specific requirements in the form of data. Therefore, **stepping ↩️ back to the data preparation phase may be necessary**.

## 👉 Select the modeling technique <a name="5.1"></a>

This task refers to **choosing** one or more specific **modeling techniques**, such as decision 🌿 tree or 🧠 neural network. If assumptions are attached to the modeling technique, these should be recorded.

- Decide **appropriate techniques for the problem**, taking into account the tool selected previously
- **Register** all **modeling techniques** that will be used
- Analyze every **assumption and requirement attached to the modeling technique** (comparing it with the 📄 report of data description) and make sure they are still valid
 - Are there any assumption about the **data size**, or about the **data distribution**? 
 - Does the model require a certain 🎚 **level of data quality**? Can we meet this level with the current data? 
 - Some data mining techniques require specific data types. Have the data the **proper type** for a particular model? If not, can we make the necessary conversions using data manipulation operations? 🏷 Categorical variables can be transformed to 🔢 numerical (e.g one-hot) and vice-versa (e.g binning)
 - If our **data is unbalanced**, does the model provide a way to give more importance to rare events (e.g with weights)? If not, we can step ↩️ back to the data preparation phase and ⚖️ balance our data
 - 📝 **Document** any **data assumptions** as well as any **data manipulations made** to meet the model's requirements. 

## ✔️ Generate test design <a name="5.2"></a>

After building a model, we must ✅ **test the model’s quality and validity**. It's appropriate to **design the test procedure before building the model**; this also has implications for data preparation. In supervised data mining tasks, we typically ✂️ **separate the data set into train and test sets**, build the model on the train set, and estimate its quality on the separate test set. We can choose more sophisticated evaluation strategies like **cross-validation**. Some tasks of this phase can be:

- **Describe** the criteria for "goodness" or **metrics** of a model (↩️ check the success criterion defined in the phase "Determine the data mining goals"). Check there are ✔️ **tests for each data mining objective**
- **Define and prepare the data** on which these criteria will be tested. How do we separate the data set?
- Defined necessary steps (iterations number, objective metrics, etc)

## 🏗️ Build and train the model <a name="5.3"></a>


Once the training data has been selected, the data analyst **runs the modeling 🛠 tool to create one or more models**. The **parameters** are 👌 **adjusted and reported with their values** and the reasons for the values chosen. In order to reproduce our results, we must keep 📝 **notes on the settings and data used** for each model. Here, an experiment tracking tool like (like [MLflow](https://www.mlflow.org/docs/latest/tracking.html) or [W&B](https://wandb.ai/site)) can help. At the end of this phase, the **models** are obtained. It's necessary **describe and interpret** them and ✍️ document each difficulty found with their meanings. Some tasks of this phase can be:

- Adjust the parameters of the model. Once the **parameters that produce the 🥇 most accurate results** are adjusted, 💾 **save the model** and ✍️ take **notes on the optimal settings**.
- **Describe** in detail the **model and the results**:
 - Are there **new insights** or unusual patterns **revealed by the model**? 
 - Were there **execution problems** for the model? How reasonable was the **processing ⌛️ time**?
 - Did the model have **difficulties with data quality 🐛 issues**, such as a high number of missing values? 
 - Express conclusions obtained by the model
 - For 📏 **based-rule models**, list out the **rules produced**, the evaluation for each rule, etc
 - For ⬛ **black-box models**, list out **technical information** about the model and **describe the behavior** produced by the modeling process
 - Describe the **interpretability** of the model

## 🔎 Assess the model <a name="5.4"></a>

The **models** have to be **interpreted based on domain knowledge, data mining success criteria, and the desired ✔️ test design**, and evaluated according to **evaluation criteria** (in technical terms), but business analysts and 👨‍💼 **domain experts** can interpret the results in the **business context**. The **output** of this phase should be a **summary of the evaluation results**, and a 🥇🥈🥉 **ranking of the models** based on their quality obtained (in terms of evaluation criteria e.g accuracy). Some tasks of this phase can be:

- **Evaluate the models** which concern evaluation criterion
- Perform a **ranking** taking into account the success and evaluation criterion and select the 🥇 best models
- **Interpret the results in business terms** (as much as possible in this phase). Get comments of the model by data experts and 👨‍💼👩‍💼 business domain experts
- Check the credibility and reliability of the model.
- Check the **effects on the data mining 🎯 objectives**
- Compare the models against a particular knowledge base to check if the information revealed is new and useful
- Analyze the **potential for the development of each result**
- If there is a verbal description of the model generated (e.g in form of rules), evaluate the rules
- Analyze specific aspects of each modeling technique and determine if it's **possible to perform adjustments in the parameters to obtain better results**  
- Based on the model evaluation, the parameters of the model can be reviewed and adjusted for the following 🔁 iterations to find a better model.

# 🕵️ Evaluation <a name="6"></a>

**Before the deployment** of the model built, it is important to more 🧐 **thoroughly evaluate the model** and review the model’s construction to check if it **properly achieves the business 🎯 objectives**. Here it is critical to **determine if some important business issue has not been sufficiently considered**. At the end of this phase, there should be **a decision about how to use the data mining results**.

## 🧐 Evaluate results <a name="6.1"></a>

The previous evaluation step dealt with factors such as the accuracy and generality of the model (data mining criteria). This step 📏 **assesses** the degree to which the **model meets the business 🎯 objectives in terms of business success criteria and determines if there is some business reason why this model is ⚠️ deficient**. 
The evaluation also seeks to unveil additional challenges and information for future directions. Once the **models** generated **meet the criteria** selected, they are ✅ **approved for inclusion in the final 📄 report**. This list should include models that satisfy both the data mining and business goals.

This step requires a clear understanding of the stated business goals, so we have to include key decision makers in the project assessment. Some tasks in this phase can be:

- **State clearly the results** in a form that can be easily presented
- **Check the effect, interpret, evaluate, estimate and 🥇🥈🥉 rank the results in terms of their applicability to the business goals** (business success criterion). In general, how well do the results answer the business goals? 
- 🖊️ **Highlight novel** or unique **findings** and ❔ questions the results have raised in business terms
- Test the model(s) on real-world applications if ⏳time and 💲budget constraints permit
- Compare the models against a particular knowledge base to check if the information revealed is new and useful
- Determine if there are **new business 🎯 objectives** for a possible following iterations of the project or for new projects
- Annotate recommendations for future data mining projects 

## ⭐ Review process <a name="6.2"></a>


It is now appropriate to do a **🔎 thorough review of the data mining engagement** to determine if there are any important factor or task that has somehow been overlooked. This review also covers **quality assurance** 🐛 issues (e.g. did we correctly build the model? Did we only use allowable attributes that are available for future deployment?). At the end of this phase, there should be a 📃 **summary of the review process**, with the **activities** done **and decisions** made **for each phase**.

- Describe the data mining process. Consider the following ❔ **questions for each phase** of the project. 
 - Did this stage contribute to the value of the final results?  
 - Did it execute optimally? 
 - How can it be improved? 
 - Identify 👎 failures, mistakes, and deviations. How can they be avoided next time? 
 - Identify alternative decisions or strategies that might have been used in a given phase, and ✍️ note them for future data mining projects.
- Review the data mining results regarding to the business success ✅ criterion


## ⏩ Determine next steps <a name="6.3"></a>

At this stage, the project leader must **decide whether** to:

- **Continue to the 🚀 deployment** phase. If the results address our data mining and business goals we can incorporate the model results into our business process and produce a final report
- Initiate **further 🔄 iterations**. If we find that our results are not optimal, we can go back and refine or replace our models. We can take what we've learned and use it to refine the models and produce better results

At the end of this phase, there should be a 📋 **list of potential future actions**, with the pros and cons of each option, as well as a decision about how to proceed, with a justification. Some tasks in this phase can be:

- Analyze the potential for the development and the improvement of each result
- Check the remaining 💻 resources and 💰 budget to determine if they allow additional iterations of the process
- Refine the process plan
- 🏷 Classify the possible actions, 👉 select one, and 📝 document the reasons for the choice

# 🚀 Deployment <a name="7"></a>

Model creation is not the 🔚 end of the project. Deployment is the process of using the insights gained to make improvements within the organization. This often involves **applying "live" models** within an organization’s **decision-making processes** such as real-time personalization. Alternatively, deployment can mean that we **use the insights** gained from data mining **to elicit change** in our organization.

The 🗝 key steps here are 📋 planning and 🖥 monitoring the deployment of results, the production of a final 📄 report, and a review of the project.

## 📋 Plan deployment <a name="7.1"></a>

This phase takes the evaluation results and develops a **strategy for 🚀 deployment**. Some tasks in this phase can be: 

- 📋 **Summarize the results** (both models and findings). This helps to determine:
 - Which models can be integrated into our systems 
 - Which findings should be presented
- For each deployable model, create a 👣 **step-by-step plan for deployment and integration** with our systems. ✍️ **Note** any **technical details** such as database requirements for model output, latency requirement, etc
- For each conclusive finding, create a plan to disseminate this information to strategy makers.
- Are there alternative deployment plans for both types of results that are worth mentioning?
- Consider **how the deployment** will be 🖥 **monitored**.
 - How will a model deployed be updated? Will we follow a 🟢🔵 blue-green deployment strategy or 🐦 canary deployment?
 - How will we decide when the model is no longer applicable? 
- Identify any deployment 🐛 problems and plan for contingencies. For example, decision-makers may want more information on modeling results and may require that we provide further technical details.

## 🖥️🔧 Plan monitoring and maintenance <a name="7.2"></a>

A carefully prepared 🖥 **monitoring and 🔧 maintenance strategy avoids ❌ incorrect usage of data mining results**. In a full-fledged deployment and integration of modeling results, **models may degrade over time** and they need to be 🧐 **evaluated periodically to ensure its effectiveness** and to make continuous ✨ improvements. Some tasks in this phase may include:

- Determine **which metrics should be 📏 measured**. For example, 💻 IT metrics (CPU, memory, network usage), latency, model performance (accuracy), etc
- Determine **what ℹ information should be logged** for each prediction: inputs, outputs, system action, model metadata, model explanation, etc
- Determine the **strategy to 📐 measure and 🖥️ monitor the model's performance**. How can we **detect model’s performance degradation**? Will we perform *ground truth evaluation* or *input drift detection* or both? Give specifics on accuracy **thresholds** or expected changes in data, etc.
- What will occur when a model's performance 📉 degrades? 
 - Can we simply retrain the model with fresher data or make slight adjustments? How can we know the new model is better than the actual one running in production? Will we perform **online evaluation** through 🅰🅱 A/B or 🥇🏅 champion-challenger (shadow testing)?
 - Or will changes be pervasive enough to require a new data mining project?
- For each model or finding, which **factors** or influences (such as market value or seasonal variation) need **to be tracked**?
- Can this model be used for similar business issues? This is where good ✍️ **documentation** becomes critical for assessing the business purpose 🎯 for each data mining project.


## 📄 Produce final report <a name="7.3"></a>

📝 Writing a final report ties up loose ends in earlier documentation, and it can also be used **to 📢 communicate our results** to the various 👨‍💼 people with a stake in the results. This report includes all of the previous deliverables and summarizes and **organizes the results**. Depending on the deployment plan, this report may be only a 📋 **summary of the project** and its experiences **or** it may be a final and **comprehensive presentation** of the data mining result(s).

It's necessary to **consider the audience of our report**. We may need to create separate reports for each audience if their needs are disparate. In either case, our report should include most of the following ⚫ points:

- A thorough **description of** the original **business problem**
- The process used to conduct data mining
- 💰 **Costs** of the project
- Notes on any **deviations** from the original project plan
- A 📋 **summary of** data mining **results**, both models and findings
- An overview of the proposed plan for deployment
- **Recommendations** for further data mining work, including interesting leads discovered during exploration and modeling


In addition to the project report, we may also need to present the project findings to a team of sponsors or related departments. If this is the case, we could use much of the same ℹ information in our report but presented from a broader perspective.

## ✨ Review project <a name="7.4"></a>


This is the final step of the CRISP-DM methodology, and it offers us a chance to formulate our **final impressions and lessons learned** during the data mining process (👎 failures, 👍 successes, potential areas of improvement for use in future projects, etc). This step can include ❓ **interviews with** the significant **project participants**. Questions to consider include the following: 

- What are your overall **impressions** of the project?
- What did you **learn during the process**, both about data mining in general and the data available?
- Which parts of the project went well 😀? Where did **difficulties** arise 😖? Was there 📖 information that might have helped ease the confusion?

After the data mining results have been 🚀 deployed, we might also ❓ **interview those affected by the results** (such as customers or business partners) to **determine whether the project was worthwhile** and offered the benefits it set out to create.

# 📘 References <a name="8"></a>

- "The CRISP-DM Model: The New Blueprint for Data Mining" by Colin Shearer

- [CRISP-DM by IBM](https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=guide-introduction-crisp-dm)

- [CRISP-DM series by Sngular](https://www.sngular.com/es/data-science-crisp-dm-metodologia/): [Business understanding](https://www.sngular.com/es/crisp-dm-fase-i-comprension-del-negocio/), [Data understanding](https://www.sngular.com/es/crispdm-data-understanding/), [Data preparation](https://www.sngular.com/es/crispdm-data-preparation/), [Modeling](https://www.sngular.com/es/crisp-dm-fase-iv-modeling-modelado/), [Evaluation](https://www.sngular.com/es/crispdm-evaluacion/).