# (CRISP-DM) CRoss Industry Standard Process for Data Mining
![image.png](attachment:image.png)

# Business Understanding
## Define Business Problem
* Get business context of the problem to be solved, assess the problem with the help of domain and subject matter experts (SMEs)
* Describe main pain points or target areas for business objective to be solved.
* Understand the solutions that are currently in place, what is lacking, and what needs to be improved.
* Define the business objective along with proper deliverables and success criteria based on input from business, data scientists, analysts, and SMEs

## Assess and Analyze Scenarios
* Assess and analyze what is currently available to solve the problem from various perspectives including data, personnel, resource time, and risks.
* Build out a brief report of key resources needed (both hardware and software) and personnel involved. In case of any shortcomings, make sure to call them out as necessary.
* Discuss business objective requirements one by one and then identify and record possible assumptions and constraints for each requirement with the help of SMEs.
* Verify assumptions and constraints based on data available (a lot of this might be answered only after detailed analysis, hence it depends on the problem to be solved and the data available).
* Document and report possible risks involved in the project including timelines, resources, personnel, data, and financial based concerns. Build contingency plans for each possible scenario.
* Discuss success criteria and try to document a comparative return on investment or cost versus valuation analysis if needed. This just needs to be a rough benchmark to make sure the project aligns with the company or business vision.

## Define Data Mining Problem
* Discuss and document possible Machine Learning and data mining methods suitable for the solution by assessing possible tools, algorithms, and techniques.
* Develop high-level designs for end-to-end solution architecture.
* Record notes on what the end output from the solution will be and how will it integrate with existing business components.
* Record success evaluation criteria from a Data Science standpoint. A simple example could be making sure that predictions are at least 80% accurate.

## Project Plan
* Definition of business objectives for the problem
* Success criteria for business and data mining efforts
* Budget allocation and resource planning
* Clear, well-defined Machine Learning and data mining methodologies to be followed, including high-level workflows from exploration to deployment
* Detailed project plan with all six phases of the CRISP-DM model defined with estimated timelines and risks

# Data Understanding
## Data Collection
* Extract, curate and collect all necessary data needed for your business objective

## Data Description
* Data sources (SQL, NoSQL, Big Data), record of origin (ROO), record of reference(ROR)
* Data volume (size, number of records, total databases, tables)
* Data attributes and their description (variables, data types)
* Relationship and mapping schemes (understand attribute representations)
* Basic descriptive statistics (mean, median, variance)
* Focus on which attributes are important for the business

## Exploratory Data Analysis
* Explore, describe, and visualize data attributes
* Select data and attributes subsets that seem most important for the problem
* Extensive analysis to find correlations and associations and test hypotheses
* Note missing data points if any

## Data Quality Analysis
* Missing values
* Inconsistent values
* Wrong information due to data errors (manual/automated)
* Wrong metadata information

# Data Preparation
## Data Integration
* Integrate or merge multiple datasets using same attributes or common keys

## Data Wrangling
* Handling missing values (remove rows, impute missing values)
* Handling data inconsistencies (delete rows, attributes, fix inconsistencies)
* Fixing incorrect metadata and annotations
* Handling ambiguous attribute values
* Curating and formatting data into necessary formats (CSV, Json, relational)

## Attribute Generation and Selection (aka feature extraction and engineering)
* e.g. age = current_date - birth_date

# Modeling
## Selecting Modeling Techniques
Mainly decided by 
* Current data available
* Business goals
* Data mining goals
* Algorithm requirements
* Constraints

## Model Building
* Feature + Machine Learning Algorithms

To keep track
* Models created, model parameters being used and their results

## Model Evaluation and Tuning
* Model Accuracy, precision, recall, F1 score, ...
* Grid Search and Cross validation

## Model Assessment
* Model performance is in line with defined success criteria
* Reproducible and consistent results from models
* Scalability, robustness, and ease of deployment
* Future extensibility of the model
* Model evaluation gives satisfactory results

## Evaluation
* Ranking final models based on the quality of results and their relevancy based on alignment with business objectives
* Any assumptions or constraints that were invalidated by the models
* Cost of deployment of the entire Machine Learning pipeline from data extraction and processing to modeling and predictions
* Any pain points in the whole process? What should be recommended? What should be avoided?
* Data sufficiency report based on results
* Final suggestions, feedback, and recommendations from solutions team and SMEs

# Deployment
* regular monitoring and maintenance of models to continuously evaluate their performance, check for results and their validity, and retire, replace, and update models as and when needed.