# Data Science Methodology Overview

## Definition of Methodology:
- A methodology is a system of methods used in a particular area of study.
- It provides guidelines for decisions researchers must make during the scientific process.

## Importance of Methodology in Data Science:
- Data science methodology is a structured approach that guides data scientists in solving complex problems and making data-driven decisions.
- It includes data **collection forms**, **measurement strategies**, and **comparisons of data analysis methods** relative to different research goals and situations.
- Using a methodology helps conduct scientific research efficiently and avoids jumping directly to solutions without proper planning.

## John Rollins' Data Science Methodology:
- The data science methodology discussed in the course is outlined by John Rollins, an experienced IBM Senior Data Scientist.
- **CRISP-DM** (Cross-Industry Standard Process for Data Mining ) consists of 10 stages: 
    1. Business Understanding
    2. Analytic Approach
    3. Data Requirements
    4. Data Collection
    5. Data Understanding
    6. Data Preparation
    7. Modeling 
    8. Evaluation
    9. Deployment
    10. Feedback.
    
### Importance of Asking Questions:
- Asking questions is the cornerstone of success in data science.
- Questions drive every stage of the data science methodology.
- The methodology aims to answer 10 basic questions aligned with defining the business issue, determining an approach, organizing data, and validating the final data design.

### 10 Basic Questions:
- What is the problem you're trying to solve?
- How can you use data to answer the question?
- What data do you need to answer the question?
- Where is the data source from, and how will you receive the data?
- Does the data you collect represent the problem to be solved?
- What additional work is required to manipulate and work with the data?
- When you apply data visualizations, do you see answers that address the business problem?
- Does the data model answer the initial business question, or must you adjust the data?
- Can you put the model into practice?
- Can you get constructive feedback from the data and stakeholders to answer the business question?

# 1. Business Understanding


## Importance of Business Understanding:
- Business understanding is crucial for defining the problem to be solved.
- It helps determine which data will be used to answer the core question.
- A clearly defined question directs the analytic approach needed to address the problem.

### Clarifying Goals and Objectives:
- Understanding the goal of the person asking the question is vital.
    - For example, if the goal is to reduce costs, it is important to know whether the aim is to improve efficiency or increase profitability.
- Breaking down objectives allows for structured discussions and prioritization.

### Engaging Stakeholders:
- Different stakeholders need to be involved in discussions to determine requirements and clarify questions.
    - Case Study: Healthcare Budget Allocation:
        - The case study involves an American healthcare insurance provider seeking to allocate a limited healthcare budget effectively.
        - The goal was to address patient readmissions, particularly for those with congestive heart failure.
        - IBM data scientists proposed a decision-tree model to understand the factors leading to readmissions.
        - An on-site workshop was conducted to gain business understanding and guide the analytics team.
        - Business Requirements Identified:
        - Predicting readmission outcomes for patients with congestive heart failure.
        - Predicting readmission risk.
        - Understanding the combination of events leading to the predicted outcome.
        - Applying an easy-to-understand process for new patients regarding their readmission risk.

![Business Understanding](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-DS0103EN-Coursera/images/200512.14%20Lesson%20Summary%20Part%201%20Infographic%20-%20Business%20Understanding.png)

# 2.Analytic Approach

## Importance of Selecting the Right Approach from the question:
- Depends on the question being asked.
- Involves seeking clarification from the questioner to pick the most appropriate path.

## Application in Data Science Methodology:
- Second stage of the methodology.
- Once the problem is defined, select the appropriate analytic approach based on business requirements.

## Understanding the Question
- Establishing a Strong Understanding:
- Identify the type of patterns needed to address the question.
 - Examples:
    - Predictive Model: For determining probabilities of an action.
    - Descriptive Approach: For showing relationships, such as clusters of similar activities.
    - Statistical Analysis: For problems requiring counts.
    - Classification Approach: For yes/no answers.

## Machine Learning
- Definition: Field of study that enables computers to learn without explicit programming.
- Usage: Identifies relationships and trends in data that might not be otherwise accessible.
- Clustering Association Approaches
- Application: Used to learn about human behavior.
- Case Study: If selected **Decision Tree Classification Model**
    - Used to identify conditions leading to each patient's outcome.
    - Examines variables in each node to determine threshold values.
    - Provides predicted outcomes and likelihood based on dominant outcomes (yes/no).

![Analytic Approach](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-DS0103EN-Coursera/images/200512.14%20Lesson%20Summary%20Part%202%20Infographic%20-%20Analytical%20Approach.png)

# Data Requirements

## Analogy:
- Making a spaghetti dinner without the right ingredients compromises success.
- Think of this section as cooking with data; each step is critical.

## Key Steps:
- Identify required ingredients (data).
- Source or collect the ingredients.
- Understand and work with the ingredients.
- Prepare the data to meet the desired outcome.

## Building on Problem Understanding
- Foundation:
    - Build on understanding the problem. 
    - Use the selected analytical approach.
- Getting Started:
    - Data Scientist is ready to start with **data requirements**.
    
## Defining Data Requirements
- Importance:
    - Vital before data collection and preparation stages.
- Example:
    - Decision-tree classification.
    - Identify necessary data content, formats, and sources.

## Case Study: Data Requirements
- First Task:
    - Define data requirements for decision tree classification.
    - Patient Cohort Selection:
        - Criteria for inclusion:
            - Admitted as in-patient within the provider service area.
            - Primary diagnosis of congestive heart failure during one full year.
            - Continuous enrollment for at least six months prior to primary admission.
        - Exclusion: Patients with other significant medical conditions to avoid skewing results.
        
## Data Content and Format
- Modeling Technique:
    - Requires one record per patient.
    - Columns represent variables in the model.
- Data Coverage:
    - Includes admissions, diagnoses, procedures, prescriptions, and other services.
    - Thousands of records per patient rolled up to one record format.
    
## Data Preparation
- Anticipation:
    - Think ahead to subsequent stages.
    - Roll up transactional records to patient level.
    - Create new variables to represent information.

# Data Collection

## Initial Data Collection:
- Performed by the data scientist.
- Assessment to determine if the necessary data is available.
- Similar to shopping for ingredients; some may be out of season or cost more.

## Revising Data Requirements
- Revisions:
- Data requirements may be revised based on availability and cost.
- Decisions on whether more or less data is needed.

## Understanding Collected Data
- Techniques:
- Descriptive statistics and visualization applied to assess content, quality, and initial insights.
- Identify gaps in data and plan to fill or substitute them.
- Analogy: Ingredients are now on the cutting board.

## Examples of Data Collection Stage
- Follow-up to Data Requirements Stage:
- Collecting data involves knowing the source or where to find needed data elements.

### Case Study: Data Collection
- Data Sources:
    - Demographic, clinical, and coverage information of patients.
    - Provider information, claims records, pharmaceutical data, and other diagnoses-related information.
- Challenges:
    - Certain drug information was needed but not integrated with other data sources.
    - Deferred decisions about unavailable data; attempt to acquire later if needed.
- Building the Model
    - Intermediate Results:
    - Use intermediate results from predictive modeling to decide if additional data (e.g., drug information) is necessary.
    - Example: Built a reasonably good model without the drug information.
- Collaboration
    - DBAs and Programmers:
    - Work together to extract and merge data from various sources.
    - Remove redundant data for the next stage (data understanding).
- Data Management
    - Automation:
    - Discuss ways to better manage data.
    - Automate processes to make data collection easier and faster.

# Glossary: From Problem to Approach

| Term | Definition |  |  |  |
|:-:|:---|---|---|---|
| Analytic Approach | The process of selecting the appropriate method or path to address a specific data science question or problem. |  |  |  |
| Analytics | The systematic analysis of data using statistical, mathematical, and computational techniques to uncover insights, patterns, and trends. |  |  |  |
| Business Understanding | The initial phase of data science methodology involves seeking clarification and understanding the goals, objectives, and requirements of a given task or problem. |  |  |  |
| Clustering Association | An approach used to learn about human behavior and identify patterns and associations in data. |  |  |  |
| Cohort | A group of individuals who share a common characteristic or experience is studied or analyzed as a unit. |  |  |  |
| Cohort study | An observational study where a group of individuals with a specific characteristic or exposure is followed over time to determine the incidence of outcomes or the relationship between exposures and outcomes. |  |  |  |
| Congestive Heart Failure (CHF) | A chronic condition in which the heart cannot pump enough blood to meet the body's needs, resulting in fluid buildup and symptoms such as shortness of breath and fatigue. |  |  |  |
| CRISP-DM | Cross-Industry Standard Process for Data Mining is a widely used methodology for data mining and analytics projects encompassing six phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. |  |  |  |
| Data analysis | The process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. |  |  |  |
| Data cleansing | The process of identifying and correcting or removing errors, inconsistencies, or inaccuracies in a dataset to improve its quality and reliability |  |  |  |
| Data science | An interdisciplinary field that combines scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. |  |  |  |
| Data science methodology | A structured approach to solving business problems using data analysis and data-driven insights. |  |  |  |
| Data scientist | A professional using scientific methods, algorithms, and tools to analyze data, extract insights, and develop models or solutions to complex business problems. |  |  |  |
| Data scientists | Professionals with data science and analytics expertise who apply their skills to solve business problems. |  |  |  |
| Data-Driven Insights | Insights derived from analyzing and interpreting data to inform decision-making |  |  |  |
| Decision tree | A supervised machine learning algorithm that uses a tree-like structure of decisions and their possible consequences to make predictions or classify instances. |  |  |  |
| Decision Tree Classification Model | A model that uses a tree-like structure to classify data based on conditions and thresholds provides predicted outcomes and associated probabilities. |  |  |  |
| Decision Tree Classifier | A classification model that uses a decision tree to determine outcomes based on specific conditions and thresholds. |  |  |  |
| Decision-Tree Model | A model used to review scenarios and identify relationships in data, such as the reasons for patient readmissions |  |  |  |
| Descriptive approach | An approach used to show relationships and identify clusters of similar activities based on events and preferences |  |  |  |
| Descriptive modeling | Modeling technique that focuses on describing and summarizing data, often through statistical analysis and visualization, without making predictions or inferences |  |  |  |
| Domain knowledge | Expertise and understanding of a specific subject area or field, including its concepts, principles, and relevant data |  |  |  |
| Goals and objectives | The sought-after outcomes and specific objectives that support the overall goal of the task or problem. |  |  |  |
| Iteration | A single cycle or repetition of a process often involves refining or modifying a solution based on feedback or new information. |  |  |  |
| Iterative process | A process that involves repeating a series of steps or actions to refine and improve a solution or analysis. Each iteration builds upon the previous one. |  |  |  |
| Leaf | The final nodes of a decision tree where data is categorized into specific outcomes. |  |  |  |
| Machine Learning | A field of study that enables computers to learn from data without being explicitly programmed, identifying hidden relationships and trends. |  |  |  |
| Mean | The average value of a set of numbers is calculated by summing all the values and dividing by the total number of values. |  |  |  |
| Median | When arranged in ascending or descending order, the middle value in a set of numbers divides the data into two equal halves. |  |  |  |
| Model (Conceptual model) | A simplified representation or abstraction of a real-world system or phenomenon used to understand, analyze, or predict its behavior. |  |  |  |
| Model building | The process of developing predictive models to gain insights and make informed decisions based on data analysis. |  |  |  |
| Pairwise comparison (correlation) | A statistical technique that measures the strength and direction of the linear relationship between two variables by calculating a correlation coefficient. |  |  |  |
| Pattern | A recurring or noticeable arrangement or sequence in data can provide insights or be used for prediction or classification. |  |  |  |
| Predictive model | A model used to determine probabilities of an action or outcome based on historical data. |  |  |  |
| Predictors | Variables or features in a model that are used to predict or explain the outcome variable or target variable. |  |  |  |
| Prioritization | The process of organizing objectives and tasks based on their importance and impact on the overall goal. |  |  |  |
| Problem solving | The process of addressing challenges and finding solutions to achieve desired outcomes. |  |  |  |
| Stakeholders | Individuals or groups with a vested interest in the data science model's outcome and its practical application, such as solution owners, marketing, application developers, and IT administration. |  |  |  |
| Standard deviation | A measure of the dispersion or variability of a set of values from their mean; It provides information about the spread or distribution of the data. |  |  |  |
| Statistical analysis | Stand deviations are applied to problems that require counts, such as yes/no answers or classification tasks. |  |  |  |
| Statistics | The collection, analysis, interpretation, presentation, and organization of data to understand patterns, relationships, and variability in the data. |  |  |  |
| Structured data (data model) | Data organized and formatted according to a predefined schema or model and is typically stored in databases or spreadsheets. |  |  |  |
| Text analysis data mining | The process of extracting useful information or knowledge from unstructured textual data through techniques such as natural language processing, text mining, and sentiment analysis. |  |  |  |
| Threshold value | The specific value used to split data into groups or categories in a decision tree. |  |  |  |
| Reinforcement Learning | Loosely based on the way human beings and other organisms learn. |  |  |  |
| REST | RE stands for Representationa the S stands for State,and the T stands for Transfer |  |  |  |
| RStudio | Unifies programming, execution, debugging, remote dataaccess, data exploration, and visualization into onetool |  |  |  |
| SaaS | Software as a service |  |  |  |
| Scala | Is a combination of scalable and language. A generalpurpose programming language that provides support forfunctional programming and is a strong static typesystem |  |  |  |
| Spyder | Integrates code, documentation, and visualizations,among others, into a single canvas |  |  |  |
| SQL | Structured Query Language that is non-procedural, usedfor querying and managing data |  |  |  |
| Supervised Learning | A learning in which a human provides input data andcorrect outputs |  |  |  |
| TensorFlow | Deep Learning library for dataflow that was built withC++ |  |  |  |
| Unsupervised Learning | The data is not labeled by a human. Examples are Clustering models used to divide each record of a dataset into one of a similar group |  |  |  |
| Watson Studio | A fully integrated development environment for datascientists |  |  |  |
| Weka | Language for data mining |  |  |  |