# Data Understanding

## Purpose:
- To determine if the collected data is representative of the problem to be solved.
- Involves constructing the data set and analyzing its characteristics.

- Case Study Application
    - Descriptive Statistics:
        - Run against data columns to become variables in the model.
        - Includes measures like mean, median, minimum, maximum, and standard deviation.
    - Pairwise Correlations:
        - Identify relationships between variables.
        - Determine if any variables are highly correlated and redundant.
    - Histograms:
        - Examine distributions of variables.
        - Help decide on data preparation steps, such as consolidating categorical values.
    - Data Quality Assessment
        - Re-coding or Dropping Values:
            - Address missing or invalid values.
            - Example: Numeric variable "age" with values 0 to 100 and 999, where 999 means "missing".
    - Iterative Process
        - Refining Definitions:
            - Initial definition of congestive heart failure admission based on primary diagnosis.
            - Data understanding revealed the need to include secondary and tertiary diagnoses.
            - Loop back to data collection to refine the definition and improve the model.


# Data Preparation - Concepts

- Data Preparation
- Analogy:
    - Similar to washing vegetables to remove unwanted elements.
    - Essential for transforming data into a usable state.

- Time Consumption
    - Significance:
        - Data preparation, along with data collection and understanding, is the most time-consuming phase, taking 70% to 90% of the project time.
        - Automation can reduce this time to as little as 50%, allowing more focus on model creation.

- Data Transformation
    - Cooking Metaphor:
        - Like chopping onions finely to enhance flavor distribution in a sauce.
        - Transforming data makes it easier to work with.

- Key Questions
    - Objective:
        - Address missing or invalid values, remove duplicates, and ensure proper formatting.
        - Feature engineering to create useful characteristics for machine learning algorithms.

- Feature Engineering
    - Importance:
        - Uses domain knowledge to create features that improve model performance.
        - Critical for predictive models and influences results.

- Text Analysis
    - Steps:
        - Coding data for manipulation.
        - Ensuring proper groupings and not overlooking hidden information.

- Importance of Data Preparation
    - Foundation:
        - Sets the stage for subsequent steps.
        - If done correctly, supports the project; if skipped, may lead to subpar outcomes and rework.

# Data Preparation - Case Study

## Case Study: Data Preparation

### Defining Congestive Heart Failure:
- **Initial Step**: Define congestive heart failure precisely.
- **Diagnosis Codes**: Identify diagnosis-related group codes for fluid buildup and heart failure types.
- **Clinical Guidance**: Needed to get the correct codes.

### Defining Readmission Criteria:
- **Timing of Events**: Evaluate to define initial (index) admission vs. readmission.
- **30-Day Window**: Set as the readmission period following discharge.

### Aggregating Transactional Records:
- **Data Format**: Multiple records per patient, including claims and clinical services.
- **Aggregation**: Combine records to create a single record per patient for modeling.
- **New Columns**: Created to represent information such as visit frequency and co-morbidities.

### Considering Co-morbidities:
- Examples: Diabetes, hypertension, and other chronic conditions impacting readmission risk.

### Literary Review:
- **Purpose**: Ensure no important data elements were overlooked.
- **Outcome**: Added more indicators for conditions and procedures.

### Final Data Aggregation:
- **Merging Data**: Combine transactional data with demographic information.
- **Patient Table**: One record per patient with many columns representing attributes.

### Variables for Modeling:
- Dependent Variable: Congestive heart failure readmission within 30 days (yes/no).

### Cohort Creation:
- Size: 2,343 patients meeting criteria.
- Split: Divided into training and testing sets for model building and validation.

# Summary
![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eSgbfmUivhLEYkKzQI0izQ.jpg)

# Glossary: From Understanding to Preparation

| Term | Definition |
|:-:|:---|
| Analytics team | A group of professionals, including data scientists and analysts, responsible for performing data analysis and modeling. |
| Data collection | The process of gathering data from various sources, including demographic, clinical, coverage, and pharmaceutical information. |
| Data integration | The merging of data from multiple sources to remove redundancy and prepare it for further analysis. |
| Data Preparation | The process of organizing and formatting data to meet the requirements of the modeling technique. |
| Data Requirements | The identification and definition of the necessary data elements, formats, and sources required for analysis. |
| Data Understanding | A stage where data scientists discuss various ways to manage data effectively, including automating certain processes in the database. |
| DBAs (Database Administrators) | The professionals who are responsible for managing and extracting data from databases. |
| Decision tree classification | A modeling technique that uses a tree-like structure to classify data based on specific conditions and variables. |
| Demographic information | Information about patient characteristics, such as age, gender, and location. |
| Descriptive statistics | Techniques used to analyze and summarize data, providing initial insights and identifying gaps in data. |
| Intermediate results | Partial results obtained from predictive modeling can influence decisions on acquiring additional data. |
| Patient cohort | A group of patients with specific criteria selected for analysis in a study or model. |
| Predictive modeling | The building of models to predict future outcomes based on historical data. |
| Training set | A subset of data used to train or fit a machine learning model; consists of input data and corresponding known or labeled output values. |
| Unavailable data | Data elements are not currently accessible or integrated into the data sources. |
| Univariate | Modeling analysis focused on a single variable or feature at a time, considering its characteristics and relationship to other variables independently. |
| Unstructured data | Data that does not have a predefined structure or format, typically text images, audio, or video, requires special techniques to extract meaning or insights. |
| Visualization | The process of representing data visually to gain insights into its content and quality. |

# From Modeling to Evaluation

### Purpose of Data Modeling:
- Develop models that are either descriptive or predictive.

- Characteristics of the Process:
    -  Descriptive models: Examine patterns (e.g., if a person did this, they’re likely to prefer that).
    - Predictive models: Yield yes/no or stop/go outcomes.

- Analytical Approach
    - Types:
        - Statistically driven or machine learning driven.
    - Training Set:
        - Historical data with known outcomes.
        - Used to calibrate the model.

- Model Development
    - Algorithm Testing:
        - Experiment with different algorithms to ensure necessary variables are included.
    - Success Factors:
        - Understanding the problem.
        - Appropriate analytical approach.
        - Quality of data (like ingredients in cooking).
        
- Continuous Improvement
    - Refinement:
        - Constant adjustments and tweaking are necessary to ensure a solid outcome.
        
### John Rollins' Descriptive Data Science Methodology
- Framework Goals:
    - Understand the question at hand.
    - Select an analytic approach or method to solve the problem.
    - Obtain, understand, prepare, and model the data.
- End Goal: Build a data model to answer the question.

- Evaluation and Deployment
    - Key Question:
        - Have I made enough to eat? (Metaphor for ensuring the model is sufficient).
- Relevance:
    - Model evaluation, deployment, and feedback loops ensure the answer is relevant.
    - Critical for the development of the data science field.


## Modeling Stage - Case Study

### Initial Model
- First decision tree classification model built for congestive heart failure readmission
    - Looking for patients with high-risk readmission (outcome = "yes")

- Initial model accuracy:
    - Overall accuracy = 85%
    - Accuracy for "yes" outcomes = 45% (low)

### Improving Model Accuracy
   - Question: How to improve accuracy in predicting "yes" outcomes?
   - For decision trees, adjust the relative cost of misclassified "yes" and "no" outcomes

### Cost of Misclassification
   - False Positive (Type I error): True non-readmission misclassified as readmission
        - Cost = Wasted intervention
   - False Negative (Type II error): True readmission misclassified as non-readmission
        - Cost = Readmission costs + patient trauma (higher cost)

### Parameter Tuning
   - Default relative cost weight = 1:1 for "yes" and "no"
   - Model 2: Relative cost set to 9:1 (favoring "yes" outcomes)
        - "Yes" accuracy = 97%, but low "no" accuracy (49% overall)
        - Too many false positives, not a good model
   - Model 3: Relative cost set to 4:1
       - "Yes" sensitivity = 68%, "No" specificity = 85%, overall 81%
       - Best balance with small training set

### Further Refinement
   - More iterations back to data preparation stage
   - Redefine variables to better represent underlying information
   - Improve model performance

# Evaluation

### Importance of Model Evaluation
- Purpose:
    - Assess the quality of the model.
    - Ensure it meets the initial request.
- Key Question:
    - Does the model answer the initial question or need adjustments?

### Phases of Model Evaluation
   1. Diagnostic Measures Phase:
       - Ensures the model works as intended.
       - Predictive Models: Use decision trees to evaluate alignment with initial design.
       - Descriptive Models: Apply a testing set with known outcomes for refinement.
    2. Statistical Significance Testing:
        - Ensures proper handling and interpretation of data.
        - Avoids unnecessary second-guessing when revealing answers.

## Case Study Application
- Parameter Tuning:
    - Example: Tuning the relative cost of misclassifying "yes" and "no" outcomes.
    - Four models built with different relative misclassification costs.
    - Observation: Increasing true-positive rate (sensitivity) at the expense of false-positive rate.

- Optimal Model Selection
    - ROC Curve:
        - Definition: Receiver Operating Characteristic curve.
        - Purpose: Quantifies binary classification model performance.
        - Plot: True Positive Rate (TPR) vs. False Positive Rate (FPR) for different misclassification costs.
        - Optimal Model: Maximum separation between ROC curve and baseline.
        - Example: Model 3 with a 4-to-1 misclassification cost was optimal.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-DS0103EN-Coursera/images/Model%20to%20Evaluation.png)

# Glossary

| Term | Definition |
|:-:|:---|
| Automation | Using tools and techniques to streamline data collection and preparation processes. |
| Data Collection | The phase of gathering and assembling data from various sources. |
| Data Compilation | The process of organizing and structuring data to create a comprehensive data set. |
| Data Formatting | The process of standardizing the data to ensure uniformity and ease of analysis. |
| Data Manipulation | The process of transforming data into a usable format. |
| Data Preparation | The phase where data is cleaned, transformed, and formatted for further analysis, including feature engineering and text analysis. |
| Data Preparation | The stage where data is transformed and organized to facilitate effective analysis and modeling. |
| Data Quality | Assessment of data integrity and completeness, addressing missing, invalid, or misleading values. |
| Data Quality Assessment | The evaluation of data integrity, accuracy, and completeness. |
| Data Set | A collection of data used for analysis and modeling. |
| Data Understanding | The stage in the data science methodology focused on exploring and analyzing the collected data to ensure that the data is representative of the problem to be solved. |
| Descriptive Statistics | Summary statistics that data scientists use to describe and understand the distribution of variables, such as mean, median, minimum, maximum, and standard deviation. |
| Feature | A characteristic or attribute within the data that helps in solving the problem. |
| Feature Engineering | The process of creating new features or variables based on domain knowledge to improve machine learning algorithms' performance. |
| Feature Extraction | Identifying and selecting relevant features or attributes from the data set. |
| Interactive Processes | Iterative and continuous refinement of the methodology based on insights and feedback from data analysis. |
| Missing Values | Values that are absent or unknown in the dataset, requiring careful handling during data preparation. |
| Model Calibration | Adjusting model parameters to improve accuracy and alignment with the initial design. |
| Pairwise Correlations | An analysis to determine the relationships and correlations between different variables. |
| Text Analysis | Steps to analyze and manipulate textual data, extracting meaningful information and patterns. |
| Text Analysis Groupings | Creating meaningful groupings and categories from textual data for analysis. |
| Visualization techniques | Methods and tools that data scientists use to create visual representations or graphics that enhance the accessibility and understanding of data patterns, relationships, and insights. |
