# <a id='toc1_'></a>[Cleaning Notebook](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Cleaning Notebook](#toc1_)    
- [**High Level Data Cleaning Process**](#toc2_)    
  - [***Gather***](#toc2_1_)    
  - [***Assess***](#toc2_2_)    
  - [***Clean***](#toc2_3_)    
    - [**Clean Data has Two Dimensions:**](#toc2_3_1_)    
      - [**Principles of High Quality Data**:](#toc2_3_1_1_)    
      - [**Principles of Tidy Data**:](#toc2_3_1_2_)    
      - [**Ideal Order for Addressing Issues**:](#toc2_3_1_3_)    
- [**Further Pre-Processing Concept Discussion**](#toc3_)    
  - [Data Cleaning](#toc3_1_)    
  - [Feature Scaling](#toc3_2_)    
  - [Feature Encoding](#toc3_3_)    
  - [Feature Transformation](#toc3_4_)    
  - [Stratification](#toc3_5_)    
  - [Dimensionality Reduction](#toc3_6_)    
  - [Train-Test Split](#toc3_7_)    
  - [Handling Imbalanced Data](#toc3_8_)    
  - [Handling Time-Series Data](#toc3_9_)    
  - [Handling Text Data](#toc3_10_)    
  - [Handling Missing Data](#toc3_11_)    
    - [High-level principles for imputing missing data points:](#toc3_11_1_)    
    - [Strategies for Missing Data:](#toc3_11_2_)    
  - [Handling Duplicate Data](#toc3_12_)    
    - [High-level principles for handling duplicate data:](#toc3_12_1_)    
    - [Strategies for Duplicate Data:](#toc3_12_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[**High Level Data Cleaning Process**](#toc0_)



## <a id='toc2_1_'></a>[***Gather***](#toc0_)
1. **Setup Libraries**: Import all the necessary libraries that you will need for your data analysis. This typically includes libraries like pandas, numpy, matplotlib, seaborn, etc. Setting up libraries at the beginning of your script ensures you have all the tools you need for analysis, visualization, and modeling.


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



2. **Load in Data**: Read in the data from your source file (like a CSV, Excel, SQL database, etc.) into a DataFrame, which is a type of data structure provided by the pandas library. This is your starting point for the data analysis.


In [2]:

df = pd.read_csv('data.csv')


FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

## <a id='toc2_2_'></a>[***Assess***](#toc0_)
1. **Programmatic Assessment**: Review the data using code. This includes methods like  `df.info()`, `df.describe()`, etc. These methods help you understand the structure of the data, the types of variables you have, and basic statistics of the variables.

2. **Visual Assessment**: Review the data by scrolling through it in a spreadsheet or using `df.head()`, `df.tail()`. This can help you spot anomalies or patterns in the data that may not be immediately apparent through programmatic assessment.



## <a id='toc2_3_'></a>[***Clean***](#toc0_)
1. **Define**: Define how you will clean the issue in words. This is your plan of action for dealing with the identified data quality and tidiness issues. It's important to define this plan before you start coding to ensure you have a clear understanding of the steps you need to take.

2. **Code**: Convert your definitions into executable code. This is where you implement your plan. This could involve writing functions to clean the data, using built-in pandas functions, or using other data cleaning libraries.

3. **Test**: Test your data to ensure your code was implemented correctly. This involves checking your cleaned data to confirm that it's in the expected format and that the data quality and tidiness issues have been addressed. This can be done using a combination of programmatic and visual assessments.

### <a id='toc2_3_1_'></a>[**Clean Data has Two Dimensions:**](#toc0_)


#### <a id='toc2_3_1_1_'></a>[**Principles of High Quality Data**:](#toc0_)
1. **Completeness**: Do we have all of the records that we should?
2. **Validity**: We have the records, but they're not valid, i.e., they don't conform to a defined schema, also known as a defined set of rules for data.
3. **Accuracy**: Inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect.
4. **Consistency**: Inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing.

#### <a id='toc2_3_1_2_'></a>[**Principles of Tidy Data**:](#toc0_)
- Tidy data is organized with three qualities in mind:
    - **Columns:** Each variable forms a column.
    - **Rows:** Each observation forms a row.
    - **Tables:** Each type of observational unit forms a table.


#### <a id='toc2_3_1_3_'></a>[**Ideal Order for Addressing Issues**:](#toc0_)

 1. **Completeness issues** or **Fix Missing Data**: It's important to do this upfront so that subsequent data cleaning will not have to be repeated.
 2. **Tidiness Issues**: Tidy datasets with data quality issues are almost always easier to clean than untidy datasets with the same issues.
 3. **Quality Control**: Address the remaining validity, accuracy, and consistency issues in that order.

# <a id='toc3_'></a>[**Further Pre-Processing Concept Discussion**](#toc0_)

## <a id='toc3_1_'></a>[Data Cleaning](#toc0_)

- **Definition:** Data cleaning, also known as data cleansing or scrubbing, is the process of identifying and correcting or removing errors, inaccuracies, and inconsistencies in datasets. This could include dealing with missing or null values, duplicate data, irrelevant data, and outliers.

- **Use Case and Intuition:** Data cleaning is crucial in any data analysis process as it ensures the quality and reliability of the data. The intuition behind it is that cleaner data leads to more accurate and reliable results from any subsequent data analysis or machine learning model.

- **Example:** Removing duplicate rows in a dataset, replacing missing values with the mean or median of the rest of the data, or dropping irrelevant columns.

- **High-Level Principles:** Data cleaning should be done carefully to avoid introducing bias into the data. It's also important to document the cleaning process for reproducibility and review.

- **Assumptions and Cautions:** The process assumes that the data errors can be found and corrected without introducing significant bias. Care should be taken not to distort the data during cleaning, as it can lead to misleading results.

- **Impact on ML Models:** Cleaner data can lead to more accurate and reliable machine learning models. On the other hand, poorly cleaned data can lead to inaccurate models and misleading results.



## <a id='toc3_2_'></a>[Feature Scaling](#toc0_)

- **Definition:** Feature scaling is a method used to normalize the range of independent variables or features of data. Common methods include min-max normalization and standardization (z-score normalization).

- **Use Case and Intuition:** Feature scaling is used when the features have different ranges. The intuition is that many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

- **Example:** Scaling age and income variables to the same range for a machine learning model predicting credit risk.

- **High-Level Principles:** All features should be scaled in the same way. The same scaling parameters should be used for training and testing data.

- **Assumptions and Cautions:** Assumes that the data is mostly normally distributed. Outliers can distort the result of scaling, so consider handling outliers before scaling.

- **Impact on ML Models:** Feature scaling can speed up the training process and can lead to better performance for many machine learning algorithms.



## <a id='toc3_3_'></a>[Feature Encoding](#toc0_)

- **Definition:** Feature encoding is the process of converting categorical data into a form that could be provided to machine learning algorithms to improve their performance.

- **Use Case and Intuition:** Used when dealing with categorical data. The intuition is that machine learning algorithms work better with numerical data, so categorical data is often encoded to numerical values.

- **Example:** Encoding a feature like "color" with values "red", "green", "blue" to numerical values like 1, 2, 3.
- 
- **5 Common Usages**:
    1. One-Hot Encoding: It is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. One hot encoding is a crucial part of feature engineering for machine learning.
    2. Binning: The process of transforming continuous numerical variables into discrete categories for grouped analysis.
    3. Polynomial Features: It is used to create interactions among features.
    4. Custom Transformations: Logarithmic, square roots, or reciprocals to reduce the skewness of data.
    5. Date/Time Features: Extracting information like 'month of the year', 'day of the week', 'hour of the day', etc.

- **High-Level Principles:** Choose an appropriate encoding method based on the nature of the data. For example, ordinal encoding for ordinal data and one-hot encoding for nominal data.

- **Assumptions and Cautions:** Assumes that the categorical variable can be adequately represented as numerical values. Be aware of the "curse of dimensionality" when using one-hot encoding.

- **Impact on ML Models:** Proper feature encoding can lead to better performance of machine learning models.



## <a id='toc3_4_'></a>[Feature Transformation](#toc0_)

- **Definition:** Feature transformation is the process of modifying existing features to better represent the underlying data patterns, or to meet the assumptions of the applied machine learning algorithms.

- **Use Case and Intuition:** Used when the relationship between features and target variable is not linear, or when the data does not meet the assumptions of the machine learning algorithm. The intuition is that transformed features may expose better the data structure to the model.

- **Example:** Applying log transformation to a feature to reduce skewness.

- **5 Common Usages**:
    1. Log Transformation: Used when data is highly skewed, it can help to reduce the skewness.
    2. Square Root Transformation: This is a moderately strong transformation with a substantial effect on distribution shape.
    3. Box-Cox Transformation: This is a family of power transformations indexed by a parameter lambda. When lambda is zero, the Box-Cox transformation equals the log transformation.
    4. Yeo-Johnson Transformation: This is similar to the Box-Cox transformation but can be used on datasets containing zero and negative values.
    5. Quantile Transformation: This transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values.


- **High-Level Principles:** Choose an appropriate transformation based on the nature of the data and the requirements of the machine learning algorithm.

- **Assumptions and Cautions:** Assumes that a transformation can better expose the data structure. Be aware that some transformations may make the data harder to interpret.

- **Impact on ML Models:** Feature transformation can lead to better performance of machine learning models by meeting their assumptions or exposing better the data structure.


## <a id='toc3_5_'></a>[Stratification](#toc0_)

- **Definition**: Stratification is the process of dividing members of the population into homogeneous subgroups before sampling. The strata should define a partition of the population. That is, it should be collectively exhaustive and mutually exclusive: every element in the population must be assigned to one and only one stratum.

- **Use case and Intuition**: Stratification is used when an entity wants to ensure that the sample represents certain characteristics in the population. The strata are formed based on members' shared attributes or characteristics such as income level, education level, etc.

- **5 Common Usages**:
    1. Stratified Random Sampling: In statistical surveys, when populations are divided into strata, a random sample is taken from each stratum in a number that is proportional to the stratum's size when compared to the population. These subsets of the strata are then pooled to form a random sample.
    2. Stratified Shuffle Split: It is a merge of Stratified K-Fold and Shuffle Split, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.
    3. Stratified Cross-Validation: In stratified cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.
    4. Stratified Train/Test Split: It is used in the splitting of data in a way that preserves the same proportions of examples in each class as observed in the original dataset.
    5. Stratified Sampling for Handling Imbalanced Datasets: In imbalanced datasets, stratified sampling can help in ensuring that the train, validation, and test sets have the same proportion of samples for each class as found in the original dataset.

- **Assumptions and Cautions**: Stratification assumes that the population is easily divisible into discrete subgroups. If stratification is done incorrectly, and the strata or layers do not accurately represent the population, then it can lead to selection bias, significantly reducing the statistical power of the output.

- **Interpretation**: Stratification ensures that each subset of the dataset has the same proportions of the different target classes as the original dataset. This is particularly useful in classification problems where the target class is imbalanced.

- **Assumptions and Cautions**: Feature engineering is more of an art than a science, and it heavily depends on the dataset and the problem at hand. It's always important to understand the underlying data and the business problem before deciding on the most appropriate feature engineering techniques.

- **Interpretation**: Feature engineering can significantly improve the performance of machine learning models by creating meaningful features from the data.


## <a id='toc3_6_'></a>[Dimensionality Reduction](#toc0_)

- **Definition:** Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables.

- **Use Case and Intuition:** Used when dealing with high-dimensional data. The intuition is that reducing the dimensionality can help to remove noise and redundancy in the data, and can make the data easier to visualize and understand.

- **Example:** Using Principal Component Analysis (PCA) to reduce the dimensionality of a dataset.

- **High-Level Principles:** Choose an appropriate method based on the nature of the data. Be aware that dimensionality reduction can lead to loss of information.

- **Assumptions and Cautions:** Assumes that the data has redundancy or noise that can be removed. Be aware that the reduced dimensions may be harder to interpret.

- **Impact on ML Models:** Dimensionality reduction can lead to faster training times and better performance by removing noise and redundancy, but it can also lead to loss of information.



## <a id='toc3_7_'></a>[Train-Test Split](#toc0_)

- **Definition:** Train-test split is a technique for evaluating the performance of a machine learning model. It involves splitting the dataset into two subsets: a training set used to train the model, and a test set used to evaluate the model.

- **Use Case and Intuition:** Used in virtually all machine learning projects. The intuition is that evaluating the model on unseen data gives a better indication of the model's performance on new data.

- **Example:** Splitting a dataset into 70% training data and 30% test data.

- **High-Level Principles:** The split should be random and representative of the overall distribution of the data. The same split should be used for all models that are being compared.

- **Assumptions and Cautions:** Assumes that the test set is representative of new data. Be aware of overfitting if the model performs well on the training data but poorly on the test data.

- **Impact on ML Models:** Proper train-test split can give a better indication of the model's performance on new data, helping to choose the best model.



## <a id='toc3_8_'></a>[Handling Imbalanced Data](#toc0_)

- **Definition:** Imbalanced data refers to a classification problem where the classes are not represented equally. Handling imbalanced data involves techniques to balance the classes, such as oversampling the minority class, undersampling the majority class, or using a combination of both.

- **Use Case and Intuition:** Used when dealing with imbalanced classification problems. The intuition is that machine learning algorithms can be biased towards the majority class, leading to poor performance on the minority class.

- **Example:** Using SMOTE (Synthetic Minority Over-sampling Technique) to oversample the minority class in a fraud detection problem.

- **High-Level Principles:** Choose an appropriate method based on the nature of the data and the problem. Be aware that balancing the classes can lead to overfitting on the minority class.

- **Assumptions and Cautions:** Assumes that the imbalance in the classes is causing poor performance on the minority class. Be aware of the trade-off between improving performance on the minority class and potentially worsening performance on the majority class.

- **Impact on ML Models:** Handling imbalanced data can improve performance on the minority class, but it can also lead to overfitting on the minority class and potentially worse performance on the majority class.


## <a id='toc3_9_'></a>[Handling Time-Series Data](#toc0_)

- **Definition:** Time-series data is a sequence of data points indexed in time order. Handling time-series data involves techniques specific to this type of data, such as dealing with seasonality, trend, autocorrelation, and time-dependent variance (heteroscedasticity).

- **Use Case and Intuition:** Used when dealing with data collected over time, such as stock prices or weather data. The intuition is that time-series data often has temporal dependencies that need to be accounted for in the analysis or modeling process.

- **Example:** Using differencing to remove trend and seasonality in a time-series forecasting model.

- **High-Level Principles:** Time-series data should be analyzed and modeled with techniques that account for its temporal dependencies. The data should also be checked for stationarity, as many time-series models assume this.

- **Assumptions and Cautions:** Assumes that the temporal dependencies in the data can be modeled. Be aware that time-series models can be sensitive to the choice of time period and can be affected by missing values or changes in trend or seasonality.

- **Impact on ML Models:** Proper handling of time-series data can lead to more accurate forecasts. On the other hand, ignoring the temporal dependencies can lead to poor model performance.



## <a id='toc3_10_'></a>[Handling Text Data](#toc0_)

- **Definition:** Text data, or unstructured data, is data that is not organized in a pre-defined manner or does not have a pre-defined data model. Handling text data involves techniques such as tokenization, stemming, lemmatization, and vectorization.

- **Use Case and Intuition:** Used when dealing with text data, such as customer reviews or tweets. The intuition is that text data can contain valuable information, but it needs to be transformed into a numerical format that can be used by machine learning algorithms.

- **Example:** Using TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize customer reviews for sentiment analysis.

- **High-Level Principles:** Text data should be preprocessed to remove noise (like punctuation and common words), and transformed into a numerical format. The choice of transformation can depend on the problem and the nature of the text.

- **Assumptions and Cautions:** Assumes that the text data contains relevant information that can be extracted and used. Be aware that text data can be noisy and can require significant preprocessing.

- **Impact on ML Models:** Proper handling of text data can lead to more accurate models when dealing with text data. On the other hand, poor handling of text data can lead to noisy and uninformative features.



## <a id='toc3_11_'></a>[Handling Missing Data](#toc0_)

- **Definition:** Missing data occurs when no data value is stored for a variable in an observation. Handling missing data involves techniques such as imputation or deletion.

- **Use Case and Intuition:** Used when dealing with datasets with missing values. The intuition is that missing data can lead to biased or incorrect results, so it's important to handle it appropriately.

- **Example:** Using mean imputation to fill in missing values in a dataset.

- **High-Level Principles:** The method for handling missing data should be chosen based on the nature of the data and the reason for the missingness. It's also important to consider the impact of the chosen method on the subsequent analysis or modeling.

- **Assumptions and Cautions:** Assumes that the missing data can be accurately imputed or that it's safe to delete the missing values. Be aware that inappropriate handling of missing data can lead to biased or incorrect results.

- **Impact on ML Models:** Proper handling of missing data can lead to more accurate and reliable models. On the other hand, poor handling of missing data can lead to biased or incorrect models.


### <a id='toc3_11_1_'></a>[High-level principles for imputing missing data points:](#toc0_)

- Understand the mechanism of missingness: Data can be missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). The appropriate imputation method depends on which of these mechanisms is at work.

- Preserve the relationships in the data: The imputation method should preserve the relationships between variables as much as possible.

- Account for the uncertainty: The imputation method should account for the uncertainty of the imputed values. This is especially important for methods like multiple imputation.

- Check the results: After imputation, check the results to ensure that the imputed data makes sense and that the statistical properties of the data have not been unduly distorted.

- Document the process: Keep a record of what imputation methods were used, and why. This is important for the reproducibility of the analysis.

### <a id='toc3_11_2_'></a>[Strategies for Missing Data:](#toc0_)

Handling missing data is a critical step in the data preprocessing pipeline. Here are five common methods for dealing with missing data, along with best practices for each:

1. **Listwise Deletion (Complete Case Analysis):** This method involves removing all data for an observation that has one or more missing values. 

   - **Best Practice:** Use this method when the data is missing completely at random, and the proportion of missing data is small. Be aware that this method can lead to a loss of information and reduced statistical power.

2. **Pairwise Deletion:** This method involves deleting cases where the specific variable is missing that is currently being analyzed.

   - **Best Practice:** Use this method when the data is missing completely at random. Be aware that this method can lead to different results for different analyses, depending on which cases are deleted.

3. **Mean/Median/Mode Imputation:** This method involves replacing the missing values for a particular variable with the mean, median, or mode of the available cases.

   - **Best Practice:** Use this method when the data is missing completely at random, and the variable is numerical (for mean or median imputation) or categorical (for mode imputation). Be aware that this method can lead to an underestimate of the variance and potentially biased estimates of the correlations between variables.

4. **Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):** This method involves replacing the missing value with the last observed value (LOCF) or the next observed value (NOCB).

   - **Best Practice:** Use this method for time-series data where the observations have a logical order. Be aware that this method can lead to biased estimates if the data is not missing at random.

5. **Multiple Imputation:** This method involves creating multiple imputed datasets, analyzing each one separately, and then pooling the results.

   - **Best Practice:** Use this method when the data is not missing at random, and the missingness can be modeled from the observed data. Be aware that this method is more complex and computationally intensive than the others.


## <a id='toc3_12_'></a>[Handling Duplicate Data](#toc0_)

- **Definition:** Duplicate data refers to repetitions of the same data entry/record in the dataset. Handling duplicate data involves identifying and dealing with these repetitions to ensure the quality and reliability of the dataset.

- **Use Case and Intuition:** Duplicate data is often encountered in real-world datasets due to various reasons such as data entry errors, merging of datasets, etc. The intuition behind handling duplicate data is that duplicates can skew the data distribution and lead to biased analysis or machine learning models.

- **Example:** In a customer database, the same customer might be recorded multiple times due to data entry errors. These duplicates can be identified by matching across several fields like name, address, and contact information, and then removed to avoid over-representing this customer in subsequent analyses.

- **High-Level Principles:** The method for handling duplicates should be chosen based on the nature of the data and the reason for duplication. It's also important to consider the impact of the chosen method on the subsequent analysis or modeling.

- **Assumptions and Cautions:** The process assumes that duplicates do not carry unique information. Care should be taken to ensure that the records are true duplicates across all features and not just a subset. Also, in some cases, duplicates might be meaningful and should not be removed (e.g., in transactional data, the same customer can make the same purchase multiple times).

- **Impact on ML Models:** Proper handling of duplicate data can lead to more accurate and reliable models by ensuring a fair representation of all data points. On the other hand, not handling duplicate data can lead to overfitting towards the over-represented data points.

### <a id='toc3_12_1_'></a>[High-level principles for handling duplicate data:](#toc0_)

- Understand the Cause: Before handling duplicate data, it's important to understand why the duplicates occurred. This can help in choosing the best method for handling them.

- Preserve Information: The method for handling duplicates should preserve as much information as possible, unless the duplicates are likely due to errors.

- Check the Results: After handling duplicates, check the results to ensure that the process has not introduced any errors or biases.

- Document the Process: Keep a record of what methods were used to handle duplicates, and why. This is important for the reproducibility of the analysis.

### <a id='toc3_12_2_'></a>[Strategies for Duplicate Data:](#toc0_)
Handling duplicate data is another critical step in the data preprocessing pipeline. Here are five common methods for dealing with duplicate data, along with best practices for each:

1. **Removal of Duplicates:** This method involves identifying and removing duplicate records in the dataset.

   - **Best Practice:** Use this method when the duplicates do not provide any additional information and are likely to have occurred due to data entry errors. Be careful to ensure that the records are true duplicates across all features and not just a subset.

2. **Averaging:** If the duplicates have slight variations in a continuous feature, you may choose to average the feature values across the duplicate records and keep a single record.

   - **Best Practice:** Use this method when the duplicates are not exact duplicates and the variations in the continuous features are minor and likely due to measurement errors.

3. **Majority Voting:** If the duplicates have variations in a categorical feature, you may choose to apply a majority voting scheme and keep the mode of the feature values across the duplicate records.

   - **Best Practice:** Use this method when the duplicates are not exact duplicates and the variations in the categorical features are minor and likely due to data entry errors.

4. **Keeping the Most Recent:** In time-series data, if duplicates are found, you may choose to keep the most recent record and discard the older ones.

   - **Best Practice:** Use this method when the data is time-series and the duplicates are likely due to data being updated over time.

5. **Combining Information:** If the duplicates have different information in different features, you may choose to combine the information into a single record.

   - **Best Practice:** Use this method when the duplicates are not exact duplicates and the different information in the duplicates is valuable and should be preserved.
