# Fundamentals of Machine Learning

Data preprocessing and cleaning are crucial steps in the machine learning pipeline. They involve transforming raw data into a suitable format that can be effectively used by machine learning algorithms. The quality and integrity of the data significantly impact the performance and reliability of the trained models.


Importance of data preprocessing and cleaning in ML:
- Ensures data consistency and reliability
- Reduces noise and irrelevant information
- Handles missing values and outliers
- Normalizes and scales features
- Encodes categorical variables
- Improves model performance and generalization


Common data quality issues:
1. Missing values
   - Occurs when certain features or observations have no recorded value
   - Can be due to data collection issues, system failures, or human errors

2. Outliers
   - Data points that significantly deviate from the majority of the data
   - Can be caused by measurement errors, data entry mistakes, or genuine extreme values

3. Inconsistent formatting
   - Variations in data formats, such as date formats or units of measurement
   - Requires standardization for consistent processing

4. Duplicate entries
   - Repeated instances of the same data point
   - Need to be identified and handled appropriately

5. Incorrect or invalid data
   - Data points that violate domain-specific rules or constraints
   - For example, negative age values or invalid zip codes

6. Imbalanced data
   - Unequal representation of different classes or categories in the dataset
   - Can lead to biased models that perform poorly on underrepresented classes


Addressing these data quality issues through proper preprocessing and cleaning techniques is essential to ensure the reliability and effectiveness of the machine learning models built on top of the data.


In the following sections, we will explore various techniques and approaches to handle missing data, outliers, data transformation, imbalanced datasets, and more, using Python and popular data manipulation libraries such as NumPy and Pandas.

**Table of contents**<a id='toc0_'></a>    
- [Handling Missing Data](#toc1_)    
- [Handling Outliers](#toc2_)    
- [Data Transformation](#toc3_)    
- [Handling Duplicate Entries](#toc4_)    
- [Handling Incorrect or Invalid Data](#toc5_)    
- [Handling Imbalanced Data](#toc6_)    
- [Summary](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Handling Missing Data](#toc0_)

Missing data is a common issue in real-world datasets. It occurs when certain features or observations have no recorded value. Identifying and handling missing data is crucial to ensure the quality and reliability of the machine learning models.


Identifying missing values:
- In Python, missing values are typically represented as `None`, `NaN` (Not a Number), or `NA` (Not Available) depending on the data type and library used.
- Pandas provides functions like `isnull()` and `isna()` to identify missing values in a DataFrame.


In [1]:
import pandas as pd

In [2]:
# Create a sample DataFrame with missing values
df = pd.DataFrame(
    {
        'A': [1, 2, None, 4],
        'B': [5, None, 7, 8],
        'C': [9, 10, 11, None]
    }
)


In [3]:
# Identify missing values
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,False,False
3,False,False,True


Techniques for dealing with missing data:

1. Deletion
   - Listwise deletion (complete case analysis): Remove entire rows containing missing values.
   - Pairwise deletion (available case analysis): Remove only the specific missing values, keeping the rest of the data intact.


Example (Listwise deletion):

In [4]:
# Remove rows with missing values
df_listwise = df.dropna()
df_listwise


Unnamed: 0,A,B,C
0,1.0,5.0,9.0


2. Imputation
   - Mean/Median imputation: Replace missing values with the mean or median of the available data for that feature.
   - Mode imputation: Replace missing values with the most frequent value (mode) of the available data for that feature.
   - KNN imputation: Use the K-Nearest Neighbors algorithm to estimate missing values based on similar instances.


Example (Mean imputation):

In [5]:
# Replace missing values with the mean of each column
df_mean_imputed = df.fillna(df.mean())
df_mean_imputed

Unnamed: 0,A,B,C
0,1.0,5.0,9.0
1,2.0,6.666667,10.0
2,2.333333,7.0,11.0
3,4.0,8.0,10.0


When choosing a technique for handling missing data, consider the following:
- The amount and pattern of missing data (random or systematic)
- The nature of the data (numerical, categorical)
- The potential impact on the analysis and model performance


It's important to carefully evaluate the implications of each technique and choose the most appropriate approach based on the specific characteristics of your dataset and the requirements of your machine learning task.


In the next section, we will explore techniques for handling outliers in the data.

## <a id='toc2_'></a>[Handling Outliers](#toc0_)

Outliers are data points that significantly deviate from the majority of the data. They can be caused by measurement errors, data entry mistakes, or genuine extreme values. Outliers can have a significant impact on statistical analyses and machine learning models, leading to biased or skewed results.


Defining and detecting outliers:
- Outliers are typically defined based on a certain threshold or statistical measure, such as the interquartile range (IQR) or standard deviation.
- Common methods for detecting outliers include:
  - Box plots: Visualize the distribution of data and identify points outside the whiskers.
  - Z-score: Measure how many standard deviations a data point is from the mean.
  - Isolation Forest: An unsupervised learning algorithm that isolates anomalies based on their distance from other data points.


Example (Detecting outliers using Z-score):

In [33]:
import numpy as np

# Create a sample dataset with outliers
data = np.array([[92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41]])

# Calculate the Z-score for each data point
z_scores = (data - np.mean(data)) / np.std(data)

# Define a threshold for outliers (e.g., Z-score > 2)
threshold = 3

# Identify outliers based on the threshold
outliers = data[np.abs(z_scores) > threshold]
print("Outliers:", outliers)

Outliers: [1053]


Techniques for handling outliers:

1. Deletion
   - Remove the outliers from the dataset if they are considered erroneous or irrelevant.
   - Be cautious when deleting outliers, as they may contain valuable information in some cases.

2. Transformation
   - Apply mathematical transformations to reduce the impact of outliers.
   - Common transformations include logarithmic, square root, or reciprocal transformations.


Example (Logarithmic transformation):

In [31]:
import numpy as np

# Create a sample dataset with outliers
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])

# Apply logarithmic transformation
transformed_data = np.log(data)
print("Transformed data:", transformed_data)

Transformed data: [0.         0.69314718 1.09861229 1.38629436 1.60943791 1.79175947
 1.94591015 2.07944154 2.19722458 4.60517019]


3. Winsorization
   - Replace the outliers with the nearest "normal" value, such as the 5th or 95th percentile.
   - Winsorization preserves the information that an outlier exists while limiting its impact on the overall analysis.


Example (Winsorization):

In [29]:
import numpy as np
import scipy.stats as stats

# Create a sample dataset with outliers
data = np.array([92, 19, 101, 58, 1053, 91, 26, 78, 10, 13, -40, 101, 86, 85, 15, 89, 89, 28, -5, 41])

# Perform winsorization (replace outliers with 5th and 95th percentiles)
winsorized_data = stats.mstats.winsorize(data, limits=[0.05, 0.05])
print("Winsorized data:", winsorized_data)

Winsorized data: [ 92  19 101  58 101  91  26  78  10  13  -5 101  86  85  15  89  89  28
  -5  41]


When dealing with outliers, consider the following:
- Understand the nature and source of the outliers (errors, anomalies, or genuine extreme values).
- Assess the impact of outliers on your analysis and model performance.
- Choose an appropriate technique based on the characteristics of your data and the requirements of your machine learning task.


Remember that outliers can sometimes provide valuable insights and should not be blindly removed without careful consideration.


In the next section, we will explore data transformation techniques, including normalization, standardization, and encoding categorical variables.

## <a id='toc3_'></a>[Data Transformation](#toc0_)

Data transformation is the process of converting data from one format or structure into another. It is a crucial step in data preprocessing to ensure that the data is in a suitable format for analysis and machine learning algorithms. Common data transformation techniques include normalization, standardization, and encoding categorical variables.


1. Normalization and Standardization
   - Normalization scales the data to a specific range, typically between 0 and 1.
   - Standardization transforms the data to have zero mean and unit variance.
   - These techniques help to ensure that all features have similar scales and prevent certain features from dominating others.


Example (Min-Max Normalization):

In [36]:
%pip install scikit-learn

Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/a5/53/c7b76a9aa241536635037a7956be36a0c2718262c234085815e8000e9ec6/scikit_learn-1.4.1.post1-cp310-cp310-macosx_12_0_arm64.whl.metadata
  Downloading scikit_learn-1.4.1.post1-cp310-cp310-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Obtaining dependency information for threadpoolctl>=2.0.0 from https://files.pythonhosted.org/packages/1e/84/ccd9b08653022b7785b6e3ee070ffb2825841e0dc119be22f0840b2b35cb/threadpoolctl-3.4.0-py3-none-any.whl.metadata
  Downloading threadpoolctl-3.4.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.4.1.post1-cp310-cp310-macosx_12_0_arm64.whl (10.4 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m734.5 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hDownloading threadpoolctl-3.4.0-py3-none-any.whl (17 kB)
In

In [37]:
from sklearn.preprocessing import MinMaxScaler

# Create a sample dataset
data = [[1, 2], [3, 4], [5, 6]]

# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Normalize the data
normalized_data = scaler.fit_transform(data)
print("Normalized data:")
print(normalized_data)

Normalized data:
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]


2. Encoding Categorical Variables
   - Machine learning algorithms typically require numerical inputs.
   - Categorical variables need to be encoded into numerical representations.
   - Common encoding techniques include:
     - One-Hot Encoding: Creates binary dummy variables for each category.
     - Label Encoding: Assigns a unique numerical label to each category.
     - Ordinal Encoding: Assigns numerical labels based on the order or hierarchy of categories.


Example (One-Hot Encoding):

In [38]:
from sklearn.preprocessing import OneHotEncoder

# Create a sample dataset with categorical variables
data = [['red'], ['green'], ['blue'], ['red'], ['green']]

# Create a OneHotEncoder object
encoder = OneHotEncoder()

# Perform one-hot encoding
encoded_data = encoder.fit_transform(data).toarray()
print("Encoded data:")
print(encoded_data)

Encoded data:
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


When applying data transformation techniques, consider the following:
- Choose the appropriate technique based on the nature of your data and the requirements of your machine learning algorithm.
- Be cautious when applying normalization or standardization to avoid losing important information or introducing biases.
- Consider the interpretability and domain-specific meaning of the transformed data.


Consistent and appropriate data transformation ensures that the data is in a suitable format for machine learning algorithms to process effectively.


In the next section, we will discuss techniques for handling imbalanced datasets, where the distribution of classes or categories is uneven.

## <a id='toc4_'></a>[Handling Duplicate Entries](#toc0_)

Duplicate entries refer to repeated instances of the same data point in a dataset. Duplicate data can occur due to various reasons, such as data entry errors, data collection issues, or merging datasets from multiple sources. Identifying and handling duplicate entries is important to ensure data integrity and avoid biased or misleading results in data analysis and machine learning.


Identifying duplicate entries:
- In Python, you can use the `duplicated()` function from the Pandas library to identify duplicate rows in a DataFrame.
- The `duplicated()` function returns a boolean series indicating which rows are duplicates.


Example:

In [39]:
import pandas as pd

# Create a sample DataFrame with duplicate entries
df = pd.DataFrame({'A': [1, 2, 3, 2, 4],
                   'B': [5, 6, 7, 6, 8],
                   'C': [9, 10, 11, 10, 12]})

# Identify duplicate rows
duplicate_rows = df.duplicated()
print("Duplicate rows:")
print(duplicate_rows)

Duplicate rows:
0    False
1    False
2    False
3     True
4    False
dtype: bool


Handling duplicate entries:
1. Removing duplicates
   - If duplicate entries are considered redundant or unnecessary, you can remove them from the dataset using the `drop_duplicates()` function in Pandas.
   - By default, `drop_duplicates()` keeps the first occurrence of each unique row and removes the subsequent duplicates.


Example:

In [40]:
# Remove duplicate rows
df_unique = df.drop_duplicates()
print("DataFrame after removing duplicates:")
print(df_unique)

DataFrame after removing duplicates:
   A  B   C
0  1  5   9
1  2  6  10
2  3  7  11
4  4  8  12


2. Keeping specific occurrences
   - In some cases, you may want to keep a specific occurrence of a duplicate entry, such as the last occurrence instead of the first.
   - You can specify the `keep` parameter in the `drop_duplicates()` function to control which occurrence to keep.


Example:

In [42]:
# Keep the last occurrence of duplicate rows
df_last_occurrence = df.drop_duplicates(keep='last')
print("DataFrame keeping the last occurrence of duplicates:")
df_last_occurrence

DataFrame keeping the last occurrence of duplicates:


Unnamed: 0,A,B,C
0,1,5,9
2,3,7,11
3,2,6,10
4,4,8,12


When handling duplicate entries, consider the following:
- Understand the source and nature of the duplicates in your dataset.
- Determine whether duplicates are truly redundant or if they carry meaningful information.
- Choose the appropriate approach (removing duplicates or keeping specific occurrences) based on your data and analysis requirements.


Handling duplicate entries ensures data cleanliness and integrity, leading to more accurate and reliable results in data analysis and machine learning tasks.


In the next section, we will discuss techniques for handling incorrect or invalid data points in a dataset.

## <a id='toc5_'></a>[Handling Incorrect or Invalid Data](#toc0_)

Incorrect or invalid data refers to data points that violate domain-specific rules, constraints, or logical boundaries. These data points can arise due to various reasons, such as data entry errors, measurement inaccuracies, or system glitches. Examples of incorrect or invalid data include negative age values, invalid zip codes, or temperatures outside the possible range. Identifying and handling such data points is crucial to maintain data quality and ensure the reliability of data analysis and machine learning models.


Identifying incorrect or invalid data:
- Domain knowledge plays a key role in identifying incorrect or invalid data points.
- Define specific rules or constraints based on the characteristics and requirements of your data.
- Use conditional statements or data validation libraries to check for violations of these rules.


Example:

In [58]:
import pandas as pd

# Create a sample DataFrame with incorrect or invalid data
df = pd.DataFrame({'Age': [25, -10, 30, 45],
                   'Zip_Code': ['12345', '98765', '12345', 'ABCDE'],
                   'Temperature': [25.5, 30.2, -5.0, 42.8]})

# Define rules for identifying incorrect or invalid data
def is_valid_age(age):
    return age >= 0

def is_valid_zip_code(zip_code):
    return zip_code.isdigit() and len(zip_code) == 5

def is_valid_temperature(temperature):
    return 0 <= temperature <= 50.0

# Identify incorrect or invalid data points
invalid_age = df['Age'].apply(lambda x: not is_valid_age(x))
invalid_zip_code = df['Zip_Code'].apply(lambda x: not is_valid_zip_code(x))
invalid_temperature = df['Temperature'].apply(lambda x: not is_valid_temperature(x))

In [59]:
print("Invalid age values:")
print(df[invalid_age])

Invalid age values:
   Age Zip_Code  Temperature
1  -10    98765         30.2


In [60]:
print("Invalid zip codes:")
print(df[invalid_zip_code])


Invalid zip codes:
   Age Zip_Code  Temperature
3   45    ABCDE         42.8


In [61]:
print("Invalid temperature values:")
print(df[invalid_temperature])

Invalid temperature values:
   Age Zip_Code  Temperature
2   30    12345         -5.0


Handling incorrect or invalid data:
1. Data correction
   - If the incorrect or invalid data points are identifiable and can be corrected based on domain knowledge or additional information, you can update the values accordingly.
   - This approach is suitable when the correct values are known or can be inferred from other reliable sources.

2. Data removal
   - If the incorrect or invalid data points are rare and their removal does not significantly impact the analysis, you can choose to remove those data points from the dataset.
   - Be cautious when removing data points, as it may introduce bias or loss of information.

3. Data imputation
   - In some cases, you can treat incorrect or invalid data points as missing values and apply appropriate imputation techniques, such as mean, median, or mode imputation.
   - Imputation allows you to retain the data points while replacing the incorrect or invalid values with estimated values.


When handling incorrect or invalid data, consider the following:
- Establish clear rules and constraints to identify incorrect or invalid data points based on domain knowledge.
- Determine the appropriate handling technique (correction, removal, or imputation) based on the nature of the data and the impact on the analysis.
- Document the steps taken to handle incorrect or invalid data to ensure transparency and reproducibility.


Handling incorrect or invalid data is an important aspect of data preprocessing to ensure the quality and reliability of the data used for analysis and machine learning tasks.


In the next section, we will discuss techniques for handling imbalanced datasets, where the distribution of classes or categories is uneven.

## <a id='toc6_'></a>[Handling Imbalanced Data](#toc0_)

Imbalanced data refers to datasets where the distribution of classes or categories is uneven, with some classes having significantly more instances than others. Imbalanced datasets are common in various domains, such as fraud detection, medical diagnosis, or customer churn prediction, where the minority class is often the class of interest. When training machine learning models on imbalanced data, the models may be biased towards the majority class and perform poorly on the underrepresented minority class.


Understanding imbalanced datasets:
- Imbalanced datasets can be characterized by the class distribution ratio, which is the ratio of the number of instances in the majority class to the number of instances in the minority class.
- Imbalanced datasets pose challenges for machine learning algorithms, as they tend to optimize for overall accuracy, which may not be suitable when the minority class is of primary importance.


Techniques for handling imbalanced data:

1. Oversampling
   - Oversampling involves increasing the number of instances in the minority class to balance the class distribution.
   - Random oversampling duplicates existing minority class instances randomly.
   - Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic examples by interpolating between existing minority class instances.


Example (Random Oversampling):

In [64]:
%pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [65]:
from imblearn.over_sampling import RandomOverSampler

# Create a sample imbalanced dataset
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [0, 0, 0, 1, 1]  # Minority class: 1, Majority class: 0

# Create a RandomOverSampler object
oversampler = RandomOverSampler(random_state=42)

# Perform random oversampling
X_resampled, y_resampled = oversampler.fit_resample(X, y)

print("Resampled dataset:")
print(X_resampled)
print(y_resampled)

Resampled dataset:
[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [7, 8]]
[0, 0, 0, 1, 1, 1]


2. Undersampling
   - Undersampling involves reducing the number of instances in the majority class to balance the class distribution.
   - Random undersampling randomly removes instances from the majority class.
   - Cluster Centroids undersampling replaces majority class instances with centroids of clusters formed by majority class instances.


Example (Random Undersampling):

In [66]:
from imblearn.under_sampling import RandomUnderSampler

# Create a sample imbalanced dataset
X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
y = [0, 0, 0, 1, 1]  # Minority class: 1, Majority class: 0

# Create a RandomUnderSampler object
undersampler = RandomUnderSampler(random_state=42)

# Perform random undersampling
X_resampled, y_resampled = undersampler.fit_resample(X, y)

print("Resampled dataset:")
print(X_resampled)
print(y_resampled)

Resampled dataset:
[[1, 2], [3, 4], [7, 8], [9, 10]]
[0, 0, 1, 1]


3. Hybrid methods
   - Hybrid methods combine oversampling and undersampling techniques to balance the class distribution.
   - SMOTE + Tomek Links applies SMOTE oversampling followed by Tomek Links undersampling to remove overlapping instances.
   - SMOTE + ENN (Edited Nearest Neighbors) applies SMOTE oversampling followed by ENN undersampling to remove instances that are misclassified by their nearest neighbors.


When handling imbalanced data, consider the following:
- Evaluate the class distribution and determine the extent of imbalance in your dataset.
- Choose an appropriate technique (oversampling, undersampling, or hybrid) based on the characteristics of your data and the specific requirements of your problem.
- Use appropriate evaluation metrics, such as precision, recall, F1-score, or area under the precision-recall curve (AUPRC), which are more suitable for imbalanced datasets than accuracy.
- Consider using ensemble methods or cost-sensitive learning approaches that can handle imbalanced data effectively.


Handling imbalanced data is crucial to ensure that machine learning models are not biased towards the majority class and can effectively identify and classify instances of the minority class, which is often the class of interest.


In the next section, we will discuss techniques for data integration and aggregation, which involve combining data from multiple sources or summarizing data at different levels of granularity.

## <a id='toc7_'></a>[Summary](#toc0_)

In this lecture, we explored the importance of data preprocessing and cleaning in machine learning. We discussed various techniques to handle common data quality issues and prepare the data for effective analysis and modeling.


Key takeaways from the lecture:
- Data preprocessing and cleaning are crucial steps in the machine learning pipeline to ensure data quality, consistency, and reliability.
- Identifying and handling missing data is essential, and techniques such as deletion and imputation can be used to address missing values.
- Outliers are data points that significantly deviate from the majority of the data and can be handled using techniques like deletion, transformation, or winsorization.
- Data transformation techniques, including normalization, standardization, and encoding categorical variables, help prepare the data for machine learning algorithms.
- Handling imbalanced datasets is important to prevent biased models and ensure effective performance on minority classes. Techniques such as oversampling, undersampling, and hybrid methods can be used to balance the class distribution.
- Data integration and aggregation involve combining data from multiple sources and summarizing data at different levels of granularity to facilitate analysis and modeling.


Throughout the lecture, we provided hands-on examples using Python and popular libraries such as NumPy, Pandas, and Scikit-learn to demonstrate the practical application of data preprocessing techniques.