# Preprocessing Techniques for Numerical Variables

Preprocessing numerical variables is a critical step in the data preparation phase of a Machine Learning project. Here are various techniques used for this purpose:

## 1. Normalization
- **What It Is**: Scaling all numerical variables to a fixed range, typically between 0 and 1.
- **How It's Done**: Divide each value by the maximum value in the dataset.
- **When to Use**: When you have variables with different scales and need to bring them to a common scale without distorting differences in the ranges.
- **Example**: Rescaling annual incomes in a dataset ranging from $30,000 to $100,000 to a scale of 0 to 1.
- **Applied Technique**: Normalize by dividing each income by $100,000.

### 1.1. Decimal Scaling Normalization 
- **What It Is**: Decimal scaling is a data normalization technique where we move the decimal point of values of a feature. This process changes the scale of the values but not the relationship between the values in the feature.
- **How It's Done**: The number of decimal places moved depends on the maximum absolute value of the feature. If the maximum absolute value is a d-digit number, the data is divided by 10^d.
- **When to Use**: To scale the feature values so that they fall between -1 and 1.
- **Example**
  **Data**: Consider a set of values [1234, 5678, 910].
  **Calculation**: The maximum absolute value is 5678 (a 4-digit number). Therefore, we divide each value by 10^4.
  **Result**: The scaled values will be [0.1234, 0.5678, 0.0910].

### 1.2. Min-Max Normalization 

- **What It Is**: Min-Max Normalization is a common technique used in data preprocessing to scale numerical data.
- **How It's Done**: This technique transforms features to a common scale by subtracting the minimum value of the feature and then dividing by the range of the feature (maximum value - minimum value).
- **Formula**: \[X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}\]
- **When to Use**: To scale the feature values so that they fall within a specified range, typically 0 to 1.
- **Example**
    **Data**: Consider a set of values [100, 200, 300].
    **Calculation**: 
  - Minimum value, X_{min} = 100
  - Maximum value, X_{max} = 300
  - Normalized values = \[(100-100)/(300-100), (200-100)/(300-100), (300-100)/(300-100)\]
    **Result**: The normalized values will be [0, 0.5, 1].

This technique ensures that all features contribute equally to the result and helps in speeding up the convergence in algorithms that use gradient descent.


## 2. Standardization
- **What It Is**: Rescaling data to have a mean (μ) of 0 and a standard deviation (σ) of 1.
- **How It's Done**: Subtract the mean and divide by the standard deviation.
- **When to Use**: Useful in algorithms that assume data is normally distributed, like Support Vector Machines and k-Nearest Neighbors.
- **Example**: Adjusting test scores from different classes to a common scale where the mean is 0 and the standard deviation is 1.
- **Applied Technique**: Subtract the class mean from each score and divide by the class standard deviation.

## 3. Log Transformation
- **What It Is**: Applying the natural logarithm or logarithm to the data.
- **How It's Done**: Replace each variable x with log(x).
- **When to Use**: When dealing with skewed data or when you want to stabilize the variance.
- **Example**: Applying log transformation to highly skewed real estate prices to reduce the impact of very high values.
- **Applied Technique**: Apply the natural logarithm (log) to each price value.

## 4. Clipping
- **What It Is**: Capping the values in a dataset to a defined minimum or maximum.
- **How It's Done**: Set thresholds and cap values exceeding these thresholds.
- **When to Use**: To limit the effect of outliers that might skew the analysis.
- **Example**: Capping credit card transactions at a certain threshold to minimize the impact of outliers.
- **Applied Technique**: Replace all transactions above $10,000 with $10,000.

## 5. Binning
- **What It Is**: Converting continuous data into categorical by creating intervals.
- **How It's Done**: Divide the range of the data into bins and assign each value to a bin.
- **When to Use**: To simplify the model or when the exact value is not as important as the range.
- **Example**: Converting continuous age data into categories like '18-30', '31-45', '46-60', etc.
- **Applied Technique**: Define bins (18-30, 31-45, 46-60) and categorize each age into these bins.

## 6. Feature Scaling
- **What It Is**: Adjusting the scale of the features to a certain range.
- **How It's Done**: Similar to normalization but can be done with different methods, like Min-Max scaling.
- **When to Use**: When the algorithm is sensitive to the scale of the data, like Gradient Descent based algorithms.
- **Example**: Scaling distance traveled and calories burned in a fitness app to a common scale for analysis.
- **Applied Technique**: Use Min-Max scaling to bring both variables into a range of 0 to 1.

## 7. Imputation
- **What It Is**: The process of replacing missing values in a dataset.
- **How It's Done**: Common methods include using the mean, median, or mode of the column, or more complex methods like using predictions from other parts of the data.
- **When to Use**: Essential when you have missing data in your dataset. The choice of imputation method depends on the nature of the data and the extent of missingness.
- **Example**: Filling in missing values in a survey dataset using the median or mean of the respective variable.
- **Applied Technique**: Calculate the median of the variable and replace missing values with this median.





In [7]:
# Performing calculations for each preprocessing technique separately and preparing text outputs for each

import numpy as np
# 1. Normalization Example
incomes = np.array([30000, 50000, 100000])
normalized_incomes = incomes / 100000
print( f"Original Incomes: {incomes}\nNormalized Incomes: {normalized_incomes}")

# 2. Standardization Example
test_scores = np.array([70, 80, 90])
standardized_scores = (test_scores - np.mean(test_scores)) / np.std(test_scores)
print(f"Original Test Scores: {test_scores}\nStandardized Scores: {standardized_scores}")

# 3. Log Transformation Example
real_estate_prices = np.array([100000, 200000, 1000000])
log_transformed_prices = np.log(real_estate_prices)
print(f"Original Real Estate Prices: {real_estate_prices}\nLog Transformed Prices: {log_transformed_prices}")

# 4. Clipping Example
transactions = np.array([5000, 15000, 8000])
clipped_transactions = np.clip(transactions, None, 10000)
print(f"Original Transactions: {transactions}\nClipped Transactions: {clipped_transactions}")

# 5. Binning Example
ages = np.array([25, 35, 55])
age_bins = [0, 30, 45, 60]
binned_ages = np.digitize(ages, age_bins)
print( f"Original Ages: {ages}\nBinned Ages (Categories 1: <30, 2: 30-45, 3: 45-60): {binned_ages}")

# 6. Feature Scaling Example
distance = np.array([5, 10, 15])  # in km
calories = np.array([100, 200, 300])
#scaled_distance = min_max_scaler(distance)
#scaled_calories = min_max_scaler(calories)
#feature_scaling_result = f"Original Distance (km): {distance}\nScaled Distance: {scaled_distance}\nOriginal Calories: {calories}\nScaled Calories: {scaled_calories}"

# 7. Imputation Example
survey_data = np.array([1, np.nan, 3, 4, 5])
median_value = np.nanmedian(survey_data)
imputed_data = np.where(np.isnan(survey_data), median_value, survey_data)
print( f"Original Survey Data: {survey_data}\nImputed Survey Data (using median): {imputed_data}")


Original Incomes: [ 30000  50000 100000]
Normalized Incomes: [0.3 0.5 1. ]
Original Test Scores: [70 80 90]
Standardized Scores: [-1.22474487  0.          1.22474487]
Original Real Estate Prices: [ 100000  200000 1000000]
Log Transformed Prices: [11.51292546 12.20607265 13.81551056]
Original Transactions: [ 5000 15000  8000]
Clipped Transactions: [ 5000 10000  8000]
Original Ages: [25 35 55]
Binned Ages (Categories 1: <30, 2: 30-45, 3: 45-60): [1 2 3]
Original Survey Data: [ 1. nan  3.  4.  5.]
Imputed Survey Data (using median): [1.  3.5 3.  4.  5. ]


# Types of Missing Data

Understanding the nature of missing data is crucial for effective data analysis and preprocessing. Here are the common types of missing data:

## 1. Missing Completely at Random (MCAR)
- **Characteristics**: The missingness is independent of any other data, both observed and unobserved. The reasons for the missing data are completely random.
- **Implication**: Analysis is less likely to be biased because the missing data points do not represent a specific trend.
- **Example**: A respondent accidentally skipping a question in a survey.

## 2. Missing at Random (MAR)
- **Characteristics**: The propensity for a data point to be missing is related to some observed data in the dataset, but not to the missing data itself.
- **Implication**: The missing data can be handled by considering the observed data, as long as the relationship is correctly modeled.
- **Example**: Younger people less likely to disclose their age in a survey, with age being the variable missing.

## 3. Missing Not at Random (MNAR)
- **Characteristics**: The missingness is related to the unobserved data, meaning the reason for the missing data is related to the missing data itself.
- **Implication**: This type is challenging to handle as it can lead to biases if not addressed properly.
- **Example**: Higher-income individuals less likely to disclose their income.

## 4. Structurally Missing Data
- **Characteristics**: The data is missing because it is not applicable or does not exist due to the nature of the data itself.
- **Implication**: This type of missing data does not necessarily indicate a problem with the data collection process but reflects the nature of the data.
- **Example**: The absence of a 'spouse's name' field for a single individual in a dataset.

Each type of missing data has its implications on the analysis and requires specific strategies for handling.


# Preprocessing Techniques for Categorical Variables

Categorical data requires specific preprocessing techniques to be effectively used in machine learning models. Here are some common methods:

## 1. Label Encoding
- **How It's Done**: Assign each unique category a numerical value.
- **When to Use**: For ordinal data where categories have a meaningful order.
- **Example**: 'Low', 'Medium', 'High' might be encoded as 1, 2, 3.
- **Applied Technique**: Use Python's `LabelEncoder` from `sklearn.preprocessing` or a manual mapping.

## 2. One-Hot Encoding
- **How It's Done**: Create new columns indicating the presence of each category with binary values.
- **When to Use**: For nominal data where no ordinal relationship exists.
- **Example**: 'Red', 'Green', 'Blue' becomes three columns with 1s and 0s.
- **Applied Technique**: Use `get_dummies` from Pandas or `OneHotEncoder` from `sklearn.preprocessing`.

## 3. Binary Encoding
- **How It's Done**: Convert categories to integers and then convert those integers into binary numbers.
- **When to Use**: With high cardinality features to avoid dimensionality issues.
- **Example**: Categories 1, 2, 3 might be encoded into 01, 10, 11.
- **Applied Technique**: Use `BinaryEncoder` from the `category_encoders` library.

## 4. Frequency or Count Encoding
- **How It's Done**: Replace categories with their frequencies or count in the dataset.
- **When to Use**: When the frequency of occurrence is informative.
- **Example**: If 'Apple' appears 50 times, it's replaced with 50.
- **Applied Technique**: Manually calculate frequencies or use `value_counts` in Pandas.

## 5. Ordinal Encoding
- **How It's Done**: Assign numbers to categories with respect for their order.
- **When to Use**: For ordinal data with a known ranking.
- **Example**: 'Bad', 'Good', 'Excellent' might be encoded as 1, 2, 3.
- **Applied Technique**: Use `OrdinalEncoder` from `sklearn.preprocessing`.

## 6. Mean Encoding (or Target Encoding)
- **How It's Done**: Replace categories with the mean of the target variable for that category.
- **When to Use**: When the category is strongly related to the target variable.
- **Example**: Replace 'Car Brand' with the average sale price of each brand.
- **Applied Technique**: Use `TargetEncoder` from the `category_encoders` library.

## 7. Hashing
- **How It's Done**: Use a hashing function to convert categories into numerical values.
- **When to Use**: For large datasets and high-cardinality features.
- **Example**: Hash 'City Name' into a fixed size of numerical values.
- **Applied Technique**: Use `HashingEncoder` from the `category_encoders` library.

## 8. Dummy Encoding
- **How It's Done**: Similar to one-hot encoding but creates N-1 binary columns for N categories.
- **When to Use**: To avoid multicollinearity in linear models.
- **Example**: For 'Red', 'Green', 'Blue', create two columns instead of three.
- **Applied Technique**: Use `pd.get_dummies(df, drop_first=True)` in Pandas.

Each technique offers unique advantages and can be selected based on the model's requirements and the data's characteristics.


# Train-Test Split in Machine Learning

The Train-Test Split is a fundamental step in the machine learning workflow, essential for evaluating model performance.

## Overview
- **Purpose**: To divide the dataset into two parts: one for training the model (training set) and one for testing the model's performance (test set).

## Training Set
- **Usage**: This subset is used to train or fit the machine learning model. It includes both the input data and the expected output (labels).
- **Proportion**: Typically constitutes 70-80% of the dataset.

## Test Set
- **Usage**: Used for evaluating the performance of the model, ensuring that it can generalize well to new, unseen data.
- **Proportion**: Commonly makes up the remaining 20-30% of the dataset.

## Importance
- **Model Generalization**: The separation into training and test sets allows for an unbiased evaluation of how well the model performs on new data.
- **Prevention of Overfitting**: Helps in detecting overfitting, where the model performs well on the training data but poorly on unseen data.

## Splitting Methods
- **Random Splitting**: The data is divided randomly into training and test sets.
- **Stratified Splitting**: Used especially for imbalanced datasets to maintain the same proportion of classes in both sets as in the original dataset.
- **Cross-Validation**: Involves multiple train-test splits, providing a more robust model evaluation, especially for smaller datasets.

Understanding and implementing the Train-Test Split correctly is crucial for the development of effective and reliable machine learning models.
