# Data Preprocessing
`Data preprocessing` is the essential step in machine learning that involves preparing and transforming raw data into a clean, structured, and consistent format suitable for training machine learning models.

The typical steps of data preprocessing in machine learning include the following:

- Data Collection: Acquiring the relevant dataset for the problem at hand.

- Data Cleaning: Handling missing values, removing or correcting errors, and addressing outliers to ensure data quality.

- Data Integration: Combining data from multiple sources if applicable.

- Data Transformation: Normalizing, scaling, encoding categorical variables, and other transformations to convert data into a suitable format.

- Data Reduction: Reducing dimensionality or selecting important features to simplify the dataset.

- Handling Imbalanced Data: Addressing class imbalance if present for classification tasks.

- Data Splitting: Dividing the dataset into training, validation, and test subsets for model evaluation.

These steps ensure the dataset is clean, consistent, and well-prepared for modeling, improving the performance and reliability of machine learning models.

 Here I created a synthesized dataset

In [14]:
import pandas as pd
import numpy as np

# Setting seed for reproducibility
np.random.seed(42)

# Number of rows
n = 300

# Generate CustomerID
customer_id = np.arange(1, n+1)

# Age with some missing values
age = np.random.randint(18, 70, size=n).astype(float)
age[np.random.choice(n, 20, replace=False)] = np.nan  # 20 missing values

# Gender with some missing values
genders = ['Male', 'Female', 'Other']
gender = np.random.choice(genders, size=n, p=[0.45, 0.45, 0.1])
gender[np.random.choice(n, 15, replace=False)] = np.nan  # 15 missing values

# Annual Income (in k$) with some missing values
income = np.random.normal(loc=60, scale=20, size=n).round(2)
income[income < 10] = 10  # minimum income
income[np.random.choice(n, 25, replace=False)] = np.nan  # 25 missing

# Purchased (binary categorical)
purchased = np.random.choice(['Yes', 'No'], size=n, p=[0.4, 0.6])

# City categories
cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
city = np.random.choice(cities, size=n, p=[0.3, 0.25, 0.2, 0.15, 0.1])

# Membership Years with noise
membership_years = np.abs(np.random.normal(loc=5, scale=3, size=n)).round(1)

# Credit Score with some missing values
credit_score = np.random.normal(loc=650, scale=50, size=n).round()
credit_score[credit_score < 300] = 300
credit_score[credit_score > 850] = 850
credit_score[np.random.choice(n, 30, replace=False)] = np.nan  # 30 missing

# Account Balance with some outliers
account_balance = np.random.normal(loc=5000, scale=2000, size=n).round(2)
# Add outliers
outliers_idx = np.random.choice(n, 5, replace=False)
account_balance[outliers_idx] = account_balance[outliers_idx] * 5

# Create DataFrame
df = pd.DataFrame({
    'CustomerID': customer_id,
    'Age': age,
    'Gender': gender,
    'Annual Income (k$)': income,
    'Purchased': purchased,
    'City': city,
    'Membership Years': membership_years,
    'Credit Score': credit_score,
    'Account Balance': account_balance
})

df.head()


Unnamed: 0,CustomerID,Age,Gender,Annual Income (k$),Purchased,City,Membership Years,Credit Score,Account Balance
0,1,56.0,Male,65.95,Yes,Phoenix,5.2,699.0,5307.66
1,2,69.0,Other,72.46,Yes,Chicago,8.1,712.0,6532.93
2,3,46.0,Female,79.0,No,Los Angeles,3.6,671.0,1564.41
3,4,32.0,Female,70.79,No,Houston,4.1,633.0,5569.74
4,5,60.0,Female,86.93,No,New York,5.7,570.0,33103.45


In [15]:
df.sample(10)  # Display 10 random samples from the dataset

Unnamed: 0,CustomerID,Age,Gender,Annual Income (k$),Purchased,City,Membership Years,Credit Score,Account Balance
126,127,51.0,Female,51.68,No,New York,8.2,,5992.33
94,95,43.0,Female,65.12,Yes,New York,4.4,604.0,4328.54
255,256,33.0,Male,60.95,No,New York,2.5,,7342.24
63,64,31.0,Male,52.35,Yes,New York,11.8,610.0,6487.63
129,130,40.0,Male,83.88,Yes,Los Angeles,6.6,659.0,3880.52
52,53,26.0,Female,73.27,Yes,Houston,4.6,613.0,5835.35
6,7,38.0,Male,,No,Chicago,7.6,627.0,7213.93
156,157,68.0,Female,60.87,No,Los Angeles,4.4,642.0,3558.72
231,232,40.0,Female,60.47,Yes,Chicago,6.4,629.0,6259.5
136,137,44.0,Male,63.39,Yes,New York,7.8,,2491.4


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   CustomerID          300 non-null    int32  
 1   Age                 280 non-null    float64
 2   Gender              300 non-null    object 
 3   Annual Income (k$)  275 non-null    float64
 4   Purchased           300 non-null    object 
 5   City                300 non-null    object 
 6   Membership Years    300 non-null    float64
 7   Credit Score        270 non-null    float64
 8   Account Balance     300 non-null    float64
dtypes: float64(5), int32(1), object(3)
memory usage: 20.1+ KB


In [17]:
df.isnull().sum()  # Check for any remaining missing values

CustomerID             0
Age                   20
Gender                 0
Annual Income (k$)    25
Purchased              0
City                   0
Membership Years       0
Credit Score          30
Account Balance        0
dtype: int64

### `scikit-learn` Library
Scikit-learn (also known as `sklearn`) is an open-source machine learning library for Python. It provides simple, efficient, and versatile tools for data analysis, modeling, and machine learning tasks, including classification, regression, clustering, and dimensionality reduction. Built on top of foundational libraries such as `numPy`, `scipy`, and `matplotlib`, scikit-learn offers a consistent, user-friendly API for implementing, training, and evaluating various machine learning algorithms. It simplifies complex tasks like data preprocessing, feature selection, model evaluation, and hyperparameter tuning, making machine learning more accessible and efficient for both beginners and experts.

In [30]:
import pandas as pd
from sklearn.impute import SimpleImputer # For handling missing values
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler # For encoding and scaling
from sklearn.compose import ColumnTransformer # For applying different transformations to different columns
from sklearn.pipeline import Pipeline # For creating a pipeline of transformations


##### **Handling Missing Values:** 
- Many real-world datasets have missing or null values in some features. These missing values can cause errors or degrade the performance of machine learning models if not handled properly.

In [31]:
# 1. Handling Missing Values

# Separate numerical and categorical columns
num_cols = ['Age', 'Annual Income (k$)', 'Membership Years', 'Credit Score', 'Account Balance']
cat_cols = ['Gender', 'Purchased', 'City']

- **Imputation:** It is the process of replacing missing values with substituted values based on available data. The goal is to maintain dataset integrity by filling missing entries so the model can learn from a complete dataset.

- `SimpleImputer`: This is a scikit-learn utility used for imputation. It provides different strategies for filling missing values:

   - **Mean Imputation:** Used here for numerical columns (num_imputer = SimpleImputer(strategy='mean')), it replaces missing numerical values with the mean (average) of the available values in that feature.

   - **Most Frequent Imputation:** Used here for categorical columns (cat_imputer = SimpleImputer(strategy='most_frequent')), it replaces missing categorical values with the most common (mode) value in the feature.

- Applying Imputers: The fit_transform method computes the statistic (mean or most frequent) from the data and replaces missing values accordingly. This ensures the entire dataset has no missing entries for the selected numerical and categorical columns, making it ready for downstream machine learning tasks.

In [None]:
# Imputer for numerical columns (mean imputation)
num_imputer = SimpleImputer(strategy='mean')

# Imputer for categorical columns (most frequent)
cat_imputer = SimpleImputer(strategy='most_frequent')

# Apply imputers
df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

#### 1. Label Encoding:

  - Converts a binary categorical variable (here, the 'Purchased' column with values like "Yes" or "No") into numeric labels (0 and 1).

  - Appropriate for ordinal or binary categories.

  - Uses scikit-learn’s LabelEncoder which assigns numeric codes to each category.

#### 2. One-Hot Encoding:

  - Converts nominal categorical variables (here, 'Gender' and 'City') into multiple binary columns representing each category.

  - Each original category value gets a separate column with 1 indicating presence, 0 otherwise.

  - Helps avoid introducing a spurious order that numeric codes might imply.

  - Uses scikit-learn’s OneHotEncoder. The drop='first' argument drops the first category as a reference to avoid multicollinearity (dummy variable trap).

  - sparse_output=False ensures the output is a dense array suitable for creating a DataFrame.

In [33]:
# 2. Encoding Categorical Variables

# For 'Purchased' (binary), use Label Encoding
le_purchased = LabelEncoder()
df['Purchased'] = le_purchased.fit_transform(df['Purchased'])

# For nominal categorical features ('Gender', 'City'), use OneHotEncoding

ohe = OneHotEncoder(drop='first', sparse_output=False)
ohe_features = ohe.fit_transform(df[['Gender', 'City']])

# Get column names for one-hot encoded features
ohe_feature_names = ohe.get_feature_names_out(['Gender', 'City'])

# Create DataFrame for encoded features
df_ohe = pd.DataFrame(ohe_features, columns=ohe_feature_names, index=df.index)

# Concatenate encoded columns and drop original ones
df = pd.concat([df.drop(['Gender', 'City'], axis=1), df_ohe], axis=1)

### 3. Scaling Numerical Features (Standardization)

- **Concept:** Feature scaling standardizes numerical data by centering it around the mean and scaling it to unit variance.

- **StandardScaler:**  
This scikit-learn class computes the mean and standard deviation of numerical features and transforms each value \( x \) using the formula:

$$
z = \frac{x - \mu}{\sigma}
$$

where $\mu$ is the mean and $\sigma$ is the standard deviation.

- **Purpose:**  
Standardized features have zero mean and unit variance, which helps many machine learning algorithms perform better and converge faster, especially algorithms sensitive to feature scale like SVM, logistic regression, and neural networks.

- **Application:**  
The numerical columns (`num_cols`) in the dataset are scaled using `fit_transform` which fits the scaler on the data and transforms it in a single step.

### 4. Outlier Detection and Handling (Capping)
- **Concept:** Outliers are extreme values in data that can distort model training and affect results.

- **Capping by Percentiles (Winsorizing):** This technique limits the extreme values by capping them at a specified lower and upper percentile value (e.g., 1st and 99th percentile).

- In the code, the variable lower is set as the 1st percentile and upper as the 99th percentile of the `Account Balance` feature.
  `clip()` is then used to replace any values below lower with lower, and above upper with upper. This reduces the influence of extreme outliers while preserving all data points.

- **Purpose:** Helps improve model stability and reduces variance caused by extreme values without completely removing data points.


In [34]:
# 3. Scaling Numerical Features

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# 4. Outlier Detection and Handling (example: capping by percentiles)

for col in ['Account Balance']:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = df[col].clip(lower, upper)

In [38]:
df.head() # Final dataset preview

Unnamed: 0,CustomerID,Age,Annual Income (k$),Purchased,Membership Years,Credit Score,Account Balance,Gender_Male,Gender_Other,Gender_nan,City_Houston,City_Los Angeles,City_New York,City_Phoenix
0,1,0.826604,0.230346,1,-0.048566,0.965262,-0.00945,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2,1.706382,0.582667,1,0.938634,1.245129,0.4925,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,3,0.149852,0.936612,0,-0.593227,0.362472,-1.542926,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,4,-0.797601,0.492287,0,-0.423021,-0.455601,0.097915,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,5,1.097305,1.365783,0,0.121641,-1.811879,4.952199,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [39]:
df.isnull().sum() # Check for any remaining missing values

CustomerID            0
Age                   0
Annual Income (k$)    0
Purchased             0
Membership Years      0
Credit Score          0
Account Balance       0
Gender_Male           0
Gender_Other          0
Gender_nan            0
City_Houston          0
City_Los Angeles      0
City_New York         0
City_Phoenix          0
dtype: int64