## Example code for Pre-processing and Exploratory Data Analysis (EDA) with pandas


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Creating a sample DataFrame
data = {
    'City': ['Sevilla', 'NY', 'LA', 'New York', np.nan, 'Seville', 'Los Angeles'],
    'Date': ['2023-06-28', '28-06-2023', np.nan, '28/06/2023', '2023-06-28', '2023-06-28', '28/06/2023'],
    'Feature1': [5, np.nan, 7, 8, 9, 10, np.nan],
    'Feature2': [15, np.nan, 17, 18, np.nan, 20, 21],
    'Feature3': ['A', 'B', np.nan, 'A', 'B', 'A', np.nan],
    'Feature4': ['X', 'Y', np.nan, 'X', 'Y', 'X', np.nan],
    'Surname': ['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller']
}

df = pd.DataFrame(data)
print("Original DataFrame")
print(df)

# Data Cleaning

## 1. Incorrect Data
name_correction = {"Sevilla": "Seville", "NY": "New York", "LA": "Los Angeles"}
df['City'] = df['City'].replace(name_correction)
display(df)

In [None]:
## 2. Improperly Formatted Data
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
display(df)

In [None]:
## 3. Duplicated Data
df = df.drop_duplicates()

## 4. Irrelevant Features or Data
df.drop('Surname', axis=1, inplace=True)

# Missing Data

# Finding the number of missing values in each column
missing_values = df.isnull().sum()
print("\nMissing values in each column:")
print(missing_values)


In [None]:
# Dropping a column if more than 90 percent of its instances are missing
df_dropped = df.dropna(thresh=0.9*len(df), axis=1)
display(df_dropped)


In [None]:

df_imputed = df.copy()

# Filling missing values in quantitative features with mean value
imputer = SimpleImputer(strategy='mean')
df_imputed[['Feature1', 'Feature2']] = imputer.fit_transform(df_imputed[['Feature1', 'Feature2']])
display(df_imputed)


In [None]:

# Replacing missing values in categorical features with the most frequent value
imputer = SimpleImputer(strategy='most_frequent')
df_imputed[['City', 'Feature3', 'Feature4']] = imputer.fit_transform(df_imputed[['City', 'Feature3', 'Feature4']])

# Filling missing values in quantitative features with mean value
imputer = IterativeImputer()
df_imputed[['Feature1', 'Feature2']] = imputer.fit_transform(df_imputed[['Feature1', 'Feature2']])

print("\nDataFrame after imputation:")
print(df_imputed)


# Creating a new binary feature to indicate the imputation
df_imputed['Feature1_imputed'] = 0
df_imputed.loc[df['Feature1'].isnull(), 'Feature1_imputed'] = 1
print("\nDataFrame after adding binary feature:")
print(df_imputed)


### Outliers

The Python library `pandas` provides functions for calculating mean, median, and quantiles. You can also use `matplotlib` to create boxplots for visualizing outliers.

Here is the Python code for each of the strategies mentioned:

**Mean and Median:**

You can use the `mean()` and `median()` functions of a pandas DataFrame to calculate the mean and median of each column.


In [None]:
import pandas as pd
import numpy as np

# Create a numeric DataFrame
np.random.seed(0)  # for reproducibility
data = {
    'Feature1': np.random.normal(0, 1, 1000),
    'Feature2': np.random.normal(10, 2, 1000),
    'Feature3': np.random.normal(-5, 5, 1000),
}

df = pd.DataFrame(data)

# Add some extreme values (outliers)
df.loc[1000] = [20, 20, -10]
df.loc[1001] = [-10, 20, 20]

print(df)


# Assuming df is your DataFrame
print("Mean of each column:")
print(df.mean())
print("\nMedian of each column:")
print(df.median())



**Interquartile Range (IQR):**

You can use the `quantile()` function of a pandas DataFrame to calculate the 1st and 3rd quartiles (25th and 75th percentiles) and then compute the IQR. Values that are 1.5 times the IQR less than the first quartile or 1.5 times the IQR more than the third quartile are considered as outliers.


In [None]:
# Assuming df is your DataFrame and 'Feature1' is one of its columns
Q1 = df['Feature1'].quantile(0.25)
Q3 = df['Feature1'].quantile(0.75)
IQR = Q3 - Q1

# Identifying outliers
outliers = df[(df['Feature1'] < (Q1 - 1.5 * IQR)) | (df['Feature1'] > (Q3 + 1.5 * IQR))]
print("Outliers in 'Feature1':")
print(outliers)


**Boxplot**:

You can use the `boxplot()` function of `matplotlib.pyplot` to create boxplots. Boxplots are useful for visually identifying outliers.


In [None]:
import matplotlib.pyplot as plt

# Assuming df is your DataFrame and 'Feature1' is one of its columns
plt.boxplot(df['Feature1'].dropna())
plt.title("Boxplot of 'Feature1'")
plt.show()


This boxplot visually represents the distribution of 'Feature1'. The line in the middle of the box is the median, the box represents the IQR, the whiskers represent the range within 1.5 times the IQR from the box, and points outside the whiskers are potential outliers.

Please remember to replace `'Feature1'` with the actual column name you're interested in. Note that these examples assume your data is numeric. If it's not, you'll have to convert or handle the data appropriately.

**Removing outliers:**

In [None]:
from scipy.stats import mstats
from scipy.special import boxcox, inv_boxcox

# Calculate the IQR for each column
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

# Define upper and lower bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify the outliers
outliers = (df < lower_bound) | (df > upper_bound)

# Print the identified outliers
print("\nOutliers detected:")
print(outliers)

# Removing outliers
df_no_outliers = df[~outliers.any(axis=1)]
print("\nDataFrame after removing outliers:")
print(df_no_outliers)

# Transforming outliers: Logarithmic transformation
df_log_transformed = df.copy()
# Add a small constant to avoid division by zero error when applying log
df_log_transformed = np.log(df_log_transformed + 0.1)
print("\nDataFrame after logarithmic transformation:")
print(df_log_transformed)

# Power transformation
df_power_transformed = df.copy()
df_power_transformed = np.sqrt(df_power_transformed)
print("\nDataFrame after power transformation:")
print(df_power_transformed)

# Winsorizing outliers: Capping
df_capped = df.copy()
for column in df_capped.columns:
    df_capped[column] = np.where(df_capped[column] < lower_bound[column], lower_bound[column], df_capped[column])
    df_capped[column] = np.where(df_capped[column] > upper_bound[column], upper_bound[column], df_capped[column])
print("\nDataFrame after capping outliers:")
print(df_capped)

# Winsorizing outliers: Truncation
df_truncated = df.copy()
df_truncated[outliers] = np.nan
print("\nDataFrame after truncating outliers:")
print(df_truncated)

# Winsorizing outliers using scipy's mstats module
df_winsorized = df.copy()
for column in df_winsorized.columns:
    df_winsorized[column] = mstats.winsorize(df_winsorized[column], limits=[0.05, 0.05])
print("\nDataFrame after Winsorizing:")
print(df_winsorized)


Features Encoding

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Create a dataframe
data = {'City': ['Amsterdam', 'Rotterdam', 'The Hague', 'Utrecht'],
        'Province': ['North Holland', 'South Holland', 'South Holland', 'Utrecht'],
        'Population': [821752, 623652, 514861, 334176]}
df = pd.DataFrame(data)

print("\nOriginal DataFrame:")
print(df)

# (a) Substituting categorical variables with numeric values using LabelEncoder:

# Instantiate LabelEncoder
le = LabelEncoder()

# Apply LabelEncoder to the 'City' column
df['City_encoded'] = le.fit_transform(df['City'])

print("\nDataFrame after Label Encoding 'City':")
print(df)

# (b) Using one-hot encoding for categorical variables:

# Instantiate OneHotEncoder
ohe = OneHotEncoder()

# Apply OneHotEncoder to the 'Province' column
ohe_results = ohe.fit_transform(df[['Province']])

# Manually create feature names
feature_names = ohe.categories_[0]

# Convert the results to a DataFrame and append to original DataFrame
ohe_df = pd.DataFrame(ohe_results.toarray(), columns=feature_names)
df = pd.concat([df, ohe_df], axis=1)

print("\nDataFrame after One-Hot Encoding 'Province':")
print(df)


### Feature Binning

Feature binning or discretization is the process of converting continuous features into discrete bins. It can be useful for algorithms that can't handle continuous values and to reduce the effect of small observation errors.


In [None]:

import pandas as pd

# Suppose we have a DataFrame with age and salary columns
data = {'age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61, 18, 28, 27, 29, 49],
        'salary': [20691, 223500, 68730, 158731, 124150, 65123, 75400, 61398, 74127, 81293, 59419, 59639, 99524, 74816, 43429]}
df = pd.DataFrame(data)

# We can bin the continuous age data into categories (bins)
bins = [18, 30, 40, 50, 60, 70]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69']
df['age_bin'] = pd.cut(df['age'], bins=bins, labels=labels)

print(df)



### Feature Scaling (Standardization and Normalization)

Feature scaling is a step in pre-processing input data for a machine learning model. Standardization (zero mean and unit variance) and normalization (scaling features to lie between a given minimum and maximum value) are common methods.

In [None]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler_standard = StandardScaler()
df['salary_standardized'] = scaler_standard.fit_transform(df[['salary']])

# Normalization
scaler_minmax = MinMaxScaler()
df['salary_normalized'] = scaler_minmax.fit_transform(df[['salary']])

print(df)



### Class Label Imbalance (SMOTE)

In imbalanced classification problems, one class can heavily outnumber the other class(es). Synthetic Minority Over-sampling Technique (SMOTE) is one method to address this issue by generating new samples in the datasets.

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter

# Create a imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=2,
                           n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)

print(f'Original Dataset Shape: {Counter(y)}')

# Apply SMOTE
sm = SMOTE(random_state=2)
X_res, y_res = sm.fit_resample(X, y)

print(f'Resampled Dataset Shape: {Counter(y_res)}')

### Effects of Scaling on Supervised Learning Algorithms

Preprocessing methods like the scalers are usually applied before applying a supervised machine learning algorithm. As an example, say we want to apply the kernel SVM (SVC) to the cancer dataset, and use MinMaxScaler for preprocessing the data. We start by loading our dataset and splitting it into a training set and a test set:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_breast_cancer()

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=1)

# Define the scaling methods
scalers = [StandardScaler(), MinMaxScaler(), RobustScaler()]

# Initialize lists to store scaler names and accuracies
scaler_names = ['Unscaled']
accuracies = []

# Calculate and store the accuracy of the unscaled dataset
lr = LogisticRegression(max_iter=5000)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
unscaled_accuracy = accuracy_score(y_test, y_pred)
accuracies.append(unscaled_accuracy)

for scaler in scalers:
    # Scale the training and test sets
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Instantiate the model
    lr = LogisticRegression(max_iter=5000)

    # Fit the model to the scaled training set
    lr.fit(X_train_scaled, y_train)

    # Predict on the scaled test set
    y_pred = lr.predict(X_test_scaled)

    # Calculate the accuracy score
    accuracy = accuracy_score(y_test, y_pred)

    # Store scaler name and accuracy
    scaler_name = type(scaler).__name__
    scaler_names.append(scaler_name)
    accuracies.append(accuracy)

# Plot the accuracies
plt.figure(figsize=(10, 6))
x = np.arange(len(scaler_names))
colors = ['darkblue', 'mediumseagreen', 'darkorange', 'red']
bar_width = 0.4

plt.bar(x, accuracies, width=bar_width, color=colors, edgecolor='black')

# Add text labels for accuracy values
for i, acc in enumerate(accuracies):
    plt.text(i, acc + 0.01, f'{acc:.2f}', ha='center', color='black', fontsize=10)

# Customize the plot
plt.xlabel('Scaling Method')
plt.ylabel('Accuracy')
plt.title('Accuracy Comparison - Breast Cancer Classification')
plt.xticks(x, scaler_names, rotation=45)
plt.ylim(0, 1)
plt.grid(axis='y', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()


To scale a dataset and use it in a machine learning model, follow these steps:

1. Load the dataset: Start by loading the dataset you want to work with. This could be a dataset you have collected or a pre-existing dataset from a library like scikit-learn.
2. Split the dataset: Divide the dataset into training and test sets using the `train_test_split` function. This step is essential to evaluate the model's performance on unseen data.
3. Choose a scaling method: Decide on a scaling method based on your data and the requirements of your machine learning algorithm. Some commonly used scaling methods are StandardScaler, MinMaxScaler, and RobustScaler.
4. Initialize the scaler: Create an instance of the chosen scaler. For example, if you want to use StandardScaler, initialize it using `scaler = StandardScaler()`.
5. Fit and transform the training data: Fit the scaler to the training data using the `fit` method: `scaler.fit(X_train)`. This step calculates the necessary statistics from the training data (e.g., mean and standard deviation for StandardScaler). Then, transform the training data using the `transform` method: `X_train_scaled = scaler.transform(X_train)`. This step scales the training data based on the calculated statistics.
6. Transform the test data: Use the same scaler instance to transform the test data. This ensures that the test data is scaled in the same way as the training data: `X_test_scaled = scaler.transform(X_test)`.
7. Build and train the model: Create an instance of the machine learning model you want to use (e.g., Logistic Regression, Random Forest, etc.). Fit the model to the scaled training data: `model.fit(X_train_scaled, y_train)`.
8. Evaluate the model: Use the trained model to make predictions on the scaled test data: `y_pred = model.predict(X_test_scaled)`. Evaluate the model's performance using appropriate evaluation metrics (e.g., accuracy, precision, recall, etc.).