## Handling Missing Values in Large-scale ML Pipelines:

**Task 1**: Impute with Mean or Median
- Step 1: Load a dataset with missing values (e.g., Boston Housing dataset).
- Step 2: Identify columns with missing values.
- Step 3: Impute missing values using the mean or median of the respective columns.

In [1]:
# write your code from here

**Task 2**: Impute with the Most Frequent Value
- Step 1: Use the Titanic dataset and identify columns with missing values.
- Step 2: Impute categorical columns using the most frequent value.

In [2]:
# write your code from here

**Task 3**: Advanced Imputation - k-Nearest Neighbors
- Step 1: Implement KNN imputation using the KNNImputer from sklearn.
- Step 2: Explore how KNN imputation improves data completion over simpler methods.

In [3]:
# write your code from here

## Feature Scaling & Normalization Best Practices:

**Task 1**: Standardization
- Step 1: Standardize features using StandardScaler.
- Step 2: Observe how standardization affects data distribution.

In [4]:
# write your code from here

**Task 2**: Min-Max Scaling

- Step 1: Scale features to lie between 0 and 1 using MinMaxScaler.
- Step 2: Compare with standardization.

In [5]:
# write your code from here

**Task 3**: Robust Scaling
- Step 1: Scale features using RobustScaler, which is useful for data with outliers.
- Step 2: Assess changes in data scaling compared to other scaling methods.

In [6]:
# write your code from here

## Feature Selection Techniques:
### Removing Highly Correlated Features:

**Task 1**: Correlation Matrix
- Step 1: Compute correlation matrix.
- Step 2: Remove highly correlated features (correlation > 0.9).

In [7]:
# write your code from here

### Using Mutual Information & Variance Thresholds:

**Task 2**: Mutual Information
- Step 1: Compute mutual information between features and target.
- Step 2: Retain features with high mutual information scores.

In [8]:
# write your code from here

**Task 3**: Variance Threshold
- Step 1: Implement VarianceThreshold to remove features with low variance.
- Step 2: Analyze impact on feature space.

In [9]:
import pandas as pd
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif, SelectKBest
import numpy as np
import seaborn as sns

# --- Task 1: Impute with Mean or Median ---
print("--- Task 1: Impute with Mean or Median ---")
# Load the Boston Housing dataset (note: it's deprecated, using alternative loading)
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
boston_data = pd.DataFrame(np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]))
boston_feature_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
boston_target = pd.DataFrame(raw_df.values[1::2, 2], columns=['MEDV'])
boston_df = pd.concat([boston_data, boston_target], axis=1)
boston_df.columns = boston_feature_names + ['MEDV']

# Introduce missing values for demonstration
boston_df.loc[[10, 50, 100], 'RM'] = np.nan
boston_df.loc[[25, 75, 150], 'LSTAT'] = np.nan

# Identify columns with missing values
missing_cols_boston = boston_df.columns[boston_df.isnull().any()].tolist()
print("Missing columns in Boston Housing dataset:", missing_cols_boston)

# Impute missing values using mean and median
imputer_mean = SimpleImputer(strategy='mean')
boston_df['RM_mean_imputed'] = imputer_mean.fit_transform(boston_df[['RM']])

imputer_median = SimpleImputer(strategy='median')
boston_df['LSTAT_median_imputed'] = imputer_median.fit_transform(boston_df[['LSTAT']])

print("\nBoston Housing dataset with mean and median imputation:")
print(boston_df[['RM', 'RM_mean_imputed', 'LSTAT', 'LSTAT_median_imputed']].head())

print("\n" + "="*50 + "\n")

# --- Task 2: Impute with the Most Frequent Value ---
print("--- Task 2: Impute with the Most Frequent Value ---")
# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

# Identify categorical columns with missing values
categorical_cols_titanic = titanic_df.select_dtypes(include=['object']).columns
missing_categorical_cols_titanic = [col for col in categorical_cols_titanic if titanic_df[col].isnull().any()]
print("Missing categorical columns in Titanic dataset:", missing_categorical_cols_titanic)

# Impute missing categorical values using the most frequent value (mode)
for col in missing_categorical_cols_titanic:
    imputer_mode = SimpleImputer(strategy='most_frequent')
    titanic_df[col + '_mode_imputed'] = imputer_mode.fit_transform(titanic_df[[col]])

print("\nTitanic dataset with mode imputation:")
print(titanic_df[['embarked', 'embarked_mode_imputed', 'deck', 'deck_mode_imputed']].head())

print("\n" + "="*50 + "\n")

# --- Task 3: Advanced Imputation - k-Nearest Neighbors ---
print("--- Task 3: Advanced Imputation - k-Nearest Neighbors ---")
# Create a subset of numerical features from Boston Housing for KNN imputation
boston_numerical = boston_df[['RM', 'LSTAT', 'PTRATIO']].copy()
boston_numerical.loc[[15, 65, 120], 'PTRATIO'] = np.nan

print("\nSubset of Boston Housing data with missing values for KNN Imputation:")
print(boston_numerical.head())

# Implement KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
boston_numerical_imputed = pd.DataFrame(knn_imputer.fit_transform(boston_numerical), columns=boston_numerical.columns)

print("\nSubset of Boston Housing data after KNN Imputation:")
print(boston_numerical_imputed.head())

print("\nKNN imputation can often provide more accurate imputations compared to simple methods like mean or mode, especially when the missing values have dependencies on other features. It leverages the information from similar data points to estimate the missing values.")

print("\n" + "="*50 + "\n")

# --- Task 1: Standardization ---
print("--- Feature Scaling & Normalization Best Practices: ---")
print("--- Task 1: Standardization ---")
# Create a sample numerical Series
numerical_data_standardization = pd.Series([10, 20, 15, 25, 30, 12, 18, 22])
print("Original numerical data:")
print(numerical_data_standardization)

# Standardize the data using StandardScaler
scaler_standard = StandardScaler()
standardized_data = scaler_standard.fit_transform(numerical_data_standardization.values.reshape(-1, 1))
standardized_series = pd.Series(standardized_data.flatten())

print("\nStandardized data:")
print(standardized_series)

print("\nStandardization scales the data to have a mean of 0 and a standard deviation of 1. This can be beneficial for algorithms that are sensitive to the scale of the input features, such as gradient-based methods. It centers the data around zero.")

print("\n" + "="*50 + "\n")

# --- Task 2: Min-Max Scaling ---
print("--- Task 2: Min-Max Scaling ---")
# Scale the original numerical data using MinMaxScaler
scaler_minmax = MinMaxScaler(feature_range=(0, 1))
minmax_scaled_data = scaler_minmax.fit_transform(numerical_data_standardization.values.reshape(-1, 1))
minmax_scaled_series = pd.Series(minmax_scaled_data.flatten())

print("\nMin-Max scaled data (range [0, 1]):")
print(minmax_scaled_series)

print("\nMin-Max scaling scales the data to a fixed range, typically between 0 and 1. This can be useful when you need values within a specific range, for example, for some neural network activation functions. Compared to standardization, it preserves the original shape of the data distribution.")

print("\n" + "="*50 + "\n")

# --- Task 3: Robust Scaling ---
print("--- Task 3: Robust Scaling ---")
# Introduce outliers to the sample data
numerical_data_outliers = pd.Series([10, 20, 15, 25, 30, 12, 18, 22, 100, -5])
print("Numerical data with outliers:")
print(numerical_data_outliers)

# Scale the data using RobustScaler
robust_scaler = RobustScaler()
robust_scaled_data = robust_scaler.fit_transform(numerical_data_outliers.values.reshape(-1, 1))
robust_scaled_series = pd.Series(robust_scaled_data.flatten())

print("\nRobust scaled data:")
print(robust_scaled_series)

print("\nRobust scaling uses the median and interquartile range (IQR) to scale the data. It is less affected by outliers compared to Min-Max scaling and standardization because it does not rely on the mean and standard deviation, which can be significantly influenced by extreme values. This makes it a good choice when your data contains outliers.")

print("\n" + "="*50 + "\n")

# --- Feature Selection Techniques: ---
print("--- Feature Selection Techniques: ---")
# --- Task 1: Correlation Matrix ---
print("--- Task 1: Correlation Matrix ---")
# Compute the correlation matrix for the Boston Housing dataset
correlation_matrix = boston_df.corr()
print("\nCorrelation Matrix of Boston Housing Dataset:")
print(correlation_matrix)

# Identify highly correlated features (correlation > 0.9)
upper_triangle = np.triu(correlation_matrix, k=1)
highly_correlated_pairs = np.where(np.abs(upper_triangle) > 0.9)
highly_correlated_features = [(correlation_matrix.columns[i], correlation_matrix.columns[j])
                             for i, j in zip(*highly_correlated_pairs)]

print("\nHighly correlated features (correlation > 0.9):", highly_correlated_features)

# To remove one of the highly correlated features, you would typically choose based on domain knowledge or which feature is less important. For example, let's arbitrarily decide to remove the second feature in each pair.
features_to_drop = [pair[1] for pair in highly_correlated_features]
boston_df_reduced_corr = boston_df.drop(columns=features_to_drop, errors='ignore')

print("\nBoston Housing DataFrame after removing highly correlated features:", boston_df_reduced_corr.shape)

print("\nRemoving highly correlated features can help to reduce multicollinearity in your data, which can improve the performance and interpretability of some models, especially linear models.")

print("\n" + "="*50 + "\n")

# --- Using Mutual Information & Variance Thresholds: ---
print("--- Using Mutual Information & Variance Thresholds: ---")
# --- Task 2: Mutual Information ---
print("--- Task 2: Mutual Information ---")
# Load a classification dataset (using a simple one for demonstration)
data_mi = {'feature_1': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
           'feature_2': [5, 4, 3, 2, 1, 5, 4, 3, 2, 1],
           'feature_3': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A'],
           'target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}
df_mi = pd.DataFrame(data_mi)
df_mi = pd.get_dummies(df_mi, columns=['feature_3'], drop_first=True) # One-hot encode categorical features

X_mi = df_mi.drop('target', axis=1)
y_mi = df_mi['target']

# Compute mutual information between features and target
mutual_info = mutual_info_classif(X_mi, y_mi)
mutual_info_series = pd.Series(mutual_info, index=X_mi.columns)
mutual_info_sorted = mutual_info_series.sort_values(ascending=False)

print("\nMutual Information between features and target:")
print(mutual_info_sorted)

# Retain features with high mutual information scores (e.g., top 2)
k_best = 2
selector_mi = SelectKBest(mutual_info_classif, k=k_best)
X_mi_selected = selector_mi.fit_transform(X_mi, y_mi)
selected_features_mi = X_mi.columns[selector_mi.get_support(indices=True)]

print(f"\nTop {k_best} features with highest mutual information:", selected_features_mi)

print("\nMutual information measures the statistical dependency between two random variables. In feature selection, it helps to identify features that have a strong relationship with the target variable, which can be useful for prediction.")

print("\n" + "="*50 + "\n")

# --- Task 3: Variance Threshold ---
print("--- Task 3: Variance Threshold ---")
# Create a DataFrame with some low variance features
data_variance = {'feature_1': [1, 1, 1, 1, 1],
                 'feature_2': [2, 2, 3, 2, 2],
                 'feature_3': [10, 20, 15, 25, 30],
                 'feature_4': [0, 0, 0, 0, 0]}
df_variance = pd.DataFrame(data_variance)

print("\nOriginal DataFrame for Variance Threshold:")
print(df_variance)

# Implement VarianceThreshold to remove features with low variance
threshold_vt = 0.1
selector_vt = VarianceThreshold(threshold=threshold_vt)
X_variance_transformed = selector_vt.fit_transform(df_variance)

# Get the names of the features that were kept
kept_features_vt_indices = selector_vt.get_support(indices=True)
kept_features_vt = df_variance.columns[kept_features_vt_indices]

print(f"\nDataFrame after applying Variance Threshold (threshold={threshold_vt}):")
print(pd.DataFrame(X_variance_transformed, columns=kept_features_vt))

print("\nVariance Threshold removes features whose variance does not exceed a certain threshold. Features with very low variance contain little information and might not be helpful for modeling. Analyzing the impact on the feature space involves seeing which features are removed and understanding if this aligns with your expectations about the importance of those features.")

--- Task 1: Impute with Mean or Median ---
Missing columns in Boston Housing dataset: ['RM', 'LSTAT']

Boston Housing dataset with mean and median imputation:
      RM  RM_mean_imputed  LSTAT  LSTAT_median_imputed
0  6.575            6.575   4.98                  4.98
1  6.421            6.421   9.14                  9.14
2  7.185            7.185   4.03                  4.03
3  6.998            6.998   2.94                  2.94
4  7.147            7.147   5.33                  5.33


--- Task 2: Impute with the Most Frequent Value ---
Missing categorical columns in Titanic dataset: ['embarked', 'embark_town']


ValueError: 2