In [None]:
# Question: Introduction to Missing Data in a DataFrame
# Description: Load a simple CSV file into a DataFrame and identify missing values.

# Steps to follow:
# 1. Load the data: Use the pandas library to read a CSV file.
# 2. Check for missing values: Use the isnull() method to find missing values.
# 3. Summarize missing data: Use the sum() function to count the number of missing values in each column.



In [None]:
# Question: Dropping Rows with Missing Values
# Description: Practice the deletion method by removing rows with any missing values from a dataset.

# Steps to follow:
# 1. Use dropna() method: Use the dropna() method to remove rows with missing values.

In [None]:
# Question: Dropping Columns with Missing Values
# Description: Practice deleting entire columns that contain missing values.

# Steps to follow:
# 1. Use dropna() with axis parameter: Set axis=1 in dropna() to remove columns with missing values.



In [None]:
# Question: Mean Imputation for Numerical Data
# Description: Fill missing values in a numerical column with the mean of that column.

# Steps to follow:
# 1. Calculate mean and fill NA: Use mean() to calculate and fillna() to fill the missing values.



In [None]:
# Question: Mode Imputation for Categorical Data
# Description: Fill missing values in a categorical column with the mode of that column.

# Steps to follow:
# 1. Calculate mode and fill NA: Use mode() to find the most frequent value and fillna() to fill the missing values.



In [None]:
# Question: Median Imputation for Skewed Data
# Description: Handle missing values in columns with a skewed distribution using the median.

# Steps to follow:
# 1. Calculate median and fill NA: Use median() for skewed data and fillna() to handle missing values.



In [None]:
# Question: KNN Imputation
# Description: Use K-Nearest Neighbors to impute missing values in a dataset.

# Steps to follow:
# 1. Install and import required libraries: Use pip install sklearn if not already installed.
# 2. KNN Imputer: Use KNNImputer to fill in missing values.



In [None]:
# Question: Detecting and Handling Missing Categorical Data
# Description: Detect missing categorical data and handle it by filling with the next frequent category.

# Steps to follow:
# 1. Identify missing values in categorical data: Use the isnull() method on categorical columns.
# 2. Impute with next frequent category: Use the mode() method to choose the next frequent category.



In [None]:
# Question: Predictive Modeling for Imputation
# Description: Use a predictive model to impute missing values for a particular feature using other features.

# Steps to follow:
# 1. Partition the data: Split the dataset into train and test based on the presence of missing values.
# 2. Train a model: Use a regression model to predict missing values.
# 3. Impute missing values with predictions.




In [None]:
# Question: Handling Time Series Data with Forward and Backward Fill
# Description: Impute missing values in a time series dataset using forward and backward fill methods.

# Steps to follow:
# 1. Sort the data: Ensure the dataset is sorted by dates.
# 2. Use fillna() with method parameter: Apply ffill() and bfill() for forward and backward fill.



In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# -----------------------------
# Load Sample Data
# -----------------------------
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
    'Age': [25, np.nan, 30, 22, np.nan, 29],
    'Gender': ['F', 'M', np.nan, 'M', 'F', np.nan],
    'Income': [50000, 54000, 58000, np.nan, 62000, 61000],
    'Purchase': [1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

# -----------------------------
# 1. Identify Missing Values
# -----------------------------
print("\nMissing Values in Each Column:\n", df.isnull().sum())

# -----------------------------
# 2. Drop Rows with Missing Values
# -----------------------------
df_drop_rows = df.dropna()
print("\nAfter Dropping Rows with Missing Values:\n", df_drop_rows)

# -----------------------------
# 3. Drop Columns with Missing Values
# -----------------------------
df_drop_cols = df.dropna(axis=1)
print("\nAfter Dropping Columns with Missing Values:\n", df_drop_cols)

# -----------------------------
# 4. Mean Imputation (for 'Age')
# -----------------------------
df_mean = df.copy()
df_mean['Age'].fillna(df_mean['Age'].mean(), inplace=True)
print("\nMean Imputation (Age):\n", df_mean)

# -----------------------------
# 5. Mode Imputation (for 'Gender')
# -----------------------------
df_mode = df.copy()
df_mode['Gender'].fillna(df_mode['Gender'].mode()[0], inplace=True)
print("\nMode Imputation (Gender):\n", df_mode)

# -----------------------------
# 6. Median Imputation (for 'Income')
# -----------------------------
df_median = df.copy()
df_median['Income'].fillna(df_median['Income'].median(), inplace=True)
print("\nMedian Imputation (Income):\n", df_median)

# -----------------------------
# 7. KNN Imputation
# -----------------------------
knn_df = df.copy()

# Convert categorical to numeric for KNN
knn_df['Gender'] = knn_df['Gender'].map({'F': 0, 'M': 1})
imputer = KNNImputer(n_neighbors=2)
knn_imputed = imputer.fit_transform(knn_df[['Age', 'Gender', 'Income']])
knn_df[['Age', 'Gender', 'Income']] = knn_imputed
print("\nKNN Imputation:\n", knn_df)

# -----------------------------
# 8. Detect and Handle Missing Categorical with Next Frequent
# -----------------------------
df_categorical = df.copy()
gender_counts = df_categorical['Gender'].value_counts()
next_frequent = gender_counts.index[1] if len(gender_counts) > 1 else gender_counts.index[0]
df_categorical['Gender'].fillna(next_frequent, inplace=True)
print("\nImpute Categorical with Next Frequent:\n", df_categorical)

# -----------------------------
# 9. Predictive Modeling Imputation (for 'Income')
# -----------------------------
df_pred = df.copy()

# Split dataset into known and missing
known = df_pred[df_pred['Income'].notnull()]
unknown = df_pred[df_pred['Income'].isnull()]

# Encode Gender
known['Gender'] = known['Gender'].map({'F': 0, 'M': 1})
unknown['Gender'] = unknown['Gender'].map({'F': 0, 'M': 1})

# Train a regression model
X_train = known[['Age', 'Gender']]
y_train = known['Income']
model = LinearRegression()
model.fit(X_train, y_train)

# Predict missing values
X_test = unknown[['Age', 'Gender']]
df_pred.loc[df_pred['Income'].isnull(), 'Income'] = model.predict(X_test)

print("\nPredictive Imputation:\n", df_pred)

# -----------------------------
# 10. Forward & Backward Fill (Time Series)
# -----------------------------
time_data = {
    'Date': pd.date_range(start='2025-01-01', periods=6, freq='D'),
    'Temperature': [30, np.nan, np.nan, 28, np.nan, 31]
}
df_time = pd.DataFrame(time_data).set_index('Date')

# Forward fill
df_ffill = df_time.fillna(method='ffill')
# Backward fill
df_bfill = df_time.fillna(method='bfill')

print("\nTime Series Forward Fill:\n", df_ffill)
print("\nTime Series Backward Fill:\n", df_bfill)


Original Data:
       Name   Age Gender   Income  Purchase
0    Alice  25.0      F  50000.0         1
1      Bob   NaN      M  54000.0         0
2  Charlie  30.0    NaN  58000.0         1
3    David  22.0      M      NaN         0
4      Eva   NaN      F  62000.0         1
5    Frank  29.0    NaN  61000.0         0

Missing Values in Each Column:
 Name        0
Age         2
Gender      2
Income      1
Purchase    0
dtype: int64

After Dropping Rows with Missing Values:
     Name   Age Gender   Income  Purchase
0  Alice  25.0      F  50000.0         1

After Dropping Columns with Missing Values:
       Name  Purchase
0    Alice         1
1      Bob         0
2  Charlie         1
3    David         0
4      Eva         1
5    Frank         0

Mean Imputation (Age):
       Name   Age Gender   Income  Purchase
0    Alice  25.0      F  50000.0         1
1      Bob  26.5      M  54000.0         0
2  Charlie  30.0    NaN  58000.0         1
3    David  22.0      M      NaN         0
4      Ev

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  known['Gender'] = known['Gender'].map({'F': 0, 'M': 1})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unknown['Gender'] = unknown['Gender'].map({'F': 0, 'M': 1})


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values