# Handling Missing Values in Data Science
This notebook demonstrates realistic handling of missing data using pandas.
We'll first create a dummy CSV with 200 rows and then explore:
1. dropna() and its variations
2. fillna() and its variations

## Step 1. Read the  CSV Dataset
We'll read a dataset of 200 customer records with some missing values.

In [2]:
# Load data from CSV file
df= pd.read_csv('customer_data.csv')

## Step 2. Understanding Missing Values
Missing values can occur due to incomplete surveys, sensor failures, or data merging issues. Let's check how many we have.

In [3]:
# Check for missing values
df.isnull().sum()

Customer_ID         0
Age                 6
Gender             25
Income             17
Purchase_Amount     7
Region              7
dtype: int64

## Step 3. Dropping Missing Values
Various ways to drop missing data:
- `dropna()` removes any row with at least one NaN.
- `dropna(axis=1)` removes columns with NaN.
- `dropna(thresh=n)` keeps rows with at least n non-NaN values.
- `dropna(subset=[cols])` drops rows only if specific columns have NaN.

In [4]:
# Check the dimensions of the dataset
print("Original Shape:", df.shape)

Original Shape: (200, 6)


In [5]:
# Drop any row with a missing value
df_drop_any = df.dropna()
# Check the dimensions of the dataset
print("After dropna():", df_drop_any.shape)

After dropna(): (142, 6)


In [6]:
# Drop columns with missing values
df_drop_col = df.dropna(axis=1)
# Check the dimensions of the dataset
print("After axis=1:", df_drop_col.shape)

After axis=1: (200, 1)


In [8]:
# Drop rows with less than 4 non-null values
df_thresh = df.dropna(thresh=4)
# Check the dimensions of the dataset
print("After thresh=4:", df_thresh.shape)

After thresh=4: (200, 6)


In [9]:
# Drop rows only if 'Income' or 'Purchase_Amount' are missing
df_subset = df.dropna(subset=['Income', 'Purchase_Amount'])

# Check the dimensions of the dataset
print("After subset=['Income','Purchase_Amount']:", df_subset.shape)

After subset=['Income','Purchase_Amount']: (177, 6)


## Step 4. Filling Missing Values
Instead of dropping, we can impute missing values.
- `fillna(value)` replaces all NaNs with a constant.
- `fillna(method='ffill')` forward fills.
- `fillna(method='bfill')` backward fills.
- Fill numeric columns with mean/median/mode.

In [None]:
# Fill with constant
df_const = df.fillna({'Gender': 'Unknown', 'Region': 'Unknown'})

In [None]:
# Forward fill
df_ffill = df.fillna(method='ffill')

In [None]:
# Backward fill
df_bfill = df.fillna(method='bfill')

In [None]:
# Fill numeric columns with mean/median
df_numfill = df.copy()


In [None]:
# Fill missing values
df_numfill['Age'] = df['Age'].fillna(df['Age'].mean())


In [None]:
# Fill missing values
df_numfill['Income'] = df['Income'].fillna(df['Income'].median())


In [None]:
# Display first few rows of data
df_numfill.head()

## Step 5. Context in Data Science
Choosing between `dropna` and `fillna` depends on context:
- **dropna()**: When missingness is rare or random.
- **fillna()**: When data is valuable and imputable (e.g., mean, median, ffill, etc.).

## Step 6. Summary Table
| Method | Description | Best Use Case |
|---------|-------------|----------------|
| dropna() | Remove rows with missing data | Sparse missingness |
| dropna(axis=1) | Remove columns with missing data | Unusable features |
| dropna(thresh=n) | Keep rows with at least n valid entries | Flexible cleaning |
| fillna(value) | Replace NaN with constant | Categorical/known values |
| fillna(method='ffill') | Forward fill | Time series data |
| fillna(method='bfill') | Backward fill | Sequential data |
| fillna(mean/median) | Statistical imputation | Numeric features |

## Dealing with outliers

In [11]:
# Calculate IQR
Q1 = df['Purchase_Amount'].quantile(0.25)
# Calculate quantile values for outlier detection
Q3 = df['Purchase_Amount'].quantile(0.75)
# Calculate Interquartile Range for outlier detection
IQR = Q3 - Q1

# Define upper and lower bounds
lower_bound = Q1 - 1.5 * IQR
# Calculate Interquartile Range for outlier detection
upper_bound = Q3 + 1.5 * IQR

print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")


Lower Bound: -1500.0, Upper Bound: 6500.0


## np.where syntax
### np.where(condition, value_if_true, value_if_false)

In [12]:

# (a) Cap to IQR limit
df['Capped_IQR'] = np.where(df['Purchase_Amount'] > upper_bound, upper_bound,
                            # Filter data based on condition
                            np.where(df['Purchase_Amount'] < lower_bound, lower_bound, df['Purchase_Amount']))

df

In [14]:
# (b) Cap to highest actual non-outlier value
highest_valid = df[df['Purchase_Amount'] <= upper_bound]['Purchase_Amount'].max()
# Filter data based on condition
df['Capped_Max'] = np.where(df['Purchase_Amount'] > upper_bound, highest_valid, df['Purchase_Amount'])



Unnamed: 0,Customer_ID,Age,Gender,Income,Purchase_Amount,Region,Capped_IQR,Capped_Max
0,1,55.0,Male,70000.0,2400.0,West,2400.0,2400.0
1,2,45.0,Male,65000.0,1900.0,North,1900.0,1900.0
2,3,31.0,Male,30000.0,2400.0,East,2400.0,2400.0
3,4,59.0,Female,50000.0,1400.0,,1400.0,1400.0
4,5,24.0,Female,70000.0,4000.0,East,4000.0,4000.0
...,...,...,...,...,...,...,...,...
195,196,42.0,Female,60000.0,2500.0,West,2500.0,2500.0
196,197,48.0,Female,70000.0,3300.0,South,3300.0,3300.0
197,198,22.0,Female,75000.0,4600.0,South,4600.0,4600.0
198,199,48.0,,75000.0,2000.0,South,2000.0,2000.0
