# Chapter 22: Data Cleaning & Preparation

Handling missing data, transformations, string manipulation, and preparing data for analysis



### Why Data Cleaning? (Slide 61)


<p><strong>80% of data analysis</strong> is spent on loading, cleaning, and preparing data.</p>
<p><strong>Common Issues You'll Face:</strong></p>
<ul>
<li><strong>Missing values</strong> ‚Äî NaN, None, empty strings, sentinel values</li>
<li><strong>Duplicates</strong> ‚Äî repeated rows or entries</li>
<li><strong>Inconsistent formatting</strong> ‚Äî 'NY' vs 'New York' vs 'new york'</li>
<li><strong>Outliers</strong> ‚Äî data points far from the norm</li>
<li><strong>Wrong data types</strong> ‚Äî numbers stored as strings</li>
<li><strong>Messy strings</strong> ‚Äî extra whitespace, mixed case, typos</li>
</ul>
<p><strong>pandas provides tools for all of these!</strong></p>


> **Note:** Clean data ‚Üí reliable analysis ‚Üí correct decisions


### Handling Missing Data: Filtering (Slide 62)


In [1]:
import pandas as pd
import numpy as np

# NaN = 'Not a Number' ‚Äî pandas' missing data marker
# .isnull()       ‚Äî True where value is NaN
# .notnull()      ‚Äî True where value is NOT NaN
# .dropna()       ‚Äî remove rows with ANY NaN
# .dropna(how='all') ‚Äî only if ALL values are NaN
# .dropna(thresh=n)  ‚Äî keep if at least n non-NaN values

data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])

# Drop rows with any NaN
print(data.dropna())       # Only row 0 survives

# Drop only if ALL values are NaN
print(data.dropna(how='all'))  # Row 2 dropped

# Keep rows with at least 2 non-NaN values
print(data.dropna(thresh=2))

# Drop columns instead of rows
print(data.dropna(axis=1, how='all'))


     0    1    2
0  1.0  6.5  3.0
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0
     0    1    2
0  1.0  6.5  3.0
3  NaN  6.5  3.0
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0


> **Note:** dropna returns a NEW object ‚Äî original is unchanged


### Handling Missing Data: Filling (Slide 63)


In [2]:
# .fillna(value)              ‚Äî replace NaN with a constant
# .fillna({'col': val, ...})  ‚Äî different fill per column
# .ffill()     ‚Äî forward fill (propagate last valid)
# .bfill()     ‚Äî backward fill
# .fillna(df.mean())          ‚Äî fill with column means
# .interpolate()              ‚Äî linear interpolation

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = np.nan
df.iloc[4:, 2] = np.nan

# Fill with constant
print(df.fillna(0))

# Different fill value per column
print(df.fillna({1: 0.5, 2: 0}))

# Forward fill (carry last valid value)
print(df.ffill())

# Limit how far to forward fill
print(df.ffill(limit=2))

# Limit how far to backward fill
print(df.bfill(limit=2))

# Fill with column means (very common!)
print(df.fillna(df.mean()))


          0         1         2
0 -1.259546  0.573200  0.157439
1 -0.756249 -0.169779  1.720001
2  0.188389  0.000000 -0.342271
3 -1.326782  0.000000 -1.001515
4 -1.687231  0.000000  0.000000
5  0.063503  0.000000  0.000000
          0         1         2
0 -1.259546  0.573200  0.157439
1 -0.756249 -0.169779  1.720001
2  0.188389  0.500000 -0.342271
3 -1.326782  0.500000 -1.001515
4 -1.687231  0.500000  0.000000
5  0.063503  0.500000  0.000000
          0         1         2
0 -1.259546  0.573200  0.157439
1 -0.756249 -0.169779  1.720001
2  0.188389 -0.169779 -0.342271
3 -1.326782 -0.169779 -1.001515
4 -1.687231 -0.169779 -1.001515
5  0.063503 -0.169779 -1.001515
          0         1         2
0 -1.259546  0.573200  0.157439
1 -0.756249 -0.169779  1.720001
2  0.188389 -0.169779 -0.342271
3 -1.326782 -0.169779 -1.001515
4 -1.687231       NaN -1.001515
5  0.063503       NaN -1.001515
          0         1         2
0 -1.259546  0.573200  0.157439
1 -0.756249 -0.169779  1.720001
2  0.188

> **Note:** Filling with mean/median is a common imputation strategy


### Removing Duplicates (Slide 64)


In [3]:
# .duplicated()          ‚Äî boolean: True for duplicate rows
# .drop_duplicates()     ‚Äî remove duplicate rows
# .duplicated(subset=['col']) ‚Äî check duplicates on specific columns
# keep='first'           ‚Äî keep first occurrence (default)
# keep='last'            ‚Äî keep last occurrence
# keep=False             ‚Äî drop ALL duplicates

import pandas as pd

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})

print(data.duplicated())
# 0    False
# ...
# 6     True  ‚Üê duplicate of row 5

print(data.drop_duplicates())

# Check duplicates on specific column only
print(data.drop_duplicates(subset=['k1']))

# Keep last occurrence instead of first
print(data.drop_duplicates(subset=['k1', 'k2'], keep='last'))

# Drop ALL occurrences of duplicated rows
print(data.drop_duplicates(keep=False))


0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
    k1  k2
0  one   1
1  two   1
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
6  two   4
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3


> **Note:** Always check df.duplicated().sum() to see how many duplicates exist


### Transforming Data: map & replace (Slide 65)


In [4]:
# series.map(dict_or_func) ‚Äî transform values using a mapping
# .replace(old, new)       ‚Äî replace specific values
# .replace([list], [list]) ‚Äî replace multiple values at once
# .replace({old: new})     ‚Äî replace using a dict

import pandas as pd
import numpy as np

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'pastrami', 'corned beef', 'bacon'],
                     'ounces': [4, 3, 12, 6, 7.5, 8]})

# Map food to animal using a dict
meat_to_animal = {'bacon': 'pig', 'pulled pork': 'pig',
                  'pastrami': 'cow', 'corned beef': 'cow'}
data['animal'] = data['food'].map(meat_to_animal)
print(data)

# Replace specific values
data['food'].replace('bacon', 'turkey bacon')

# Replace multiple values at once
data.replace({'bacon': 'turkey', 'pastrami': 'tofu'})

# Replace with regex
data.replace(r'\bpork\b', 'chicken', regex=True)


          food  ounces animal
0        bacon     4.0    pig
1  pulled pork     3.0    pig
2        bacon    12.0    pig
3     pastrami     6.0    cow
4  corned beef     7.5    cow
5        bacon     8.0    pig


Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled chicken,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig


> **Note:** map is for Series; replace works on both Series and DataFrame


### Renaming Axes & Indexes (Slide 66)


In [5]:
# .rename(index={old: new})    ‚Äî rename row labels
# .rename(columns={old: new})  ‚Äî rename column labels
# .rename(str.upper)           ‚Äî apply function to all labels
# .index.map(func)             ‚Äî transform index labels
# All return NEW objects (unless inplace=True)

import pandas as pd

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

# Transform index with a function
data.index = data.index.map(str.upper)
print(data.index)  # ['OHIO', 'COLORADO', 'NEW YORK']

# Rename specific labels (returns new DataFrame)
data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

# Rename with a function
data.rename(index=str.title, columns=str.upper)

# Rename in-place
# data.rename(columns={'one': 'first'}, inplace=True)


Index(['OHIO', 'COLORADO', 'NEW YORK'], dtype='str')


Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


> **Note:** rename returns a new object ‚Äî use inplace=True to modify original


### Discretization & Binning (Slide 67)


In [6]:
# pd.cut(data, bins)     ‚Äî bin continuous data into intervals
# pd.cut(data, n)        ‚Äî cut into n equal-width bins
# pd.qcut(data, n)       ‚Äî cut into n equal-SIZE bins (quantiles)
# labels=['a', 'b', ...]  ‚Äî custom bin labels

import pandas as pd
import numpy as np

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

# Custom bin edges
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
print(cats)         # [(18, 25], (18, 25], (18, 25], (25, 35], ...]
print(cats.codes)   # [0 0 0 1 0 0 1 1 3 2 2 1]

# With custom labels
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

# Value counts of bins
print(cats.value_counts())

# Equal-width bins (4 bins from min to max)
pd.cut(ages, 4, precision=2)

# Quantile-based bins (equal NUMBER of points per bin)
data = np.random.randn(1000)
quartiles = pd.qcut(data, 4)  # 250 in each bin
print(quartiles.value_counts())


[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
[0 0 0 1 0 0 2 1 3 2 2 1]
(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64
(-2.743, -0.715]     250
(-0.715, 0.00103]    250
(0.00103, 0.63]      250
(0.63, 3.04]         250
Name: count, dtype: int64


> **Note:** cut = equal-width bins, qcut = equal-frequency bins


### Detecting & Filtering Outliers (Slide 68)


In [7]:
# Common approach: values beyond ¬±3 standard deviations
# np.sign(data) ‚Äî returns -1, 0, or 1 (useful for capping)
# .any(axis=1)  ‚Äî True if ANY column exceeds threshold in that row
# .clip(lower, upper) ‚Äî cap values at boundaries

import pandas as pd
import numpy as np

np.random.seed(42)
data = pd.DataFrame(np.random.randn(1000, 4))

# Find values beyond ¬±3 standard deviations
print(data.describe())
col = data[2]
print(col[col.abs() > 3])  # Outliers in column 2

# Find rows with ANY column exceeding ¬±3
print(data[(data.abs() > 3).any(axis=1)])

# Cap (clip) values at ¬±3
data = data.clip(-3, 3)  # Values clamped to [-3, 3]
print(data.describe())   # max/min now ‚â§ 3

# Alternative: use np.sign to preserve direction
data[data.abs() > 3] = np.sign(data) * 3


                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      0.030624     0.024828    -0.008255     0.030086
std       0.963919     1.011884     1.006075     1.006964
min      -3.019512    -2.896255    -3.241267    -2.991136
25%      -0.612942    -0.677037    -0.675299    -0.670871
50%       0.056187     0.020210    -0.007509     0.021158
75%       0.664881     0.693881     0.642282     0.695878
max       3.243093     3.852731     3.152057     3.926238
65    -3.241267
119    3.078881
995    3.152057
Name: 2, dtype: float64
            0         1         2         3
52   0.515048  3.852731  0.570891  1.135566
65  -0.926930 -0.059525 -3.241267 -1.024388
119  0.576557  0.311250  3.078881  1.119575
403  0.883110 -0.077837 -0.180480  3.193108
489 -2.135674  3.137749  1.056057  0.223239
506 -3.019512  0.183850  1.800511  1.238946
576  1.995667  3.109919  0.606723 -0.183197
723  0.768207  0.215397  0.508269  3.926238
929  3.243

> **Note:** clip() is the cleanest way to cap outliers


### Permutation & Random Sampling (Slide 69)


In [8]:
# np.random.permutation(n) ‚Äî shuffled array of [0, 1, ..., n-1]
# df.take(indices)          ‚Äî select rows by integer position
# df.sample(n)              ‚Äî random sample of n rows
# df.sample(frac=0.5)       ‚Äî random 50% of rows
# df.sample(n, replace=True) ‚Äî sample WITH replacement (bootstrap)

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(20).reshape((5, 4)))

# Shuffle rows using permutation
sampler = np.random.permutation(5)
print(sampler)        # e.g. [1 0 2 3 4]
print(df.take(sampler))

# Random sample (without replacement)
print(df.sample(n=3))

# Random fraction
print(df.sample(frac=0.6))  # 60% of rows

# Bootstrap sampling (with replacement)
choices = pd.Series([5, 7, -1, 6, 4])
print(choices.sample(n=10, replace=True))

# Reproducible sampling
print(df.sample(n=3, random_state=42))


[4 3 0 1 2]
    0   1   2   3
4  16  17  18  19
3  12  13  14  15
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
    0   1   2   3
2   8   9  10  11
3  12  13  14  15
1   4   5   6   7
    0   1   2   3
4  16  17  18  19
3  12  13  14  15
1   4   5   6   7
3    6
0    5
3    6
3    6
1    7
2   -1
1    7
0    5
3    6
1    7
dtype: int64
    0   1   2   3
1   4   5   6   7
4  16  17  18  19
2   8   9  10  11


> **Note:** Use random_state=N for reproducible random samples


### Dummy Variables (One-Hot Encoding) (Slide 70)


In [9]:
# pd.get_dummies(df['col'])   ‚Äî one-hot encode a column
# prefix='X'                  ‚Äî custom prefix for column names
# drop_first=True             ‚Äî drop first category (avoid multicollinearity)
# Used to convert categorical data into numeric for ML models

import pandas as pd

df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})

# One-hot encode 'key' column
print(pd.get_dummies(df['key']))
#    a  b  c
# 0  0  1  0
# 1  0  1  0
# 2  1  0  0
# 3  0  0  1
# 4  1  0  0
# 5  0  1  0

# With prefix and join back
dummies = pd.get_dummies(df['key'], prefix='key')
result = df[['data1']].join(dummies)
print(result)

# Drop first column to avoid multicollinearity
print(pd.get_dummies(df['key'], drop_first=True))

# Multiple columns at once
df2 = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['x', 'y', 'x']})
print(pd.get_dummies(df2))


       a      b      c
0  False   True  False
1  False   True  False
2   True  False  False
3  False  False   True
4   True  False  False
5  False   True  False
   data1  key_a  key_b  key_c
0      0  False   True  False
1      1  False   True  False
2      2   True  False  False
3      3  False  False   True
4      4   True  False  False
5      5  False   True  False
       b      c
0   True  False
1   True  False
2  False  False
3  False   True
4  False  False
5   True  False
     A_a    A_b    B_x    B_y
0   True  False   True  False
1  False   True  False   True
2   True  False   True  False


> **Note:** drop_first=True prevents the 'dummy variable trap' in regression


### String Methods: Basics (Slide 71)


In [10]:
# Series.str ‚Äî access vectorized string methods
# .str.lower()       ‚Äî lowercase
# .str.upper()       ‚Äî uppercase
# .str.title()       ‚Äî title case
# .str.strip()       ‚Äî remove leading/trailing whitespace
# .str.split(sep)    ‚Äî split into list
# .str.replace(a, b) ‚Äî replace substring
# .str.len()         ‚Äî length of each string
# .str.contains(pat) ‚Äî boolean: contains pattern?
# All methods skip NaN automatically!

import pandas as pd
import numpy as np

data = pd.Series(['  dave ', 'steve', 'rob', 'wes', np.nan])

print(data.str.strip())          # Remove whitespace
print(data.str.upper())          # UPPERCASE
print(data.str.contains('e'))    # [True True False True NaN]
print(data.str.len())            # [6 5 3 3 NaN]

# Split and access parts
data2 = pd.Series(['a_b_c', 'd_e_f', np.nan, 'g_h_i'])
print(data2.str.split('_'))      # Lists of parts
print(data2.str.split('_').str[1])  # Second element: b, e, NaN, h

# Replace
print(data.str.replace(' ', '_'))


0     dave
1    steve
2      rob
3      wes
4      NaN
dtype: str
0      DAVE 
1      STEVE
2        ROB
3        WES
4        NaN
dtype: str
0     True
1     True
2    False
3     True
4    False
dtype: bool
0    7.0
1    5.0
2    3.0
3    3.0
4    NaN
dtype: float64
0    [a, b, c]
1    [d, e, f]
2          NaN
3    [g, h, i]
dtype: object
0      b
1      e
2    NaN
3      h
dtype: object
0    __dave_
1      steve
2        rob
3        wes
4        NaN
dtype: str


> **Note:** str methods auto-skip NaN ‚Äî no need for manual null checks


### String Methods: Advanced (Slide 72)


In [11]:
# .str.startswith(pat)  ‚Äî starts with pattern?
# .str.endswith(pat)    ‚Äî ends with pattern?
# .str.findall(regex)   ‚Äî find all regex matches
# .str.match(regex)     ‚Äî match regex at start of string
# .str.extract(regex)   ‚Äî extract groups into DataFrame columns
# .str.get_dummies(sep) ‚Äî one-hot encode delimited strings
# .str.cat(sep=',')     ‚Äî concatenate all strings
# .str.pad(width)       ‚Äî pad strings to fixed width
# .str.slice(start, stop) ‚Äî slice each string

import pandas as pd

# Extract structured data with regex
data = pd.Series(['Dave dave@google.com', 'Steve steve@gmail.com',
                  'Rob rob@outlook.com'])

# Extract email addresses
emails = data.str.findall(r'[\w.]+@[\w.]+')
print(emails)

# Extract into columns with named groups
pattern = r'(?P<name>\w+)\s+(?P<email>[\w.]+@[\w.]+)'
print(data.str.extract(pattern))

# One-hot from pipe-separated values
s = pd.Series(['a|b', 'b|c', 'a|c|b'])
print(s.str.get_dummies(sep='|'))


0    [dave@google.com]
1    [steve@gmail.com]
2    [rob@outlook.com]
dtype: object
    name            email
0   Dave  dave@google.com
1  Steve  steve@gmail.com
2    Rob  rob@outlook.com
   a  b  c
0  1  1  0
1  0  1  1
2  1  1  1


> **Note:** str.extract with named groups is great for parsing structured text


### Regular Expressions in pandas (Slide 73)


In [12]:
# Python re module works with pandas str methods
# re.findall(pattern, string)  ‚Äî all matches
# re.search(pattern, string)   ‚Äî first match
# re.sub(pattern, repl, string) ‚Äî substitute
# re.split(pattern, string)    ‚Äî split on pattern
# re.compile(pattern)          ‚Äî precompile for speed

import re
import pandas as pd

# Common regex patterns:
# \d+    ‚Äî one or more digits
# \w+    ‚Äî one or more word characters
# \s+    ‚Äî one or more whitespace
# [A-Z]  ‚Äî uppercase letter
# .      ‚Äî any character
# ^...$  ‚Äî start to end of string
# (...)  ‚Äî capture group

text = 'foo    bar\t baz  \tqux'
print(re.split(r'\s+', text))  # ['foo', 'bar', 'baz', 'qux']

# Compile pattern for reuse (faster)
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')
phones = pd.Series(['555-123-4567', 'no phone', '555-987-6543'])
print(phones.str.contains(pattern))
print(phones.str.findall(pattern))


['foo', 'bar', 'baz', 'qux']
0     True
1    False
2     True
dtype: bool
0    [555-123-4567]
1                []
2    [555-987-6543]
dtype: object


> **Note:** Compile patterns with re.compile() when used repeatedly


### Categoricals (Slide 74)


In [13]:
# pd.Categorical(values)    ‚Äî create categorical type
# .astype('category')       ‚Äî convert column to categorical
# .cat.codes                ‚Äî integer codes for each category
# .cat.categories           ‚Äî the unique categories
# .cat.set_categories(new)  ‚Äî change the set of categories
# .cat.rename_categories()  ‚Äî rename categories
# Saves HUGE memory on repeated string values!

import pandas as pd
import numpy as np

values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)

# Convert to categorical (saves memory!)
cat_values = values.astype('category')
print(cat_values.dtype)       # category
print(cat_values.cat.codes)   # [0 1 0 0 0 1 0 0]
print(cat_values.cat.categories)  # ['apple', 'orange']

# Memory savings on large datasets
N = 10_000_000
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))
cat_labels = labels.astype('category')
print(f'String: {labels.memory_usage():,} bytes')
print(f'Category: {cat_labels.memory_usage():,} bytes')
# Category uses ~90% less memory!


category
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int8
Index(['apple', 'orange'], dtype='str')


String: 80,000,132 bytes
Category: 10,000,164 bytes


> **Note:** Use category dtype for columns with few unique values ‚Äî huge memory savings


### Practical Example: Cleaning Messy Data (Slide 75)


In [14]:
# Real-world cleaning pipeline combining multiple techniques

import pandas as pd
import numpy as np

# Messy input data
df = pd.DataFrame({
    'name':  ['  Alice ', 'BOB', 'charlie', 'Alice ', 'bob', None],
    'age':   ['25', '30', 'unknown', '25', '30', '28'],
    'city':  ['NY', 'new york', 'NYC', 'ny', 'New York', 'LA'],
    'score': [85, 92, np.nan, 85, 92, 78]
})

# Step 1: Clean strings
df['name'] = df['name'].str.strip().str.title()

# Step 2: Standardize city names
city_map = {'ny': 'New York', 'nyc': 'New York', 'new york': 'New York',
            'la': 'Los Angeles'}
df['city'] = df['city'].str.lower().map(city_map)

# Step 3: Convert age to numeric (errors ‚Üí NaN)
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Step 4: Drop duplicates, fill missing
df = df.drop_duplicates(subset=['name', 'age'])
df['score'] = df['score'].fillna(df['score'].median())
df['age'] = df['age'].fillna(df['age'].median())
print(df)


      name   age         city  score
0    Alice  25.0     New York   85.0
1      Bob  30.0     New York   92.0
2  Charlie  28.0     New York   85.0
5      NaN  28.0  Los Angeles   78.0


> **Note:** pd.to_numeric(errors='coerce') converts bad values to NaN safely


### Data Cleaning Checklist (Slide 76)


<p><strong>üîç Inspect:</strong></p>
<ul>
<li><code>df.shape</code>, <code>df.dtypes</code>, <code>df.describe()</code></li>
<li><code>df.isnull().sum()</code> ‚Äî count missing per column</li>
<li><code>df.duplicated().sum()</code> ‚Äî count duplicate rows</li>
</ul>
<p><strong>üßπ Clean:</strong></p>
<ul>
<li><code>dropna()</code> / <code>fillna()</code> ‚Äî handle missing data</li>
<li><code>drop_duplicates()</code> ‚Äî remove duplicates</li>
<li><code>str.strip().str.lower()</code> ‚Äî normalize strings</li>
<li><code>replace()</code> / <code>map()</code> ‚Äî standardize values</li>
<li><code>pd.to_numeric(errors='coerce')</code> ‚Äî fix wrong types</li>
</ul>
<p><strong>üîß Transform:</strong></p>
<ul>
<li><code>pd.cut()</code> / <code>pd.qcut()</code> ‚Äî bin continuous values</li>
<li><code>get_dummies()</code> ‚Äî one-hot encode categories</li>
<li><code>clip()</code> ‚Äî cap outliers</li>
<li><code>.astype('category')</code> ‚Äî save memory on repeated strings</li>
</ul>


> **Note:** Run this checklist on every new dataset before analysis
