# Preprocessing

**Preprocessing Techniques**

### 🔹 Missing Value Handling

### 1. **Drop Rows/Columns**

-   **Drop rows** with missing values:

    ``` python
    python
    CopyEdit
    df.dropna(axis=0, inplace=True)
    ```

-   **Drop columns** with missing values:

    ``` python
    python
    CopyEdit
    df.dropna(axis=1, inplace=True)
    ```

### 2. **Fill with Mean/Median/Mode**

-   **Mean**:

    ``` python
    python
    CopyEdit
    df['age'].fillna(df['age'].mean(), inplace=True)
    ```

-   **Median**:

    ``` python
    python
    CopyEdit
    df['income'].fillna(df['income'].median(), inplace=True)
    ```

-   **Mode**:

    ``` python
    python
    CopyEdit
    df['gender'].fillna(df['gender'].mode()[0], inplace=True)
    ```

### 3. **Forward/Backward Fill**

-   **Forward Fill**: Fill missing values with the **last available
    value** (i.e., carry forward the previous observation).

    ``` python
    python
    CopyEdit
    df.fillna(method='ffill', inplace=True)
    ```

-   **Backward Fill**: Fill missing values with the **next available
    value** (i.e., fill using the value coming after the missing one).

    ``` python
    python
    CopyEdit
    df.fillna(method='bfill', inplace=True)
    ```

### 4. **KNN or Model-Based Imputation**

-   **KNN Imputation**: This method uses **K-Nearest Neighbors (KNN)**
    to impute missing values by looking at the **similarity** of other
    data points. KNN identifies the closest rows (neighbors) and
    predicts the missing value based on them.

    ``` python
    python
    CopyEdit
    from sklearn.impute import KNNImputer
    imputer = KNNImputer(n_neighbors=5)
    df_imputed = imputer.fit_transform(df)
    ```

-   **Model-Based Imputation**: You can use models (like **Random
    Forests** or **Linear Regression**) to predict missing values by
    training on the non-missing values and predicting for the missing
    ones.

    -   Example (using RandomForestRegressor for imputation):

        ``` python
        python
        CopyEdit
        from sklearn.ensemble import RandomForestRegressor
        model = RandomForestRegressor()
        df_missing = df[df['target'].isnull()]
        df_not_missing = df[df['target'].notnull()]
        model.fit(df_not_missing.drop('target', axis=1), df_not_missing['target'])
        df_missing['target'] = model.predict(df_missing.drop('target', axis=1))
        ```

### 🔹 Categorical Encoding

### 1. **Label Encoding**

``` python
python
CopyEdit
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])
```

### 2. **One-Hot Encoding**

``` python
python
CopyEdit
pd.get_dummies(df['color'], prefix='color')
```

### 3. **Target Encoding**

-   **How** (with `category_encoders`):

``` python
python
CopyEdit
import category_encoders as ce
encoder = ce.TargetEncoder()
df['gender_encoded'] = encoder.fit_transform(df['gender'], df['target'])
```

### 4. **Frequency Encoding**

``` python
python
CopyEdit
freq_map = df['city'].value_counts().to_dict()
df['city_encoded'] = df['city'].map(freq_map)
```

### 🔹 Feature Scaling / Normalization

### 1. **StandardScaler (Z-score Normalization)**

``` python
python
CopyEdit
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
```

### 2. **MinMaxScaler (Normalization to \[0, 1\])**

``` python
python
CopyEdit
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)
```

### 3. **RobustScaler (Outlier-resistant)**

``` python
python
CopyEdit
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df)
```

### 4. **PowerTransformer (Box-Cox / Yeo-Johnson)**

``` python
python
CopyEdit
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')  # or 'box-cox' if all values > 0
df_transformed = pt.fit_transform(df)
```

### 🔹 Outlier Treatment

### 1. **Z-score Method**

``` python
python
CopyEdit
from scipy.stats import zscore
z = zscore(df['feature'])
df[z > 3]  # outliers
```

### 2. **IQR Method (Interquartile Range)**

``` python
python
CopyEdit
Q1 = df['feature'].quantile(0.25)
Q3 = df['feature'].quantile(0.75)
IQR = Q3 - Q1
mask = (df['feature'] < Q1 - 1.5*IQR) | (df['feature'] > Q3 + 1.5*IQR)
df[mask]  # outliers
```

### 3. **Winsorization / Clipping / Capping**

-   **Winsorization** (clip at percentile):

    ``` python
    python
    CopyEdit
    from scipy.stats.mstats import winsorize
    df['feature_wins'] = winsorize(df['feature'], limits=[0.05, 0.05])
    ```

-   **Clipping** (hard limit):

    ``` python
    python
    CopyEdit
    df['feature'] = df['feature'].clip(lower=lower_bound, upper=upper_bound)
    ```

### 🔹 Text Cleaning

### 1. **Lowercase + Remove Punctuation**

``` python
python
CopyEdit
import string

text = text.lower()  # lowercase
text = text.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
```

### 2. **Remove Stopwords**

``` python
python
CopyEdit
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

words = [word for word in text.split() if word not in stop_words]
```

``` python
python
CopyEdit
import nltk
nltk.download('stopwords')
```

### 3. **Stemming / Lemmatization**

### **Stemming**

``` python
python
CopyEdit
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
```

### **Lemmatization**

``` python
python
CopyEdit
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
```

``` python
python
CopyEdit
nltk.download('wordnet')
nltk.download('omw-1.4')
```

### 🔹 Datetime Conversion

### 1. **Convert to `datetime` Format**

``` python
python
CopyEdit
import pandas as pd

df['date'] = pd.to_datetime(df['date'])  # auto-parses many formats
```

-   Got a custom format? Use `format`:

    ``` python
    python
    CopyEdit
    df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y')
    ```

### 2. **Handle Inconsistent Timestamps**

Real-world timestamps can be messy — mixed formats, missing values,
timezone weirdness

### Fixing mixed or garbage formats:

``` python
python
CopyEdit
df['date'] = pd.to_datetime(df['date'], errors='coerce')  # invalid dates become NaT
```

### Timezone handling:

``` python
python
CopyEdit
df['date'] = df['date'].dt.tz_localize('UTC')            # localize naive timestamp
df['date'] = df['date'].dt.tz_convert('Asia/Kolkata')    # c
```