## Import Libs

In [1]:
import pandas as pd
import numpy as np

# fillna()

### doc

**`fillna()`**

The `fillna()` function is part of the pandas library in Python and is used to fill NA/NaN values using the specified method.

**Usage:**

**Parameters:**

- **`value : scalar, dict, Series, or DataFrame`**
  * Value to use to fill holes (e.g., 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

- **`method: {'backfill', 'bfill', 'pad', 'ffill', None}, default None`**
  * Method to use for filling holes in reindexed series.

- **`axis: {0 or 'index', 1 or 'columns'}`**
  * Axis along which to fill missing values.

- **`inplace: bool, default False`**
  * If True, fill in-place. Note: this will modify any other views on this object.

- **`limit: int, default None`**
  * If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.

- **`downcast: dict, default is None`**
  * A dictionary of item->dtype of what to downcast if possible, or the string 'infer' which will try to downcast to an appropriate equal type.

**Returns:**
DataFrame or Series


## categorical data

### mode()


#### doc
- part of the pandas lib
- used to find the mode (most frequently occurring value) 
    - in a pandas Series 
    - or DataFrames.

#### Cookbook receipe

1. define columns
2. find mode
3. fill NA/NaN with mode
    -  In case we have equally occurring values, we just pick the first (index = 0)

In [2]:
# Create simple dictionary with some missing values
data = {
    'Name': ['John', 'Anna', np.nan, 'Linda', 'John'],
    'Type': ['Type1', 'Type2', 'Type2', np.nan, 'Type1'],
    'Country': ['Country1', np.nan, 'Country2', 'Country1', 'Country2'],
}

# Convert dictionary to pandas DataFrame
data = pd.DataFrame(data)

print("Original DataFrame:")
print(data)

# Apply mode and fillna
for column in ['Name', 'Type', 'Country']:
    mode = data[column].mode()[0] # in case we have more then one mode
    data[column].fillna(mode, inplace=True)

print("\nDataFrame after filling NA values with mode:")
print(data)


Original DataFrame:
    Name   Type   Country
0   John  Type1  Country1
1   Anna  Type2       NaN
2    NaN  Type2  Country2
3  Linda    NaN  Country1
4   John  Type1  Country2

DataFrame after filling NA values with mode:
    Name   Type   Country
0   John  Type1  Country1
1   Anna  Type2  Country1
2   John  Type2  Country2
3  Linda  Type1  Country1
4   John  Type1  Country2


## numerical data

### median()

#### doc

**`np.median()`**

`np.median()` is a function in the NumPy library in Python. This function is used to compute the **median**, or middle value, of a set of numbers. 

*Parameters:*

- **`a`** (_array_like_): The input array or object that can be converted into an array. This collection of numbers is what the median is computed from.

- **`axis`** ({_int, sequence of int, None_}, optional): The axis or axes along which the medians are computed. The default is `None`, meaning that the median is computed from the entire array.

- **`out`** (_ndarray_, optional): An alternative output array in which to place the result. It must have the same shape and buffer length as the expected output, but the type will be cast if necessary.

- **`overwrite_input`** (_bool_, optional): This is `False` by default, which prevents the input from being changed. If it is set to `True`, allow modifications to the input array `a` to save memory. 

- **`keepdims`** (_bool_, optional): If set to `True`, the axes which were reduced stay dimension with size one. With this option, the result will broadcast correctly against the input array.

*Returns:*

  Returns a new array holding the median result. If the input array has an even number of elements, it returns the average of the two middle elements. If the input array has an odd number of elements, it returns the exact middle element.

*Raises:*

- **`TypeError`**: If the input array is not numerical and cannot be converted into a numerical array.

#### cookbook receipe

1. define columns manually -> as list
2. find median of each column
3. fill NA/NaN with median

In [3]:
# Sample data with some missing values
data = {
    'Turnover': [100, 200, np.nan, 400, 500, 600, np.nan, 800],
    'Transactions': [1, 2, 3, np.nan, np.nan, 6, 7, 8]
}

# Convert dictionary to a pandas DataFrame
data = pd.DataFrame(data)

print("Original DataFrame:")
print(data)

# Fill missing values with column median
for column in ['Turnover', 'Transactions']:
    median = data[column].median()
    data[column].fillna(median, inplace=True)

print("\nDataFrame after filling NA values with median:")
print(data)


Original DataFrame:
   Turnover  Transactions
0     100.0           1.0
1     200.0           2.0
2       NaN           3.0
3     400.0           NaN
4     500.0           NaN
5     600.0           6.0
6       NaN           7.0
7     800.0           8.0

DataFrame after filling NA values with median:
   Turnover  Transactions
0     100.0           1.0
1     200.0           2.0
2     450.0           3.0
3     400.0           4.5
4     500.0           4.5
5     600.0           6.0
6     450.0           7.0
7     800.0           8.0


## mixed data - select_dtypes

### select_dtypes()

### doc

**`select_dtypes()`**

`select_dtypes()` is a method in pandas DataFrame. This method is used to select columns in a DataFrame by data type.

*Parameters:*

- **`include`** (_string or list of strings, optional_): Column data types to include.

- **`exclude`** (_string or list of strings, optional_): Column data types to exclude.

*Returns:*

A DataFrame with columns that match the specified data types.


### cookbook receipe

- **numeric**
    - **Identifying** numeric columns `select_dtypes(include=np.number).columns`
    - **Applying** `fill_na_with_mean()`
- **categorical**
    - **Identifying** categorical columns that include string/object type using `select_dtypes(include='object').columns`
    - **Applying** the `fill_na_with_mode()`
- **Returning** the cleaned dataframe with no NA/NaN values in both numerical and categorical columns.

In [4]:
import pandas as pd
import numpy as np

# Simple DataFrame with both numerical and categorical data
data = {
    'Age': [25, 30, 35, np.nan, 45],
    'City': ['New York', 'Seattle', 'San Francisco', 'Austin', np.nan],
    'Income': [50000, 70000, np.nan, 90000, 100000]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

def fill_na_with_mean(df, num_cols):
    mean_values = df[num_cols].mean()
    df[num_cols] = df[num_cols].fillna(mean_values)
    return df


def fill_na_with_mode(df, cat_cols):
    mode_values = df[cat_cols].mode().iloc[0]
    df[cat_cols] = df[cat_cols].fillna(mode_values)
    return df


def clean_data(df):
    # Identify numerical columns and fill NA/NaN with mean
    num_cols = df.select_dtypes(include=np.number).columns
    df = fill_na_with_mean(df, num_cols)

    # Identify categorical columns and fill NA/NaN with mode
    cat_cols = df.select_dtypes(include='object').columns
    df = fill_na_with_mode(df, cat_cols)

    return df



# Apply the function on our DataFrame
cleaned_df = clean_data(df)
print("\nCleaned DataFrame:")
print(cleaned_df)


Original DataFrame:
    Age           City    Income
0  25.0       New York   50000.0
1  30.0        Seattle   70000.0
2  35.0  San Francisco       NaN
3   NaN         Austin   90000.0
4  45.0            NaN  100000.0

Cleaned DataFrame:
     Age           City    Income
0  25.00       New York   50000.0
1  30.00        Seattle   70000.0
2  35.00  San Francisco   77500.0
3  33.75         Austin   90000.0
4  45.00         Austin  100000.0
