In [1]:
import numpy as np
import pandas as pd

In [2]:
d = {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan], 'C': [6, np.nan, np.nan], 'D': [7, np.nan, np.nan]}
d

{'A': [1, 2, nan], 'B': [5, nan, nan], 'C': [6, nan, nan], 'D': [7, nan, nan]}

In [4]:
df = pd.DataFrame(d)
df

Unnamed: 0,A,B,C,D
0,1.0,5.0,6.0,7.0
1,2.0,,,
2,,,,


# Dropping NaN values

In [10]:
df.dropna(axis=0, inplace=False)

Unnamed: 0,A,B,C,D
0,1.0,5.0,6.0,7.0


In [11]:
df.dropna(axis=1, inplace=False)

0
1
2


In [15]:
df.dropna(axis=0, thresh=1, inplace=False) # threshold of 1

Unnamed: 0,A,B,C,D
0,1.0,5.0,6.0,7.0
1,2.0,,,


# Filling instead of dropping

In [16]:
df.fillna(value='Failed')

Unnamed: 0,A,B,C,D
0,1.0,5.0,6.0,7.0
1,2.0,Failed,Failed,Failed
2,Failed,Failed,Failed,Failed


### Filling based on mean

In [17]:
df['A']

0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

In [19]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

## fillna and dropna in Pandas

Pandas provides two useful methods for handling missing data: `fillna` and `dropna`. Here's an explanation of how to use each method:

### fillna

The `fillna` method is used to fill missing values (NaN) in a DataFrame or Series with a specified value or method. Here are some key points about `fillna`:

- You can pass a scalar value to fill all missing values with that value[1][2]:

```python
df.fillna(0, inplace=True)
```

- You can pass a dictionary to fill specific columns with different values[1][2]:

```python
df.fillna({'A': 0, 'B': 'missing'})
```

- The `method` parameter allows you to fill missing values using a forward-fill (`ffill`), back-fill (`bfill`), or interpolation[1][2]:

```python
df.fillna(method='ffill')
```

- Setting `inplace=True` modifies the original DataFrame, otherwise a new DataFrame is returned[1][2]

### dropna

The `dropna` method is used to drop rows or columns containing missing values from a DataFrame. Key points about `dropna`:

- By default it drops rows where any column has a missing value[3][4]:

```python
df.dropna()
```

- Setting `how='all'` only drops rows where all values are missing[3]
- Use `subset` to specify which columns to consider for dropping rows[3][4]:

```python
df.dropna(subset=['A', 'B'])
```

- Set `axis=1` to drop columns instead of rows[4]
- Setting `inplace=True` modifies the original DataFrame[3][4]

In summary, `fillna` is used to replace missing values, while `dropna` removes rows or columns containing NaNs. Both methods provide flexibility to handle missing data in Pandas DataFrames and Series.

Citations:

[1] https://slidescope.com/handling-missing-data-with-fillna-dropna-and-interpolate-in-pandas-lesson-6/

[2] https://www.w3schools.com/python/pandas/ref_df_fillna.asp

[3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

[4] https://stackoverflow.com/questions/60832099/what-should-i-do-to-solve-my-problem-with-dropna-and-fillna-in-python

[5] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

------------
The `fillna` method in Pandas is a powerful tool for handling missing values (NaN) in DataFrames and Series. It allows users to fill these missing values with specified values or methods, enhancing data integrity and usability during analysis.

### Syntax

The basic syntax of the `fillna` method is as follows:

```python
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
```

### Parameters

- **value**: This can be a scalar, dictionary, Series, or DataFrame. It specifies the value(s) to use for filling NaN values. For example, you can fill all NaNs with a specific number or use a dictionary to fill different columns with different values.

- **method**: This parameter allows for filling NaNs using specific methods:
  - `'ffill'` or `'pad'`: Propagates the last valid observation forward to the next valid.
  - `'bfill'` or `'backfill'`: Uses the next valid observation to fill the gap.

- **axis**: This specifies the axis along which to fill. `0` or `'index'` fills along rows, while `1` or `'columns'` fills along columns.

- **inplace**: If set to `True`, the operation modifies the original DataFrame. If `False`, it returns a new DataFrame.

- **limit**: This limits the number of consecutive NaNs to fill. It is useful when you want to restrict the filling to a certain number of NaNs.

- **downcast**: This allows for downcasting to a specific data type if possible.

### Examples

1. **Filling with a Specific Value**:
   ```python
   import pandas as pd

   df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, None]})
   df.fillna(0, inplace=True)
   print(df)
   ```
   This replaces all NaN values with `0`.

2. **Filling with a Dictionary**:
   ```python
   df.fillna({'A': 0, 'B': 1}, inplace=True)
   ```
   This fills NaNs in column 'A' with `0` and those in 'B' with `1`.

3. **Forward Fill**:
   ```python
   df.fillna(method='ffill', inplace=True)
   ```
   This fills NaNs with the last valid observation.

4. **Limit Filling**:
   ```python
   df.fillna(method='ffill', limit=1, inplace=True)
   ```
   This will only forward fill the first NaN in a sequence.

### Conclusion

The `fillna` method is essential for data preprocessing, allowing analysts to maintain the integrity of their datasets by effectively managing missing values. It provides flexibility in how missing data is handled, catering to various analytical needs and scenarios, such as time series data where previous or subsequent observations are relevant for filling gaps[1][2][3][4][5].

Citations:

[1] https://www.geeksforgeeks.org/python-pandas-series-fillna/

[2] https://www.kdnuggets.com/2023/02/optimal-way-input-missing-data-pandas-fillna.html

[3] https://www.w3schools.com/python/pandas/ref_df_fillna.asp

[4] https://slidescope.com/handling-missing-data-with-fillna-dropna-and-interpolate-in-pandas-lesson-6/

[5] https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.fillna.html