# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates

#### **`Dropping or Filling Missing Values`**

#### 1. Decision-Making Process:

1. **Dropping Missing Values:**
   - **Context:** Dropping missing values is suitable when the missingness is random, and removing incomplete records doesn't introduce bias or impact the analysis significantly. It's a pragmatic approach when the missing data is negligible compared to the dataset size.

   - **Example:**
     ```python
     # Drop rows with any missing values
     df_dropped = df.dropna()
     ```

2. **Filling Missing Values:**
   - **Context:** Filling missing values is appropriate when retaining the incomplete records is crucial, and a reasonable estimation can be made for the missing values. This is common when dealing with time-series data, where continuity matters.

   - **Example:**
     ```python
     # Fill missing values in 'column_name' with a constant value
     df_filled_constant = df.fillna(value=0)
     ```

   - **Example:**
     ```python
     # Fill missing values with the mean of each column
     df_filled_mean = df.fillna(df.mean())
     ```

   - **Example:**
     ```python
     # Forward fill missing values in a DataFrame
     df_forward_filled = df.ffill()
     ```

   - **Example:**
     ```python
     # Backward fill missing values in a DataFrame
     df_backward_filled = df.bfill()
     ```

   - **Example:**
     ```python
     # Interpolate missing values using linear interpolation
     df_interpolated_linear = df.interpolate(method='linear')
     ```

#### Considerations:

- **Data Nature:**
  - Consider the nature of the data. For time-series data, forward or backward filling might be suitable, while for numeric data, mean or interpolation might be appropriate.

- **Impact on Analysis:**
  - Evaluate how the chosen method for handling missing values might impact subsequent analyses. Ensure that the imputation method aligns with the overall analysis goals.

- **Domain Knowledge:**
  - Leverage domain knowledge to make informed decisions. Some missing values may be inherently unfillable due to the nature of the data.

#### Conclusion:

The decision between dropping or filling missing values depends on the specific characteristics of the data and the analysis goals. Dropping values is a straightforward approach but may lead to data loss. Filling values is a more nuanced process, requiring careful consideration of the data's nature and the impact on downstream analyses. Experiment with different strategies and choose the one that best fits the context of your dataset.

#### 2. Example:

Let's consider a scenario where we have a DataFrame representing monthly sales data for a product. The dataset has missing values in the 'Sales' column, and we need to decide whether to drop or fill those missing values based on the context.

In [3]:
import pandas as pd
import numpy as np

# Sample monthly sales data with missing values
sales_data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Sales': [100, 120, np.nan, 150, np.nan, 180],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Displaying the original DataFrame
print("Original Sales DataFrame:")
print(df_sales)

# Decision 1: Dropping Missing Values
df_dropped = df_sales.dropna()

# Displaying the DataFrame after dropping missing values
print("\nDataFrame after Dropping Missing Values:")
print(df_dropped)

# Decision 2: Filling Missing Values with Forward Fill
df_filled_forward = df_sales.ffill()

# Displaying the DataFrame after forward filling missing values
print("\nDataFrame after Forward Filling Missing Values:")
print(df_filled_forward)


# Decision 3: Filling Missing Values with backward Fill
df_filled_backward = df_sales.bfill()

# Displaying the DataFrame after forward filling missing values
print("\nDataFrame after Backward Filling Missing Values:")
print(df_filled_backward)


# Decision 4: Filling Missing Values with Mean
df_filled_mean = df_sales.fillna(df_sales['Sales'].mean())

# Displaying the DataFrame after filling missing values with mean
print("\nDataFrame after Filling Missing Values with Mean:")
print(df_filled_mean)


Original Sales DataFrame:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar    NaN
3   Apr  150.0
4   May    NaN
5   Jun  180.0

DataFrame after Dropping Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
3   Apr  150.0
5   Jun  180.0

DataFrame after Forward Filling Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  120.0
3   Apr  150.0
4   May  150.0
5   Jun  180.0

DataFrame after Backward Filling Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  150.0
3   Apr  150.0
4   May  180.0
5   Jun  180.0

DataFrame after Filling Missing Values with Mean:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  137.5
3   Apr  150.0
4   May  137.5
5   Jun  180.0


In this example:

1. **Dropping Missing Values:**
   - We use `dropna()` to remove rows with any missing values. This might be suitable if missing values are limited and their removal doesn't significantly affect the analysis.

2. **Filling Missing Values with Forward Fill:**
   - We use `ffill()` to fill missing values with the previous month's sales. This approach is reasonable when the missing values follow a pattern and can be reasonably estimated using existing data.

3. **Filling Missing Values with Mean:**
   - We use `fillna()` with the mean of the 'Sales' column to impute missing values. This approach is suitable when we want to retain all rows and fill missing values with a representative value.

Adjust the code based on the specific characteristics of your dataset and the analysis goals. Choosing between dropping or filling missing values should be driven by the dataset's context and the impact on subsequent analyses.

#### 3. Real-world Scenario:

Imagine you are managing a dataset that tracks monthly sales data for a retail business. The dataset includes information such as the month, product category, sales quantity, and revenue. However, due to occasional reporting errors or data collection issues, there are missing values in the dataset. Let's explore how to handle these missing values using Pandas.

In [4]:
import pandas as pd
import numpy as np

# Sample monthly sales data with missing values
sales_data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Category': ['Electronics', 'Clothing', np.nan, 'Electronics', np.nan, 'Clothing'],
    'Sales_Quantity': [120, 150, np.nan, 200, np.nan, 180],
    'Revenue': [12000, np.nan, 18000, np.nan, 25000, 22000],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Displaying the original DataFrame
print("Original Monthly Sales DataFrame:")
print(df_sales)

# Handling Missing Values:
# Decision 1: Dropping rows with any missing values
df_dropped = df_sales.dropna()

# Decision 2: Filling missing values with mean for numerical columns
df_filled_mean = df_sales.fillna({'Sales_Quantity': df_sales['Sales_Quantity'].mean(), 'Revenue': df_sales['Revenue'].mean()})

# Displaying the DataFrames after handling missing values
print("\nDataFrame after Dropping Missing Values:")
print(df_dropped)

print("\nDataFrame after Filling Missing Values with Mean:")
print(df_filled_mean)


Original Monthly Sales DataFrame:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
1   Feb     Clothing           150.0      NaN
2   Mar          NaN             NaN  18000.0
3   Apr  Electronics           200.0      NaN
4   May          NaN             NaN  25000.0
5   Jun     Clothing           180.0  22000.0

DataFrame after Dropping Missing Values:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
5   Jun     Clothing           180.0  22000.0

DataFrame after Filling Missing Values with Mean:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
1   Feb     Clothing           150.0  19250.0
2   Mar          NaN           162.5  18000.0
3   Apr  Electronics           200.0  19250.0
4   May          NaN           162.5  25000.0
5   Jun     Clothing           180.0  22000.0


#### 4. Considerations or Peculiarities:

- **Imputation Strategy:**
  - Choosing between dropping and filling depends on the impact on analysis. Dropping may lead to loss of important information, while filling may introduce bias if not done carefully.

- **Context of Data:**
  - Understand the context of your data. For example, filling missing revenue values with the mean might be reasonable, but for product categories, it may not make sense.

- **Column-specific Strategies:**
  - Different columns may require different strategies. For numeric columns, mean or median filling could be appropriate, while for categorical columns, forward fill or mode filling might be more suitable.

#### 5. Common Mistakes:

- **Unintended Data Loss:**
  - Developers might drop rows without considering the impact on the dataset's integrity. This can lead to unintended data loss, especially if the missing values are not randomly distributed.

- **Inconsistent Imputation:**
  - Filling missing values inconsistently across columns or datasets can introduce inconsistencies in the dataset.

- **Overlooking Context:**
  - Filling missing values without understanding the context of the data and the reasons for missingness may lead to inaccurate imputations.

Handling missing values is a critical aspect of data preprocessing. It requires thoughtful consideration of the dataset's context, the nature of missingness, and the impact on downstream analyses. Developers should choose strategies that align with the goals of their analysis and avoid common pitfalls that can compromise data quality.