# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**
4. **Creating DataFrames**
   - From lists, dictionaries, and arrays
   - Reading data from CSV, Excel, and other formats

5. **Basic DataFrame Operations**
   - Inspecting the DataFrame
   - Indexing and selecting data
   - Descriptive statistics

6. **Data Cleaning and Handling Missing Data**
   - Handling missing values
   - Dropping or filling missing values
   - Removing duplicates

#### **`Handling Missing Values`**

#### 1\. Importance of Identifying and Handling Missing Values in a DataFrame:

Missing values, represented as NaN (Not a Number) in Pandas, are a common occurrence in real-world datasets. Properly identifying and handling missing values is crucial for meaningful and accurate data analysis. Ignoring missing values can lead to biased results and incorrect interpretations. Here's why handling missing values is important:

1. **Data Accuracy:** Missing values can distort summary statistics, such as mean and standard deviation, leading to inaccurate insights about the dataset.
    
2. **Model Performance:** If missing values are not addressed, they can adversely impact machine learning models, causing biased predictions and reduced model performance.
    
3. **Data Visualization:** Visualizations may not accurately represent the distribution of data when missing values are present, affecting the interpretation of results.
    
4. **Statistical Analyses:** Many statistical analyses and tests assume complete data. Missing values can compromise the validity of statistical results and significance testing.

#### 2. Methods for Handling Missing Values:


1. **Identifying Missing Values:**
   - **`isna()` and `notna()`:**
     ```python
     # Check for missing values
     df.isna()  # Returns a DataFrame of the same shape with True for missing values
     df.notna()  # Returns the opposite of isna()
     ```

In [1]:
import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 6, 7],
    'B': [5, np.nan, 7, 8, np.nan, 10, 11],
    'C': [9, 10, 11, np.nan, 13, 14, 15],
    'D': [14, np.nan, 16, 17, 18, np.nan, 20],
}

# Adding more rows to the DataFrame
for i in range(5):
    data['A'].append(np.nan)
    data['B'].append(np.random.randint(1, 100))  # Random integers as additional values
    data['C'].append(np.random.choice(['apple', 'banana', 'orange']))  # Random strings as additional values
    data['D'].append(np.random.uniform(0, 1))  # Random floats as additional values

df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Check for missing values
missing_values = df.isna()
not_missing_values = df.notna()

# Display the DataFrames with True for missing values
print("\nDataFrame with True for Missing Values:")
print(missing_values)

# Display the opposite DataFrame (True for non-missing values)
print("\nDataFrame with True for Non-Missing Values:")
print(not_missing_values)


Original DataFrame:
      A     B       C          D
0   1.0   5.0       9  14.000000
1   2.0   NaN      10        NaN
2   NaN   7.0      11  16.000000
3   4.0   8.0     NaN  17.000000
4   5.0   NaN      13  18.000000
5   6.0  10.0      14        NaN
6   7.0  11.0      15  20.000000
7   NaN  44.0  orange   0.137091
8   NaN  89.0   apple   0.391259
9   NaN  53.0   apple   0.164750
10  NaN   4.0   apple   0.223547
11  NaN  74.0  banana   0.464314

DataFrame with True for Missing Values:
        A      B      C      D
0   False  False  False  False
1   False   True  False   True
2    True  False  False  False
3   False  False   True  False
4   False   True  False  False
5   False  False  False   True
6   False  False  False  False
7    True  False  False  False
8    True  False  False  False
9    True  False  False  False
10   True  False  False  False
11   True  False  False  False

DataFrame with True for Non-Missing Values:
        A      B      C      D
0    True   True   True   True


#### Explanation:

- `df.isna()`: Returns a DataFrame of the same shape as df with True for missing values and False for non-missing values.
- `df.notna()`: Returns the opposite DataFrame, with True for non-missing values and False for missing values.

2. **Handling Missing Values:**

**a.`fillna()`:**

     ```python
     # Fill missing values with a specified value or a calculated value
     df.fillna(value)  # Fill with a constant value
     df.fillna(df.mean())  # Fill with the mean of each column
     ```

In [4]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {'A': [1, 2, np.nan, 4, 5],
        'B': [5, np.nan, 7, 8, np.nan],
        'C': [9, 10, np.nan, 12, 13]}

df = pd.DataFrame(data)

# Example 1: Fill missing values with a specified constant value
filled_with_constant = df.fillna(0)
print("Filled with Constant:")
print(filled_with_constant)

# Example 2: Fill missing values with the mean of each column
filled_with_mean = df.fillna(df.mean())
print("\nFilled with Mean:")
print(filled_with_mean)


Filled with Constant:
     A    B     C
0  1.0  5.0   9.0
1  2.0  0.0  10.0
2  0.0  7.0   0.0
3  4.0  8.0  12.0
4  5.0  0.0  13.0

Filled with Mean:
     A         B     C
0  1.0  5.000000   9.0
1  2.0  6.666667  10.0
2  3.0  7.000000  11.0
3  4.0  8.000000  12.0
4  5.0  6.666667  13.0


#### Explanation:

- `df.fillna(0)`: This fills missing values in the DataFrame with the constant value 0.
- `df.fillna(df.mean())`: This fills missing values in the DataFrame with the mean of each column. The mean() function calculates the mean for each column, and missing values are replaced with the respective column mean.

**b. `Dropping Missing Values:`**

     ```python
     # Drop rows or columns containing missing values
     df.dropna()  # Drop rows with any missing values
     df.dropna(axis=1)  # Drop columns with any missing values
     ```

In [5]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 6, 7],
    'B': [5, np.nan, 7, 8, np.nan, 10, 11],
    'C': [9, 10, 11, np.nan, 13, 14, 15],
    'D': [14, np.nan, 16, 17, 18, np.nan, 20],
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Example 1: Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDataFrame after Dropping Rows with Any Missing Values:")
print(df_dropped_rows)

# Example 2: Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print("\nDataFrame after Dropping Columns with Any Missing Values:")
print(df_dropped_columns)


Original DataFrame:
     A     B     C     D
0  1.0   5.0   9.0  14.0
1  2.0   NaN  10.0   NaN
2  NaN   7.0  11.0  16.0
3  4.0   8.0   NaN  17.0
4  5.0   NaN  13.0  18.0
5  6.0  10.0  14.0   NaN
6  7.0  11.0  15.0  20.0

DataFrame after Dropping Rows with Any Missing Values:
     A     B     C     D
0  1.0   5.0   9.0  14.0
6  7.0  11.0  15.0  20.0

DataFrame after Dropping Columns with Any Missing Values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]


####    Explanation:

- `df.dropna()`: This method drops rows containing any missing values.
- `df.dropna(axis=1)`: This method drops columns containing any missing values.

**c. `Interpolation`:**

     ```python
     # Interpolate missing values using various methods (linear, polynomial, etc.)
     df.interpolate()
     ```

In [8]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 6, 7],
    'B': [5, np.nan, 7, 8, np.nan, 10, 11],
    'C': [9, 10, 11, np.nan, 13, 14, 15],
    'D': [14, np.nan, 16, 17, 18, np.nan, 20],
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Example: Interpolate missing values using default linear interpolation
df_interpolated_linear = df.interpolate()
print("\nDataFrame after Linear Interpolation:")
print(df_interpolated_linear)


Original DataFrame:
     A     B     C     D
0  1.0   5.0   9.0  14.0
1  2.0   NaN  10.0   NaN
2  NaN   7.0  11.0  16.0
3  4.0   8.0   NaN  17.0
4  5.0   NaN  13.0  18.0
5  6.0  10.0  14.0   NaN
6  7.0  11.0  15.0  20.0

DataFrame after Linear Interpolation:
     A     B     C     D
0  1.0   5.0   9.0  14.0
1  2.0   6.0  10.0  15.0
2  3.0   7.0  11.0  16.0
3  4.0   8.0  12.0  17.0
4  5.0   9.0  13.0  18.0
5  6.0  10.0  14.0  19.0
6  7.0  11.0  15.0  20.0


#### Explanation:

- `df.interpolate()`: This method performs linear interpolation by default, filling in missing values using a linear interpolation strategy.
- You can run this code to observe the effect of linear interpolation on missing values in the original DataFrame. The interpolate() method also supports various interpolation methods, and you can specify them using the method parameter (e.g., method='polynomial'). Adjust the parameters based on your specific data and interpolation requirements.

#### 3. Considerations and Best Practices:

- **Context Matters:** The method chosen to handle missing values depends on the nature of the data and the reason for missingness. Consider the context before applying a specific strategy.

- **Impact on Analysis:** Understand how the chosen method might impact your analysis. For example, filling missing values with the mean could introduce bias if missingness is not random.

- **Visualization:** Visualize the distribution of missing values using tools like heatmaps to better understand patterns of missingness.

- **Documentation:** Clearly document the chosen strategy for handling missing values in your analysis to ensure transparency and reproducibility.

#### Conclusion:

Properly handling missing values is a critical step in the data cleaning process. It ensures the integrity of analyses and models, leading to more reliable and accurate results. Familiarizing yourself with Pandas methods like `isna()`, `notna()`, and `fillna()` empowers you to make informed decisions when dealing with missing data in your DataFrame.

#### 4. Example:
Consider a scenario where we have a DataFrame containing information about students' exam scores in different subjects. The dataset has missing values that need to be handled, and we'll demonstrate the use of `isna()`, `notna()`, and `fillna()` to address these missing values.

In [9]:
import pandas as pd
import numpy as np

# Sample student exam data with missing values
exam_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math_Score': [85, np.nan, 78, 92, 88],
    'English_Score': [75, 85, np.nan, 88, 92],
    'Physics_Score': [90, 78, 85, np.nan, 94],
    'Chemistry_Score': [82, 88, 90, 76, np.nan],
}

# Creating a DataFrame from the exam data
df_exams = pd.DataFrame(exam_data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df_exams)

# Identifying missing values
missing_values = df_exams.isna()
print("\nMissing Values:")
print(missing_values)

# Filling missing values with the mean of each column
mean_filled_df = df_exams.fillna(df_exams.mean())

# Displaying the DataFrame after handling missing values
print("\nDataFrame after Filling Missing Values with Mean:")
print(mean_filled_df)


Original DataFrame:
   StudentID  Math_Score  English_Score  Physics_Score  Chemistry_Score
0          1        85.0           75.0           90.0             82.0
1          2         NaN           85.0           78.0             88.0
2          3        78.0            NaN           85.0             90.0
3          4        92.0           88.0            NaN             76.0
4          5        88.0           92.0           94.0              NaN

Missing Values:
   StudentID  Math_Score  English_Score  Physics_Score  Chemistry_Score
0      False       False          False          False            False
1      False        True          False          False            False
2      False       False           True          False            False
3      False       False          False           True            False
4      False       False          False          False             True

DataFrame after Filling Missing Values with Mean:
   StudentID  Math_Score  English_Score  Physics

#### Explanation:

1. **Identifying Missing Values:**
   - We use `isna()` to create a DataFrame of the same shape as the original, with `True` values where missing values are present.

2. **Handling Missing Values:**
   - We use `fillna()` to fill missing values with the mean of each column.

3. **Result:**
   - The final DataFrame (`mean_filled_df`) has missing values filled with the mean of each respective column.

This example showcases the importance of identifying and handling missing values and demonstrates a practical approach using Pandas methods. Adjust the code based on your specific dataset and requirements.

#### 5. Real-world Scenario:
Consider a scenario where you have a dataset containing information about customer orders in an e-commerce platform. The dataset includes order IDs, product names, quantities, prices, and shipping dates. Due to various reasons such as system glitches or customer actions, some data is missing. Let's explore how to identify and handle missing values in this context.

In [10]:
import pandas as pd
import numpy as np

# Sample e-commerce order data with missing values
order_data = {
    'OrderID': [101, 102, np.nan, 104, 105],
    'Product': ['Laptop', 'Smartphone', 'Tablet', np.nan, 'Camera'],
    'Quantity': [2, 1, np.nan, 2, 1],
    'Price': [1200, 800, 300, np.nan, 700],
    'Shipping_Date': ['2022-01-01', '2022-01-02', np.nan, '2022-01-03', '2022-01-03'],
}

# Creating a DataFrame from the order data
df_orders = pd.DataFrame(order_data)

# Displaying the original DataFrame
print("Original Order DataFrame:")
print(df_orders)

# Identifying missing values
missing_values = df_orders.isna()
print("\nMissing Values:")
print(missing_values)

# Handling missing values by dropping rows with missing OrderID and filling Price and Shipping_Date
df_orders_cleaned = df_orders.dropna(subset=['OrderID']).fillna({'Price': df_orders['Price'].mean(), 'Shipping_Date': '2022-01-01'})

# Displaying the DataFrame after handling missing values
print("\nDataFrame after Handling Missing Values:")
print(df_orders_cleaned)


Original Order DataFrame:
   OrderID     Product  Quantity   Price Shipping_Date
0    101.0      Laptop       2.0  1200.0    2022-01-01
1    102.0  Smartphone       1.0   800.0    2022-01-02
2      NaN      Tablet       NaN   300.0           NaN
3    104.0         NaN       2.0     NaN    2022-01-03
4    105.0      Camera       1.0   700.0    2022-01-03

Missing Values:
   OrderID  Product  Quantity  Price  Shipping_Date
0    False    False     False  False          False
1    False    False     False  False          False
2     True    False      True  False           True
3    False     True     False   True          False
4    False    False     False  False          False

DataFrame after Handling Missing Values:
   OrderID     Product  Quantity   Price Shipping_Date
0    101.0      Laptop       2.0  1200.0    2022-01-01
1    102.0  Smartphone       1.0   800.0    2022-01-02
3    104.0         NaN       2.0   750.0    2022-01-03
4    105.0      Camera       1.0   700.0    2022-01-0

#### Explanation:

This code is handling missing values in a DataFrame called `df_orders` using a combination of dropping rows with missing 'OrderID' values and filling missing values in the 'Price' and 'Shipping_Date' columns.

Let's break down the code:

```python
# Handling missing values by dropping rows with missing OrderID
# and filling Price and Shipping_Date
df_orders_cleaned = df_orders.dropna(subset=['OrderID']).fillna({'Price': df_orders['Price'].mean(), 'Shipping_Date': '2022-01-01'})
```

1. **`df_orders.dropna(subset=['OrderID'])`**:
   - This part drops rows where the 'OrderID' column has missing values (`NaN`).

2. **`.fillna({'Price': df_orders['Price'].mean(), 'Shipping_Date': '2022-01-01'})`**:
   - After dropping rows with missing 'OrderID', this part fills any remaining missing values.
   - For the 'Price' column, it fills missing values with the mean of the existing 'Price' values using `df_orders['Price'].mean()`.
   - For the 'Shipping_Date' column, it fills missing values with the constant value '2022-01-01'.

So, the resulting DataFrame (`df_orders_cleaned`) will have rows with missing 'OrderID' values removed, and any remaining missing values in 'Price' will be filled with the mean of existing 'Price' values, while missing values in 'Shipping_Date' will be filled with the constant date '2022-01-01'.

#### 6. Considerations or Peculiarities:

- **Reasons for Missingness:**
  - Understand the reasons for missing values. In this example, missing OrderID might be due to a system error, missing Product might be due to a new product without details, and missing Price and Shipping_Date might be due to incomplete data.

- **Impact on Analysis:**
  - Consider how missing values might impact your analysis. Dropping rows or filling missing values should align with the analysis goals.

- **Domain Knowledge:**
  - Domain knowledge is crucial for deciding how to handle missing values appropriately. For example, filling a missing Price with the mean might not be suitable if prices vary significantly.

#### 7. Common Mistakes:

- **Ignoring Missing Values:**
  - Ignoring missing values without assessing their impact on analyses can lead to biased results.

- **Unintended Dropping:**
  - Unintentionally dropping rows or columns without considering the reasons for missingness may result in data loss and incomplete analyses.

- **Inconsistent Handling:**
  - Inconsistently handling missing values across different columns or datasets can introduce inconsistencies in your analysis.

Handling missing values requires careful consideration and should be aligned with the overall data analysis goals. It's essential to understand the dataset's context and choose appropriate strategies based on the nature of the missing data.
