# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**
4. **Creating DataFrames**
   - From lists, dictionaries, and arrays
   - Reading data from CSV, Excel, and other formats

5. **Basic DataFrame Operations**
   - Inspecting the DataFrame
   - Indexing and selecting data
   - Descriptive statistics

6. **Data Cleaning and Handling Missing Data**
   - Handling missing values
   - Dropping or filling missing values
   - Removing duplicates

### **`6. Data Cleaning and Handling Missing Data`**

#### **`Handling Missing Values`**

#### Importance of Identifying and Handling Missing Values in a DataFrame:

Missing values, represented as NaN (Not a Number) in Pandas, are a common occurrence in real-world datasets. Properly identifying and handling missing values is crucial for meaningful and accurate data analysis. Ignoring missing values can lead to biased results and incorrect interpretations. Here's why handling missing values is important:

1. **Data Accuracy:** Missing values can distort summary statistics, such as mean and standard deviation, leading to inaccurate insights about the dataset.

2. **Model Performance:** If missing values are not addressed, they can adversely impact machine learning models, causing biased predictions and reduced model performance.

3. **Data Visualization:** Visualizations may not accurately represent the distribution of data when missing values are present, affecting the interpretation of results.

4. **Statistical Analyses:** Many statistical analyses and tests assume complete data. Missing values can compromise the validity of statistical results and significance testing.

#### Methods for Handling Missing Values:

1. **Identifying Missing Values:**
   - **`isna()` and `notna()`:**

In [3]:
import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5, 6, 7],
    'B': [5, np.nan, 7, 8, np.nan, 10, 11],
    'C': [9, 10, 11, np.nan, 13, 14, 15],
    'D': [14, np.nan, 16, 17, 18, np.nan, 20],
}

# Adding more rows to the DataFrame
for i in range(5):
    data['A'].append(np.nan)
    data['B'].append(np.random.randint(1, 100))  # Random integers as additional values
    data['C'].append(np.random.choice(['apple', 'banana', 'orange']))  # Random strings as additional values
    data['D'].append(np.random.uniform(0, 1))  # Random floats as additional values

df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Check for missing values
missing_values = df.isna()
not_missing_values = df.notna()

# Display the DataFrames with True for missing values
print("\nDataFrame with True for Missing Values:")
print(missing_values)

# Display the opposite DataFrame (True for non-missing values)
print("\nDataFrame with True for Non-Missing Values:")
print(not_missing_values)


Original DataFrame:
      A     B       C          D
0   1.0   5.0       9  14.000000
1   2.0   NaN      10        NaN
2   NaN   7.0      11  16.000000
3   4.0   8.0     NaN  17.000000
4   5.0   NaN      13  18.000000
5   6.0  10.0      14        NaN
6   7.0  11.0      15  20.000000
7   NaN  79.0  banana   0.891187
8   NaN  47.0  banana   0.708668
9   NaN  73.0  orange   0.544267
10  NaN  30.0   apple   0.052271
11  NaN  13.0  orange   0.222462

DataFrame with True for Missing Values:
        A      B      C      D
0   False  False  False  False
1   False   True  False   True
2    True  False  False  False
3   False  False   True  False
4   False   True  False  False
5   False  False  False   True
6   False  False  False  False
7    True  False  False  False
8    True  False  False  False
9    True  False  False  False
10   True  False  False  False
11   True  False  False  False

DataFrame with True for Non-Missing Values:
        A      B      C      D
0    True   True   True   True


#### **`Handling Missing Values`**

#### Importance of Identifying and Handling Missing Values in a DataFrame:

Missing values, represented as NaN (Not a Number) in Pandas, are a common occurrence in real-world datasets. Properly identifying and handling missing values is crucial for meaningful and accurate data analysis. Ignoring missing values can lead to biased results and incorrect interpretations. Here's why handling missing values is important:

1. **Data Accuracy:** Missing values can distort summary statistics, such as mean and standard deviation, leading to inaccurate insights about the dataset.

2. **Model Performance:** If missing values are not addressed, they can adversely impact machine learning models, causing biased predictions and reduced model performance.

3. **Data Visualization:** Visualizations may not accurately represent the distribution of data when missing values are present, affecting the interpretation of results.

4. **Statistical Analyses:** Many statistical analyses and tests assume complete data. Missing values can compromise the validity of statistical results and significance testing.

#### Methods for Handling Missing Values:

1. **Identifying Missing Values:**
   - **`isna()` and `notna()`:**
     ```python
     # Check for missing values
     df.isna()  # Returns a DataFrame of the same shape with True for missing values
     df.notna()  # Returns the opposite of isna()
     ```

2. **Handling Missing Values:**
   - **`fillna()`:**
     ```python
     # Fill missing values with a specified value or a calculated value
     df.fillna(value)  # Fill with a constant value
     df.fillna(df.mean())  # Fill with the mean of each column
     ```

   - **Dropping Missing Values:**
     ```python
     # Drop rows or columns containing missing values
     df.dropna()  # Drop rows with any missing values
     df.dropna(axis=1)  # Drop columns with any missing values
     ```

   - **Interpolation:**
     ```python
     # Interpolate missing values using various methods (linear, polynomial, etc.)
     df.interpolate()
     ```

#### Considerations and Best Practices:

- **Context Matters:** The method chosen to handle missing values depends on the nature of the data and the reason for missingness. Consider the context before applying a specific strategy.

- **Impact on Analysis:** Understand how the chosen method might impact your analysis. For example, filling missing values with the mean could introduce bias if missingness is not random.

- **Visualization:** Visualize the distribution of missing values using tools like heatmaps to better understand patterns of missingness.

- **Documentation:** Clearly document the chosen strategy for handling missing values in your analysis to ensure transparency and reproducibility.

#### Conclusion:

Properly handling missing values is a critical step in the data cleaning process. It ensures the integrity of analyses and models, leading to more reliable and accurate results. Familiarizing yourself with Pandas methods like `isna()`, `notna()`, and `fillna()` empowers you to make informed decisions when dealing with missing data in your DataFrame.

#### Example:
Consider a scenario where we have a DataFrame containing information about students' exam scores in different subjects. The dataset has missing values that need to be handled, and we'll demonstrate the use of `isna()`, `notna()`, and `fillna()` to address these missing values.

In [1]:
import pandas as pd
import numpy as np

# Sample student exam data with missing values
exam_data = {
    'StudentID': [1, 2, 3, 4, 5],
    'Math_Score': [85, np.nan, 78, 92, 88],
    'English_Score': [75, 85, np.nan, 88, 92],
    'Physics_Score': [90, 78, 85, np.nan, 94],
    'Chemistry_Score': [82, 88, 90, 76, np.nan],
}

# Creating a DataFrame from the exam data
df_exams = pd.DataFrame(exam_data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df_exams)

# Identifying missing values
missing_values = df_exams.isna()
print("\nMissing Values:")
print(missing_values)

# Filling missing values with the mean of each column
mean_filled_df = df_exams.fillna(df_exams.mean())

# Displaying the DataFrame after handling missing values
print("\nDataFrame after Filling Missing Values with Mean:")
print(mean_filled_df)


Original DataFrame:
   StudentID  Math_Score  English_Score  Physics_Score  Chemistry_Score
0          1        85.0           75.0           90.0             82.0
1          2         NaN           85.0           78.0             88.0
2          3        78.0            NaN           85.0             90.0
3          4        92.0           88.0            NaN             76.0
4          5        88.0           92.0           94.0              NaN

Missing Values:
   StudentID  Math_Score  English_Score  Physics_Score  Chemistry_Score
0      False       False          False          False            False
1      False        True          False          False            False
2      False       False           True          False            False
3      False       False          False           True            False
4      False       False          False          False             True

DataFrame after Filling Missing Values with Mean:
   StudentID  Math_Score  English_Score  Physics

In the above example:

1. **Identifying Missing Values:**
   - We use `isna()` to create a DataFrame of the same shape as the original, with `True` values where missing values are present.

2. **Handling Missing Values:**
   - We use `fillna()` to fill missing values with the mean of each column.

3. **Result:**
   - The final DataFrame (`mean_filled_df`) has missing values filled with the mean of each respective column.

This example showcases the importance of identifying and handling missing values and demonstrates a practical approach using Pandas methods. Adjust the code based on your specific dataset and requirements.

#### Real-world Scenario:
Consider a scenario where you have a dataset containing information about customer orders in an e-commerce platform. The dataset includes order IDs, product names, quantities, prices, and shipping dates. Due to various reasons such as system glitches or customer actions, some data is missing. Let's explore how to identify and handle missing values in this context.

In [2]:
import pandas as pd
import numpy as np

# Sample e-commerce order data with missing values
order_data = {
    'OrderID': [101, 102, np.nan, 104, 105],
    'Product': ['Laptop', 'Smartphone', 'Tablet', np.nan, 'Camera'],
    'Quantity': [2, 1, np.nan, 2, 1],
    'Price': [1200, 800, 300, np.nan, 700],
    'Shipping_Date': ['2022-01-01', '2022-01-02', np.nan, '2022-01-03', '2022-01-03'],
}

# Creating a DataFrame from the order data
df_orders = pd.DataFrame(order_data)

# Displaying the original DataFrame
print("Original Order DataFrame:")
print(df_orders)

# Identifying missing values
missing_values = df_orders.isna()
print("\nMissing Values:")
print(missing_values)

# Handling missing values by dropping rows with missing OrderID and filling Price and Shipping_Date
df_orders_cleaned = df_orders.dropna(subset=['OrderID']).fillna({'Price': df_orders['Price'].mean(), 'Shipping_Date': '2022-01-01'})

# Displaying the DataFrame after handling missing values
print("\nDataFrame after Handling Missing Values:")
print(df_orders_cleaned)


Original Order DataFrame:
   OrderID     Product  Quantity   Price Shipping_Date
0    101.0      Laptop       2.0  1200.0    2022-01-01
1    102.0  Smartphone       1.0   800.0    2022-01-02
2      NaN      Tablet       NaN   300.0           NaN
3    104.0         NaN       2.0     NaN    2022-01-03
4    105.0      Camera       1.0   700.0    2022-01-03

Missing Values:
   OrderID  Product  Quantity  Price  Shipping_Date
0    False    False     False  False          False
1    False    False     False  False          False
2     True    False      True  False           True
3    False     True     False   True          False
4    False    False     False  False          False

DataFrame after Handling Missing Values:
   OrderID     Product  Quantity   Price Shipping_Date
0    101.0      Laptop       2.0  1200.0    2022-01-01
1    102.0  Smartphone       1.0   800.0    2022-01-02
3    104.0         NaN       2.0   750.0    2022-01-03
4    105.0      Camera       1.0   700.0    2022-01-0

#### Considerations or Peculiarities:

- **Reasons for Missingness:**
  - Understand the reasons for missing values. In this example, missing OrderID might be due to a system error, missing Product might be due to a new product without details, and missing Price and Shipping_Date might be due to incomplete data.

- **Impact on Analysis:**
  - Consider how missing values might impact your analysis. Dropping rows or filling missing values should align with the analysis goals.

- **Domain Knowledge:**
  - Domain knowledge is crucial for deciding how to handle missing values appropriately. For example, filling a missing Price with the mean might not be suitable if prices vary significantly.

#### Common Mistakes:

- **Ignoring Missing Values:**
  - Ignoring missing values without assessing their impact on analyses can lead to biased results.

- **Unintended Dropping:**
  - Unintentionally dropping rows or columns without considering the reasons for missingness may result in data loss and incomplete analyses.

- **Inconsistent Handling:**
  - Inconsistently handling missing values across different columns or datasets can introduce inconsistencies in your analysis.

Handling missing values requires careful consideration and should be aligned with the overall data analysis goals. It's essential to understand the dataset's context and choose appropriate strategies based on the nature of the missing data.


#### **`Dropping or Filling Missing Values`**

#### Decision-Making Process:

1. **Dropping Missing Values:**
   - **Context:** Dropping missing values is suitable when the missingness is random, and removing incomplete records doesn't introduce bias or impact the analysis significantly. It's a pragmatic approach when the missing data is negligible compared to the dataset size.

   - **Example:**
     ```python
     # Drop rows with any missing values
     df_dropped = df.dropna()
     ```

2. **Filling Missing Values:**
   - **Context:** Filling missing values is appropriate when retaining the incomplete records is crucial, and a reasonable estimation can be made for the missing values. This is common when dealing with time-series data, where continuity matters.

   - **Example:**
     ```python
     # Fill missing values in 'column_name' with a constant value
     df_filled_constant = df.fillna(value=0)
     ```

   - **Example:**
     ```python
     # Fill missing values with the mean of each column
     df_filled_mean = df.fillna(df.mean())
     ```

   - **Example:**
     ```python
     # Forward fill missing values in a DataFrame
     df_forward_filled = df.ffill()
     ```

   - **Example:**
     ```python
     # Backward fill missing values in a DataFrame
     df_backward_filled = df.bfill()
     ```

   - **Example:**
     ```python
     # Interpolate missing values using linear interpolation
     df_interpolated_linear = df.interpolate(method='linear')
     ```

#### Considerations:

- **Data Nature:**
  - Consider the nature of the data. For time-series data, forward or backward filling might be suitable, while for numeric data, mean or interpolation might be appropriate.

- **Impact on Analysis:**
  - Evaluate how the chosen method for handling missing values might impact subsequent analyses. Ensure that the imputation method aligns with the overall analysis goals.

- **Domain Knowledge:**
  - Leverage domain knowledge to make informed decisions. Some missing values may be inherently unfillable due to the nature of the data.

#### Conclusion:

The decision between dropping or filling missing values depends on the specific characteristics of the data and the analysis goals. Dropping values is a straightforward approach but may lead to data loss. Filling values is a more nuanced process, requiring careful consideration of the data's nature and the impact on downstream analyses. Experiment with different strategies and choose the one that best fits the context of your dataset.

#### Example:

Let's consider a scenario where we have a DataFrame representing monthly sales data for a product. The dataset has missing values in the 'Sales' column, and we need to decide whether to drop or fill those missing values based on the context.

In [3]:
import pandas as pd
import numpy as np

# Sample monthly sales data with missing values
sales_data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Sales': [100, 120, np.nan, 150, np.nan, 180],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Displaying the original DataFrame
print("Original Sales DataFrame:")
print(df_sales)

# Decision 1: Dropping Missing Values
df_dropped = df_sales.dropna()

# Displaying the DataFrame after dropping missing values
print("\nDataFrame after Dropping Missing Values:")
print(df_dropped)

# Decision 2: Filling Missing Values with Forward Fill
df_filled_forward = df_sales.ffill()

# Displaying the DataFrame after forward filling missing values
print("\nDataFrame after Forward Filling Missing Values:")
print(df_filled_forward)

# Decision 3: Filling Missing Values with Mean
df_filled_mean = df_sales.fillna(df_sales['Sales'].mean())

# Displaying the DataFrame after filling missing values with mean
print("\nDataFrame after Filling Missing Values with Mean:")
print(df_filled_mean)


Original Sales DataFrame:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar    NaN
3   Apr  150.0
4   May    NaN
5   Jun  180.0

DataFrame after Dropping Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
3   Apr  150.0
5   Jun  180.0

DataFrame after Forward Filling Missing Values:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  120.0
3   Apr  150.0
4   May  150.0
5   Jun  180.0

DataFrame after Filling Missing Values with Mean:
  Month  Sales
0   Jan  100.0
1   Feb  120.0
2   Mar  137.5
3   Apr  150.0
4   May  137.5
5   Jun  180.0


In this example:

1. **Dropping Missing Values:**
   - We use `dropna()` to remove rows with any missing values. This might be suitable if missing values are limited and their removal doesn't significantly affect the analysis.

2. **Filling Missing Values with Forward Fill:**
   - We use `ffill()` to fill missing values with the previous month's sales. This approach is reasonable when the missing values follow a pattern and can be reasonably estimated using existing data.

3. **Filling Missing Values with Mean:**
   - We use `fillna()` with the mean of the 'Sales' column to impute missing values. This approach is suitable when we want to retain all rows and fill missing values with a representative value.

Adjust the code based on the specific characteristics of your dataset and the analysis goals. Choosing between dropping or filling missing values should be driven by the dataset's context and the impact on subsequent analyses.

#### Real-world Scenario:

Imagine you are managing a dataset that tracks monthly sales data for a retail business. The dataset includes information such as the month, product category, sales quantity, and revenue. However, due to occasional reporting errors or data collection issues, there are missing values in the dataset. Let's explore how to handle these missing values using Pandas.

In [4]:
import pandas as pd
import numpy as np

# Sample monthly sales data with missing values
sales_data = {
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Category': ['Electronics', 'Clothing', np.nan, 'Electronics', np.nan, 'Clothing'],
    'Sales_Quantity': [120, 150, np.nan, 200, np.nan, 180],
    'Revenue': [12000, np.nan, 18000, np.nan, 25000, 22000],
}

# Creating a DataFrame from the sales data
df_sales = pd.DataFrame(sales_data)

# Displaying the original DataFrame
print("Original Monthly Sales DataFrame:")
print(df_sales)

# Handling Missing Values:
# Decision 1: Dropping rows with any missing values
df_dropped = df_sales.dropna()

# Decision 2: Filling missing values with mean for numerical columns
df_filled_mean = df_sales.fillna({'Sales_Quantity': df_sales['Sales_Quantity'].mean(), 'Revenue': df_sales['Revenue'].mean()})

# Displaying the DataFrames after handling missing values
print("\nDataFrame after Dropping Missing Values:")
print(df_dropped)

print("\nDataFrame after Filling Missing Values with Mean:")
print(df_filled_mean)


Original Monthly Sales DataFrame:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
1   Feb     Clothing           150.0      NaN
2   Mar          NaN             NaN  18000.0
3   Apr  Electronics           200.0      NaN
4   May          NaN             NaN  25000.0
5   Jun     Clothing           180.0  22000.0

DataFrame after Dropping Missing Values:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
5   Jun     Clothing           180.0  22000.0

DataFrame after Filling Missing Values with Mean:
  Month     Category  Sales_Quantity  Revenue
0   Jan  Electronics           120.0  12000.0
1   Feb     Clothing           150.0  19250.0
2   Mar          NaN           162.5  18000.0
3   Apr  Electronics           200.0  19250.0
4   May          NaN           162.5  25000.0
5   Jun     Clothing           180.0  22000.0


#### Considerations or Peculiarities:

- **Imputation Strategy:**
  - Choosing between dropping and filling depends on the impact on analysis. Dropping may lead to loss of important information, while filling may introduce bias if not done carefully.

- **Context of Data:**
  - Understand the context of your data. For example, filling missing revenue values with the mean might be reasonable, but for product categories, it may not make sense.

- **Column-specific Strategies:**
  - Different columns may require different strategies. For numeric columns, mean or median filling could be appropriate, while for categorical columns, forward fill or mode filling might be more suitable.

#### Common Mistakes:

- **Unintended Data Loss:**
  - Developers might drop rows without considering the impact on the dataset's integrity. This can lead to unintended data loss, especially if the missing values are not randomly distributed.

- **Inconsistent Imputation:**
  - Filling missing values inconsistently across columns or datasets can introduce inconsistencies in the dataset.

- **Overlooking Context:**
  - Filling missing values without understanding the context of the data and the reasons for missingness may lead to inaccurate imputations.

Handling missing values is a critical aspect of data preprocessing. It requires thoughtful consideration of the dataset's context, the nature of missingness, and the impact on downstream analyses. Developers should choose strategies that align with the goals of their analysis and avoid common pitfalls that can compromise data quality.

#### **`Removing Duplicates in a DataFrame`**

#### Significance of Identifying and Removing Duplicate Rows:

**1. Data Accuracy:**
   - Duplicate rows can distort analyses by inflating counts, averages, or other summary statistics. Removing duplicates ensures the accuracy of calculated metrics.

**2. Consistent Results:**
   - Duplicates can lead to inconsistencies in results, especially in scenarios where aggregated data or distinct counts are essential.

**3. Efficient Memory Usage:**
   - Datasets with duplicate rows consume more memory. Eliminating duplicates optimizes memory usage and enhances computational efficiency.

**4. Meaningful Insights:**
   - Duplicate rows may not contribute meaningful insights but can skew results. Removing them ensures a cleaner dataset for analysis.

#### Examples of Removing Duplicates:

**1. Identifying Duplicate Rows:**
```python
# Check for duplicate rows based on all columns
duplicates = df.duplicated()

# Check for duplicate rows based on specific columns
duplicates_specific_columns = df.duplicated(subset=['Column1', 'Column2'])
```

**2. Removing Duplicate Rows:**
```python
# Remove all duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()

# Remove duplicate rows based on specific columns, keeping the first occurrence
df_no_duplicates_specific_columns = df.drop_duplicates(subset=['Column1', 'Column2'])
```

#### Considerations:

- **Column Selection:**
  - Consider the columns relevant to duplicate identification. In some cases, duplicates may only be duplicates when considering specific columns.

- **Order Matters:**
  - `drop_duplicates()` retains the first occurrence and removes subsequent duplicates. Ensure the order aligns with your analysis goals.

- **In-Place vs. New DataFrame:**
  - Decide whether to modify the existing DataFrame in-place or create a new one. Choose based on the need to retain the original data.

#### Common Mistakes:

- **Ignoring Specific Columns:**
  - Failing to specify columns during duplicate checking can result in unintended removal of rows that might be duplicates only in certain columns.

- **Overlooking Order:**
  - If retaining the first occurrence is essential, ensure that the DataFrame is sorted appropriately before using `drop_duplicates()`.

- **Inconsistent Usage:**
  - Inconsistently applying duplicate removal across different datasets or analyses can lead to inconsistent results.

#### Conclusion:

Identifying and removing duplicate rows is a crucial step in data cleaning and preprocessing. It enhances the accuracy of analyses, ensures meaningful insights, and optimizes memory usage. Developers should carefully consider the columns involved, the order of removal, and whether to modify the DataFrame in-place when handling duplicates.

#### Example:

Let's consider a scenario where you have a DataFrame containing data on customer orders, and due to data entry errors or system glitches, there are duplicate entries. We'll explore how to identify and remove these duplicate rows using Pandas.

In [5]:
import pandas as pd

# Sample order data with duplicate entries
order_data = {
    'OrderID': [101, 102, 101, 103, 104, 102],
    'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Camera', 'Smartphone'],
    'Quantity': [2, 1, 1, 3, 1, 1],
    'Total_Price': [1200, 800, 1200, 450, 700, 800],
}

# Creating a DataFrame from the order data
df_orders = pd.DataFrame(order_data)

# Displaying the original DataFrame
print("Original Order DataFrame:")
print(df_orders)

# Identifying Duplicate Rows
duplicates = df_orders.duplicated()

# Displaying duplicate rows
print("\nDuplicate Rows:")
print(df_orders[duplicates])

# Removing Duplicate Rows
df_no_duplicates = df_orders.drop_duplicates()

# Displaying the DataFrame after removing duplicates
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)


Original Order DataFrame:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700
5      102  Smartphone         1          800

Duplicate Rows:
   OrderID     Product  Quantity  Total_Price
5      102  Smartphone         1          800

DataFrame after Removing Duplicates:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700


In this example:

1. **Identifying Duplicate Rows:**
   - We use `duplicated()` to identify duplicate rows based on all columns. The result is a boolean series indicating which rows are duplicates.

2. **Displaying Duplicate Rows:**
   - We use boolean indexing to display the rows that are identified as duplicates.

3. **Removing Duplicate Rows:**
   - We use `drop_duplicates()` to remove duplicate rows, keeping the first occurrence of each unique row.

4. **Displaying Result:**
   - We display the DataFrame after removing duplicates to see the cleaned dataset.

Adjust the code based on your specific dataset and analysis goals. Understanding the significance of removing duplicates and applying these methods ensures a cleaner and more reliable dataset for further analysis.

#### Considerations or Peculiarities:

- **Column Selection:**
  - Consider which columns should be considered for identifying duplicates. In some cases, duplicates may only be duplicates when considering specific columns.

- **Impact on Analysis:**
  - Consider how duplicate rows might impact subsequent analyses. Retaining duplicates might skew results, while removing them ensures a cleaner dataset.

#### Common Mistakes:

- **Incomplete Duplicate Identification:**
  - Not considering all relevant columns during duplicate identification might result in incomplete removal of duplicates.

- **Ignoring Context:**
  - Failing to understand the context of the data might lead to unintended removal of rows that may be legitimate duplicates.

- **Overlooking Order:**
  - Forgetting to sort the DataFrame appropriately before using `drop_duplicates()` may lead to unexpected results if order matters.

Handling duplicate rows is essential for maintaining data accuracy and ensuring meaningful analyses. Developers should carefully choose columns for duplicate identification, understand the impact of duplicates on analysis, and avoid common mistakes that could compromise data integrity.

### **`Hands On Experience:`**


### Question 1: Creating a DataFrame from Lists and Basic Operations

#### Scenario:
You have information about monthly sales for a retail store. Each list contains data for a different month.

```python
# Data for three months
months = ['Jan', 'Feb', 'Mar']
sales = [1200, 1500, 1800]
expenses = [800, 900, 1000]

# Question:
# Create a DataFrame named 'df_sales' from these lists, and display the DataFrame.
# Calculate the profit for each month (Profit = Sales - Expenses).
# Display the DataFrame after adding the 'Profit' column.
```

In [6]:
# Data for three months
months = ['Jan', 'Feb', 'Mar']
sales = [1200, 1500, 1800]
expenses = [800, 900, 1000]

import pandas as pd

# Creating a DataFrame from Lists
df_sales = pd.DataFrame({'Month': months, 'Sales': sales, 'Expenses': expenses})

# Calculating Profit
df_sales['Profit'] = df_sales['Sales'] - df_sales['Expenses']

# Displaying the DataFrame
print("DataFrame after Creating and Calculating Profit:")
print(df_sales)

DataFrame after Creating and Calculating Profit:
  Month  Sales  Expenses  Profit
0   Jan   1200       800     400
1   Feb   1500       900     600
2   Mar   1800      1000     800


### Question 2: Reading Data from CSV and Descriptive Statistics

#### Scenario:

Let's assume you have a CSV file named 'sales_data.csv' with the following structure:

```csv
Product,Quantity,Revenue
Laptop,10,12000
Smartphone,5,8000
Tablet,,4500
Camera,3,
```

You have a CSV file named 'sales_data.csv' containing information about product sales. Read the data into a DataFrame and perform descriptive statistics.

```python
# Question:
# Read 'sales_data.csv' into a DataFrame named 'df_sales'.
# Display the first 5 rows of the DataFrame.
# Calculate basic descriptive statistics for the 'Quantity' column.
```



In [7]:
import pandas as pd

# Reading Data from CSV
df_sales = pd.read_csv('sales_data.csv')

# Displaying the first 5 rows
print("First 5 Rows of df_sales:")
print(df_sales.head())

# Descriptive Statistics for 'Quantity'
quantity_stats = df_sales['Quantity'].describe()
print("\nDescriptive Statistics for 'Quantity':")
print(quantity_stats)

First 5 Rows of df_sales:
      Product  Quantity  Revenue
0      Laptop      10.0  12000.0
1  Smartphone       5.0   8000.0
2      Tablet       NaN   4500.0
3      Camera       3.0      NaN

Descriptive Statistics for 'Quantity':
count     3.000000
mean      6.000000
std       3.605551
min       3.000000
25%       4.000000
50%       5.000000
75%       7.500000
max      10.000000
Name: Quantity, dtype: float64


### Question 3: Handling Missing Values and Filling with Mean

#### Scenario:
Your DataFrame has missing values in the 'Revenue' column. Handle the missing values by filling them with the mean.

```python
# Question:
# Handle missing values in the 'Revenue' column by filling them with the mean.
# Display the DataFrame after handling missing values.
```

In [8]:
# Handling Missing Values in 'Revenue'
df_sales['Revenue'].fillna(df_sales['Revenue'].mean(), inplace=True)

# Displaying the DataFrame after Handling Missing Values
print("DataFrame after Handling Missing Values in 'Revenue':")
print(df_sales)

DataFrame after Handling Missing Values in 'Revenue':
      Product  Quantity       Revenue
0      Laptop      10.0  12000.000000
1  Smartphone       5.0   8000.000000
2      Tablet       NaN   4500.000000
3      Camera       3.0   8166.666667


### Question 4: Removing Duplicates

#### Scenario:
Your DataFrame 'df_orders' contains duplicate entries for customer orders. Remove the duplicates based on all columns.

```python
# Question:
# Identify and remove duplicate rows from 'df_orders'.
# Display the DataFrame after removing duplicates.
```

In [9]:
# Identifying Duplicate Rows
duplicates = df_orders.duplicated()

# Removing Duplicate Rows
df_orders_no_duplicates = df_orders.drop_duplicates()

# Displaying the DataFrame after Removing Duplicates
print("DataFrame after Removing Duplicates:")
print(df_orders_no_duplicates)

DataFrame after Removing Duplicates:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700


### Question 5: Conditional Indexing and Filtering

#### Scenario:
You want to analyze only the orders with a quantity greater than 2.

```python
# Question:
# Create a new DataFrame 'df_large_orders' containing only the orders with Quantity greater than 2.
# Display the new DataFrame.
```

In [10]:
# Conditional Indexing and Filtering
df_large_orders = df_orders[df_orders['Quantity'] > 2]

# Displaying the DataFrame with Large Orders
print("DataFrame with Orders Quantity > 2:")
print(df_large_orders)

DataFrame with Orders Quantity > 2:
   OrderID Product  Quantity  Total_Price
3      103  Tablet         3          450
