# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#2: DataFrames in Depth`**

4. **Creating DataFrames**
    
    - From lists, dictionaries, and arrays
    - Reading data from CSV, Excel, and other formats
5. **Basic DataFrame Operations**
    
    - Inspecting the DataFrame
    - Indexing and selecting data
    - Descriptive statistics
6. **Data Cleaning and Handling Missing Data**
    
    - Handling missing values
    - Dropping or filling missing values
    - Removing duplicates

#### **`Removing Duplicates in a DataFrame`**

#### **1. Significance of Identifying and Removing Duplicate Rows:**

**a. Data Accuracy:**
   - Duplicate rows can distort analyses by inflating counts, averages, or other summary statistics. Removing duplicates ensures the accuracy of calculated metrics.

**b. Consistent Results:**
   - Duplicates can lead to inconsistencies in results, especially in scenarios where aggregated data or distinct counts are essential.

**c. Efficient Memory Usage:**
   - Datasets with duplicate rows consume more memory. Eliminating duplicates optimizes memory usage and enhances computational efficiency.

**d. Meaningful Insights:**
   - Duplicate rows may not contribute meaningful insights but can skew results. Removing them ensures a cleaner dataset for analysis.

#### Examples of Removing Duplicates:

**1. Identifying Duplicate Rows:**
```python
# Check for duplicate rows based on all columns
duplicates = df.duplicated()

# Check for duplicate rows based on specific columns
duplicates_specific_columns = df.duplicated(subset=['Column1', 'Column2'])
```

**2. Removing Duplicate Rows:**
```python
# Remove all duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()

# Remove duplicate rows based on specific columns, keeping the first occurrence
df_no_duplicates_specific_columns = df.drop_duplicates(subset=['Column1', 'Column2'])
```

#### Considerations:

- **Column Selection:**
  - Consider the columns relevant to duplicate identification. In some cases, duplicates may only be duplicates when considering specific columns.

- **Order Matters:**
  - `drop_duplicates()` retains the first occurrence and removes subsequent duplicates. Ensure the order aligns with your analysis goals.

- **In-Place vs. New DataFrame:**
  - Decide whether to modify the existing DataFrame in-place or create a new one. Choose based on the need to retain the original data.

#### Common Mistakes:

- **Ignoring Specific Columns:**
  - Failing to specify columns during duplicate checking can result in unintended removal of rows that might be duplicates only in certain columns.

- **Overlooking Order:**
  - If retaining the first occurrence is essential, ensure that the DataFrame is sorted appropriately before using `drop_duplicates()`.

- **Inconsistent Usage:**
  - Inconsistently applying duplicate removal across different datasets or analyses can lead to inconsistent results.

#### Conclusion:

Identifying and removing duplicate rows is a crucial step in data cleaning and preprocessing. It enhances the accuracy of analyses, ensures meaningful insights, and optimizes memory usage. Developers should carefully consider the columns involved, the order of removal, and whether to modify the DataFrame in-place when handling duplicates.

#### 2. Example:

Let's consider a scenario where you have a DataFrame containing data on customer orders, and due to data entry errors or system glitches, there are duplicate entries. We'll explore how to identify and remove these duplicate rows using Pandas.

In [1]:
import pandas as pd

# Sample order data with duplicate entries
order_data = {
    'OrderID': [101, 102, 101, 103, 104, 102],
    'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Camera', 'Smartphone'],
    'Quantity': [2, 1, 1, 3, 1, 1],
    'Total_Price': [1200, 800, 1200, 450, 700, 800],
}

# Creating a DataFrame from the order data
df_orders = pd.DataFrame(order_data)

# Displaying the original DataFrame
print("Original Order DataFrame:")
print(df_orders)

# Identifying Duplicate Rows
duplicates = df_orders.duplicated()

# Displaying duplicate rows
print("\nDuplicate Rows:")
print(df_orders[duplicates])

# Removing Duplicate Rows
df_no_duplicates = df_orders.drop_duplicates()

# Displaying the DataFrame after removing duplicates
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)


Original Order DataFrame:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700
5      102  Smartphone         1          800

Duplicate Rows:
   OrderID     Product  Quantity  Total_Price
5      102  Smartphone         1          800

DataFrame after Removing Duplicates:
   OrderID     Product  Quantity  Total_Price
0      101      Laptop         2         1200
1      102  Smartphone         1          800
2      101      Laptop         1         1200
3      103      Tablet         3          450
4      104      Camera         1          700


In this example:

1. **Identifying Duplicate Rows:**
   - We use `duplicated()` to identify duplicate rows based on all columns. The result is a boolean series indicating which rows are duplicates.

2. **Displaying Duplicate Rows:**
   - We use boolean indexing to display the rows that are identified as duplicates.

3. **Removing Duplicate Rows:**
   - We use `drop_duplicates()` to remove duplicate rows, keeping the first occurrence of each unique row.

4. **Displaying Result:**
   - We display the DataFrame after removing duplicates to see the cleaned dataset.

Adjust the code based on your specific dataset and analysis goals. Understanding the significance of removing duplicates and applying these methods ensures a cleaner and more reliable dataset for further analysis.

#### 3. Considerations or Peculiarities:

- **Column Selection:**
  - Consider which columns should be considered for identifying duplicates. In some cases, duplicates may only be duplicates when considering specific columns.

- **Impact on Analysis:**
  - Consider how duplicate rows might impact subsequent analyses. Retaining duplicates might skew results, while removing them ensures a cleaner dataset.

#### 4. Common Mistakes:

- **Incomplete Duplicate Identification:**
  - Not considering all relevant columns during duplicate identification might result in incomplete removal of duplicates.

- **Ignoring Context:**
  - Failing to understand the context of the data might lead to unintended removal of rows that may be legitimate duplicates.

- **Overlooking Order:**
  - Forgetting to sort the DataFrame appropriately before using `drop_duplicates()` may lead to unexpected results if order matters.

Handling duplicate rows is essential for maintaining data accuracy and ensuring meaningful analyses. Developers should carefully choose columns for duplicate identification, understand the impact of duplicates on analysis, and avoid common mistakes that could compromise data integrity.