#### Dealing with duplicate data/ feature engineering
In this lab, we will explore and analyze the data on calls for service from New Orleans. The dataset used in this example is available at the following link: https://data.nola.gov/. Specifically, we will focus on the dataset from the year 2015.

This dataset provides valuable insights into the nature and frequency of service calls made within the city of New Orleans. By examining this data, we can uncover patterns, trends, and important information that can contribute to a better understanding of public safety and resource allocation.

During this lab, our main focus will be on dealing with duplicated data.

Duplicates in a dataset can introduce errors and inaccuracies in our analysis, leading to biased results and flawed conclusions. Therefore, it is crucial to identify and handle duplicate entries effectively.

By handling duplicate entries effectively in your dataset, you ensure that your analysis is not skewed by repetitive or erroneous data points. This leads to more reliable insights and prevents biased results and flawed conclusions. In this example, it demonstrates the importance of data cleaning and quality assurance in data analysis to obtain trustworthy results for decision-making.

Simple example

Here's an example of Python code that illustrates how to identify and handle duplicate entries in a dataset using the popular pandas library.

In [1]:
import pandas as pd

# Sample dataset with duplicate entries
data = {
    'Customer_ID': [1, 2, 3, 4, 1, 5, 2, 6],
    'Product_ID': ['A', 'B', 'C', 'D', 'A', 'E', 'B', 'F'],
    'Review_Text': ['Great product', 'Good product', 'Average', 'Not satisfied', 'Great product', 'Excellent', 'Good product', 'Very satisfied'],
    'Rating': [5, 4, 3, 2, 5, 5, 4, 5]
}

df = pd.DataFrame(data)

# Identifying and handling duplicates
duplicate_rows = df[df.duplicated(['Customer_ID', 'Product_ID', 'Review_Text'], keep='first')]
df_cleaned = df.drop_duplicates(['Customer_ID', 'Product_ID', 'Review_Text'], keep='first')

# Displaying the original and cleaned datasets
print("Original Dataset:")
print(df)

print("\nDuplicate Rows:")
print(duplicate_rows)

print("\nCleaned Dataset:")
print(df_cleaned)

Original Dataset:
   Customer_ID Product_ID     Review_Text  Rating
0            1          A   Great product       5
1            2          B    Good product       4
2            3          C         Average       3
3            4          D   Not satisfied       2
4            1          A   Great product       5
5            5          E       Excellent       5
6            2          B    Good product       4
7            6          F  Very satisfied       5

Duplicate Rows:
   Customer_ID Product_ID    Review_Text  Rating
4            1          A  Great product       5
6            2          B   Good product       4

Cleaned Dataset:
   Customer_ID Product_ID     Review_Text  Rating
0            1          A   Great product       5
1            2          B    Good product       4
2            3          C         Average       3
3            4          D   Not satisfied       2
5            5          E       Excellent       5
7            6          F  Very satisfied       5


In [6]:
df.duplicated()

0    False
1    False
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [10]:
df.duplicated('Product_ID')

0    False
1    False
2    False
3    False
5    False
7    False
dtype: bool

In [8]:
df.drop_duplicates(keep='first',inplace=True)

In [9]:
df.duplicated().sum()

0