### find and delete duplicate data

Duplicate data can cause issues in data analysis and machine learning models. Here are the steps to find, handle, and remove duplicate data:

1. **Find Duplicate Data**:
    - Use the `duplicated()` method to identify duplicate rows.
    - Example: `df.duplicated()`

2. **Handle Duplicate Data**:
    - You can choose to keep the first occurrence and mark the rest as duplicates.
    - Example: `df.duplicated(keep='first')`

3. **Remove Duplicate Data**:
    - Use the `drop_duplicates()` method to remove duplicate rows.
    - Example: `df.drop_duplicates(inplace=True)`

By following these steps, you can ensure that your dataset is free from duplicate entries, leading to more accurate analysis and modeling.

#### Find Duplicate Data

In [95]:
import pandas as pd


data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Alice","Charlie"],
    "eng": [85, 90, 78, 85, 85,78],
    "bangla": [80, 85, 75, 90, 80,99]
}

In [96]:
df = pd.DataFrame(data)
df
# Charlie not duplicate because bangla mark not same

Unnamed: 0,name,eng,bangla
0,Alice,85,80
1,Bob,90,85
2,Charlie,78,75
3,David,85,90
4,Alice,85,80
5,Charlie,78,99


In [97]:
#df["duplicate"] = df.duplicated()
df

Unnamed: 0,name,eng,bangla
0,Alice,85,80
1,Bob,90,85
2,Charlie,78,75
3,David,85,90
4,Alice,85,80
5,Charlie,78,99


In [98]:
#df.drop_duplicates(inplace=True) # change original dataset
df

Unnamed: 0,name,eng,bangla
0,Alice,85,80
1,Bob,90,85
2,Charlie,78,75
3,David,85,90
4,Alice,85,80
5,Charlie,78,99


In [99]:
df.duplicated(keep='first')#You can choose to keep the first occurrence and mark the rest as duplicates.


0    False
1    False
2    False
3    False
4     True
5    False
dtype: bool

In [100]:
df.drop_duplicates(inplace=True) # change original dataset
df

Unnamed: 0,name,eng,bangla
0,Alice,85,80
1,Bob,90,85
2,Charlie,78,75
3,David,85,90
5,Charlie,78,99


## Work with original dataset

In [101]:
ds = pd.read_csv("loan.csv")

In [102]:
ds.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y



### Deleting Duplicate Values and Checking for Duplicates

To handle duplicate values in a DataFrame, you can follow these steps:

1. **Check for Duplicate Values**:
    - Use the `duplicated()` method to identify duplicate rows.
    - Example: `df.duplicated()`
    - This will return a boolean Series indicating whether each row is a duplicate or not.

2. **Delete Duplicate Values**:
    - Use the `drop_duplicates()` method to remove duplicate rows.
    - Example: `df.drop_duplicates(inplace=True)`
    - This will remove the duplicate rows from the DataFrame and update it in place.

Here is an example using the `df` DataFrame:

```python
# Check for duplicate values
duplicates = df.duplicated()
print("Duplicate rows:\n", duplicates)

# Delete duplicate values
df.drop_duplicates(inplace=True)
print("DataFrame after removing duplicates:\n", df)
```

By following these steps, you can ensure that your dataset is free from duplicate entries, leading to more accurate analysis and modeling.


# `**Importance**`
`1.primary 614 row and 13 column on the dataset`</br>
 `2.after delete duplicate value`</br>
 `3.again 614 row and 13 column available on the dataset`</br>
 `4.so that there is no duplicate value`</br>

In [103]:
ds.shape #primary 614 row and 13 column on the dataset

(614, 13)

In [104]:
ds.drop_duplicates(inplace=True)
ds.shape 
#fter delete duplicate value
#again 614 row and 13 column available on the dataset

(614, 13)