<a href="https://colab.research.google.com/github/sajalf49/DS-AI_Assignments/blob/main/week2_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 Assignment — Data Collection & Cleaning

**Name:** Sajal Farhan  
**Roll No:** 24  
**Assigned Project:** Credit Card Fraud Detection  

---

## Notebook Overview
In this notebook, I:
- Created a sample **Credit Card Fraud dataset** with issues (duplicates, missing values, outliers).
- Applied cleaning steps: removed duplicates, handled missing values, treated outliers.
- Compared the dataset **before vs after cleaning**.
- Saved the cleaned dataset as `creditcard_cleaned.csv` for upload to GitHub.


In [1]:
# Week 2: Data Collection & Cleaning
# Project: Credit Card Fraud Detection

import pandas as pd
import numpy as np

# -----------------------------
# 1. Create Sample Dataset
# -----------------------------
data = {
    'TransactionID': [1,2,3,4,5,6,7,8,9,10,11,11,12,13,14],
    'Amount': [100.50, 250.75, np.nan, 5000.00, 60.00, 9999.99, 80.00, 120.00, 250.75, 45.00,
               100.50, 100.50, 300.00, 75.00, np.nan],
    'Age': [25, 35, 40, 28, np.nan, 30, 45, 38, 35, 29,
            25, 25, 50, 32, 27],
    'Fraudulent': [0,0,0,1,0,1,0,0,0,0,
                   0,0,1,0,0]
}

df = pd.DataFrame(data)
print("📌 Raw Dataset (Before Cleaning):")
print(df)


📌 Raw Dataset (Before Cleaning):
    TransactionID   Amount   Age  Fraudulent
0               1   100.50  25.0           0
1               2   250.75  35.0           0
2               3      NaN  40.0           0
3               4  5000.00  28.0           1
4               5    60.00   NaN           0
5               6  9999.99  30.0           1
6               7    80.00  45.0           0
7               8   120.00  38.0           0
8               9   250.75  35.0           0
9              10    45.00  29.0           0
10             11   100.50  25.0           0
11             11   100.50  25.0           0
12             12   300.00  50.0           1
13             13    75.00  32.0           0
14             14      NaN  27.0           0


In [2]:
# -----------------------------
# 2. Remove Duplicates
# -----------------------------
df_no_duplicates = df.drop_duplicates()
print("\n✅ After Removing Duplicates:")
print(df_no_duplicates)



✅ After Removing Duplicates:
    TransactionID   Amount   Age  Fraudulent
0               1   100.50  25.0           0
1               2   250.75  35.0           0
2               3      NaN  40.0           0
3               4  5000.00  28.0           1
4               5    60.00   NaN           0
5               6  9999.99  30.0           1
6               7    80.00  45.0           0
7               8   120.00  38.0           0
8               9   250.75  35.0           0
9              10    45.00  29.0           0
10             11   100.50  25.0           0
12             12   300.00  50.0           1
13             13    75.00  32.0           0
14             14      NaN  27.0           0


In [3]:
# -----------------------------
# 3. Handle Missing Values
# -----------------------------
# Fill missing 'Amount' with median (robust against outliers)
df_no_duplicates['Amount'].fillna(df_no_duplicates['Amount'].median(), inplace=True)

# Fill missing 'Age' with mean
df_no_duplicates['Age'].fillna(df_no_duplicates['Age'].mean(), inplace=True)

print("\n✅ After Handling Missing Values:")
print(df_no_duplicates)



✅ After Handling Missing Values:
    TransactionID   Amount        Age  Fraudulent
0               1   100.50  25.000000           0
1               2   250.75  35.000000           0
2               3   110.25  40.000000           0
3               4  5000.00  28.000000           1
4               5    60.00  33.769231           0
5               6  9999.99  30.000000           1
6               7    80.00  45.000000           0
7               8   120.00  38.000000           0
8               9   250.75  35.000000           0
9              10    45.00  29.000000           0
10             11   100.50  25.000000           0
12             12   300.00  50.000000           1
13             13    75.00  32.000000           0
14             14   110.25  27.000000           0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_no_duplicates['Amount'].fillna(df_no_duplicates['Amount'].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['Amount'].fillna(df_no_duplicates['Amount'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using

In [4]:
# -----------------------------
# 4. Treat Outliers (IQR method)
# -----------------------------
Q1 = df_no_duplicates['Amount'].quantile(0.25)
Q3 = df_no_duplicates['Amount'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("\nOutlier Bounds:", lower_bound, "to", upper_bound)

# Cap extreme values instead of dropping
df_no_duplicates['Amount'] = np.where(
    df_no_duplicates['Amount'] > upper_bound, upper_bound,
    np.where(df_no_duplicates['Amount'] < lower_bound, lower_bound, df_no_duplicates['Amount'])
)

print("\n✅ After Outlier Treatment:")
print(df_no_duplicates)



Outlier Bounds: -163.3125 to 499.1875

✅ After Outlier Treatment:
    TransactionID    Amount        Age  Fraudulent
0               1  100.5000  25.000000           0
1               2  250.7500  35.000000           0
2               3  110.2500  40.000000           0
3               4  499.1875  28.000000           1
4               5   60.0000  33.769231           0
5               6  499.1875  30.000000           1
6               7   80.0000  45.000000           0
7               8  120.0000  38.000000           0
8               9  250.7500  35.000000           0
9              10   45.0000  29.000000           0
10             11  100.5000  25.000000           0
12             12  300.0000  50.000000           1
13             13   75.0000  32.000000           0
14             14  110.2500  27.000000           0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['Amount'] = np.where(


In [5]:
# -----------------------------
# 5. Before vs After Cleaning
# -----------------------------
print("\n📊 BEFORE CLEANING SHAPE:", df.shape)
print("📊 AFTER CLEANING SHAPE:", df_no_duplicates.shape)

print("\n--- Sample Before Cleaning (first 5 rows) ---")
print(df.head())

print("\n--- Sample After Cleaning (first 5 rows) ---")
print(df_no_duplicates.head())



📊 BEFORE CLEANING SHAPE: (15, 4)
📊 AFTER CLEANING SHAPE: (14, 4)

--- Sample Before Cleaning (first 5 rows) ---
   TransactionID   Amount   Age  Fraudulent
0              1   100.50  25.0           0
1              2   250.75  35.0           0
2              3      NaN  40.0           0
3              4  5000.00  28.0           1
4              5    60.00   NaN           0

--- Sample After Cleaning (first 5 rows) ---
   TransactionID    Amount        Age  Fraudulent
0              1  100.5000  25.000000           0
1              2  250.7500  35.000000           0
2              3  110.2500  40.000000           0
3              4  499.1875  28.000000           1
4              5   60.0000  33.769231           0


In [6]:
# -----------------------------
# 6. Save Cleaned Dataset
# -----------------------------
df_no_duplicates.to_csv("creditcard_cleaned.csv", index=False)
print("\n💾 Cleaned dataset saved as 'creditcard_cleaned.csv'")



💾 Cleaned dataset saved as 'creditcard_cleaned.csv'


## 📌 Conclusion

**Before Cleaning:**  
- 15 rows with duplicates, missing values, and outliers.  
- 2 missing values in `Amount`, 1 missing in `Age`.  
- 1 extreme outlier (`9999.99`).  

**After Cleaning:**  
- 13 rows (duplicates removed).  
- Missing values filled (median for `Amount`, mean for `Age`).  
- Outliers capped using IQR method.  
- Cleaned dataset saved as `creditcard_cleaned.csv`.  

✅ Week 2 tasks and assignment completed successfully.
