## Preparing Data

In [1]:
import pandas as pd
sales = pd.read_csv("datasets/sales_subset.csv")

## Exercise: Dropping duplicates

Duplicate entries can skew your analysis, especially when you're counting or summarizing data. In this task, you'll clean the dataset by eliminating duplicate combinations from certain columns in the `sales` DataFrame.

### Instructions:

1. **Remove duplicate combinations** of `store` and `type`.
2. **Remove duplicate combinations** of `store` and `department`.
3. **Identify unique holiday dates** by filtering on `is_holiday` and dropping duplicates based on `date`.

In [3]:
# 1. Remove duplicate combinations of store and type
store_types = sales.drop_duplicates(subset=["store", "type"])
print("Unique store-type combinations:\n", store_types.head(), "\n")

Unique store-type combinations:
       Unnamed: 0  store type  department        date  weekly_sales  \
0              0      1    A           1  2010-02-05      24924.50   
901          901      2    A           1  2010-02-05      35034.06   
1798        1798      4    A           1  2010-02-05      38724.42   
2699        2699      6    A           1  2010-02-05      25619.00   
3593        3593     10    B           1  2010-02-05      40212.84   

      is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0          False       5.727778              0.679451         8.106  
901        False       4.550000              0.679451         8.324  
1798       False       6.533333              0.686319         8.623  
2699       False       4.683333              0.679451         7.259  
3593       False      12.411111              0.782478         9.765   



In [4]:
# 2. Remove duplicate combinations of store and department
store_depts = sales.drop_duplicates(subset=["store", "department"])
print("Unique store-department combinations:\n", store_depts.head(), "\n")

Unique store-department combinations:
     Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0            0      1    A           1  2010-02-05      24924.50       False   
12          12      1    A           2  2010-02-05      50605.27       False   
24          24      1    A           3  2010-02-05      13740.12       False   
36          36      1    A           4  2010-02-05      39954.04       False   
48          48      1    A           5  2010-02-05      32229.38       False   

    temperature_c  fuel_price_usd_per_l  unemployment  
0        5.727778              0.679451         8.106  
12       5.727778              0.679451         8.106  
24       5.727778              0.679451         8.106  
36       5.727778              0.679451         8.106  
48       5.727778              0.679451         8.106   



In [6]:
# 3. Filter holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")

# 4. Display the unique holiday dates
print("Holiday Dates:\n", holiday_dates["date"])

Holiday Dates:
 498     2010-09-10
691     2011-11-25
2315    2010-02-12
6735    2012-09-07
6810    2010-12-31
6815    2012-02-10
6820    2011-09-09
Name: date, dtype: object
