## Preparing Data

In [3]:
import pandas as pd
sales = pd.read_csv("datasets/sales_subset.csv")

## Exercise: Dropping duplicates

Duplicate entries can skew your analysis, especially when you're counting or summarizing data. In this task, you'll clean the dataset by eliminating duplicate combinations from certain columns in the `sales` DataFrame.

### Instructions:

1. **Remove duplicate combinations** of `store` and `type`.
2. **Remove duplicate combinations** of `store` and `department`.
3. **Identify unique holiday dates** by filtering on `is_holiday` and dropping duplicates based on `date`.

In [4]:
# 1. Remove duplicate combinations of store and type
store_types = sales.drop_duplicates(subset=["store", "type"])
print("Unique store-type combinations:\n", store_types.head(), "\n")

Unique store-type combinations:
       Unnamed: 0  store type  department        date  weekly_sales  \
0              0      1    A           1  2010-02-05      24924.50   
901          901      2    A           1  2010-02-05      35034.06   
1798        1798      4    A           1  2010-02-05      38724.42   
2699        2699      6    A           1  2010-02-05      25619.00   
3593        3593     10    B           1  2010-02-05      40212.84   

      is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0          False       5.727778              0.679451         8.106  
901        False       4.550000              0.679451         8.324  
1798       False       6.533333              0.686319         8.623  
2699       False       4.683333              0.679451         7.259  
3593       False      12.411111              0.782478         9.765   



In [5]:
# 2. Remove duplicate combinations of store and department
store_depts = sales.drop_duplicates(subset=["store", "department"])
print("Unique store-department combinations:\n", store_depts.head(), "\n")

Unique store-department combinations:
     Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0            0      1    A           1  2010-02-05      24924.50       False   
12          12      1    A           2  2010-02-05      50605.27       False   
24          24      1    A           3  2010-02-05      13740.12       False   
36          36      1    A           4  2010-02-05      39954.04       False   
48          48      1    A           5  2010-02-05      32229.38       False   

    temperature_c  fuel_price_usd_per_l  unemployment  
0        5.727778              0.679451         8.106  
12       5.727778              0.679451         8.106  
24       5.727778              0.679451         8.106  
36       5.727778              0.679451         8.106  
48       5.727778              0.679451         8.106   



In [6]:
# 3. Filter holiday weeks and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")

# 4. Display the unique holiday dates
print("Holiday Dates:\n", holiday_dates["date"])

Holiday Dates:
 498     2010-09-10
691     2011-11-25
2315    2010-02-12
6735    2012-09-07
6810    2010-12-31
6815    2012-02-10
6820    2011-09-09
Name: date, dtype: object


## Counting categorical variables

Counting is an excellent technique to summarize categorical data and uncover interesting patterns. In this task, you’ll explore the distribution of store types and department numbers using the cleaned DataFrames you previously created.

You’ll work with:

* `store_types`: Contains unique combinations of `store` and `type`
* `store_depts`: Contains unique combinations of `store` and `department`

### Instructions:

1. **Calculate the count** of each store `type` in `store_types`.
2. **Compute the proportion** (relative frequency) of each store `type`.
3. **Find the count** of each `department` in `store_depts`, sorted in **descending** order.
4. **Determine the proportion** of each `department`, also sorted in **descending** order.

In [8]:
# Count how many stores belong to each store type
store_type_counts = store_types['type'].value_counts()
print(store_type_counts)

type
A    11
B     1
Name: count, dtype: int64


In [9]:
# Calculate the proportion of each store type
store_type_proportions = store_types['type'].value_counts(normalize=True)
print(store_type_proportions)

type
A    0.916667
B    0.083333
Name: proportion, dtype: float64


In [10]:
# Count how many stores carry each department
# By default, value_counts() sorts results in descending order
department_counts = store_depts['department'].value_counts()
print(department_counts)

department
1     12
2     12
3     12
4     12
5     12
      ..
37    10
48     8
50     6
39     4
43     2
Name: count, Length: 80, dtype: int64


In [11]:
# Calculate the proportion of stores carrying each department
department_proportions = store_depts['department'].value_counts(normalize=True)
print(department_proportions)

department
1     0.012917
2     0.012917
3     0.012917
4     0.012917
5     0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: proportion, Length: 80, dtype: float64


#### Key Note:

* `value_counts()` returns a Series sorted in **descending order** by default, so there's no need to pass `sort=True`.
* Using `normalize=True` returns **proportions** instead of counts.
