# Clustering Task Instruction

In this task, we will work on some clustering problems. The [dataset](https://www.kaggle.com/datasets/crawford/weekly-sales-transactions) we will be working on is a 52-week of sales transaction report. The **ultimate goal** of this task is to find similar time series in the sales transaction data.

## 1. Dataset Information
We have only one csv file prepared for you. There are 53 attributes with 811 entries (or rows) in the file.
The first column is Product_Code, then we have the rest of 52 columns corresponding to 52 weeks of our sales transactions. The normalised values are also provided.

## 2. Importing the Dataset
A dataset can be imported directly from a `.zip` file.
To import a dataset, you will need to specify the file where is dataset is located.
The relative path below is correct for the location of this instruction file.
````python
import pandas as pd
import zipfile

zf = zipfile.ZipFile('Data/sales_transactions_dataset_weekly.zip') 
df = pd.read_csv(zf.open('sales_transactions_dataset_weekly.csv'))
````
This is specific to our repository.

## 3. Task Instruction
The steps below are served as a guidance to solve this clustering problem. They are by no means a must or the only way to solve this partcular dataset. Feel free to use what you have learned in the previous program and to be creative. Try to find out your own approach to this problem.

**Step 1: Data loading & preprocessing**
- load the data into Python Notebook and convert it to the appropriate format (dataframe, numpy.array, list, etc.)
- observe & explore the dataset
- check for null values

**Step 2: Data modelling**
- standardize the data to normal distribution 
- pick one data modelling approach respectively the Python modelling package that you would like to use
- fit the training dataset to the model and train the model
- output the model 

**Step 3: Result extration & interpretation**
- make your conclusions and interpretation on the model and final results
- evaluate the performance of your model and algorithm using different KPIs 

**Note!** Important criteria for evaluating your use case are well-documented cells, a good structure of the notebook with headers which are depicting various parts of it, and short comments on each part with reflections and insights that you gained.

## 4. Additional Resources:
   
**Packages that might be useful for you:**
- pandas
- numpy
- sklearn
- sklearn.cluster
- matplotlib 

**Useful links:**
- k-means clustering: https://en.wikipedia.org/wiki/K-means_clustering
- hierarchical clustering: https://en.wikipedia.org/wiki/Hierarchical_clustering & https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec
- the elbow method: https://en.wikipedia.org/wiki/Elbow_method_(clustering)#:~:text=In%20cluster%20analysis%2C%20the%20elbow,number%20of%20clusters%20to%20use.

   

**Dataset citation:**  
[[Tan et al., 2014] Tan, S.C., Lau, J.P.S. (2014). Time Series Clustering: A Superior Alternative for Market Basket Analysis. In: Herawan, T., Deris, M., Abawajy, J. (eds) Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). Lecture Notes in Electrical Engineering, vol 285. Springer, Singapore.](https://link.springer.com/chapter/10.1007/978-981-4585-18-7_28)