# Clustering Task Instruction

In this task, we will work on some clustering problems. The dataset we will be working on is a 52-week of sales transaction report. The **ultimate goal** of this task is to find similar time series in the sales transaction data.

## 1. Dataset Information
The variable description is taken directly from the [dataset source](https://www.kaggle.com/datasets/arjunbhasin2013/ccdata).
The dataset summarizes the usage behavior of about 9000 active credit card holders during a timeframe of 6 months.
The file is at a customer level with 18 behavioral variables:

- `CUST_ID`: Identification of Credit Card holder (Categorical)
- `BALANCE`: Balance amount left in their account to make purchases (
- `BALANCE_FREQUENCY`: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- `PURCHASES`: Amount of purchases made from account
- `ONEOFF_PURCHASES`: Maximum purchase amount done in one-go
- `INSTALLMENTS_PURCHASES`: Amount of purchase done in installment
- `CASH_ADVANCE`: Cash in advance given by the user
- `PURCHASES_FREQUENCY`: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- `ONEOFFPURCHASESFREQUENCY`: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- `PURCHASESINSTALLMENTSFREQUENCY`: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- `CASHADVANCEFREQUENCY`: How frequently the cash in advance being paid
- `CASHADVANCETRX`: Number of Transactions made with "Cash in Advanced"
- `PURCHASES_TRX`: Numbe of purchase transactions made
- `CREDIT_LIMIT`: Limit of Credit Card for user
- `PAYMENTS`: Amount of Payment done by user
- `MINIMUM_PAYMENTS`: Minimum amount of payments made by user
- `PRCFULLPAYMENT`: Percent of full payment paid by user
- `TENURE`: Tenure of credit card service for user

## 2. Importing the Dataset
A dataset can be imported directly from a `.zip` file.
To import a dataset, you will need to specify the file where is dataset is located.
The relative path below is correct for the location of this instruction file.
````python
import pandas as pd
import zipfile

zf = zipfile.ZipFile('Data/CC_GENERAL.zip') 
df = pd.read_csv(zf.open('CC_GENERAL.csv'))
````
This is specific to our repository.

## 3. Task Instruction
The steps below are served as a guidance to solve this clustering problem. They are by no means a must or the only way to solve this partcular dataset. Feel free to use what you have learned in the previous program and to be creative. Try to find out your own approach to this problem.

**Step 1: Data loading & preprocessing**
- load the data into Python Notebook and convert it to the appropriate format (dataframe, numpy.array, list, etc.)
- observe & explore the dataset
- check for null values

**Step 2: Data modelling**
- standardize the data to normal distribution 
- pick one data modelling approach respectively the Python modelling package that you would like to use
- fit the training dataset to the model and train the model
- output the model 

**Step 3: Result extration & interpretation**
- make your conclusions and interpretation on the model and final results
- evaluate the performance of your model and algorithm using different KPIs 

**Note!** Important criteria for evaluating your use case are well-documented cells, a good structure of the notebook with headers which are depicting various parts of it, and short comments on each part with reflections and insights that you gained.

## 4. Additional Resources:
   
**Packages that might be useful for you:**
- pandas
- numpy
- sklearn
- sklearn.cluster
- matplotlib 

**Useful links:**
- k-means clustering: https://en.wikipedia.org/wiki/K-means_clustering
- hierarchical clustering: https://en.wikipedia.org/wiki/Hierarchical_clustering & https://towardsdatascience.com/understanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec
- the elbow method: https://en.wikipedia.org/wiki/Elbow_method_(clustering)#:~:text=In%20cluster%20analysis%2C%20the%20elbow,number%20of%20clusters%20to%20use.

   

**Dataset citation:**
https://www.kaggle.com/datasets/arjunbhasin2013/ccdata