Problem type: [Online Retail II Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II)

Deadline: 17th Jan
Soft Deadline: 14th Jan

Task 1: Retrieving and Preparing the Data
- The  goal of the project: order cancellation
- Pre-process data: data cleaning
    - Missing values
    - Correct data type
- Create "OrderCancelled" column from "InvoiceNo"

(Trung)
- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- OrderCancelled

(Thao)
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.

(Hoang)
- UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.


(Trung) select based on ALL data

Task 2: Feature Engineering 
- Select relevant columns: use corr() => select all columns with high correlation
Note: These steps must be performed consistently for train/val/test sets.

Task 3: Data Modelling
- Model the data by treating it as Clustering AND Classification Task

Qs:
- "These steps must be performed consistently for train/val/test sets" => when do split data?
- "You must use at least two different models for each approach" => confirm select 2 approaches?

Task 4: Report:

Create PDF: https://docs.google.com/document/d/1Xp_-m0HaH1vw3eBdGQDE0IwwjcMSG8yXpYFJeAdOhV8/edit

Task 5: Presentation

prepare 10-12 slides for in-class presentation anddemonstration.


Phase 1: Task 1 + Task 2 (for all data set) (Deadline: 5th Jan)

Phase 2: Data modelling (5th -> 12th Jan)

Phase 3: Report + Presentation slides (12th -> 14th Jan)

## Task 1: Retrieving and Understanding Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# TODO: change file path to the correct one on your computer
file_path = "/kaggle/input/online-retail-iixlsx/online_retail_II.xlsx"

df = pd.read_excel(file_path)

**Attribute Information**

* **`Invoice`**: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
* **`StockCode`**: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
* **`Description`** : Product (item) name. Nominal.
* **`Quantity`**: The quantities of each product (item) per transaction. Numeric.
* **`InvoiceDate`**: Invice date and time. Numeric. The day and time when a transaction was generated.
* **`Price`**: Unit price. Numeric. Product price per unit in sterling (£).
* **`Customer ID`**: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
* **`Country`**: Country name. Nominal. The name of the country where a customer resides.

**Data Shape**

In [None]:
len(df)

In [None]:
df.info()

## Data Cleaning
**Missing values**

In [None]:
df.isnull().any()

In [None]:
print('Number of invoices for cancelation that also have negative quantity',
      df.loc[(df['Invoice'].str.contains('C', na = False)) & (df['Price'] < 0)].shape[0])

Dataset has 525461 rows and 8 columns, with missing values in **Customer ID** and **Description** columns.

In [None]:
df.isnull().sum()

In specific, the **Description** column has 2928 missing values while **CustomerID** column has 107927 null values. 

In [None]:
df.duplicated().sum()

In [None]:
df[df["Invoice"].str.contains("C", na=False)]

In [None]:
# Add new column: OrderCancelled
def map_order_cancelled(row):
    if row == None:
        return None
    
    if not isinstance(row, str):
        return row
    
    return 1 if "C" in str(row).upper() else 0

df["OrderCancelled"] = df["Invoice"].map(map_order_cancelled)


**Description:**


In [None]:
df[df["Description"].isnull() == True]

Those columns having null values in **Description, Customer ID** and **Price** equals 0 are dropped because they are considered failed transaction

In [None]:
# Drop failed transactions
idx = df[(df['Description'].isnull()) & (df['Customer ID'].isnull()) & (df['Price']==0)].index.values
df.drop(idx, inplace=True)

In [None]:
# Check the missing values
print(df.isnull().sum(),"are dropped")

**Quantity:**
The quantity column not only displays the quantity of items purchased, but it also displays the amount of cancelled/returned items by encoding the cancelled transactions as negative. Each of these cancelled transactions is associated with the Invoice Number. However, there is only 1 instance whose invoice starts with C but the Quantity is positive (1).

In [None]:
print('The number of entries with negative quantity', df[(df['Quantity'] < 0)].shape[0])

In [None]:
print('Number of invoices for cancelation that also have negative quantity',
      df.loc[(df['Invoice'].str.contains('C', na = False)) & (df['Quantity'] < 0)].shape[0])

In [None]:
df.loc[(df['Invoice'].str.contains('C', na = False)) & (df['Quantity'] < 0)]

In [None]:
df.loc[(df['Invoice'].str.contains('C', na = False)) & (df['Quantity'] >= 0)]

In [None]:
import matplotlib.pyplot as plt

df[["Price","Quantity"]].plot(kind="box")

**InvoiceDate**

In [None]:
max_invoice_date = df['InvoiceDate'].max()
min_invoice_date = df['InvoiceDate'].min()
date_fmt = '%Y-%m-%d'
print('The data ranges from {} to {}'.format(min_invoice_date.strftime(date_fmt), 
                                             max_invoice_date.strftime(date_fmt) 
                                            )
     )

In [None]:
duplicated = df[df.duplicated(keep = False)].sort_values(by = ['InvoiceDate','Invoice','StockCode'])
duplicated