# Task No1 OLTP Systems
## Task

1. Read Lectures #1 ACID requirements, OLTP systems
2. Find a OLTP dataset (csv file) containing more than 1,000,000 records in the
open recourses
thttps://www.kaggle.com/datasets https://www.stats.govt.nz/large-datasets/csv-files-for-download/
3. Select (choose) a list of columns (it is possible to choose not all columns)
4. Describe each selected column.
5. Describe the functionality of the OLTP system.
6. Dataset should not be repeated.

For example
A dataset containing the statistics of 101 calls for a period. For this dataset, the functionality can be the following:
- registering a call to the 101 service; 
- fixing the result of 101 service;
- finding the nearest ambulance, etc.

## Analysis

I found this dataset from kaggle and hope it can meet the requirements of task1:

- Dataset Name: eCommerce Events History in Cosmetics Shop
- Dataset source: https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop/
- Describe: This file contains behavior data for 5 months (Oct 2019 – Feb 2020) from a medium cosmetics online store.

I selected the 2020-Jan.csv file for analysis:

In [1]:
import pandas as pd
from IPython.display import display

# read the csv
orders_data = pd.read_csv('dataset/2020-Jan.csv')

Output this basic information:

In [16]:
import io
from IPython.display import HTML
# display info
buffer = io.StringIO()
orders_data.info(buf=buffer)
info = buffer.getvalue()
display(HTML(f'<pre>{info}</pre>'))

The first 20 rows of data are shown:

In [17]:
display(orders_data.head(10))

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2020-01-01 00:00:00 UTC,view,5809910,1602943681873052386,,grattol,5.24,595414620,4adb70bb-edbd-4981-b60f-a05bfd32683a
1,2020-01-01 00:00:09 UTC,view,5812943,1487580012121948301,,kinetics,3.97,595414640,c8c5205d-be43-4f1d-aa56-4828b8151c8a
2,2020-01-01 00:00:19 UTC,view,5798924,1783999068867920626,,zinger,3.97,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3
3,2020-01-01 00:00:24 UTC,view,5793052,1487580005754995573,,,4.92,420652863,546f6af3-a517-4752-a98b-80c4c5860711
4,2020-01-01 00:00:25 UTC,view,5899926,2115334439910245200,,,3.92,484071203,cff70ddf-529e-4b0c-a4fc-f43a749c0acb
5,2020-01-01 00:00:30 UTC,view,5837111,1783999068867920626,,staleks,6.35,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3
6,2020-01-01 00:00:37 UTC,cart,5850281,1487580006300255120,,marathon,137.78,593016733,848f607c-1d14-474a-8869-c40e60783c9d
7,2020-01-01 00:00:46 UTC,view,5802440,2151191070908613477,,,2.16,595411904,74ca1cd5-5381-4ffe-b00b-a258b390db77
8,2020-01-01 00:00:57 UTC,view,5726464,1487580005268456287,,,5.56,420652863,546f6af3-a517-4752-a98b-80c4c5860711
9,2020-01-01 00:01:02 UTC,remove_from_cart,5850281,1487580006300255120,,marathon,137.78,593016733,848f607c-1d14-474a-8869-c40e60783c9d


Check for missing values in key columns

In [21]:
missing_data = orders_data.isnull().sum()

print(f"Missing Data in Key Columns: {missing_data}")

Missing Data in Key Columns: event_time             0
event_type             0
product_id             0
category_id            0
category_code    4190033
brand            1775630
price                  0
user_id                0
user_session        1314
dtype: int64


Three columns have significant missing data, especially category_code and brand, which may need to be properly populated to be Task1 compliant.So I'm going to ignore these lists for now.

Check for transaction types:

In [24]:
transaction_types = orders_data['event_type'].unique()
print(f"Transaction Types: {transaction_types}")

Transaction Types: ['view' 'cart' 'remove_from_cart' 'purchase']


The event type has 4 user actions:
- view: User browsing products;
- cart: Use to add items to the shopping cart;
- remove_from_cart: Remove the item from the shopping cart;
- purchase: Purchase commodity.

Check for missing values in key columns:

In [26]:
concurrent_events = orders_data['event_time'].value_counts().head(10)
print(f"Concurrent Events: ")
print(f"{concurrent_events}")

Concurrent Events: 
2020-01-09 09:06:06 UTC    257
2020-01-14 20:24:38 UTC    235
2020-01-22 11:21:29 UTC    181
2020-01-19 12:10:19 UTC    180
2020-01-29 12:16:49 UTC    173
2020-01-27 13:46:56 UTC    171
2020-01-23 12:57:31 UTC    170
2020-01-29 07:53:06 UTC    138
2020-01-16 13:21:26 UTC    137
2020-01-17 15:26:56 UTC    135
Name: event_time, dtype: int64


I found that at certain times, the system will have a large amount of concurrent user behavior, such as 2020-01-09 09:06:06 can reach 257 times.

With such levels of concurrency, I believe the system must satisfy OLTP requirements to ensure the integrity of the business and the robustness of the system.

## My results

This is a dataset that records user behavior on an e-commerce website over a period of time.

### Describe each selected colum:

- **event_time**: Timestamp of when the event occurred, vital for analyzing user behavior patterns and system performance.
- **event_type**: Type of user interaction with a product, such as "view", "cart", "remove_from_cart", or "purchase".
- **product_id**: A unique identifier for the product category, used for classifying products.
- **user_id**: A unique identify for the user. allowing the system to track and analyze individual user behavior patterns.
- **user_session**: A unique identifier for the user session, I believe this session is used to record a user's activities over a certain period of time, essentially grouping user behavior into short-term segments based on time.

### For this dataset, the functionality can be the following:

**1. Event Tracking**:
- Capturing and logging user interaction as they occur in real-time.
- Identifying the type of interaction(like 'view', 'cart', 'remove_from_cart', 'purchase') to understand user engagement.


**2. Inventory Management**:
- Tracking the flow of products through user interactions by product identifiers.
- Managing invetory levels based on user activities such as adding to cart and purchases, which my affect stock levels.

**3. Customer Identification**:
- Identifying and tracking users through unique identifiers to provide parsonalized experience and support customer service operations.

**4. Sales Processing**:
- Facilitating the purchase transactions and ensuring their completion with robust transaction processing mechanisms.

**5. Analytics and Reporting**:
- Analyzing user behavior for business intelligence to inform marketing strategies, prodect placement, and inventory planning.
- Generating reports on sales trends, product popularity, and user engagement.