In [1]:
import pandas as pd
import numpy as np

In [23]:
df = pd.read_csv('../data/raw/sales_2023_2025_11.csv')

# Data Overview & Initial Assessment
The data was exported directly from Google Sheets and reflects manual data entry, resulting in multiple inconsistencies and missing values

In [43]:
# Display basic information about the dataframe
df.head()

Unnamed: 0,sale_id,sale_date,purchase_date,product_description,manager,purchase_price_uah,sale_price_uah,margin_uah
0,6070.0,1/5/2023,12/21/2022,HP i7 8750 16 1256 1070,manager_2,28150.0,33500.0,5350
1,6086.0,1/5/2023,12/24/2022,dell 13 i3 5005u 8 ssd128,manager_2,4460.0,6500.0,2040
2,,,,док станция + hdmi,,,700.0,700
3,5725.0,1/5/2023,9/9/2022,Монітор Samsung S24R350F,manager_1,4000.0,6000.0,2000
4,6085.0,1/5/2023,12/24/2022,dell 13 i3 5005u 8 ssd128,manager_2,4460.0,6500.0,2040


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3429 entries, 0 to 3428
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   sale_id              2274 non-null   float64
 1   sale_date            2273 non-null   object 
 2   purchase_date        2273 non-null   object 
 3   product_description  2543 non-null   object 
 4   manager              1960 non-null   object 
 5   purchase_price_uah   2396 non-null   float64
 6   sale_price_uah       2502 non-null   float64
 7   margin_uah           3429 non-null   int64  
dtypes: float64(3), int64(1), object(4)
memory usage: 214.4+ KB


At this stage, the dataset does not contain proper column headers.  
When loading the CSV file, the first row was incorrectly interpreted as column names by pandas.

In [45]:
# Add column names
df.columns = ["sale_id",
    "sale_date",
    "purchase_date",
    "product_description",
    "manager",
    "purchase_price_uah",
    "sale_price_uah",
    "margin_uah"]

In [57]:
# Map manager names to standardized identifiers
manager_map = {
    "Игорь": "manager_1",
    "Паша": "manager_2",
    "Коля": "manager_3",
    "коля": "manager_3",
}

df["manager"] = df["manager"].replace(manager_map)

In [47]:
# Check for missing values
df.isna().sum()

sale_id                1155
sale_date              1156
purchase_date          1156
product_description     886
manager                1469
purchase_price_uah     1033
sale_price_uah          927
margin_uah                0
dtype: int64

In [48]:
#check rows with key values missing
empty_rows = df[
    df["product_description"].isna() &
    df["sale_date"].isna() &
    df["purchase_price_uah"].isna() &
    df["sale_price_uah"].isna()
]

empty_rows.shape


(884, 8)

In [49]:
empty_rows.head()

Unnamed: 0,sale_id,sale_date,purchase_date,product_description,manager,purchase_price_uah,sale_price_uah,margin_uah
87,,,,,,,,0
88,,,,,,,,0
89,,,,,,,,0
90,,,,,,,,0
91,,,,,,,,0


### Structural Empty Rows Identification

Using key business fields (`product_description`, `sale_date`, `purchase_price_uah`, `sale_price_uah`),  
884 rows were identified as invalid sales records.

These rows contain values only in the `margin_uah` column (set to 0) and have missing values across all essential sales attributes.

In [61]:
#check rows with partial data
partial_rows = df[
    df.notna().sum(axis=1).between(2, 6)
]

partial_rows.shape


(276, 8)

In [62]:
partial_rows.head(10)

Unnamed: 0,sale_id,sale_date,purchase_date,product_description,manager,purchase_price_uah,sale_price_uah,margin_uah
2,,,,док станция + hdmi,,,700.0,700
14,,,,клава + мышка,,,550.0,550
30,,,,возврат за обогреватель Розетка,,,3495.0,3495
34,,,,lenovo thinkpad x280 i5 8350u 16 ssd256,manager_3,8000.0,10500.0,2500
50,,,,lenovo x280 i5 8пок. 8gb / 16 gb 13шт,manager_3,8000.0,10100.0,2100
53,,,,thp S1 13'',,,5900.0,5900
55,,,,HP a10 9600 6 ssd128 r7,,,8300.0,8300
63,,,,lenovo x280 i5 8пок. 8gb / 16 gb 12 шт,manager_3,42500.0,50500.0,8000
80,,,,сервис + ssd128,,,800.0,800
84,,,,lenovo x280 i5 8пок. 8gb / 16 gb 7шт,,42500.0,50500.0,8000


### Partially Filled Rows Analysis

In addition to fully empty structural rows, 276 partially filled rows were identified.

These rows fall into two distinct categories:

**1. Non-sales operational records**  
Rows representing accessories, refunds, or miscellaneous transactions (e.g. keyboards, docks, refunds).  
Such records contain a product description and margin value but lack dates, prices, and manager information.  
These entries do not represent laptop sales and should be treated separately or excluded from the sales dataset.

**2. Incomplete sales records**  
Rows that represent actual laptop sales but have missing key attributes such as sale or purchase dates.  
These records contain product descriptions, prices, and manager information but require additional handling during data cleaning.
