## Feature Engineering Plan

The goal of this notebook is to transform unstructured product descriptions into structured, analysis-ready features suitable for analytical and modeling tasks.

The feature engineering process follows an iterative, data-driven approach:
1. Explore raw product descriptions to identify recurring patterns and sources of noise.
2. Define and refine product categories based on textual and hardware-related signals.
3. Extract core attributes such as brand, CPU vendor, RAM size, storage type, and GPU category.
4. Produce a feature-enriched dataset that supports downstream analysis, segmentation, and modeling.


In [176]:
import pandas as pd
import numpy as np
import re


In [177]:
df = pd.read_csv("../data/processed/sales_clean.csv")
df.shape

(2397, 16)

### Step 1 — Explore raw product descriptions

This step focuses on understanding the structure, language, and variability
of the product description field before applying any parsing logic.

In [178]:
df["product_description"].sample(20, random_state=42)


443                       acer i5 10210u 8 ssd 512 mx350
196                           asus 14 pen 5405 4 128 int
1508                        dell i7 8650u 8 ssd256 mx130
1292                        hp ryz 3 4300u 8 ssd480 euro
1263                        asus i3 8130u 8 ssd256 mx110
321        msi bravo ryzen 5 5600h 16 ssd512 rx5500 burd
2380                            hp ryz 5 5500u 16 ssd512
1666                         ps5 slim edition new tehn k
1385                 lenovo v15 i3 1215u 8 ssd256 1шт 9к
1418               microsoft surface go i5 1035 4 ssd256
630     hp i5 7300hq 16 ssd512 gtx 1050 tehn отказ 03.10
1608                             acer a9 9420 8 500 tehn
1165                lenovo ryz 5 5600 16 ssd512 gtx 1650
736                            acer e1 1200 8 ssd240 int
613                                      ps 5 2 joy tehn
433                    lenovo ryzen 5 4500u 8 ssd256 int
2034                               dell i5 8250 8 ssd256
471                    lenovo a

In [179]:
df["product_description"].value_counts().head(10)

product_description
service                                       12
lenovo 14 i3 1115g4 8 ssd128                   7
acer ryz 5 7520u 16 ssd512                     5
сист. i7 4790 16 240                           5
hp 830 g5 14 i5 8 8 ssd128                     4
ps 5 blurey                                    4
acer i3 1115g4 8 ssd256                        3
lenovo i5 1035g1 8 ssd256                      3
lenovo 14 i3 1115g4 8 ssd256                   3
lenovo ryz 5 5600h 16 ssd256 1tb rtx3050ti     3
Name: count, dtype: int64

### Step 2 — Brand-based product classification (baseline)

Brands are extracted from product descriptions and used as the primary signal
to identify laptops 

In [180]:
# List of known brands observed in the data
brands = ["lenovo", "dell", "hp", "asus", "acer", "apple", "macbook", "msi", "samsung", "toshiba", \
    "lg", "huawei", "xiaomi", "microsoft", "gateway", "redmi", "realme"]

# Function to extract brand from product description
def extract_brand(text):
    if not isinstance(text, str):
        return "unknown"
    for brand in brands:
        if text.startswith(brand):
            return brand
    return "unknown"

df["brand"] = df["product_description"].apply(extract_brand)


In [181]:
df["brand"].value_counts()


brand
lenovo       593
unknown      404
hp           352
asus         342
acer         270
dell         235
xiaomi        46
macbook       34
apple         30
huawei        30
msi           26
samsung       11
microsoft      9
redmi          8
toshiba        3
gateway        2
lg             1
realme         1
Name: count, dtype: int64

In [182]:
df[["product_description", "brand"]].sample(20, random_state=42)

Unnamed: 0,product_description,brand
443,acer i5 10210u 8 ssd 512 mx350,acer
196,asus 14 pen 5405 4 128 int,asus
1508,dell i7 8650u 8 ssd256 mx130,dell
1292,hp ryz 3 4300u 8 ssd480 euro,hp
1263,asus i3 8130u 8 ssd256 mx110,asus
321,msi bravo ryzen 5 5600h 16 ssd512 rx5500 burd,msi
2380,hp ryz 5 5500u 16 ssd512,hp
1666,ps5 slim edition new tehn k,unknown
1385,lenovo v15 i3 1215u 8 ssd256 1шт 9к,lenovo
1418,microsoft surface go i5 1035 4 ssd256,microsoft


In [183]:
# assign category "laptop" based on brand names
df["product_category"] = np.where(
    df["brand"] != "unknown",
    "laptop",
    "other"
)

df["product_category"].value_counts()


product_category
laptop    1993
other      404
Name: count, dtype: int64

In [184]:
df[df["brand"] == "unknown"]["product_description"].sample(20, random_state=42)

546                                   ps 4 slim 1tb tehn
1498                                       ps 4 pro euro
2324                      монітор samsung s27r350fhi 1шт
227                                        часы carnival
305                         geo flex 14 i3 1005 4 ssd120
581       моноблок hp 23-q055no i5 4460 8 1tb rad r7 4gb
944                                       ps4 slim 2 joy
2061                     комп i5 10400 16 ssd512 1шт int
2243                оперативка ноутбук ddr4 85шт 408 грн
1696    джойстики playstation 5 dualsense rozetka 8 1900
2242                                              rx6600
582              моноблок 22 hp ryz 5 4500u 8 ssd256 int
1783                             видеокарта nv rtx3070ti
2330                                         nikon d3300
1141            honor i5 10210 8 ssd512 tehn отказ 04.04
1675                               монитор xiaomi p27fba
1260                     surface go 12 i5 1035g1 4 ssd64
2145                           

In [185]:
# Create separate DataFrames for other products
df_other = df[df["brand"] == "unknown"].copy()


In [186]:
# Function to classify other products
def classify_other(text):
    if not isinstance(text, str):
        return "unknown"
    if "ps" in text or "sony" in text or "xbox" in text:
        return "console"
    if "monitor" in text or "монитор" in text or "монітор" in text:
        return "monitor"
    if "canon" in text or "nikon" in text:
        return "camera"
    if "пк" in text or "комп'ютер" in text or "компьютер" in text or "сист" in text:
        return "desktop"
    if "моноблок" in text:
        return "all-in-one"
    if "принтер" in text or "printer" in text:
        return "printer"
    return "other"

df.loc[df["brand"] == "unknown", "product_category"] = (
    df_other["product_description"].apply(classify_other)
)


In [187]:
df["product_category"].value_counts()

product_category
laptop        1993
other          188
console        121
desktop         33
monitor         33
all-in-one      20
camera           7
unknown          2
Name: count, dtype: int64

### Step 3: Hardware Feature Extraction (CPU, RAM, Storage)

### CPU


In [188]:
# CPU patterns to search for in product descriptions
cpu_patterns = [
        r'i[3579][ -]?\d{4,5}',        
        r'ryz\s?[3579]\s?\d{3,4}',  
        r'ryzen\s?[3579]\s?\d{3,4}',
        r'\ba(4|6|8|9|10|12)\b',
        r'\ba(4|6|8|9|10|12)[ -]?\d{3,4}'
        r'\be[12]\b',
        r'\be[12][ -]?\d{3,4}'
        r'celeron',
        r'\bcel\b',
        r'pentium',
        r'\bpen\b',
        r'athlon',
        r'\bath\b',
        r'\bm[123]\b',                 
    ]
# Function to extract CPU from product description

def extract_cpu(text):
    if not isinstance(text, str):
        return "unknown"
    for pattern in cpu_patterns:
        match = re.search(pattern, text)
        if match:
            return match.group(0).replace(" ", "").replace("-", "")
    return None

In [189]:
df["cpu"] = df["product_description"].apply(extract_cpu)


In [190]:
df["cpu"].value_counts().head(10)


cpu
pen        123
cel         88
i31115      77
i59300      62
i510300     62
i58250      57
i51035      51
i36006      48
i57300      46
i58300      43
Name: count, dtype: int64

In [191]:
df["cpu"] = df["cpu"].fillna("unknown")

In [192]:
# Function to classify CPU vendor

def classify_cpu_vendor(cpu):
    if not isinstance(cpu, str):
        return None

    cpu = cpu.lower()

    # Intel
    if (
        cpu.startswith("i")
        or cpu in {"cel", "celeron", "pentium", "pen", "pent"}
    ):
        return "intel"

    # AMD
    if (
        "ryzen" in cpu
        or "ryz" in cpu
        or "athlon" in cpu
        or cpu.startswith("a")   
        or cpu.startswith("e")  
    ):
        return "amd"

    # Apple
    if cpu.startswith("m"):
        return "apple"

    return "other"


In [193]:
df["cpu_vendor"] = df["cpu"].apply(classify_cpu_vendor)


In [194]:
df["cpu_vendor"].value_counts(dropna=False)


cpu_vendor
intel    1369
other     550
amd       466
apple      12
Name: count, dtype: int64

In [195]:
df["cpu_vendor"] = df["cpu_vendor"].fillna("unknown")


CPU vendor was derived from parsed CPU strings and grouped into high-level categories 
(Intel, AMD, Apple, Unknown).  
This abstraction enables vendor-level analysis while remaining robust to noisy 
and abbreviated CPU names commonly found in real-world listings.


### RAM / SSD 

In [196]:
def extract_ram(text):
    if not isinstance(text, str):
        return None

    valid_ram = {2, 4, 8, 12, 16, 20, 32, 40, 64}

  
    match = re.search(r'\b(\d{1,2})\s?gb\b', text)
    if match:
        ram = int(match.group(1))
        if ram in valid_ram:
            return ram

    match = re.search(r'\b(\d{1,2})\s?(ram|ddr)\b', text)
    if match:
        ram = int(match.group(1))
        if ram in valid_ram:
            return ram

 
    if (
        re.search(r'i[3579]|ryzen|ryz|athlon|\ba(4|6|8|9|10|12)\b|\be[12]\b', text)
        or "ssd" in text
        or "hdd" in text
    ):
        match = re.search(r'\b(2|4|8|12|16|20|32|40|64)\b', text)
        if match:
            return int(match.group(1))

    return None


In [197]:
df["ram_gb"] = df["product_description"].apply(extract_ram)


In [198]:
df["ram_gb"].value_counts(dropna=False)


ram_gb
8.0     965
NaN     505
16.0    489
4.0     283
12.0     81
32.0     34
2.0      25
20.0     10
64.0      4
40.0      1
Name: count, dtype: int64

In [199]:
df[df["ram_gb"].notna()][["product_description", "ram_gb"]].sample(15)


Unnamed: 0,product_description,ram_gb
1779,asus cel 3350 4 ssd128,4.0
760,пк ryz 5 5600х 16 512,16.0
2176,acer 14 swift ryz 3 5300u 8 ssd256,8.0
837,asus pen 4200 4 ssd256 gef tehn,4.0
1384,lenovo 14 thinkbook i5 10210u 8 ssd512,8.0
949,lenovo 13 i3 5005u 8 ssd128 int 1шт 3490,8.0
1990,acer ryz 5 7520u 16 ssd512,16.0
1759,dell xps i5 8300h 32 ssd1tb gtx1050 4gb,4.0
2199,asus vivo 13 oled n6000 4 ssd128,4.0
1861,dell 13 i3 6006u 4 ssd128 5 шт 3600,4.0


In [200]:

def extract_storage(text):
    if not isinstance(text, str):
        return None


    has_ssd = "ssd" in text
    has_hdd = "hdd" in text

    if has_ssd and has_hdd:
        return "ssd+hdd"
    if has_ssd:
        return "ssd"
    if has_hdd:
        return "hdd"

    return "hdd"

In [201]:
df["storage_type"] = df["product_description"].apply(extract_storage)


In [202]:
df["storage_type"].value_counts(dropna=False)


storage_type
ssd        1676
hdd         711
ssd+hdd       8
None          2
Name: count, dtype: int64

In [203]:
df["storage_type"] = df["storage_type"].fillna("unknown")

Before extracting device-specific attributes, products labeled as `other` were re-evaluated using hardware signals (CPU and memory) to identify additional laptop entries and reduce category ambiguity.


In [204]:
# Extract more laptops based on CPU vendor and RAM size
laptop_mask = (
    (df["product_category"] == "other") &
    (
        df["cpu_vendor"].isin(["intel", "amd", "apple"])
        | (df["ram_gb"] >= 0)
    )
)

df.loc[laptop_mask, "product_category"] = "laptop"



In [205]:
df["product_category"].value_counts()

product_category
laptop        2058
other          123
console        121
desktop         33
monitor         33
all-in-one      20
camera           7
unknown          2
Name: count, dtype: int64

### GPU 


In [206]:
def extract_gpu(text):
    if not isinstance(text, str):
        return None


    # NVIDIA
    if "rtx" in text:
        return "nvidia_rtx"
    if "gtx" in text:
        return "nvidia_gtx"
    if "mx" in text:
        return "nvidia_mx"
    if "geforce" in text:
        return "nvidia"

    # AMD
    if "rx" in text or "radeon" in text:
        return "amd_radeon"
    if " r5 " in f" {text} " or " r7 " in f" {text} ":
        return "amd_radeon"

    return "integrated"


In [207]:
# Extract GPU type only for laptops
df.loc[df["product_category"] == "laptop", "gpu_type"] = (
    df.loc[df["product_category"] == "laptop", "product_description"]
      .apply(extract_gpu)
)



In [208]:
df["gpu_type"].value_counts()


gpu_type
integrated    1337
nvidia_gtx     324
nvidia_mx      220
nvidia_rtx     126
amd_radeon      50
nvidia           1
Name: count, dtype: int64

In [209]:
df[df["gpu_type"] != "integrated"][["product_description", "gpu_type"]].sample(10)


Unnamed: 0,product_description,gpu_type
174,тел redmi m200 тарас,
1923,ps4 pro,
2340,ps4 slim 500 + 1 new joy,
2288,lenovo 17 i5 8300h 20 ssd256 gtx1050 4gb,nvidia_gtx
1352,asus tuf ryz 5 3550h 16 1tb ssd500 rx560,amd_radeon
2231,hp omen ryz 5 4600h 16 ssd1tb ssd512 gtx1650ti,nvidia_gtx
1381,asus n5000 4 256 mx110,nvidia_mx
1695,джойстики playstation 5 dualsense rozetka 9 1900,
1688,nintendo switch,
824,телевизор lg 42ln541v,


In [210]:
# Apply function to determine if laptop is gaming 

def is_gaming(gpu_type):
    if not isinstance(gpu_type, str):
        return 0

    if gpu_type in {"nvidia_gtx", "nvidia_rtx", "amd_radeon"}:
        return 1

    return 0


In [211]:
# Create is_gaming_laptop feature

df["is_gaming_laptop"] = 0
mask_laptop = df["product_category"] == "laptop"

df.loc[mask_laptop, "is_gaming_laptop"] = (
    df.loc[mask_laptop, "gpu_type"]
      .apply(lambda x: 1 if x in {"nvidia_gtx", "nvidia_rtx", "amd_radeon"} else 0)
)


In [212]:
df[mask_laptop]["is_gaming_laptop"].value_counts()


is_gaming_laptop
0    1558
1     500
Name: count, dtype: int64

Hardware features including CPU vendor, RAM size, storage type, and GPU category were extracted from raw product descriptions using lightweight, rule-based parsing.  
These features transformed unstructured text into structured, analytically meaningful attributes and enabled robust downstream analysis, including refined product categorization and gaming laptop identification.


### Step 4 (Dates)

To capture temporal dynamics of sales, date-based features were derived from purchase and sale timestamps, enabling analysis of sales velocity and seasonal patterns.


In [213]:
df["purchase_date"] = pd.to_datetime(df["purchase_date"], errors="coerce")
df["sale_date"] = pd.to_datetime(df["sale_date"], errors="coerce")


In [214]:
# Calculate days on market
df["days_on_market"] = (
    df["sale_date"] - df["purchase_date"]
).dt.days


In [215]:
# Extract sale month
df["sale_month"] = df["sale_date"].dt.month.astype("Int64")


In [216]:
df["days_on_market"].describe()


count    2270.000000
mean       72.585022
std       119.975400
min         0.000000
25%        12.000000
50%        28.000000
75%        75.750000
max      1150.000000
Name: days_on_market, dtype: float64

In [217]:
df["sale_month"].value_counts().sort_index()


sale_month
1     205
2     187
3     206
4     173
5     130
6     150
7     180
8     261
9     220
10    192
11    201
12    168
Name: count, dtype: Int64

The time between purchase and sale was calculated as a measure of product liquidity, while the month of sale was extracted to support seasonality analysis.  
These temporal features complement hardware attributes and provide additional business context for downstream analysis.


In [218]:
# Final check 
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2397 entries, 0 to 2396
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   sku                    2274 non-null   float64       
 1   sale_date              2273 non-null   datetime64[ns]
 2   purchase_date          2273 non-null   datetime64[ns]
 3   product_description    2395 non-null   object        
 4   manager                2397 non-null   object        
 5   purchase_price_uah     2397 non-null   float64       
 6   sale_price_uah         2393 non-null   float64       
 7   margin_uah             2397 non-null   int64         
 8   missing_sale_date      2397 non-null   bool          
 9   missing_purchase_date  2397 non-null   bool          
 10  missing_sku            2397 non-null   bool          
 11  sale_year              2273 non-null   float64       
 12  usd_rate               2273 non-null   float64       
 13  pur

In [222]:
df.head()




Unnamed: 0,sku,sale_date,purchase_date,product_description,manager,purchase_price_uah,sale_price_uah,margin_uah,missing_sale_date,missing_purchase_date,...,brand,product_category,cpu,cpu_vendor,ram_gb,storage_type,gpu_type,is_gaming_laptop,days_on_market,sale_month
0,5076,2023-01-05,2022-01-21,систем hp omen i5 7300hq 16 500ssd gtx 1080 8,manager_1,18550.0,23000.0,4450,False,False,...,unknown,desktop,i57300,intel,16.0,ssd,,0,349.0,1
1,6070,2023-01-05,2022-12-21,hp i7 8750 16 1256 1070,manager_2,28150.0,33500.0,5350,False,False,...,hp,laptop,i78750,intel,16.0,hdd,integrated,0,15.0,1
2,6086,2023-01-05,2022-12-24,dell 13 i3 5005u 8 ssd128,manager_2,4460.0,6500.0,2040,False,False,...,dell,laptop,i35005,intel,8.0,ssd,integrated,0,12.0,1
3,5725,2023-01-05,2022-09-09,монітор samsung s24r350f,manager_1,4000.0,6000.0,2000,False,False,...,unknown,monitor,unknown,other,,hdd,,0,118.0,1
4,6085,2023-01-05,2022-12-24,dell 13 i3 5005u 8 ssd128,manager_2,4460.0,6500.0,2040,False,False,...,dell,laptop,i35005,intel,8.0,ssd,integrated,0,12.0,1


In [224]:
# Save the feature engineered dataset
df.to_csv("../data/processed/feature_engineered_products.csv", index=False)


## SUMMARY

  
Unstructured textual descriptions were systematically converted into structured features by combining rule-based parsing, hardware signal extraction, and iterative category refinement.  
The resulting dataset captures both technical attributes (CPU, RAM, storage, GPU) and temporal dynamics of sales, providing a robust foundation for exploratory analysis and downstream modeling.
