## Feature Engineering Plan

The goal of this notebook is to transform unstructured product descriptions into
structured, analysis-ready features.

The process follows an iterative approach:
1. Explore raw product descriptions to understand common patterns.
2. Define product categories based on textual signals.
3. Extract core attributes such as brand, CPU, RAM, and storage.
4. Produce a feature-enriched dataset for downstream analysis or modeling.

In [1]:
import pandas as pd
import numpy as np
import re


In [2]:
df = pd.read_csv("../data/processed/sales_clean.csv")
df.shape

(2397, 16)

### Step 1 — Explore raw product descriptions

This step focuses on understanding the structure, language, and variability
of the product description field before applying any parsing logic.

In [3]:
df["product_description"].sample(20, random_state=42)


443                       acer i5 10210u 8 ssd 512 mx350
196                           asus 14 pen 5405 4 128 int
1508                        dell i7 8650u 8 ssd256 mx130
1292                        hp ryz 3 4300u 8 ssd480 euro
1263                        asus i3 8130u 8 ssd256 mx110
321        msi bravo ryzen 5 5600h 16 ssd512 rx5500 burd
2380                            hp ryz 5 5500u 16 ssd512
1666                         ps5 slim edition new tehn k
1385                 lenovo v15 i3 1215u 8 ssd256 1шт 9к
1418               microsoft surface go i5 1035 4 ssd256
630     hp i5 7300hq 16 ssd512 gtx 1050 tehn отказ 03.10
1608                             acer a9 9420 8 500 tehn
1165                lenovo ryz 5 5600 16 ssd512 gtx 1650
736                            acer e1 1200 8 ssd240 int
613                                      ps 5 2 joy tehn
433                    lenovo ryzen 5 4500u 8 ssd256 int
2034                               dell i5 8250 8 ssd256
471                    lenovo a

In [4]:
df["product_description"].value_counts().head(10)

product_description
service                                       12
lenovo 14 i3 1115g4 8 ssd128                   7
acer ryz 5 7520u 16 ssd512                     5
сист. i7 4790 16 240                           5
hp 830 g5 14 i5 8 8 ssd128                     4
ps 5 blurey                                    4
acer i3 1115g4 8 ssd256                        3
lenovo i5 1035g1 8 ssd256                      3
lenovo 14 i3 1115g4 8 ssd256                   3
lenovo ryz 5 5600h 16 ssd256 1tb rtx3050ti     3
Name: count, dtype: int64

### Step 2 — Brand-based product classification (baseline)

Brands are extracted from product descriptions and used as the primary signal
to identify laptops 

In [5]:
# List of known brands observed in the data
brands = ["lenovo", "dell", "hp", "asus", "acer", "apple", "macbook", "msi", "samsung", "toshiba", \
    "lg", "huawei", "xiaomi", "microsoft", "gateway", "redmi", "realme"]

# Function to extract brand from product description
def extract_brand(text):
    if not isinstance(text, str):
        return "unknown"
    for brand in brands:
        if text.startswith(brand):
            return brand
    return "unknown"

df["brand"] = df["product_description"].apply(extract_brand)


In [6]:
df["brand"].value_counts()


brand
lenovo       593
unknown      404
hp           352
asus         342
acer         270
dell         235
xiaomi        46
macbook       34
apple         30
huawei        30
msi           26
samsung       11
microsoft      9
redmi          8
toshiba        3
gateway        2
lg             1
realme         1
Name: count, dtype: int64

In [7]:
df[["product_description", "brand"]].sample(20, random_state=42)

Unnamed: 0,product_description,brand
443,acer i5 10210u 8 ssd 512 mx350,acer
196,asus 14 pen 5405 4 128 int,asus
1508,dell i7 8650u 8 ssd256 mx130,dell
1292,hp ryz 3 4300u 8 ssd480 euro,hp
1263,asus i3 8130u 8 ssd256 mx110,asus
321,msi bravo ryzen 5 5600h 16 ssd512 rx5500 burd,msi
2380,hp ryz 5 5500u 16 ssd512,hp
1666,ps5 slim edition new tehn k,unknown
1385,lenovo v15 i3 1215u 8 ssd256 1шт 9к,lenovo
1418,microsoft surface go i5 1035 4 ssd256,microsoft


In [8]:
# assign category "laptop" based on brand names
df["product_category"] = np.where(
    df["brand"] != "unknown",
    "laptop",
    "other"
)

df["product_category"].value_counts()


product_category
laptop    1993
other      404
Name: count, dtype: int64

In [9]:
df[df["brand"] == "unknown"]["product_description"].sample(20, random_state=42)

546                                   ps 4 slim 1tb tehn
1498                                       ps 4 pro euro
2324                      монітор samsung s27r350fhi 1шт
227                                        часы carnival
305                         geo flex 14 i3 1005 4 ssd120
581       моноблок hp 23-q055no i5 4460 8 1tb rad r7 4gb
944                                       ps4 slim 2 joy
2061                     комп i5 10400 16 ssd512 1шт int
2243                оперативка ноутбук ddr4 85шт 408 грн
1696    джойстики playstation 5 dualsense rozetka 8 1900
2242                                              rx6600
582              моноблок 22 hp ryz 5 4500u 8 ssd256 int
1783                             видеокарта nv rtx3070ti
2330                                         nikon d3300
1141            honor i5 10210 8 ssd512 tehn отказ 04.04
1675                               монитор xiaomi p27fba
1260                     surface go 12 i5 1035g1 4 ssd64
2145                           

In [10]:
# Create separate DataFrames for other products
df_other = df[df["brand"] == "unknown"].copy()


In [11]:
# Function to classify other products
def classify_other(text):
    if not isinstance(text, str):
        return "unknown"
    if "ps" in text or "sony" in text or "xbox" in text:
        return "console"
    if "monitor" in text or "монитор" in text or "монітор" in text:
        return "monitor"
    if "canon" in text or "nikon" in text:
        return "camera"
    if "пк" in text or "комп'ютер" in text or "компьютер" in text or "сист" in text:
        return "desktop"
    if "моноблок" in text:
        return "all-in-one"
    if "принтер" in text or "printer" in text:
        return "printer"
    return "other"

df.loc[df["brand"] == "unknown", "product_category"] = (
    df_other["product_description"].apply(classify_other)
)


In [12]:
df["product_category"].value_counts()

product_category
laptop        1993
other          188
console        121
desktop         33
monitor         33
all-in-one      20
camera           7
unknown          2
Name: count, dtype: int64

### Step 3: Hardware Feature Extraction (CPU, RAM, Storage)

### CPU


In [13]:
# CPU patterns to search for in product descriptions
cpu_patterns = [
        r'i[3579][ -]?\d{4,5}',        
        r'ryz\s?[3579]\s?\d{3,4}',  
        r'ryzen\s?[3579]\s?\d{3,4}',
        r'\ba(4|6|8|9|10|12)\b',
        r'\ba(4|6|8|9|10|12)[ -]?\d{3,4}'
        r'\be[12]\b',
        r'\be[12][ -]?\d{3,4}'
        r'celeron',
        r'\bcel\b',
        r'pentium',
        r'\bpen\b',
        r'athlon',
        r'\bath\b',
        r'\bm[123]\b',                 
    ]
# Function to extract CPU from product description

def extract_cpu(text):
    if not isinstance(text, str):
        return "unknown"
    for pattern in cpu_patterns:
        match = re.search(pattern, text)
        if match:
            return match.group(0).replace(" ", "").replace("-", "")
    return None

In [14]:
df["cpu"] = df["product_description"].apply(extract_cpu)


In [15]:
df["cpu"].value_counts().head(10)


cpu
pen        123
cel         88
i31115      77
i59300      62
i510300     62
i58250      57
i51035      51
i36006      48
i57300      46
i58300      43
Name: count, dtype: int64

In [16]:
df["cpu"] = df["cpu"].fillna("unknown")

In [17]:
# Function to classify CPU vendor

def classify_cpu_vendor(cpu):
    if not isinstance(cpu, str):
        return None

    cpu = cpu.lower()

    # Intel
    if (
        cpu.startswith("i")
        or cpu in {"cel", "celeron", "pentium", "pen", "pent"}
    ):
        return "intel"

    # AMD
    if (
        "ryzen" in cpu
        or "ryz" in cpu
        or "athlon" in cpu
        or cpu.startswith("a")   
        or cpu.startswith("e")  
    ):
        return "amd"

    # Apple
    if cpu.startswith("m"):
        return "apple"

    return "other"


In [18]:
df["cpu_vendor"] = df["cpu"].apply(classify_cpu_vendor)


In [19]:
df["cpu_vendor"].value_counts(dropna=False)


cpu_vendor
intel    1369
other     550
amd       466
apple      12
Name: count, dtype: int64

In [20]:
df["cpu_vendor"] = df["cpu_vendor"].fillna("unknown")


CPU vendor was derived from parsed CPU strings and grouped into high-level categories 
(Intel, AMD, Apple, Unknown).  
This abstraction enables vendor-level analysis while remaining robust to noisy 
and abbreviated CPU names commonly found in real-world listings.


### RAM / SSD 

In [21]:
def extract_ram(text):
    if not isinstance(text, str):
        return None

    valid_ram = {2, 4, 8, 12, 16, 20, 32, 40, 64}

  
    match = re.search(r'\b(\d{1,2})\s?gb\b', text)
    if match:
        ram = int(match.group(1))
        if ram in valid_ram:
            return ram

    match = re.search(r'\b(\d{1,2})\s?(ram|ddr)\b', text)
    if match:
        ram = int(match.group(1))
        if ram in valid_ram:
            return ram

 
    if (
        re.search(r'i[3579]|ryzen|ryz|athlon|\ba(4|6|8|9|10|12)\b|\be[12]\b', text)
        or "ssd" in text
        or "hdd" in text
    ):
        match = re.search(r'\b(2|4|8|12|16|20|32|40|64)\b', text)
        if match:
            return int(match.group(1))

    return None


In [22]:
df["ram_gb"] = df["product_description"].apply(extract_ram)


In [23]:
df["ram_gb"].value_counts(dropna=False)


ram_gb
8.0     965
NaN     505
16.0    489
4.0     283
12.0     81
32.0     34
2.0      25
20.0     10
64.0      4
40.0      1
Name: count, dtype: int64

In [24]:
df[df["ram_gb"].notna()][["product_description", "ram_gb"]].sample(15)


Unnamed: 0,product_description,ram_gb
1075,lenovo leg y520 i5 7300 16 ssd256 1tb gtx 1060...,16.0
918,mac mini 2012 i7 8 1000 256,8.0
240,acer i5 8265u 8 ssd256 mx130 tehn,8.0
2152,hp ryz 7 3700u 8 ssd256,8.0
359,asus zen i7 11 16 512 mx450 tehn карта игорь,16.0
2356,asus oled ryz 5 7520u 16 ssd512,16.0
303,lenovo i3 7020 4 ssd120 int tehn,4.0
1053,lenovo n5000 8 ssd256 tehn,8.0
164,asus i5 7200 8 ssd240 920mx,8.0
226,lenovo i3 1115g4 8 120 burd,8.0


In [25]:

def extract_storage(text):
    if not isinstance(text, str):
        return None


    has_ssd = "ssd" in text
    has_hdd = "hdd" in text

    if has_ssd and has_hdd:
        return "ssd+hdd"
    if has_ssd:
        return "ssd"
    if has_hdd:
        return "hdd"

    return "hdd"

In [26]:
df["storage_type"] = df["product_description"].apply(extract_storage)


In [27]:
df["storage_type"].value_counts(dropna=False)


storage_type
ssd        1676
hdd         711
ssd+hdd       8
None          2
Name: count, dtype: int64

In [28]:
df["storage_type"] = df["storage_type"].fillna("unknown")