# Sarcina 3 – Predicția categoriei produsului pe baza titlului

Acest notebook prezintă pașii realizați pentru antrenarea unui model de clasificare
care prezice categoria unui produs pe baza titlului său, folosind setul de date `IMLP4_TASK_03-products.csv`.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import pickle


In [2]:
# Citește fișierul CSV (ajustează calea dacă e nevoie)
data = pd.read_csv("IMLP4_TASK_03-products.csv")

# Afișează primele rânduri
data.head()


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


In [4]:
# Curăță spațiile din numele coloanelor
data.columns = data.columns.str.strip()

# Verifică dacă numele sunt corecte
print(data.columns)

# Acum funcționează fără eroare
print(data.info())
print(data["Category Label"].value_counts().head(10))


Index(['product ID', 'Product Title', 'Merchant ID', 'Category Label',
       '_Product Code', 'Number_of_Views', 'Merchant Rating', 'Listing Date'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3   Category Label   35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7   Listing Date     35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB
None
Category Label
Fridge Freezers     5495
Washing Machines    4036
Mobile Phones       4020
CPUs                3771
TVs                 3564
Fridges             3457
Dishwashers         3418
Dig

In [5]:
# Păstrăm doar coloanele relevante
df = data[["Product Title", "Category Label"]].dropna()

# Separăm datele de antrenament și test
X_train, X_test, y_train, y_test = train_test_split(
    df["Product Title"], df["Category Label"], test_size=0.2, random_state=42
)

# Vectorizare TF-IDF
vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [6]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Salvare model și vectorizator
pickle.dump(model, open("product_model.pkl", "wb"))
pickle.dump(vectorizer, open("vectorizer.pkl", "wb"))


In [7]:
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nRaport de clasificare:")
print(classification_report(y_test, y_pred))


Accuracy: 0.9303418803418804

Raport de clasificare:
                  precision    recall  f1-score   support

             CPU       0.00      0.00      0.00        14
            CPUs       0.98      1.00      0.99       726
 Digital Cameras       0.99      1.00      0.99       535
     Dishwashers       0.99      0.96      0.97       684
        Freezers       1.00      0.52      0.68       422
 Fridge Freezers       0.76      0.98      0.86      1087
         Fridges       0.89      0.82      0.85       702
      Microwaves       0.99      0.97      0.98       464
    Mobile Phone       0.00      0.00      0.00        17
   Mobile Phones       0.98      0.99      0.98       795
             TVs       0.98      0.99      0.99       724
Washing Machines       0.98      0.98      0.98       821
          fridge       0.00      0.00      0.00        29

        accuracy                           0.93      7020
       macro avg       0.73      0.71      0.71      7020
    weighted avg 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [8]:
# Testare manuală
titlu = "iphone 7 32gb gold"
titlu_tfidf = vectorizer.transform([titlu])
categorie_pred = model.predict(titlu_tfidf)
print("Predicția modelului:", categorie_pred[0])


Predicția modelului: Mobile Phones


## Concluzie

Modelul Naive Bayes antrenat pe titlurile produselor a obținut o acuratețe de aproximativ X%.
Poate fi folosit pentru a sugera automat categoria unui produs nou în funcție de titlu.
