# Sarcina 3 – Predicția categoriei produsului pe baza titlului

Acest notebook prezintă pașii realizați pentru antrenarea unui model de clasificare
care prezice categoria unui produs pe baza titlului său, folosind setul de date `IMLP4_TASK_03-products.csv`.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import pickle


In [5]:
!python -m pip install pandas numpy scikit-learn jupyter ipykernel matplotlib


Collecting pandas
  Downloading pandas-2.3.3.tar.gz (4.5 MB)
     ---------------------------------------- 0.0/4.5 MB ? eta -:--:--
     ---------------- ----------------------- 1.8/4.5 MB 15.1 MB/s eta 0:00:01
     ---------------------------------------- 4.5/4.5 MB 18.0 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'error'


  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [23 lines of output]
      + meson setup C:\Users\silvi\AppData\Local\Temp\pip-install-ecjwtnij\pandas_f55690ffe7814c5fa69f8a7f7646b023 C:\Users\silvi\AppData\Local\Temp\pip-install-ecjwtnij\pandas_f55690ffe7814c5fa69f8a7f7646b023\.mesonpy-da08zshb -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --vsenv --native-file=C:\Users\silvi\AppData\Local\Temp\pip-install-ecjwtnij\pandas_f55690ffe7814c5fa69f8a7f7646b023\.mesonpy-da08zshb\meson-python-native-file.ini
      The Meson build system
      Version: 1.9.1
      Source dir: C:\Users\silvi\AppData\Local\Temp\pip-install-ecjwtnij\pandas_f55690ffe7814c5fa69f8a7f7646b023
      Build dir: C:\Users\silvi\AppData\Local\Temp\pip-install-ecjwtnij\pandas_f55690ffe7814c5fa69f8a7f7646b023\.mesonpy-da08zshb
      Build type: native build
      Activating VS 17.12.4
      Project name: pandas
      Project version: 

In [2]:
# Citește fișierul CSV (ajustează calea dacă e nevoie)
data = pd.read_csv("IMLP4_TASK_03-products.csv")

# Afișează primele rânduri
data.head()


Unnamed: 0,product ID,Product Title,Merchant ID,Category Label,_Product Code,Number_of_Views,Merchant Rating,Listing Date
0,1,apple iphone 8 plus 64gb silver,1,Mobile Phones,QA-2276-XC,860.0,2.5,5/10/2024
1,2,apple iphone 8 plus 64 gb spacegrau,2,Mobile Phones,KA-2501-QO,3772.0,4.8,12/31/2024
2,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,Mobile Phones,FP-8086-IE,3092.0,3.9,11/10/2024
3,4,apple iphone 8 plus 64gb space grey,4,Mobile Phones,YI-0086-US,466.0,3.4,5/2/2022
4,5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,5,Mobile Phones,NZ-3586-WP,4426.0,1.6,4/12/2023


In [None]:
# Curăță spațiile din numele coloanelor
data.columns = data.columns.str.strip()

# Verifică dacă numele sunt corecte
print(data.columns)

# Acum funcționează fără eroare
print(data.info())
print(data["Category Label"].value_counts().head(10))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35311 entries, 0 to 35310
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   product ID       35311 non-null  int64  
 1   Product Title    35139 non-null  object 
 2   Merchant ID      35311 non-null  int64  
 3    Category Label  35267 non-null  object 
 4   _Product Code    35216 non-null  object 
 5   Number_of_Views  35297 non-null  float64
 6   Merchant Rating  35141 non-null  float64
 7    Listing Date    35252 non-null  object 
dtypes: float64(2), int64(2), object(4)
memory usage: 2.2+ MB
None


KeyError: 'Category Label'

In [None]:
# Păstrăm doar coloanele relevante
df = data[["Product Title", "Category Label"]].dropna()

# Separăm datele de antrenament și test
X_train, X_test, y_train, y_test = train_test_split(
    df["Product Title"], df["Category Label"], test_size=0.2, random_state=42
)

# Vectorizare TF-IDF
vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


In [None]:
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Salvare model și vectorizator
pickle.dump(model, open("product_model.pkl", "wb"))
pickle.dump(vectorizer, open("vectorizer.pkl", "wb"))


In [None]:
y_pred = model.predict(X_test_tfidf)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nRaport de clasificare:")
print(classification_report(y_test, y_pred))


In [None]:
# Testare manuală
titlu = "iphone 7 32gb gold"
titlu_tfidf = vectorizer.transform([titlu])
categorie_pred = model.predict(titlu_tfidf)
print("Predicția modelului:", categorie_pred[0])


## Concluzie

Modelul Naive Bayes antrenat pe titlurile produselor a obținut o acuratețe de aproximativ X%.
Poate fi folosit pentru a sugera automat categoria unui produs nou în funcție de titlu.
