In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/product-classification-model/keras/default/1/LICENSE
/kaggle/input/product-classification-model/keras/default/1/README.md


Your task is to develop a prototype machine learning model for classifying dummy products into predefined categories. To reduce implementation time, you are encouraged to utilize TensorFlow. The objective is to demonstrate your ability to quickly prototype a solution for product classification and basic machine learning principles.



Please use the following prerequisites below:

- use Python 3.9 and above. 

- use TensorFlow.

- use Pandas

- use NumPy



The task requirements:

1. Data Generation

- Create a small dataset of dummy products with attributes such as name, description, price, and category. Aim for a manageable number of products and categories to facilitate quicker processing.

2. Data Preprocessing

- Perform basic data preprocessing steps, such as tokenization of text attributes and encoding of categorical attributes. 

3. Data Preprocessing

- Develop a simple text classification model using TensorFlow and Python. Consider using a basic neural network or a pre-trained model for faster development.

- Train the model using the generated dataset and evaluate its performance using basic evaluation metrics.

4. Documentation

- Provide brief documentation outlining the approach taken, including any assumptions made and limitations of the prototype.

A brief documentation outlining the approach below, including assumptions and limitations of the prototype :

Documentation :

Approach :

1. Data Generation : 

we created a dummy dataset with 1000 products, each having a name, description, price and a category. The categories are as Electronics , Clothing, Books, and Home & Kitchen.
    
2. Data Preprocessing :
        
i. Text proprocessing : we used TensorFlow's Tokenizer to convert product descriptions into sequence of integers, which were then padded to ensure uniform length.

ii. Category encoding : we used sklearn's LabelEncoder to convert category labels into numerical values.
        
3. Model Development : 
        
i. we created a simple newural network using TensorFlow's Keras API.

ii. the model architecture consists of an Embedding layer, a GlobalAveragePooling1D layer, and two Dense layers.

iii. we used sparse categorical crossentropy as the loss function and Adam as the optimizer.

iv. the model was trained for 10 epochs with a validation split of 0.2 .
       
4. Evaluation : 

we evaluated the model on a test set (20% of the data) and reported the test accuracy.
    
5. Prediction : 

we implemented a function to predict the category for a new product description.
    
    
    
Assumptions : 
1. the dummy data generated is representative of the real-world product data.
2. the product descriptions contain sufficient information for category classification.
3. the predefined categories are mutually exclusive and cover all possible product types.
    
    
Limitations :
1. **small dataset** : the model is being trained on a small and random generated dataset, which may not capture the complexicity of real-world product data.
2. **simple model architecture** : the current model is basic and may not capture comples relationships in the data.
3. **limited text preprocessing** : we used basic tokenization. 
4. **fixed number of categories**: the model is designed for a fixed set of categories and would need to be retrained to accomodate new categories.
5. also **model assums each product belongs to only one category**.
    
    
The below prototype demonstrates a basic classification using machine learning. 

In [7]:
import tensorflow as tf
import pandas as pd
import numpy as np

print("TensorFlow version:", tf.__version__)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)


TensorFlow version: 2.16.1
Pandas version: 2.2.2
NumPy version: 1.26.4


In [8]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Step 1 : Data Generation 

def generate_dummy_data(n_samples=1000):
    categories = ['Electronics', 'Clothing', 'Books', 'Home & Kitchen']
    
    data = {
        'name': [f'Product {i}' for i in range(n_samples)],
        'description': [f'This is a description for product {i}' for i in range(n_samples)],
        'price': np.random.uniform(10, 1000, n_samples).round(2),
        'category': np.random.choice(categories, n_samples)
    }
    
    return pd.DataFrame(data)


# Generating the dummy data

df = generate_dummy_data()

# Printing a snippet of the generated data

print("Snippet of generated dummy data:")
print(df.head())
print("\nDataset shape:", df.shape)
print("\nCategory distribution:")
print(df['category'].value_counts())

# Step 2 : Data Preprocessing 

# Tokenize text

tokenizer = Tokenizer(num_words=5000, oov_token="<OOV>")
tokenizer.fit_on_texts(df['description'])
sequences = tokenizer.texts_to_sequences(df['description'])
padded_sequences = pad_sequences(sequences, maxlen=100, padding='post', truncating='post')

# Encode categories

label_encoder = LabelEncoder()
encoded_categories = label_encoder.fit_transform(df['category'])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_categories, test_size=0.2, random_state=42)

# Step 3 : Model Development 

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=5000, output_dim=16, input_length=100),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(len(label_encoder.classes_), activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Training the model

history = model.fit(X_train, y_train, epochs=10, validation_split=0.2, batch_size=32)

# Step 4 : Evaluation of the model

test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.4f}")

# Step 5 : Prediction

def predict_category(description):
    sequence = tokenizer.texts_to_sequences([description])
    padded = pad_sequences(sequence, maxlen=100, padding='post', truncating='post')
    prediction = model.predict(padded)
    predicted_category = label_encoder.inverse_transform([np.argmax(prediction)])
    return predicted_category[0]



# example

new_product_description = "A new smartphone with advanced features"
predicted_category = predict_category(new_product_description)
print(f"Predicted category for '{new_product_description}': {predicted_category}")

Snippet of generated dummy data:
        name                          description   price        category
0  Product 0  This is a description for product 0   91.40        Clothing
1  Product 1  This is a description for product 1  456.61     Electronics
2  Product 2  This is a description for product 2  527.28  Home & Kitchen
3  Product 3  This is a description for product 3   27.71     Electronics
4  Product 4  This is a description for product 4  993.08     Electronics

Dataset shape: (1000, 4)

Category distribution:
category
Clothing          267
Books             259
Home & Kitchen    238
Electronics       236
Name: count, dtype: int64
Epoch 1/10




[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - accuracy: 0.2540 - loss: 1.3870 - val_accuracy: 0.1813 - val_loss: 1.3943
Epoch 2/10
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.2469 - loss: 1.3832 - val_accuracy: 0.2812 - val_loss: 1.3906
Epoch 3/10
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.2649 - loss: 1.3850 - val_accuracy: 0.1813 - val_loss: 1.3939
Epoch 4/10
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.2870 - loss: 1.3833 - val_accuracy: 0.1813 - val_loss: 1.3948
Epoch 5/10
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.2794 - loss: 1.3841 - val_accuracy: 0.1813 - val_loss: 1.3940
Epoch 6/10
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.2892 - loss: 1.3842 - val_accuracy: 0.1813 - val_loss: 1.3953
Epoch 7/10
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━