# Product Category Prediction

##### Hi, 
###### Welcome to this repository! The objective of this repository is to best understand Naive Bayes . The data used in this repo has been taken from Kaggle (link below). 
https://www.kaggle.com/PromptCloudHQ/flipkart-products


### About project Mechanic of Machine Learning:
I am a mechanical engineer by education. Now, I want to deep dive in the world of Machine Learning, hence the name, mechanic of ML :D. I have taken up this project to understand the in-depth mathematics involved in regularly used ML algorithms. Under this project, I will be sharing useful material and links as I explore this domain. The objective is to learn and spread the same. Stay tuned to my GitHub for updates!

### Business Case: 
An online retailer wants to classify data based on description provided by seller. Generate a model to facilitate this ask. 
### Notebook objectives:
* To understand and implement naive bayes 


### Assumptions:

* Only five category data is considered from the total set 

### References:
* GDA and NB: https://www.youtube.com/watch?v=nt63k3bfXS0
* NB: https://www.youtube.com/watch?v=O2L2Uv9pdDA
* Gaussian NB:https://www.youtube.com/watch?v=H3EjCKtlVog
* Building NB from scratch: https://towardsdatascience.com/na%C3%AFve-bayes-from-scratch-using-python-only-no-fancy-frameworks-a1904b37222d
* Notes and source code: https://github.com/ArindamRoy23/Product-Prediction_GDA-NB_Mechanic-of-ML


In [None]:
'''
Importing packages

'''


import numpy as np 
import pandas as pd 
import re 
import nltk 
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:

def preprocess_string(str_arg):
    '''
    input: str_arg --> Takes string to clean
    output: cleaned_str --> Gives back cleaned string
    This fuction cleans the text in the mentioned ways as comments after the line.This has been copied from some other kernel.

    '''
    cleaned_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) #every char except alphabets is replaced
    cleaned_str=re.sub('(\s+)',' ',cleaned_str) #multiple spaces are replaced by single space
    cleaned_str=cleaned_str.lower() #converting the cleaned string to lower case
    
    return cleaned_str # Returning the preprocessed string in tokenized form

In [None]:
'''
This code block is for reading and cleaning data.

'''
import_df = pd.read_csv('../input/flipkart-products/flipkart_com-ecommerce_sample.csv')
# Reading relevant data
import_df['product_category_tree'] = import_df['product_category_tree'].apply(lambda x : x.split('>>')[0][2:].strip())
# Category processing. (Check data to understand)
top_fiv_gen = list(import_df.groupby('product_category_tree').count().sort_values(by='uniq_id',ascending=False).head(5).index)
# Taking only top 5 categories for example sake
processed_df = import_df[import_df['product_category_tree'].isin(top_fiv_gen)][['product_category_tree','description']]
# Selecting only relevant columns
processed_df['description'] = processed_df['description'].astype('str').apply(preprocess_string)
# Cleaning strings
cat_list = list(processed_df['product_category_tree'].unique())
# Creating a list of categories for later use
print(cat_list)
# Printing the list of top 5 categories
le = preprocessing.LabelEncoder()
category_encoded=le.fit_transform(processed_df['product_category_tree'])
processed_df['product_category_tree'] = category_encoded
# Encoding the product category

In [None]:
'''
This code block is for spliting train test data

'''
X_train, X_test, y_train, y_test = train_test_split(processed_df['description'],processed_df['product_category_tree'],test_size=0.2)

In [None]:
'''
This code block is for converting the training data to vectorized form

'''
vect = CountVectorizer(stop_words = 'english')
# Removing stop words
X_train_matrix = vect.fit_transform(X_train) 
# Converting the train data

In [None]:
'''
This code block is for training vectorized data and predicting & scoring test data

'''
clf=MultinomialNB()
# Defining model
clf.fit(X_train_matrix, y_train)
# Fitting to multinomial NB model 
print(clf.score(X_train_matrix, y_train))
# Scoring the trained model (Expected to be above 95 percent)
X_test_matrix = vect.transform(X_test) 
# Converting the test data
print (clf.score(X_test_matrix, y_test))
# Scoring for the test data
predicted_result=clf.predict(X_test_matrix)
print(classification_report(y_test,predicted_result))
# Printing score 

In [None]:
'''
This code block is for converting the training data to Tf-Idf form

'''
vectorizer = TfidfVectorizer(stop_words = 'english')
# Removing stop words
X_train_tfidf = vectorizer.fit_transform(X_train)
# Converting the train data

In [None]:
'''
This code block is for training, predicting & scoring test data

'''
clf2=MultinomialNB()
# Defining model
clf2.fit(X_train_tfidf, y_train)
# Fitting to multinomial NB model 
print(clf2.score(X_train_tfidf, y_train))
# Scoring the trained model (Expected to be above 95 percent)
X_test_tfidf = vectorizer.transform(X_test) 
# Converting the test data
print (clf2.score(X_test_tfidf, y_test))
# Printing score 

In [None]:
'''
Testing Block: Test your sting. Replace the 'car' string to test
'''
le.inverse_transform(clf.predict(vect.transform(['car'])))

### Conclusion 
 Naive Bayes works very well for this data set with an above 99% accuracy. This is a business ready model to deploy. 