<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Task-1---Correcting-categories" data-toc-modified-id="Task-1---Correcting-categories-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Task 1 - Correcting categories</a></span></li><li><span><a href="#Task-2---updating-/-correcting-brand-names" data-toc-modified-id="Task-2---updating-/-correcting-brand-names-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Task 2 - updating / correcting brand names</a></span></li></ul></div>

Since there is no "ground truth" in the data (as there are mistakes in it), I have used an unsupervised approach that attempts to learn the empirical distributions of categories and hopefully corrects any mistakes accordingly.

### Task 1 - Correcting categories

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from collections import OrderedDict
import sys
import operator
import spacy
import math

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [2]:
df = pd.read_excel("file7_andrea.xlsx")

cat_mix = pd.read_excel("Latest category mix 03-05-2019.xlsx")

unique_cats = list(df['cat0fk'].unique())

mean_prices = df.groupby("cat0fk")["price"].mean()

Counting the empirical probability for each unigram and bigram per category

In [3]:
count_name_tokens = {}

# For each category...
for cat in tqdm(unique_cats):
    
    count_cat = {}
    
    # Get the list of unigrams...
    unigrams = df[df['cat0fk'] == cat]['clean_name'].to_list()
    
    unigrams = [item for sublist in unigrams for item in sublist.split() if item not in stop_words]
    
    total_unigrams = len(unigrams)
    
    # Then count the occurence of each unigram. Divide by total to create probability
    # This is the a-priori (default) probaility for each token in each class
    for unigram in unigrams:
        count_cat[unigram] = unigrams.count(unigram)/total_unigrams
        
    
    
    
    # Do the same for bigrams
    bigrams = [(unigrams[i], unigrams[i+1]) for i in range(len(unigrams)-1)]
    
    total_bigrams = len(bigrams)
    
    for bigram in bigrams:
        count_cat[bigram] = bigrams.count(bigram)/total_bigrams
    
    count_name_tokens[cat] = count_cat

100%|██████████| 7/7 [01:29<00:00, 12.79s/it]


Defining the updating function

In [4]:
threshold = 1

def reassign(product):
    
    text = product['clean_name']
    price = product['price']
    
    original_category = product['cat0fk']
        
    likelihood = dict.fromkeys(unique_cats, 0)
    
    min_price_diff = sys.maxsize
    
    unigrams = text.split()
    
    bigrams = [(unigrams[i], unigrams[i+1]) for i in range(len(unigrams) - 1)]
    
    
    # Calculate likelihood for each unigram
    for unigram in unigrams:
        cat_unigram_appearance = 0
        for cat, word_scores in count_name_tokens.items():
            if unigram in word_scores:
                cat_unigram_appearance += 1
                
        try:
            cat_unigram_weight = 1 / cat_unigram_appearance
        except:
            cat_unigram_weight = 1
            
        for cat, word_scores in count_name_tokens.items():
            if unigram in word_scores:
                likelihood[cat] += cat_unigram_weight * word_scores[unigram]
                
                
    # Calculate likelihood for each bigram
    for bigram in bigrams:
        # cat_bigram_weight = 1
        cat_bigram_appearance = 0
       
        for cat, word_scores in count_name_tokens.items():
            
            if bigram in word_scores:
                
                cat_bigram_appearance += 1
        try:
            cat_bigram_weight = 1 / cat_bigram_appearance
        except:
            cat_bigram_weight = 1
        
        for cat, word_scores in count_name_tokens.items():
            
            if bigram in word_scores:
                
                # Bigram likelihood get weighted by an additional 10
                likelihood[cat] += 10 * cat_bigram_weight * word_scores[bigram]
               
            
    likelihood = {k: v for k, v in sorted(likelihood.items(), key = lambda item: item[1],
                                          reverse = True)}
    
    original_likelihood = likelihood[original_category]
        
    most_prob_tag = max(likelihood.items(), key = operator.itemgetter(1))[0]
    
    most_prob_likelihood = likelihood[most_prob_tag]
    
    
    # Keep the original tag if the difference in likelihood
    # between new and original categories is relatively low
    if (most_prob_likelihood - original_likelihood)/original_likelihood < threshold:
        most_prob_tag = original_category
    
    return most_prob_tag

Updating categories for each row based on the above function

In [5]:
cat0fk_corrected = []

for index, row in df.iterrows():
    cat0fk_corrected.append(reassign(row))

In [6]:
df['cat0fk_corrected'] = cat0fk_corrected

In [7]:
df.head()

Unnamed: 0,product_name,clean_name,price,mapped_brands,cat0fk,cat0fk_corrected
0,"Royal Canin Persian Adult 30 Cat Food, 4 kg",royal canin persian adult 30 cat food 4 kg,439.0,royal canin,Home,Home
1,ROYAL CARE Reusable Latex Rubber Household Han...,royal care reusable latex rubber household han...,345.0,royal,Home,Home
2,Royal Carpet High Density Artificial Grass Car...,royal carpet high density artificial grass car...,1170.0,royal,Home,Home
3,"Royal Comfort Zone Cotton Mattress (Orange, 72...",royal comfort zone cotton mattress orange 72x3...,649.0,royal comfort,Home,Home
4,Royal Crown Austrian Crystal Silver Designer R...,royal crown austrian crystal silver designer r...,215.0,royal crown,LifeStyle,LifeStyle


I tried getting the right categories on Flipkart's API but I did not have the credentials to access it. An external reference will help a great deal in improving this task. 

### Task 2 - updating / correcting brand names

I use SpaCy's pretrained model for named entity extraction and POS tagging.

See - https://github.com/explosion/spacy-models/releases//tag/en_core_web_lg-2.2.5

In [8]:
nlp = spacy.load("en_core_web_lg")

# All possible brand names
possible_brands = df['mapped_brands'].unique()

Here - 
NER = Entities extracted from SpaCy

When updating each brand name, I have thought of four possible cases

In [9]:

def map_brands(product):
    
    # Extracting fields from row
    product_name = product['clean_name']
    brand_name = product['mapped_brands']
    
    # Feeding the name into SpaCy's model
    doc = nlp(product['clean_name'])

    
    product_name_tokens = product_name.split()
    
    try:
        brand_name_tokens = brand_name.split()
        len_brand = len(brand_name_tokens)
    except:
        len_brand = 1
    
    # This will contain the corrected brand name
    mapped_corrected = []
    
    # If no entities are found in the product name, 
    # set the first two tokens in the product name as the brand name (my assumption)
    if not doc.ents:
        string = ""
        for substr in product_name_tokens[:2]:
            string += (substr + " ")
        mapped_corrected = string
    
    # Loop through every detected entity...
    for entity in doc.ents:
        
        ent_start = entity.start
        ent_end = entity.end
        
        entity_tokens = [token for token in doc[ent_start: ent_end]]
    
    
        if type(brand_name) == str:
            
            # Case 1 - tokens exist in both NER and mapped_brands
            
            # Checking for overlap in brand_name_tokens and entity_tokens
            if bool(set(brand_name_tokens) & set(entity_tokens)):
                mapped_corrected = brand_name
                break
                
                
                
            # Case 2 - tokens exist in mapped_brands but not NER
            else:
                set_mb = set(brand_name_tokens)
                set_pn = set(product_name_tokens)
                
                pn_mb_overlap = set_mb & set_pn
                
                pn_mb_overlap = [x for x in product_name_tokens if x in pn_mb_overlap]                
                
                if pn_mb_overlap:
                    string = ""
                    for substr in pn_mb_overlap:
                        string += (substr + " ")
                    mapped_corrected = string
                    break
                    
                
            
        # Case 3 - tokens exist in NER but not mapped_brands
        elif math.isnan(brand_name):
            if bool(set(product_name_tokens) & set(entity_tokens)):
                mapped_corrected = entity.text
                break
            
            
            
            # Case 4 - tokens do not exist in both mapped_brands and NER
            else:
                string = ""
                for substr in product_name_tokens[:len_brand]:
                    string += (substr + " ")
                mapped_corrected = string
                break
    
    # If no criteria was fulfilled in the previous steps, simply return the brand name as is
    if not mapped_corrected:
        mapped_corrected = brand_name
                
    return mapped_corrected

In [10]:
# Execute the above function for all rows

mapped_brands_corrected = []

for index, row in tqdm(df.iterrows()):
    mapped_brands_corrected.append(map_brands(row))

6056it [02:13, 45.35it/s]


In [11]:
df['mapped_brands_corrected'] = mapped_brands_corrected

In [12]:
df.tail()

Unnamed: 0,product_name,clean_name,price,mapped_brands,cat0fk,cat0fk_corrected,mapped_brands_corrected
6051,The Great Ages of World Architecture (With Int...,the great ages of world architecture with intr...,365.0,the,Home,Home,the
6052,The Great Gatsby,the great gatsby,79.0,the,BGM,BGM,the great
6053,The Greatness Guide 2,the greatness guide 2,244.0,the,BGM,BGM,the
6054,The Gruffalo's Child Magnet Book,the gruffalo s child magnet book,520.0,the,BGM,BGM,the gruffalo
6055,The Heartfulness Way (Kannada),the heartfulness way kannada,180.0,the,Electronics,Electronics,the


The problem here is for products without a brand name. "The" seems to be captured as the brand name, even when it isn't correct. This can be solved to an extent by writing a rule to omit tokens with POS tags == 'DET' in SpaCy. 

