In [88]:
from __future__ import print_function
import pandas as pd 
from scipy import stats

# CDiscount Category Analysis
This notebook aims to analyze the distribution of the categories in light of desinign a scalable strategy to tackle this large scale classification problem.

In [5]:
# Parameters
DATASET_ROOT = '../Data/raw/'

In [7]:
# Load Dataset
data = pd.read_csv(DATASET_ROOT+'category_names.csv')

In [15]:
print('TOTAL RECORDS: ' + str(data.size))
data.head()

TOTAL RECORDS: 21080


Unnamed: 0,category_id,category_level1,category_level2,category_level3
0,1000021794,ABONNEMENT / SERVICES,CARTE PREPAYEE,CARTE PREPAYEE MULTIMEDIA
1,1000012764,AMENAGEMENT URBAIN - VOIRIE,AMENAGEMENT URBAIN,ABRI FUMEUR
2,1000012776,AMENAGEMENT URBAIN - VOIRIE,AMENAGEMENT URBAIN,ABRI VELO - ABRI MOTO
3,1000012768,AMENAGEMENT URBAIN - VOIRIE,AMENAGEMENT URBAIN,FONTAINE A EAU
4,1000012755,AMENAGEMENT URBAIN - VOIRIE,SIGNALETIQUE,PANNEAU D'INFORMATION EXTERIEUR


### Category 1 Analysis

In [25]:
cat1_list = data.category_level1.unique().tolist()
print('CAT 1 - Unique Total: ' + str(len(cat1_list)) + '\n')
print(cat1_list)

CAT 1 - Unique Total: 49

['ABONNEMENT / SERVICES', 'AMENAGEMENT URBAIN - VOIRIE', 'ANIMALERIE', 'APICULTURE', 'ART DE LA TABLE - ARTICLES CULINAIRES', 'ARTICLES POUR FUMEUR', 'AUTO - MOTO', 'BAGAGERIE', 'BATEAU MOTEUR - VOILIER', 'BIJOUX -  LUNETTES - MONTRES', 'BRICOLAGE - OUTILLAGE - QUINCAILLERIE', 'CHAUSSURES - ACCESSOIRES', 'COFFRET CADEAU BOX', 'CONDITIONNEMENT', 'DECO - LINGE - LUMINAIRE', 'DROGUERIE', 'DVD - BLU-RAY', 'ELECTROMENAGER', 'ELECTRONIQUE', 'EPICERIE', 'FUNERAIRE', 'HYGIENE - BEAUTE - PARFUM', 'INFORMATIQUE', 'INSTRUMENTS DE MUSIQUE', 'JARDIN - PISCINE', 'JEUX - JOUETS', 'JEUX VIDEO', 'LIBRAIRIE', 'LITERIE', 'LOISIRS CREATIFS - BEAUX ARTS - PAPETERIE', 'MANUTENTION', 'MATERIEL DE BUREAU', 'MATERIEL MEDICAL', 'MERCERIE', 'MEUBLE', 'MUSIQUE', 'PARAPHARMACIE', 'PHOTO - OPTIQUE', 'POINT DE VENTE - COMMERCE - ADMINISTRATION', 'PRODUITS FRAIS', 'PRODUITS SURGELES', 'PUERICULTURE', 'SONO - DJ', 'SPORT', 'TATOUAGE - PIERCING', 'TELEPHONIE - GPS', 'TENUE PROFESSIONNELLE', 'T

### Category 2 Analysis
Look at the distribution of CAT2 per CAT1 - in particular to analyze the overall per category labels.

In [29]:
# Build Category Dictionary
cat_dict = {i:dict() for i in cat1_list}

In [68]:
# Aggregate Category 2 Per Category 1
data_dict = data.to_dict()
for k, v in data_dict['category_level1'].iteritems():
    if v not in cat_dict[v]: cat_dict[v][data_dict['category_level2'][k]] = dict()

In [79]:
# Generate Distribution Counts of Category 2 per Category 1
cat2_dist = []
for k, v in cat_dict.iteritems(): cat2_dist.append((k, len(v)))
print(cat2_dist)

[('ELECTRONIQUE', 10), ('PRODUITS SURGELES', 1), ('BIJOUX -  LUNETTES - MONTRES', 3), ('AUTO - MOTO', 11), ('PHOTO - OPTIQUE', 15), ('ABONNEMENT / SERVICES', 1), ('CONDITIONNEMENT', 7), ('TENUE PROFESSIONNELLE', 4), ('DVD - BLU-RAY', 3), ('COFFRET CADEAU BOX', 1), ('DECO - LINGE - LUMINAIRE', 20), ('PUERICULTURE', 13), ('MATERIEL MEDICAL', 25), ('SONO - DJ', 8), ('JARDIN - PISCINE', 13), ('CHAUSSURES - ACCESSOIRES', 5), ('ARTICLES POUR FUMEUR', 6), ('MATERIEL DE BUREAU', 8), ('HYGIENE - BEAUTE - PARFUM', 13), ('PARAPHARMACIE', 5), ('ART DE LA TABLE - ARTICLES CULINAIRES', 12), ('EPICERIE', 17), ('SPORT', 40), ('LIBRAIRIE', 29), ('INFORMATIQUE', 13), ('DROGUERIE', 10), ('TATOUAGE - PIERCING', 8), ('VIN - ALCOOL - LIQUIDES', 8), ('APICULTURE', 1), ('INSTRUMENTS DE MUSIQUE', 10), ('TV - VIDEO - SON', 11), ('ELECTROMENAGER', 15), ('MEUBLE', 8), ('MUSIQUE', 2), ('MERCERIE', 5), ('BATEAU MOTEUR - VOILIER', 13), ('LOISIRS CREATIFS - BEAUX ARTS - PAPETERIE', 21), ('BAGAGERIE', 5), ('ANIMALERIE

In [90]:
# Obtain General Stats for CAT2 Distribution
stats.describe([i[1] for i in cat2_dist])

DescribeResult(nobs=49, minmax=(1, 40), mean=9.9183673469387763, variance=61.701530612244902, skewness=1.503873612531015, kurtosis=3.103971878687968)

## Potential Strategies
Based on the analysis performed above, here are potential strategies we can employ based on the overall data types we are working with.
* Exploit the provided hierarchy in some form - perhaps using a bayseian based approach for implementing topic modleing to use posterior knolwedge of the high level category to classify and reduce the search space for classification.

* Build a hierarchical model, which first classifies the product in the first category through one independent model, then use another model which takes on the weight from the parent models and prune weights according to the data - use some large scale ResNet architecture to handle this amount of categories...

* Question is how deep of a model is necessary to achieve a much higher accruacy/precision...?