# Stock Codes

There are issues with the given stock code - description pairings found in the original data set. This notebook is my work in trying to discover these issues and to help develop a script to fix them in an automated way.  

In [1]:
import os
import numpy as np
import pandas as pd
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
file_path = f"{os.getenv('PROJ_REPOS')}\\data\\Online Retail.xlsx"
online_retail = pd.read_excel(file_path)
online_retail['StockCode'] = online_retail['StockCode'].astype(str)
online_retail['Description'] = online_retail['Description'].astype(str)

In [3]:
stock_items_raw = online_retail.query('Quantity > 0')[['StockCode', 'Description']].drop_duplicates()

In [4]:
stock_items_raw

Unnamed: 0,StockCode,Description
0,85123A,WHITE HANGING HEART T-LIGHT HOLDER
1,71053,WHITE METAL LANTERN
2,84406B,CREAM CUPID HEARTS COAT HANGER
3,84029G,KNITTED UNION FLAG HOT WATER BOTTLE
4,84029E,RED WOOLLY HOTTIE WHITE HEART.
...,...,...
535334,22142,check
537224,47591b,SCOTTIES CHILDRENS APRON
537621,85123A,CREAM HANGING HEART T-LIGHT HOLDER
538554,85175,


In [5]:
stock_items = stock_items_raw.dropna()
stock_items.loc[:, 'StockCode'] = stock_items.loc[:, 'StockCode'].str.upper()
stock_items['Desc_isupper'] = list(map(lambda x: x.isupper(), stock_items['Description']))
stock_items = stock_items.query('Desc_isupper == True')[['StockCode', 'Description']]

In [6]:
duplicated_stock_codes = stock_items.copy(deep=True)
duplicated_stock_codes['duplicated'] = duplicated_stock_codes['StockCode'].duplicated()
duplicated_stock_codes = duplicated_stock_codes.query("duplicated == True")["StockCode"]
duplicated_stock_codes = list(duplicated_stock_codes.astype(str))
duplicated_stock_codes.sort()

In [7]:
duplicates_table = stock_items[stock_items['StockCode'].isin(duplicated_stock_codes)]


In [8]:
duplicates_table = duplicates_table.sort_values('StockCode').drop_duplicates()

In [9]:
duplicates_table

Unnamed: 0,StockCode,Description
5370,15056BL,EDWARDIAN PARASOL BLACK
15801,15056N,EDWARDIAN PARASOL NATURAL
2307,15056P,EDWARDIAN PARASOL PINK
4310,15060B,FAIRY CAKE DESIGN UMBRELLA
47284,16156L,"WRAP, CAROUSEL"
...,...,...
10792,90014B,GOLD M PEARL ORBIT NECKLACE
501725,90014C,SILVER AND BLACK ORBIT NECKLACE
142063,90014C,SILVER/BLACK ORBIT NECKLACE
75049,90183C,BLACK DROP EARRINGS W LONG BEADS


In [10]:
'''
file_path = f"{os.getenv('PROJ_REPOS')}\\data\\duplicates_table.csv"
duplicates_table.to_csv(file_path)
'''

'\nfile_path = f"{os.getenv(\'PROJ_REPOS\')}\\data\\duplicates_table.csv"\nduplicates_table.to_csv(file_path)\n'

After looking at a CSV file of only 457 rows, three particular descriptions were clearly out of place: "AMAZON", "FOUND", and "FBA"

In [11]:
stock_items.query('Description == "AMAZON"')

Unnamed: 0,StockCode,Description
499887,72807A,AMAZON
503634,22848,AMAZON
519355,22925,AMAZON


In [12]:
stock_items.query('Description == "FOUND"')

Unnamed: 0,StockCode,Description
241840,22734,FOUND


In [13]:
stock_items.query('Description == "FBA"')

Unnamed: 0,StockCode,Description
270339,82583,FBA


In [14]:
stock_items['multi_word'] = list(map(lambda desc: ' ' in desc, stock_items['Description']))
stock_items.query("multi_word == False")

Unnamed: 0,StockCode,Description,multi_word
45,POST,POSTAGE,False
1423,C2,CARRIAGE,False
152709,S,SAMPLES,False
241840,22734,FOUND,False
270339,82583,FBA,False
499887,72807A,AMAZON,False
503634,22848,AMAZON,False
519355,22925,AMAZON,False


In every case where the stock codes represent a product, the descriptions contain multiple words except when the description does not describe the product, but describes the transaction about the product. Therefore, all descriptions of regular stock codes (when a code represents a product) that contain only one word will be dropped.  

In [15]:
stock_codes = list(stock_items.StockCode.unique())
stock_codes.sort()

In [16]:
stock_codes[-15:]

['90214Y',
 '90214Z',
 'AMAZONFEE',
 'C2',
 'DCGS0003',
 'DCGS0004',
 'DCGS0069',
 'DCGS0070',
 'DCGS0076',
 'DCGSSBOY',
 'DCGSSGIRL',
 'DOT',
 'PADS',
 'POST',
 'S']

In addition, the stock codes that do not represent products need to be dropped. These include "AMAZONFEE", "C2", "DOT", "PADS", "POST", and "S"  

In [17]:
to_drop = [
    "AMAZONFEE",
    "C2",
    "DOT",
    "PADS",
    "POST",
    "S"
]

to_keep = list(stock_codes)
for entry in to_drop:
    to_keep.remove(entry)

In [18]:
stock_items = stock_items[stock_items["StockCode"].isin(to_keep)].query("multi_word == True")[['StockCode', 'Description']]

In [19]:
stock_items = stock_items.sort_values('StockCode').reset_index()[['StockCode', 'Description']]
stock_items_all = stock_items

In [20]:
stock_items_all

Unnamed: 0,StockCode,Description
0,10002,INFLATABLE POLITICAL GLOBE
1,10080,GROOVY CACTUS INFLATABLE
2,10120,DOGGY RUBBER
3,10123C,HEARTS WRAPPING TAPE
4,10124A,SPOTS ON RED BOOKCOVER TAPE
...,...,...
4118,DCGS0069,OOH LA LA DOGS COLLAR
4119,DCGS0070,CAMOUFLAGE DOG COLLAR
4120,DCGS0076,SUNJAR LED NIGHT NIGHT LIGHT
4121,DCGSSBOY,BOYS PARTY BAG


In [21]:
stock_items_all.to_csv(f"{os.getenv('PROJ_REPOS')}\\data\\Stock_Items_All.csv")

In [22]:
stock_items = stock_items.drop_duplicates('StockCode')

In [23]:
stock_items.to_csv(f"{os.getenv('PROJ_REPOS')}\\data\\Stock_Items.csv")

Stock_Items_All.csv includes all valid variations of each stock code, while Stock_Items.csv ignores the variations but results in each stock code being assigned to a unique description. This is done under the assumption that each code represents a particular product in which the products represented by a stock code would only have slight differences (size, color, etc.).  