# Searching for the category

For this code along we are only going to use the products DataFrame. However, if you believe there is information in other tables that can help to create categories, please feel free to explore.

In [1]:
import pandas as pd

In [2]:
# products_cl.csv
url = "https://drive.google.com/file/d/1s7Lai4NSlsYjGEPg1QSOUJobNYVsZBOJ/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
products_cl = pd.read_csv(path)

In [3]:
product_category_df = products_cl.copy()

In [4]:
product_category_df.head()

Unnamed: 0,sku,name,desc,price,in_stock,type
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,1,8696
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,0,13855401
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,0,1387
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,0,1230
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,1,1364


## 1.&nbsp; Category creation by search term
Let's start by creating a column `category`. For now we'll fill this column with a blank string `""`.

In [5]:
product_category_df["category"] = ""
product_category_df.head()

Unnamed: 0,sku,name,desc,price,in_stock,type,category
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,1,8696,
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,0,13855401,
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,0,1387,
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,0,1230,
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,1,1364,


We can find all the products with certain words in their `description` using `.loc[]` and `.str.contains()`. Here we'll look at all the items that have the word `keyboard` in their description.

In [6]:
product_category_df.loc[product_category_df["desc"].str.lower().str.contains("keyboard"), :]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.00,0,13855401,
15,MOS0021,Clearguard Moshi MacBook Pro and Air,Keyboard Protector MacBook Pro 13-inch Retina ...,24.95,0,13835403,
24,APP0277,Apple Wireless Keyboard Keyboard (OEM) Mac,Ultrathin keyboard Apple Bluetooth Spanish (un...,79.00,0,13855401,
64,HGD0012,Henge Docks Click keyboard support iMac,Base to hold the Apple Magic TrackPad and Wire...,29.00,0,8696,
365,LOG0084,Logitech Ultrathin Keyboard Cover Keyboard Cov...,Ultrathin cover and cover with Bluetooth keybo...,89.99,0,12575403,
...,...,...,...,...,...,...,...
9720,PAC2508,Replacement Magic Wireless Keyboard by Matias ...,Keyboard replacement service at the time of pu...,119.99,1,13855401,
9751,MTF0008,Mistify Clean Screens Natural 500ml.,Spray cleaning screens and keyboards.,14.99,1,12085400,
9796,ZAG0026-A,Open - Zagg Rugged Keyboard Folio iPad Messeng...,Case reconditioned keyboard and adjustable pos...,99.99,0,12575403,
9932,APP1472,Apple Magic Keyboard English International,English keyboard Mac and Apple iPad Ultrathin ...,119.00,1,13855401,


Next, we change the value in the category column to `keyboard` for all of these keyboard products. 

In [7]:
product_category_df.loc[product_category_df["desc"].str.lower().str.contains("keyboard"), "category"] = "keyboard"

Let's take a look at the effect that had on the `category` column.

In [8]:
product_category_df["category"].value_counts()

            9903
keyboard      89
Name: category, dtype: int64

## 2.&nbsp; Category creation using regex
We can also use a product's `name` to select products for our categories.

In [9]:
product_category_df.loc[product_category_df["name"].str.lower().str.contains("apple iphone"), :]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
35,APP0308,AV Cable Adapter Apple iPhone iPad and iPod white,IPhone iPad iPod adapter and AV cable.,45.00,0,1230,
214,REP0100,Color change to White Apple iPhone 4,It is including parts and labor..,94.21,0,"1,44E+11",
215,REP0052,Color change to White Apple iPhone 4,It is including parts and labor..,94.21,0,"1,44E+11",
579,APP0675,Apple iPhone 5S 32GB Space Gray,New Free iPhone 5S 32GB (ME435Y / A).,559.00,0,,
956,APP0823,Apple iPhone 6 16GB Silver,New iPhone 6 16GB Free (MG482QL / A).,639.00,0,,
...,...,...,...,...,...,...,...
9790,AP20455,Like new - Apple iPhone 8 256GB Gold,Apple iPhone 8 reconditioned 256GB in Gold rea...,979.00,0,113291716,
9794,APP2482-A,Open - Apple iPhone 8 Plus 256GB Gold,Refurbished Apple iPhone 8 Plus 256GB Free Gold,1089.00,0,113281716,
9929,APP2477-A,Open - Apple iPhone 8 Plus 64GB Space Gray,Apple iPhone 8 Plus 64GB Space Gray,919.00,0,113281716,
9958,AP20467,Like new - Apple iPhone Silicone Case Cover 7 ...,Reconditioned silicone sleeve microfiber Apple...,45.00,0,11865403,


Looks like we get a lot of accessories included in this search. We can refine this using a little regex. Here, we will add `.{0,7}` at the beginning of the search: this means we will find all `apple iphone`s that have 7 or less characters preceding the term "apple iphone" - if there's 8 characters preceding the search term, it won't be found. This should help refine our search by using the nomenclature of the DataFrame to our advantage.

If you feel unsure about regex, please use [regex101](https://regex101.com/). It's really useful for checking your code, and parts of other people's code that you're unsure about.

In [10]:
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple iphone"), :]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
579,APP0675,Apple iPhone 5S 32GB Space Gray,New Free iPhone 5S 32GB (ME435Y / A).,559.0,0,,
956,APP0823,Apple iPhone 6 16GB Silver,New iPhone 6 16GB Free (MG482QL / A).,639.0,0,,
961,APP0829,Apple iPhone 6 Plus 16GB Silver,New iPhone 6 Plus 16G Free (MGA92QL / A).,749.0,0,,
962,APP0822,Apple iPhone 6 16GB Space Gray,New iPhone 6 16GB Free (MG472QL / A).,639.0,0,,
963,APP0825,Apple iPhone 6 64GB Space Gray,New iPhone 6 64GB Free (MG4F2QL / A).,749.0,0,,
...,...,...,...,...,...,...,...
9585,APP1634-A,Open - Apple iPhone 7 Plus 32GB Black,New 32GB Apple iPhone 7 Plus Free Black,779.0,0,85651716,
9587,APP2540-A,Open - Apple iPhone Leather Folio X Baya,Leather case with box and official cover Apple,109.0,0,11865403,
9714,APP2562-A,Open - Apple iPhone Leather Case Cover Red,Reconditioned skin sheath official Apple desig...,45.0,0,11865403,
9794,APP2482-A,Open - Apple iPhone 8 Plus 256GB Gold,Refurbished Apple iPhone 8 Plus 256GB Free Gold,1089.0,0,113281716,


Now we can use the same trick as before to set the category - selecting the `category` column and setting it to the string of our choice.

In [11]:
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple iphone"), "category"] = "smartphone"

In [12]:
product_category_df["category"].value_counts()

              9634
smartphone     269
keyboard        89
Name: category, dtype: int64

## 3.&nbsp; One product with multiple categories
A product may fit into multiple categories. To help us create multiple categories for one product, we will use the python addition assignment `+=`. The addition assignment is a shorthand way to add something (number, string, etc...) to a variable without changing the variable name. 

Let's have a look at a couple of examples.

In [13]:
a = 10
a = a + 5
a

15

In [14]:
a = 10
a += 5
a

15

In [15]:
b = "Tyrannosaurus"
b = b + " rex"
b

'Tyrannosaurus rex'

In [16]:
b = "Tyrannosaurus"
b += " rex"
b

'Tyrannosaurus rex'

Now let's look at how this can help us in our category creation.

First, we'll reset all the values in the category column to an empty string `""`.

In [17]:
product_category_df["category"] = ""

Now, let's create some categories and utilise the addition assignment.

In [18]:
product_category_df.loc[product_category_df["desc"].str.lower().str.contains("keyboard"), "category"] += ", keyboard"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple iphone"), "category"] += ", smartphone"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple ipod"), "category"] += ", ipod"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("^.{0,7}apple ipad|tablet"), "category"] += ", tablet"
product_category_df.loc[product_category_df["name"].str.lower().str.contains("imac|mac mini|mac pro"), "category"] += ", desktop"

In [19]:
product_category_df["category"].value_counts()

                       8362
, desktop               923
, tablet                307
, smartphone            269
, keyboard               83
, ipod                   42
, keyboard, tablet        4
, keyboard, desktop       2
Name: category, dtype: int64

As you can see, some products now have 2 categories instead of just one. At the end, you can use your skills with string to tidy up the opening comma and space in the `category` column.

# Challenge. Your categories
Now it's your turn. We'll reset the Dataframe so that no categories exist, and it's up to you to create the categories based on keywords in the name and description. Feel free to go wild and make as many categories as you like.
* Remember you can also use regex to refine your searches.
* Remember you can use the or operator `|` to search for multiple terms at once.
* Remember to tidy up any untidy strings at the end.

In [21]:
# your code here
product_category_df["category"] = ""

In [22]:
product_category_df.loc[product_category_df.desc.str.lower().str.contains("keyboard"), "category"] += ", keyboard"
product_category_df.loc[product_category_df.name.str.lower().str.contains("^.{0,7}apple iphone"), "category"] += ", smartphone"
product_category_df.loc[product_category_df.name.str.lower().str.contains("^.{0,7}apple ipod"), "category"] += ", ipod"
product_category_df.loc[product_category_df.name.str.lower().str.contains("^.{0,7}apple ipad|tablet"), "category"] += ", tablet"
product_category_df.loc[product_category_df.name.str.lower().str.contains("imac|mac mini|mac pro"), "category"] += ", desktop"
product_category_df.loc[product_category_df.name.str.lower().str.contains("macbook"), "category"] += ", laptop"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("backpack"), "category"] += ", backpack"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("case|funda|housing|casing|folder"), "category"] += ", case"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("dock|hub|connection|expansion box"), "category"] += ", dock"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("cable|connector|lightning to usb|wall socket|power strip"), "category"] += ", cable"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("flash drive|hard drive|pendrive|hard disk|memory|storage|^ssd|^hardssd|modules|ssd expansion"), "category"] = ", storage"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("battery"), "category"] += ", battery"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("headset|headphones"), "category"] += ", headset"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("charger"), "category"] += ", charger"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("mouse|trackpad"), "category"] += ", mouse"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("stand|support"), "category"] += ", stand"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("strap|armband|belt|bracelet"), "category"] += ", strap"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("^.{0,6}apple watch|smartwatch|smart watch"), "category"] += ", smartwatch"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("adapter"), "category"] += ", adapter"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("^.{0,7}ram"), "category"] += ", ram"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("protect|cover|sleeve|screensaver|shell"), "category"] += ", protection"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("nas|server|raid|synology"), "category"] += ", server"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("scale"), "category"] += ", scale"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("thermometer"), "category"] += ", thermometer"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("monitor"), "category"] += ", monitor"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("speaker|music system"), "category"] += ", speaker"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("camera"), "category"] += ", camera"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("pointer"), "category"] += ", pointer"
product_category_df.loc[product_category_df.desc.str.lower().str.contains("refurbished|reconditioned|like new"), "category"] += ", refurbished"

In [23]:
product_category_df.category.value_counts()

                                         1150
, storage                                1000
, desktop                                 714
, server                                  589
, laptop                                  546
                                         ... 
, desktop, dock, stand                      1
, case, battery, headset                    1
, keyboard, cable, protection               1
, battery, monitor                          1
, keyboard, mouse, stand, refurbished       1
Name: category, Length: 250, dtype: int64

**Problem:** We can see that we still have a lot of products that don't fall into our categories.

**Solution:** Label all of these products category as `other`

In [24]:
product_category_df.loc[product_category_df["category"] == "", "category"] = ", other"

**Problem:** all of the categories start with a comma `, `

**Solution:** let's exclude the 1st 2 characters of all categories

In [25]:
product_category_df["category"] = product_category_df["category"].str[2:]

Let's take a look at the 1st few rows to see how the categories have worked.

In [26]:
product_category_df.head(10)

Unnamed: 0,sku,name,desc,price,in_stock,type,category
0,RAI0007,Silver Rain Design mStand Support,Aluminum support compatible with all MacBook,59.99,1,8696,stand
1,APP0023,Apple Mac Keyboard Keypad Spanish,USB ultrathin keyboard Apple Mac Spanish.,59.0,0,13855401,keyboard
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.0,0,1387,"cable, mouse"
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.0,0,1230,"dock, cable"
4,KIN0007,Mac Memory Kingston 2GB 667MHz DDR2 SO-DIMM,2GB RAM Mac mini and iMac (2006/07) MacBook Pr...,34.99,1,1364,ram
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45.0,0,1230,"dock, cable"
6,KIN0008,Mac Memory Kingston 1GB 667MHz DDR2 SO-DIMM,1GB RAM Mac mini and iMac (2006/07) MacBook Pr...,18.99,0,1364,ram
7,KIN0009,Mac Memory Kingston 2GB 800MHz DDR2 SO-DIMM,2GB RAM iMac with Intel Core 2 Duo (Penryn).,36.99,0,1364,ram
8,KIN0001-2,Mac memory Kingston 4GB (2x2GB) 667MHz DDR2 SO...,RAM 4GB (2x2GB) Mac mini and iMac (2006/07) Ma...,74.0,0,1364,ram
9,APP0100,Apple Adapter Mini Display Port to VGA,Adapter Mini Display Port to VGA MacBook and M...,35.0,0,1325,adapter


In [28]:
product_category_df.loc[product_category_df.category.str.contains("cable"), :]

Unnamed: 0,sku,name,desc,price,in_stock,type,category
2,APP0025,Mighty Mouse Apple Mouse for Mac,mouse Apple USB cable.,59.00,0,1387,"cable, mouse"
3,APP0072,Apple Dock to USB Cable iPhone and iPod white,IPhone dock and USB Cable Apple iPod.,25.00,0,1230,"dock, cable"
5,APP0073,Apple Composite AV Cable iPhone and iPod white,IPhone and iPod AV Cable Dock to Composite Video.,45.00,0,1230,"dock, cable"
14,MOS0013,Adapter Moshi FireWire 400 to FireWire 800,FireWire 400 adapter cable FireWire 800.,20.00,0,1325,"cable, adapter"
19,APP0234,Apple Dock Connector to VGA,Dock Connector to VGA IOS.,35.00,0,13955395,"dock, cable"
...,...,...,...,...,...,...,...
9836,BEL0377,Belkin Thunderbolt Cable 3 40Gb / s 100W 2m,2m Thunderbolt cable 3 with transmission data ...,79.99,1,1325,cable
9928,OWC0227-A,Open - OWC USB Dock-C 10-port power 80W Plata,Aluminum Hub with 10 different ports include 2...,217.99,0,12585395,"dock, cable"
9936,MOS0247,Moshi USB-C to DisplayPort Cable,15m adapter cable with USB-C connection Displa...,54.95,0,1325,"dock, cable, adapter"
9969,AP20468,Like new - Apple iPhone Black Lightning Dock,Support base and refitted with dock connector ...,59.00,0,13615399,"dock, cable, stand"


## 4.&nbsp; [BONUS] Using `type` to create categories
There could be another way to create categories, but this one you'll have to explore this one alone.

We have the mysterious column `type` in the `products` table. This could potentially be ready-made categories labelled with numbers instead of words. Let's investigate.

In [29]:
category_type_df = products_cl.copy()

Here are the `type`s that have the most products.

In [30]:
category_type_df.groupby("type").count().nlargest(10, "sku")

Unnamed: 0_level_0,sku,name,desc,price,in_stock
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
11865403,1057,1057,1057,1057,1057
12175397,939,939,939,939,939
1298,783,783,783,783,783
11935397,562,562,562,562,562
11905404,454,454,454,454,454
1282,373,373,373,373,373
12635403,362,362,362,362,362
13835403,269,269,269,269,269
"5,74E+15",247,247,247,247,247
1364,216,216,216,216,216


Let's have a look at the first `type` to see if we can make categories from this column.

In [31]:
category_type_df.loc[category_type_df["type"] == "11865403", :].sample(10)

Unnamed: 0,sku,name,desc,price,in_stock,type
8199,PUR0161,Pure Hologram iPhone 7/8 Case Rosa,Ultrathin hologram effect protection for your ...,19.95,1,11865403
2410,KUA0028,Support Kukaclip car + Funda iPhone 6 / 6S Tra...,Magnetic car holder with 360 degrees rotating ...,24.99,0,11865403
5249,TUC0292,Tucano DUEINUNO iPhone Case 2 in 1 7/8 Rosa,Case 2 in 1 with removable rear cover and anti...,26.9,0,11865403
1808,SPE0156,Speck CandyShell Grip Case for iPhone 6 Plus 5...,ultra tough for iPhone 6 Plus. housing,26.99,0,11865403
2538,APP1149,Case Apple iPhone 6 / 6S Silicone Case Antique...,Ultrathin silicone case and microfiber premium...,39.0,0,11865403
8909,BEL0323,Belkin iPhone Case SheerForce 8 Plus / 7 Plus ...,Case against impact and wear resistant materia...,24.99,1,11865403
1404,JMO0070,Just Mobile Phone Case Skin AluFrame 6 Pink,Aluminum and leather iPhone 6.,39.99,1,11865403
8254,TUC0349,Tucano iPhone Case Filo Booklet X,Eco leather cover anti-radiation card slots .....,26.9,1,11865403
5785,MOP0099,Hold Force Base Case Mophie iPhone Case 7 Wrap,Case Ultra Thin magnetic plate on the back to ...,39.95,0,11865403
1408,KUA0019,Support Kukaclip car + Funda iPhone 6 / 6S Black,Magnetic car holder with 360 degrees rotating ...,24.99,0,11865403


Looks like this is a category of phone cases.

Let's have a look at the 2nd largest type to see if that's also a clear category.

In [32]:
category_type_df.loc[category_type_df["type"] == "12175397", :].sample(10)

Unnamed: 0,sku,name,desc,price,in_stock,type
8500,PAC2409,DS418play Synology NAS Server | 10GB RAM | 48T...,4-bay NAS server to accommodate 4K Ultra HD files,2601.31,0,12175397
8609,PAC2374,Synology DS218 + NAS Server | 6GB RAM | 20TB (...,NAS storage server integrated with special foc...,1206.96,1,12175397
3235,PAC1392,Pack QNAP TS-451 + | 2GB RAM + WD Red 32TB,Pack QNAP TS-451 + with 2GB of RAM memory + 32...,1914.99,0,12175397
7767,PAC1696,Pack QNAP TS-251A NAS Server | 16GB RAM | Seag...,NAS with 16GB of RAM and 8 TB (2x4TB) Seagate ...,851.65,0,12175397
3227,PAC1393,Pack QNAP TS-451 + | 8GB RAM | Seagate 32TB Ir...,Pack QNAP TS-451 + with 8GB of RAM memory + 32...,2039.95,0,12175397
4375,PAC1435,Synology DS916 + Pack | 8GB RAM | WD 32TB Network,Synology DS916 + with 8GB of RAM memory + 32TB...,2065.2,0,12175397
8548,PAC2220,Synology DS718 + NAS Server | 16GB RAM | 16TB ...,Scalable NAS server with transcoding 4K: 4-cor...,1360.71,0,12175397
8544,PAC2216,Synology DS718 + NAS Server | 16GB RAM | 4TB (...,Scalable NAS server with transcoding 4K: 4-cor...,840.71,0,12175397
3148,PAC1629,QNAP HS-251 + | Seagate 12TB Iron Wolf,HS-251 + NAS with 12TB (2x6TB) IronWolf Seagat...,958.97,0,12175397
3884,PAC1375,Synology Pack DS216J | WD 16TB Network,NAS + 16TB (2x8TB) WD Network for Mac and PC.,893.99,0,12175397


Looks like this category is full of servers.

I wonder how many `type`s account for most of our products?

In [33]:
n = 30
print(f"With the {n} largest types, we account for {((category_type_df.groupby('type').count().nlargest(n, 'sku')['sku'].sum()) / (category_type_df.shape[0]) * 100).round(2)}% of all products.")

With the 30 largest types, we account for 78.4% of all products.


Looks like we can simply investigate 30 types and set the categories, then the remaining 20% of products can have the category `other`.

Use the skills you learnt above to change the category for each type.