In [4]:
import pandas as pd
products = pd.read_csv('../data/eniac/clean/products_cl.csv')

In [5]:
import re

# Introduction to Regex

## GOAL

Introduce to the library `re` (regexp) and show the main functions and how filter text based on regular expressions. 

## DESCRIPTION

In this workshop, the following functions will be reviewed: 

* `findall()`
* `search()`
* `split()`
* `sub()`
* `span()`
* `string()`
* `group()`

Metacharacters: ` . ^ $ * + ? { } [ ] \ | ( )`

Special Sequences: `\A` `\b` `\d` `\s`

And how to compile the regex expressions to reuse it. 

More information on that [link](https://www.w3schools.com/python/python_regex.asp).

In [6]:
products.sample()

Unnamed: 0,sku,name,desc,price
6058,SPE0169-A,"Open - Speck SeeThru Case Macbook Pro 13 ""Blue...",Protective polycarbonate shell for MacBook Pro...,49.9


In [7]:
# extract an specific description
prod_descr = products.query('sku == "DLK0139"')['desc'].values[0]
prod_descr

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

### `findall`

Returns a list containing all matches

In [8]:
# return all ocurrencies appearing on a string
re.findall('a', prod_descr)

['a', 'a', 'a', 'a', 'a']

### `search`

Returns a Match object if there is a match anywhere in the string. If there is more than one match, only the first occurrence of the match will be returned.

The mathch objects have the following methods: 
- `.span()` returns a tuple containing the start-, and end positions of the match.
- `.string` returns the string passed into the function
- `.group()` returns the part of the string where there was a match

In [9]:
prod_descr

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

In [11]:
match_obj = re.search('video', prod_descr)

In [13]:
match_obj.string

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

In [16]:
match_obj.group()

'video'

In [17]:
match_obj.span()

(8, 13)

### `split`

Returns a list where the string has been split at each match

In [22]:
prod_descr.split(' and ') # and is removed from the list

['Full HD video surveillance camera with 180 degrees',
 'night vision compatible HomeKit']

### `sub`

Replaces one or many matches with a string

In [24]:
dark_descr = re.sub("camera", "pool", prod_descr)
print(dark_descr)

Full HD video surveillance pool with 180 degrees and night vision compatible HomeKit


### METACHARACTERS


Some characters are special metacharacters, and don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

` . ^ $ * + ? { } [ ] \ | ( )`

 #### `[]` means set of characters:
 
 - `[abc]` will match any of the characters a, b, or c
 - `[a-c]` will do the same
 - `[a-z]` will match any lowercase letter

In [25]:
alphanumeric = "4298fsfsv012rvv21v9"

In [26]:
re.findall(r"[a-z]", alphanumeric)

['f', 's', 'f', 's', 'v', 'r', 'v', 'v', 'v']

`\` Can help us to scape special characters 

In [28]:
alphanumeric_with_special = alphanumeric + "[a-z]"
print(alphanumeric_with_special)
# CALLENGE: use \ to escape the square brakets
re.findall(r"\[a-z]", alphanumeric_with_special)

4298fsfsv012rvv21v9[a-z]


['[a-z]']

#### Some special sequences:

- `\A`- Returns a match if the specified characters are at the beginning of the string
- `\b` - Returns a match where the specified characters are at the beginning or at the end of a word
- `\d` - 	Returns a match where the string contains digits (numbers from 0-9) (`\D` for where the string DOES NOT contain digits)
- `\s`- Returns a match where the string contains a white space character (`\S` for where the string DOES NOT contain a white space)

In [29]:
prod_descr

'Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit'

In [43]:
# find all possible numbers
re.findall(r"\d", prod_descr)

['1', '8', '0']

### `.`	Any character (except newline character)

In [54]:
re.findall(r'..c', prod_descr)

['anc', 'e c', 'n c']

### `+` One or more occurrences

In [57]:
re.findall(r'e+', prod_descr)

['e', 'e', 'e', 'e', 'e', 'ee', 'e', 'e']

In [58]:
re.sub("e+", "__", prod_descr)

'Full HD vid__o surv__illanc__ cam__ra with 180 d__gr__s and night vision compatibl__ Hom__Kit'

### `{}`- Exactly the specified number of occurrences

In [60]:
re.findall(r"e{2}", prod_descr)

['ee']

In [62]:
re.sub(r"e{2}", "__", prod_descr)

'Full HD video surveillance camera with 180 degr__s and night vision compatible HomeKit'

### `^` Starts with

In [70]:
re.findall(r"^F", prod_descr)

['F']

#### How to apply it on the whole dataframe?

In [129]:
products.loc[products['name'].str.contains(r'^Fit')]

Unnamed: 0,sku,name,desc,price
123,FIT0009,Fitbit Aria scale smart white,smart scale with WiFi connection.,119.99
124,FIT0010,Fitbit Aria scale smart black,smart scale with WiFi connection.,119.99
270,FIT0013,Fitbit ZIP monitor green activity,Activity Monitor compact and lightweight.,59.99
699,FIT0023,Fitbit Flex Bracelet navy activity monitor,Control activity bracelet with two interchange...,99.99
1454,FIT0024,Fitbit Charge Bracelet Black Size L,Bracelet size L activity and sleep monitor wor...,129.95
1510,FIT0026,Fitbit Charge HR Bracelet Black Size L,Bracelet sport and activity monitors sleep.,149.95
2178,FIT0028,Fitbit Surge Figured Black Clock,Smartwatch with monitoring activity and sleep ...,249.95
2181,FIT0029,Fitbit Surge Black Clock Small size,Smartwatch with monitoring activity and sleep ...,249.95
9533,FIT0062,Fitbit Smartwatch Ionic Gray,Fitbit is the sports Smartwatch Ionic waterpro...,349.95
9534,FIT0064,Fitbit Orange Blue Ionic Smartwatch,Fitbit is the sports Smartwatch Ionic waterpro...,349.95


Learn more how to apply regexp and pandas: 

* https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/

### `*`	Zero or more occurences

In [80]:
similar_words = ["hey", "hay", "how", "h i j k", "h", "ha", "oops"]

In [81]:
# use "." to return all words starting with "h"
for word in similar_words:
    print(re.findall("h.*", word))

['hey']
['hay']
['how']
['h i j k']
['h']
['ha']
[]


In [104]:
print(prod_descr)
re.findall("vi*\S", prod_descr)

Full HD video surveillance camera with 180 degrees and night vision compatible HomeKit


['vid', 've', 'vis']

In [107]:
# Another way to show
re.findall("vi*\w+", prod_descr)
# \w: Returns a match where the string contains any word characters 
#    (characters from a to Z, digits from 0-9, and the underscore _ character)
# +: One or more occurrences

['video', 'veillance', 'vision']

### Examples into dataframes

In [128]:
# I would like to filter all the names that contain body
(
products
    .loc[products['name'].str.contains(r'(body|Body)')]
    .sort_values('name').head(5))

  .loc[products['name'].str.contains(r'(body|Body)')]


Unnamed: 0,sku,name,desc,price
9630,BOD0009,BodyGuardz TrainR Pro 8/7/6 iPhone Case with A...,Advanced holster included sports armband for i...,39.99
10293,BOD0007,BodyGuardz TrainR Pro X iPhone Case with Armba...,Advanced holster included sports armband for i...,39.99
5225,GTE0075,G-Technology G-DOCK ev Body only USB3.0,Housing with connection USB3.0 compatible with...,107.99
6017,LMP0023,"LMP battery MacBook Pro 17 ""Unibody Early / Mi...",replacement battery compatible with MacBook Pr...,129.99
4621,NTE0104,NewerTech NuPower 95 W Battery for MacBook Pro...,internal battery MacBook Pro 17-inch Unibody 2011,131.99


In [127]:
# CHALLENGE: how can you reduce the previous regexp expression?
(
products
    .loc[products['name'].str.contains(r'(b|B)ody')]
    .sort_values('name').head(5))

  .loc[products['name'].str.contains(r'(b|B)ody')]


Unnamed: 0,sku,name,desc,price
9630,BOD0009,BodyGuardz TrainR Pro 8/7/6 iPhone Case with A...,Advanced holster included sports armband for i...,39.99
10293,BOD0007,BodyGuardz TrainR Pro X iPhone Case with Armba...,Advanced holster included sports armband for i...,39.99
5225,GTE0075,G-Technology G-DOCK ev Body only USB3.0,Housing with connection USB3.0 compatible with...,107.99
6017,LMP0023,"LMP battery MacBook Pro 17 ""Unibody Early / Mi...",replacement battery compatible with MacBook Pr...,129.99
4621,NTE0104,NewerTech NuPower 95 W Battery for MacBook Pro...,internal battery MacBook Pro 17-inch Unibody 2011,131.99


### Compile regular expressions

In [273]:
# the last will be the first ones
regexp_dict = {
    'ipod':'^.{0,7}apple ipod',
    'case':'(case|funda|housing|casing|folder)',
    'cable':'cable|connector|Lightning to USB|Wall socket|power strip',
    'battery':'battery',
    'headset':'(headset|headphones)',
    'mouse':'(mouse|trackpad)',
    'stand':'(stand|support)',
    'protect':'(protect|cover|sleeve|Screensaver|shell)',
    'watch':'(^.{0,6}apple watch|smartwatch|smart watch)',
    'camera':'camera',
    'refurbished':'(refurbished|reconditioned|like new)',
    'strap':'strap|armband|belt|bracelet'
}

temp = products.copy().assign(category = 'unknown')

import numpy as np

for val in regexp_dict.items(): 
    label = val[0]
    regexp = re.compile(val[1], flags=re.IGNORECASE)
    temp = (
    temp
        .assign(
            category = lambda x: np.where(
                (x['desc'].str.contains(regexp, regex=True)) &
                (x['category'] == 'unknown'), label, x['category'])))

temp['category'].value_counts()

  (x['desc'].str.contains(regexp, regex=True)) &


unknown        6439
case           1377
protect         808
cable           512
refurbished     398
battery         223
stand           221
strap           191
headset         140
watch           126
camera          104
mouse            40
Name: category, dtype: int64