<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Python-tricks" data-toc-modified-id="Python-tricks-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Python tricks</a></span></li><li><span><a href="#Numpy" data-toc-modified-id="Numpy-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Numpy</a></span></li><li><span><a href="#Regular-expressions" data-toc-modified-id="Regular-expressions-0.3"><span class="toc-item-num">0.3&nbsp;&nbsp;</span>Regular expressions</a></span></li></ul></li></ul></div>

This notebook covers:
* Python tricks: Some lesser known python tricks and tips
* Numpy
* Regex basics

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd
import itertools

## Regular expressions

In [2]:
# Image("../input/regex-example.png")

I have talked about some basic regex functionality which is taken from this excellent post

https://www.machinelearningplus.com/python/python-regex-tutorial-examples/

In [3]:
import re

A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.

Here the '\s' matches any whitespace character. By adding a '+' notation at the end will make the pattern match at least 1 or more spaces. So, this pattern will match even tab '\t' characters as well.

In [6]:
regex = re.compile('\s+')

**Splitting a string using regex**

In [7]:
text = "Hello World.   Regex is awesome"

In [8]:
regex.split(text)

['Hello', 'World.', 'Regex', 'is', 'awesome']

Another way but regex is generally the better one

In [9]:
re.split('\s', text)

['Hello', 'World.', '', '', 'Regex', 'is', 'awesome']

**re.findall**

the findall method extracts all occurrences of the pattern

 `'\d'` is a regular expression which matches any digit

In [10]:
text = "101 howard street, 246 mcallister street"

In [11]:
regex_num = re.compile('\d+')  #one or more digits

In [12]:
regex_num.findall(text)

['101', '246']

In [13]:
regex_num.split(text)

['', ' howard street, ', ' mcallister street']

**re.search() vs re.match()**

`regex.search()` returns a particular match object that contains the starting and ending positions of the **first occurrence of the pattern**.

Likewise, `regex.match()` also returns a match object. But the difference is, it requires the pattern to be present at the **beginning of the text itself**.

In [35]:
text2 = "189 MAT Mathematics 205"

In [36]:
m = regex_num.match(text2)

In [37]:
m.group()

'189'

In [45]:
m.start()  #returns the index of the starting

0

In [46]:
s = regex_num.search(text2)

In [47]:
s.group()

'189'

**Substituting one text by another using `regex.sub()`**

In [55]:
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  English"""

In [56]:
regex = re.compile('\s+')

In [61]:
regex.sub(' ', text)  #it replaces the regular expression by ' '

'101 COM Computers 205 MAT Mathematics 189 ENG English'

In [62]:
text

'101   COM \t  Computers\n205   MAT \t  Mathematics\n189   ENG  \t  English'

In [63]:
# get rid of all extra spaces except newline
regex = re.compile('((?!\n)\s+)')
print(regex.sub(' ', text))

101 COM Computers
205 MAT Mathematics
189 ENG English


**combining regex pattern**

In [101]:
text = """101   COM \t  Computers
205   MAT \t  Mathematics
189   ENG  \t  Englishhhh
213da   a3d \t  CekDoaANGZZ
ADA33   ADa \t  Scienc3
120s   M4T \t  d4ta
"""

In [102]:
#{}minimum
# define the course text pattern groups and extract
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})'
re.findall(course_pattern, text)

[('101', 'COM', 'Computers'),
 ('205', 'MAT', 'Mathematics'),
 ('189', 'ENG', 'Englishhhh')]

**greedy regex**

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.

In [103]:
text = "< body>Regex Greedy Matching Example < /body>"
re.findall('<.*>', text)

['< body>Regex Greedy Matching Example < /body>']

it should have stopped at first > but it didn't. For extracting only the smaller portions:

Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern.

In [104]:
re.findall('<.*?>', text)

['< body>', '< /body>']

In [105]:
s = re.search('<.*?>', text)  #getting only the first one

In [106]:
s.group()

'< body>'

In [107]:
# Image("../input/regex.png")

In [108]:
text = '01, Jan 2015'

In [110]:
print(re.findall('\d{3}', text))

['201']


**matching word boundaries**

Word boundaries `\b` are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa.

For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b

Can you come up with a regex that will match only the first ‘toy’ in ‘play toy broke toys’? (hint: \b on both sides)

Likewise, `\B` will match any non-boundary.

For example, \Btoy\B will match ‘toy’ surrounded by words on both sides, as in, ‘antoynet’.

In [111]:
re.findall(r'\btoy\b', 'play toy broke toys')

['toy']

In [112]:
re.findall(r'\btoy', 'play toy broke toys')

['toy', 'toy']

In [113]:
re.findall(r'toy\b', 'play toy broke toys')

['toy']

In [114]:
re.findall(r'\Btoy\b', 'playtoy broke toys')

['toy']

In [115]:
re.findall(r'\Btoy\B', 'playtoybroke toys')

['toy']

In [116]:
re.findall(r'\btoy', 'playtoybroke toys')

['toy']

**Practice regex examples**

In [117]:
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'), ('page33', 'google', 'com'),
                  ('jeff42', 'amazon', 'com')]

In [118]:
regex = re.compile('([\w]+)@([\w]+).([\w]+)')

In [119]:
regex.findall(emails)

[('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

2. Retrieve all the words starting with ‘b’ or ‘B’ from the following text.

In [120]:
text = """Betty bought a bit of butter, 
But the butter was so bitter, So she bought
some better butter, To make the bitter butter better."""

In [121]:
regex = re.compile('([$bB]\w+)')

In [122]:
regex.findall(text)

['Betty',
 'bought',
 'bit',
 'butter',
 'But',
 'butter',
 'bitter',
 'bought',
 'better',
 'butter',
 'bitter',
 'butter',
 'better']

In [123]:
sentence = """A, very   very; irregular_sentence"""
desired_output = "A very very irregular sentence"

In [124]:
regex = re.compile('[,\s;_]+')

In [125]:
' '.join(regex.split(sentence))

'A very very irregular sentence'

In [126]:
tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

In [127]:
desired_output = 'Good advice What I would do differently if I was learning to code today'

In [128]:
def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""),
                   '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet


print(clean_tweet(tweet))

Good advice What I would do differently if I was learning to code today 
