- A. [Collection Modules](#Collection-module)
- B. [Opening and Reading of files and folders](#Opening-and-Reading-Files)
- C. [Datetime module](#datetime-module)
- D. [Math and Random module](#Math-and-Random-modules)
- E. [Python Debugger](#Python-Debugger)
- F. [Overview of Regexes](#Regular-Expressions)
- G. [How to time your code - timeit module](#Timing-your-code)

__________________________________________________________________________________________________________________

# Collection module

The collections module is a built-in module that implements specialized container data types providing alternatives to Python’s general purpose built-in containers. The general built-ins are - dict, list, set, and tuple.

Now we'll learn about the alternatives that the **collections** module provides.

## I. Counter

Counter is a dict subclass which helps count hashable objects. Inside of it elements are stored as dictionary keys and the counts of the objects are stored as the value.

#### Example :

In [2]:
from collections import Counter


#### Counter( ) with lists

In [2]:
lst = [1,2,2,2,2,3,3,3,1,2,1,12,3,2,32,1,21,1,223,1]

Counter(lst)

Counter({1: 6, 2: 6, 3: 4, 12: 1, 32: 1, 21: 1, 223: 1})

In [23]:
items = ['a',10,3,10,'a','a']

# the Counter of a list can be assigned to a variable also, like :
itemsCount = Counter(items)

itemsCount

Counter({'a': 3, 10: 2, 3: 1})

In [11]:
# the datatype is collections.Counter

type(itemsCount)

collections.Counter

In [16]:
# we can access the value of 'a' key i.e. below line of code will
# return the count of item 'a' that is stored as key in this collections.Counter

itemsCount['a']

2

If, mistakenly, you try to fetch value of such an item which is not present in the counter then it will return as a 0(zero).

In [17]:
# 'w' item is not present. So this would return to a 0 

itemsCount['w']

0


#### Counter( ) with strings

In [10]:
Counter('hello, I will come soon')

Counter({'h': 1,
         'e': 1,
         'l': 4,
         'o': 1,
         ',': 1,
         ' ': 2,
         'I': 1,
         'w': 1,
         'i': 1})

In [17]:
Counter('aabsbsbAs bhMhhMsbs xpMpxA')

Counter({'a': 2,
         'b': 5,
         's': 5,
         'A': 2,
         ' ': 2,
         'h': 3,
         'M': 3,
         'x': 2,
         'p': 2})

#### Counter with words in a sentence

In [29]:
s = 'johnny johnny yes papa eating sugar no papa'

words = s.split()

Counter(words)

Counter({'johnny': 2, 'yes': 1, 'papa': 2, 'eating': 1, 'sugar': 1, 'no': 1})

In [30]:
# Methods offered by Counter() - for eg, .most_common()
c = Counter(words)

c.most_common(3)       # returns 3 most common words along with the count of each.

[('johnny', 2), ('papa', 2), ('yes', 1)]

### Common patterns when using the Counter( ) object

    sum(c.values())                 # total of all counts
    c.clear()                       # reset all counts
    list(c)                         # list unique elements
    set(c)                          # convert to a set
    dict(c)                         # convert to a regular dictionary
    c.items()                       # convert to a list of (elem, cnt) pairs
    Counter(dict(list_of_pairs))    # convert from a list of (elem, cnt) pairs
    c.most_common()[:-n-1:-1]       # n least common elements
    c += Counter()                  # remove zero and negative counts

*****

In [31]:
list(c)

['johnny', 'yes', 'papa', 'eating', 'sugar', 'no']

In [32]:
set(c)

{'eating', 'johnny', 'no', 'papa', 'sugar', 'yes'}

## II. defaultdict

defaultdict is a dictionary-like object which provides all methods provided by a dictionary but takes a first argument (default_factory) as a default data type for the dictionary. Using defaultdict is faster than doing the same using dict.set_default method.

**A defaultdict will never raise a KeyError. Any key that does not exist gets the value returned by the default factory.**

In [39]:
from collections import defaultdict

In [47]:
# Notice here, 'd' is a common dictionary. And it will throw key error when tried to access the value of a key if it doesn't exist.

d = {}

d['one']

KeyError: 'one'

In [75]:
d = defaultdict(object)

In [76]:
# the datatype of a defaultdict is collections.defaultdict

type(d)

collections.defaultdict

In [77]:
d['one']

<object at 0x10f3c7380>

In [78]:
d

defaultdict(object, {'one': <object at 0x10f3c7380>})

In [82]:
for item in d:
    print(item)

one


A Default-Dictionary works in the way that it creates a dicitionary with a default value, for any key that is/will be a part of this dictionary.

Can also initialize with default values :

In [83]:
t = defaultdict()

In [84]:
t['hello']

KeyError: 'hello'

In [92]:
t

defaultdict(None, {})

In [89]:
defDic = defaultdict(lambda: 2)

In [90]:
defDic['some-key']

2

In [91]:
for item in defDic:
    print(item)

some-key


## III. namedtuple

The standard tuple uses numerical indexes to access its members, for example :

In [95]:
t = (13,17,14)

t[1]

17

The `namedtuple` tried to expand on a normal tuple on a normal tuple object by actually having named indices. 

For simple use cases, this is usually enough. On the other hand, remembering which index should be used for each value can lead to errors, especially if the tuple has a lot of fields and is constructed far from where it is used. A namedtuple assigns names, as well as the numerical index, to each member.

Each kind of namedtuple is represented by its own class, created by using the `namedtuple()` factory function. The arguments are the name of the new class and a string containing the names of the elements.

You can basically think of **namedtuples** as a very quick way of creating a new object/class type with some attribute fields. For example :

In [108]:
from collections import namedtuple

Dog = namedtuple('Dog',['age','breed','name'])

sunny = Dog(age=2,breed='Lab',name='Sunny')

frank = Dog(age=2,breed='Shepard',name="Frankie")

In [109]:
# it kind-of looks like a class

Dog

__main__.Dog

We construct the namedtuple by first passing the object type name (Dog) and then passing a string with the variety of fields as a string with spaces between the field names.

We can then call on the various attributes :

In [110]:
sunny

Dog(age=2, breed='Lab', name='Sunny')

In [111]:
# it shows the type for 'sam'

type(sunny)

__main__.Dog

In [112]:
sunny.age

2

In [113]:
sunny.breed

'Lab'

In [115]:
# access the breed by index 1, like :

sunny[1]

'Lab'

> **Note :** Use of `namedtuple` function is very useful. You can imagine, in scenarios where you've large tuples, where you can't quite remember what values are at which index. It might be useful and easier to be able to access them both by using calling an index positon, such as 1, or by calling it as attribute, asking for breed.

<br>
__________________________________________________________________________________________________________________

# Opening and Reading Files
So far we've discussed how to open files manually, one by one. But, we don't know how to tackle cases like :
- what if we have to open every file in a directory, especially which have their sub-directories too?
- move files around on our own computer?

Python's `os` module and `shutil` utility module allow us to easily navigate files and directories on the computer and then perform actions on them, such as move, delete, etc.

So let's explore how we can open files programatically.

#### Review : Understanding File Paths

In [116]:
pwd

'/Users/barmanr/Documents/myself/Python3-material/12-Advanced-Python-Module'

#### Create a practice file

Let's create a practice text file that we'll be using for further demonstration.

In [117]:
f = open('practice.text', 'w+')
f.write('This is a practice file and this is first line!')
f.close()

## `os` module 

In [118]:
import os

In [122]:
# get current working directory

os.getcwd()

'/Users/barmanr/Documents/myself/Python3-material/12-Advanced-Python-Module'

In [124]:
# list all the current files and folders in the current working directory level.
# the return type is a list

os.listdir()

['Advanced-Python-modules.ipynb', 'practice.text', '.ipynb_checkpoints']

In [126]:
# pass the location as parameter, if you want to list down files-folders in present in a directory.

os.listdir('/Users/barmanr/Documents/myself/Python3-material')

list

## `shutil` utility module
#### Moving Files
You can use the built-in `shutil` module to to move files to different locations.

Keep in mind, there are permission restrictions. For example if you are logged in as User A, you won't be able to make changes to the top level Users folder without the proper permissions, more [info](https://stackoverflow.com/questions/23253439/shutil-movescr-dst-gets-me-ioerror-errno-13-permission-denied-and-3-more-e)

In [127]:
import shutil

Move a file from one folder to another.

The below command returns back the destination location alongwith filename.

In [128]:
shutil.move('/Users/barmanr/Documents/myself/folder-1/practiceText.txt', '/Users/barmanr/Documents/myself/folder-2')

'/Users/barmanr/Documents/myself/folder-2/practiceText.txt'

### Deleting Files

**NOTE -> The `os` module provides 3 methods for deleting files :**
- `os.unlink(path)` : which deletes a file at the path that you provide.
- `os.rmdir(path)` : which deletes a folder (folder must be empty) at the path that you provide.
- `shutil.rmtree(path)` : this is the most dangerous, as it will remove all files and folders contained in the path.

**All of these methods can not be reversed! Which means if you make a mistake you won't be able to recover the file. Instead we will use the `send2trash` module. A safer alternative that sends deleted files to the trash bin instead of permanent removal.**

> At your command line, install the send2trash module with :<br>
`pip install send2trash`

In [130]:
import send2trash

send2trash.send2trash('/Users/barmanr/Documents/myself/folder-2')
# it'll move the 'folder-2' directory (alongwith all files and sub-folders) to the trash/bin

### Walking through a directory

Often you will just need to "walk" through a directory, i.e. visit every file or folder and check to see if a file is in the directory, and then perhaps do something with that file. Usually recursively, walking through every file and folder in a directory would be quite tricky to program, but luckily the `os` module has a direct method call for this called `os.walk()`.

Let's see an example. Here, I've used a _datasets_other_ folder to demontrate. You can use your own folder for a try-out.

In [141]:
os.listdir('/Users/barmanr/Documents/myself/datasets_other/')

['.DS_Store', 'Cricket-Australia-Datasets', 'Health-Tweets', 'pokemon.csv']

In [148]:
filePath = "/Users/barmanr/Documents/myself/datasets_other"
i = 1

for folder, sub_folders, files in os.walk(filePath):
    print("{}. Currently looking at folder : ".format(i) + folder)
    print('\n')
    
    print('THE SUB-FOLDERS are : ')
    for subFol in sub_folders:
        print('\t Subfolder -> '+ subFol)
        
    print('\n')

    print('THE FILES ARE : ')
    for f in files:
        print('\t File -> '+ f)
    i += 1
    print('\n********* xxx *********\n')
    
    

1. Currently looking at folder : /Users/barmanr/Documents/myself/datasets_other


THE SUB-FOLDERS are : 
	 Subfolder -> Cricket-Australia-Datasets
	 Subfolder -> Health-Tweets


THE FILES ARE : 
	 File -> .DS_Store
	 File -> pokemon.csv

********* xxx *********

2. Currently looking at folder : /Users/barmanr/Documents/myself/datasets_other/Cricket-Australia-Datasets


THE SUB-FOLDERS are : 
	 Subfolder -> BBB International Australia Data


THE FILES ARE : 
	 File -> BallByBall_BBL&WBBL copy.zip
	 File -> .DS_Store
	 File -> OvalsAndFacilityAudit.zip
	 File -> TicketSales_2016-17SeasonOnwards copy.zip
	 File -> OvalsAndFacilityAudit copy.zip
	 File -> BBB International Australia Data.zip
	 File -> TV Broadcast copy.zip
	 File -> OvalsAndFacility.zip

********* xxx *********

3. Currently looking at folder : /Users/barmanr/Documents/myself/datasets_other/Cricket-Australia-Datasets/BBB International Australia Data


THE SUB-FOLDERS are : 


THE FILES ARE : 
	 File -> Players.txt
	 File -

> _Remember that the `os` module works for any oeprating system that supports Python, which means these commands will work across Linux, MacOs, or Windows without need for adjustment._

<br>
________________________________________________________________________________________________________________

# `datetime` module

Python has the **datetime** module that essentially allows to create objects that have info not just about the date or time, but also about timezones, operations and difference of seconds passed, and deal with timestamps in your code.

Time values are represented with the time class. Times have attributes for hour, minute, second, and microsecond. They can also include time zone information. The arguments to initialize a time instance are optional, but the default of 0 is unlikely to be what you want.

### `.time( )`

Let's take a look at how we can extract time information from the datetime module.

We can create a timestamp by specifying `datetime.time(hour,minute,second,microsecond)`

parameter _hour_ is in 24 hour format.

In [182]:
import datetime

tm = datetime.time(6, 20, 5)

# Let's show different components
print(tm)
print('hour :', tm.hour)
print('minute :', tm.minute)
print('second :', tm.second)
print('microsecond :', tm.microsecond)
print('tzinfo :', tm.tzinfo)            # 'tzinfo' is the timezone information

06:20:05
hour : 6
minute : 20
second : 5
microsecond : 0
tzinfo : None


In [167]:
type(tm)

datetime.time

**Note :** A time instance only holds values of time, and not a date associated with the time.

We can also check the `min` and `max` values, of a time of day can have in the module :

In [156]:
print('Earliest   :', datetime.time.min)
print('Latest     :', datetime.time.max)
print('Resolution :', datetime.time.resolution)

Earliest   : 00:00:00
Latest     : 23:59:59.999999
Resolution : 0:00:00.000001


**The `min` and `max` class attributes reflect the valid range of times in a single day.**

### Dates (`datetime`)

`time` object holds info about time, and not specifically a **date**. To hold a date, datetime object is best to use.

`datetime`, as you might suspect, also allows us to work with date timestamps. Calendar date values are represented with the date class. Instances have attributes for year, month, and day. It is easy to create a date representing today’s date using the `today()` class method.

Let's see some examples :

In [168]:
today = datetime.date.today()     # YY-MM-DD -> follows the European Standard of reperesenting dates

print(today)
print('ctime:', today.ctime())
print('tuple:', today.timetuple())
print('ordinal:', today.toordinal())
print('Year :', today.year)
print('Month:', today.month)
print('Day  :', today.day)

2021-09-04
ctime: Sat Sep  4 00:00:00 2021
tuple: time.struct_time(tm_year=2021, tm_mon=9, tm_mday=4, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=247, tm_isdst=-1)
ordinal: 738037
Year : 2021
Month: 9
Day  : 4


In [188]:
from datetime import datetime

In [176]:
mydatetime = datetime(2018, 9, 29, 7, 35, 25)

print(mydatetime)
print(mydatetime.ctime())

2018-09-29 07:35:25
Sat Sep 29 07:35:25 2018


You can replace any attribute of a datetime object using `.replace()`, like :

**Note: The replacement does not happen in-place.**

**Example-1 :**

In [179]:
mydatetime.replace(year=2021)

print(mydatetime.ctime())     # even after replace, mydatetime.year is still 2018

Sat Sep 29 07:35:25 2018


In [180]:
mydatetime = mydatetime.replace(year=2021)

print(mydatetime.ctime())

Wed Sep 29 07:35:25 2021


**Example-2 :**

In [183]:
d1 = datetime.date(2018, 6, 3)
print('d1:', d1)

d2 = d1.replace(year=1990)
print('d2:', d2)

d1: 2018-06-03
d2: 1990-06-03


<br>

As with time, the range of date values supported can be determined using the `min` and `max` attributes.

In [158]:
print('Earliest  :', datetime.date.min)
print('Latest    :', datetime.date.max)
print('Resolution:', datetime.date.resolution)

Earliest  : 0001-01-01
Latest    : 9999-12-31
Resolution: 1 day, 0:00:00


<br>

## Arithmetic
We can perform arithmetic on date objects to check for time differences. For example :

**I. Differences in `date` objects :**

In [184]:
d1

datetime.date(2018, 6, 3)

In [185]:
d2

datetime.date(1990, 6, 3)

In [186]:
d1 - d2

datetime.timedelta(days=10227)

This gives us the difference in days between the two dates. You can use the `timedelta` method to specify various units of times ie. days, minutes, hours, etc.

**II. Differences in `datetime` objects :**

In [203]:
from datetime import datetime 
datetm1 = datetime(2021, 12, 3, 22, 0)
datetm2 = datetime(2020, 12, 3, 12, 0)

In [206]:
diff = datetm1 - datetm2
diff

# here, the difference in seconds is actually the difference of 10 hours => 10*60*60 = 36000 seconds

datetime.timedelta(days=365, seconds=36000)

**Notice here, besides the difference in days, it also returns the difference of time ie. in seconds**

In [209]:
# Now, this is the difference with its total no. of seconds
# => 365*24*60*60 + 36000 = 31572000

diff.total_seconds()

31572000.0

<br>

<br>
__________________________________________________________________________________________________________________

# Math and Random modules

Python comes with a built in math module and random module. In this lecture we will give a brief tour of their capabilities. Usually you can simply look up the function call you are looking for in the online documentation.

- [Math module](https://docs.python.org/3/library/math.html)
- [Random module](https://docs.python.org/3/library/random.html)

## Math functions
Some useful math functions are :

In [428]:
import math

# help(math)    # math functions provided by math module

#### 1. Rounding Numbers

In [215]:
value = 4.36

In [216]:
math.floor(value)

4

In [217]:
math.ceil(value)

5

`round(number, ndigits)` function works in the same way as we studied in our schools, ie. it rounds-off the **number** to nearest its whole integer (if **ndigits** are not specified), or to nearest number as per the **ndigits** specified.

The `round()` function is not part of the `math` module. You can simply use it as and when. 

**Examples :**

In [218]:
round(value)

4

<br>

Observe here, you might think that 4.5 should round-off to 5. But it isn't.
Because, it gets rounded-off to **nearest even integer**.

This case is only visible when dealing with .5's

In [258]:
round(4.5)    # nearest whole integer is 4. So it takes 4, and not 5.

4

In [254]:
round(9.5)    # nearest whole integer is 10, and not 9.

10

In [269]:
round(-1.5)

-2

In [270]:
round(-2.5)

-2

In [265]:
print(round(5.5126432, 3))

print(round(6.5125432, 3))

print(round(7.5124432, 3))

5.513
6.513
7.512


#### 2. Mathematical Constants

In [219]:
math.pi

3.141592653589793

Or you can also do,

In [266]:
from math import pi

pi

3.141592653589793

In [276]:
math.e     # exponential constant

2.718281828459045

In [278]:
math.inf   # positive infinity

inf

In [280]:
math.nan

nan

In [282]:
-math.inf  # negative infinity

-inf

In [283]:
-math.nan

nan

#### 3. Logarithmic Values

In [284]:
math.e

2.718281828459045

In [285]:
# Log to base 'e'
math.log(math.e)

1.0

In [286]:
from math import log

In [287]:
log(math.e)

1.0

In [290]:
log(0)

ValueError: math domain error

In [291]:
log(10)

2.302585092994046

In [294]:
math.e ** log(10)

NameError: name 'e' is not defined

#### 4. Custom Logarithmic-base

`math.log(x, base)`

In [296]:
math.log(100, 10)

2.0

In [321]:
log(625, 5)

4.0

#### 5.  Trigonometric Functions

In [297]:
# radians

math.sin(10)

-0.5440211108893699

In [298]:
math.degrees(pi/2)

90.0

In [301]:
# math.radian(degrees)
math.radians(180)

3.141592653589793

In [302]:
math.radians(900)

# 900 = 5 * 180 => 5 rounds of 180˚. Hence,
# = 5 * radians(180˚) 

15.707963267948966

## Random module

Random Module allows us to create random numbers. We can even set a seed to produce the same random set every time.

The explanation of how a computer attempts to generate random numbers involves higher level mathmatics. You may check out these interesting topics :

- https://en.wikipedia.org/wiki/Pseudorandom_number_generator
- https://en.wikipedia.org/wiki/Random_seed
- [Pseudorandom number generators | Computer Science | Khan Academy](https://www.youtube.com/watch?v=GtOt7EBNEwQ)


### Understanding a Seed

Setting a seed allows us to start from a seeded psuedorandom number generator, which means the same random numbers will show up in a series.

Note, you need the seed to be in the same cell if your using Jupyter to guarantee the same results each time.<br>Getting a same set of random numbers can be important in situations where you will be trying different variations of functions and want to compare their performance on random values, but want to do it fairly (so you need the same set of random numbers each time).

In [303]:
import random

In [312]:
random.randint(0,100)

74

In [313]:
random.randint(0,100)

48

In [337]:
# the value 101 is completely arbitrary, you can pass in any number you want.
random.seed(101)
# as many times this cell runs, it will always return the same number.
random.randint(0,50)

37

In [343]:
random.randint(0,100)

84

The sequence of numbers produced by below lines of code will be same when run n number of times, even on any other computer - the sequence of numbers is gonna be same. Just give it a try! 

In [344]:
# The value 101 is completely arbitrary, you can pass in any number you want
random.seed(101)
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))

74
24
69
45
59


<br>

### Random with Sequences

##### Grab a random item from a list

In [354]:
myList = list(range(0,15))
myList

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

In [357]:
random.choice(myList)

7

<br>

### Sample with Replacement

Take a sample size, allowing picking elements more than once. Imagine a bag of numbered lottery balls, you reach in to grab a random lotto ball, then after marking down the number, **you place it back in the bag**, then continue picking another one.

In [359]:
random.choices(population=myList, k=10)

[6, 8, 2, 11, 3, 11, 6, 0, 7, 3]

Notice here, the number 6, 3, 11 looks repeated.

### Sample without Replacement

Once an item has been randomly picked, it can't be picked again. Imagine a bag of numbered lottery balls, you reach in to grab a random lotto ball, then after marking down the number, you **leave it out of the bag**, then continue picking another one.

In [360]:
random.sample(population=myList, k=10)

[10, 13, 9, 14, 5, 4, 6, 8, 1, 3]

<br>

### Shuffle a list

**Note: This affects the object in-place i.e. permanently! So, post shuffling, you don't have the list to anything**

In [363]:
# Don't assign this to anything!
random.shuffle(myList)

myList

[12, 2, 7, 1, 14, 13, 4, 9, 11, 0, 8, 5, 10, 6, 3]

<br>

### Random Distributions - some functions that deal with probability distribution

[Uniform Distribution](https://en.wikipedia.org/wiki/Uniform_distribution)

Randomly picks a value between a and b, from a continuous distribution between these. Each value has an equal chance (likelihood) of being chosen.

In [365]:
random.uniform(a=0, b=100)

11.887206640710046

[Normal or Gaussian Distribution](https://en.wikipedia.org/wiki/Normal_distribution)

In [371]:
random.gauss(mu=0, sigma=2)     # sigma : Standard Deviation
                                # mu : Centered Mean

0.968948355232866

<br>

<br>
__________________________________________________________________________________________________________________

# Python Debugger

You've probably used a variety of print statements, trying to find errors in your code. A better way of doing this is by using Python's built-in debugger module (pdb). The `pdb` module implements an interactive debugging environment for Python programs. It includes features to let you pause your program, look at the values of variables, and watch program execution step-by-step, so you can understand what your program actually does and find bugs in the logic.

This is a bit difficult to show since it requires creating an error on purpose, but hopefully this simple example illustrates the power of the pdb module.

Note: Keep in mind it would be pretty unusual to use pdb in an Jupyter Notebook setting.

Here we will create an error on purpose, trying to add a list to an integer :

In [372]:
x = [1, 3, 6]
y = 3
z = 2

result = y + z
print(result)
result2 = y + x
print(result2)

5


TypeError: unsupported operand type(s) for +: 'int' and 'list'

Hmmm, looks like we get an error! Let's implement a `set_trace()` using the **pdb** module. This will allow us to basically pause the code at the point of the trace and check if anything is wrong.

In [376]:
import pdb

x = [1,3,4]
y = 2
z = 3

result = y + z
print(result)

# Set a trace using Python Debugger
pdb.set_trace()

result2 = y + x
print(result2)

5
--Return--
None
> [0;32m<ipython-input-376-88b1a2293928>[0m(11)[0;36m<module>[0;34m()[0m
[0;32m      9 [0;31m[0;34m[0m[0m
[0m[0;32m     10 [0;31m[0;31m# Set a trace using Python Debugger[0m[0;34m[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 11 [0;31m[0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     12 [0;31m[0;34m[0m[0m
[0m[0;32m     13 [0;31m[0mresult2[0m [0;34m=[0m [0my[0m [0;34m+[0m [0mx[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> x
[1, 3, 4]
ipdb> y
2
ipdb> z
3
ipdb> x+y
*** TypeError: can only concatenate list (not "int") to list
ipdb> y+z
5
ipdb> x+z
*** TypeError: can only concatenate list (not "int") to list
ipdb> result
5
ipdb> result2
*** NameError: name 'result2' is not defined
ipdb> q


BdbQuit: 

Great! Now we could check what the various variables were and check for errors. You can type `q` to quit the debugger. For more information on general debugging techniques and more methods, check out the official documentation for **pdb** : https://docs.python.org/3/library/pdb.html

<br>
________________________________________________________________________________________________________________

# Regular Expressions

Regular Expressions (sometimes called **regex**, for short) allows a user to search for general patterns in textual data, i.e. search for strings using almost any sort of rule/structure they can come up.<br>For example, finding all capital letters in a string, or finding a phone number in a document, or finding a simple email format.

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

The `re` library allows us to create specialized pattern strings and then search for matches within text.

The primary skill for regex is understanding the special syntax for these pattern strings.

Let's begin by explaining how to search for basic patterns in a string!

## Search for basic strings/text

In [377]:
text = "The person's phone number is 908-555-1234. Call soon!"

We'll start off by trying to find out if the string "phone" is inside the text string. Now we could quickly do this with :

In [381]:
'phone' in text

True

Let's try this with regexes as they provide even more leverage and functionalities for search,

In [382]:
import re

In [383]:
pattern = 'phone'

In [384]:
re.search(pattern, text)

<re.Match object; span=(13, 18), match='phone'>

`re.search()` will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a `None` is returned (in Jupyter Notebook this just means that nothing is output below the cell).<br>Let's see an example for `None` in this case.

In [392]:
re.search('address', text)

# the text 'address' is not present in the text variable.

Let's take a closer look at this Match object.

In [393]:
match = re.search('phone', text)

match

<re.Match object; span=(13, 18), match='phone'>

**Notice the span, there is a start and end index information.**

In [396]:
match.span()

(13, 18)

In [397]:
match.start()

13

In [398]:
match.end()

18

But what if the pattern occurs more than once ?

In [402]:
text = 'the lake is down the road. lake water is warmer.'

match = re.search('lake', text)

match.span()

(4, 8)

**Notice it only matches the first instance.**

If we wanted a list of all matches, we can use `.findall()` method. It returns the list of all matched occurences.

In [403]:
matches = re.findall('lake', text)

matches

['lake', 'lake']

To get actual match objects, use the iterator i.e. `.finditer()` method :

In [406]:
for match in re.finditer('lake', text):
    print(match)
    print(match.span())
    print("***x***")

<re.Match object; span=(4, 8), match='lake'>
(4, 8)
***x***
<re.Match object; span=(27, 31), match='lake'>
(27, 31)
***x***


If you wanted the actual text that matched, you can use the `.group()` method.

In [405]:
match.group()

'lake'

## Patterns

So far we've learned how to search for a basic text/string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use `search` method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this, often its just a matter of looking up the pattern code.

### Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash `\` . Because of this when defining a pattern string for regular expression we use the format :

    r'mypattern'

placing the `r` in front of the string allows python to understand that the `\` in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers :

In [407]:
text = 'my telephone number is 886-724-1234'
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [408]:
phone.group()

'886-724-1234'

Notice the repetition of `\d`. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

### Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

**<u>Character identifiers</u>**

| Character|Description|Example Pattern Code|Example Match |
| :----: | :----: | :----: | :----: |
| \d | A digit | file_\d\d | file_25 |
| \w | Alphanumeric | \w-\w\w\w | A-b_1 |
| \s | White space | a\sb\sc | a b c |
| \D | A non-digit | \D\D\D | ABC |
| \W | Non-alphanumeric | \W\W\W\W\W | \*-+=) |
| \S | Non-whitespace | \S\S\S\S | Yoyo |

**<u>Quantifiers</u>** - indicate repitition of a same character.

| Character|Description|Example Pattern Code|Example Match |
| :----: | :----: | :----: | :----: |
| + | Occurs 1 or more times | Version \w-\w+ | Version A-b1_1 |
| {3} | Occurs exactly 3 times | \D{3} | abc |
| {2,4} | Occurs 2 to 4 times | \d{2,4} | • 123 • 58 • 1829  |
| {3,} | Occurs 3 or more times | \w{3,} | anycharacters |
| \* | Occurs 0 or more times | A\*B\*C* | • AAACC • ABBC • BC • AB • BBB • C|
| ? | Once or more | plurals? | • plural • plurals |

Let's rewrite our pattern using these quantifiers :

<br>

In [412]:
re.search(r'\d{3}-\d{3}-\d{4}', text)

<re.Match object; span=(23, 35), match='886-724-1234'>

### Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

Using the phone number example, we can separate groups of regular expressions using parenthesis :

In [413]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [414]:
results = re.search(phone_pattern, text)

In [415]:
results.group()

'886-724-1234'

- Can then also call by group position.
- Remember, groups were separated by parenthesis `()`
- Importan note is that the group ordering starts at 1. Passing in 0 returns everything.

In [453]:
results.group(0)

'886-724-1234'

In [416]:
results.group(1)

'886'

In [417]:
results.group(2)

'724'

In [418]:
results.group(3)

'1234'

In [419]:
# IndexError, because we only had 3 groups of prenthesis

results.group(4)

IndexError: no such group

## Additional Regex syntax

#### 1. OR operator `|`

Use the pipe operator to have an `or` statement, i.e. to perform search for multiple terms in a single regex. For example :

In [454]:
# search for 'man' or 'woman'

re.search(r"man|woman", "This man was going to the beach.")

<re.Match object; span=(5, 8), match='man'>

In [421]:
re.search(r"man|woman", "Your Highness Kelly is a woman of pride and respect.")

<re.Match object; span=(25, 30), match='woman'>

In [422]:
# it will match and return the first occurence only

re.search(r"man|woman", "Every man and woman in the town shall obey the orders.")

<re.Match object; span=(6, 9), match='man'>

#### 2. Wildcard character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period `.` for this. For example :

In [456]:
re.findall(r"at", "The cat in the hat sat there.")

['at', 'at', 'at']

In [455]:
re.findall(r".at", "The cat in the hat sat there.")

['cat', 'hat', 'sat']

In [425]:
re.findall(r".at", "The Batman splats a sound which alerted all bats.")

['Bat', 'lat', 'bat']

Notice how we only matched the first 3 letters, that is because we need a `.` for each wildcard letter. Or, use the quantifiers described above to set its own rules.

In [426]:
re.findall(r"...at","The Batman splats a sound which alerted all bats.")

['e Bat', 'splat', 'l bat']

However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at"

In [427]:
# oen or more non-whitespace that ends with 'at'

re.findall(r'\S+at', "The Batman splats a sound which alerted all bats.")

['Bat', 'splat', 'bat']

#### 3. Starts with and Ends with

We can use the `^` to signal starts with, and the `$` to signal ends with :

In [430]:
# ends with a number

re.findall(r'\d$', "My account number is XCVB-5279")

['9']

In [431]:
# starts with a number

re.findall(r'^\d', "2 is the first prime number")

['2']

Note that this is for the entire string, not individual words!<br>See below line as example :

In [457]:
re.findall(r'^\d', 'the 3.14 is pi')

[]

In [439]:
re.findall(r'^\d', "698798")

['6']

#### 4. Exclusion

To exclude characters, we can use the `^` symbol in conjunction with a set of brackets `[]`. Anything inside the brackets is excluded. For example :

In [440]:
phrase = 'there are 3 numbeers 34 inside 5 this sentence.'

In [458]:
re.findall(r'[^\d]', phrase)      # exclude any digits in 'phrase'

['T',
 'h',
 'i',
 's',
 ' ',
 'i',
 's',
 ' ',
 'a',
 ' ',
 's',
 't',
 'r',
 'i',
 'n',
 'g',
 '!',
 ' ',
 'B',
 'u',
 't',
 ' ',
 'i',
 't',
 ' ',
 'h',
 'a',
 's',
 ' ',
 'p',
 'u',
 'n',
 'c',
 't',
 'u',
 'a',
 't',
 'i',
 'o',
 'n',
 '.',
 ' ',
 'H',
 'o',
 'w',
 ' ',
 'c',
 'a',
 'n',
 ' ',
 'w',
 'e',
 ' ',
 'r',
 'e',
 'm',
 'o',
 'v',
 'e',
 ' ',
 'i',
 't',
 '?']

To get the words back together, use a `+` sign.

In [442]:
re.findall(r'[^\d]+', phrase)

['there are ', ' numbeers ', ' inside ', ' this sentence.']

> **Tip:** [Exclusions](#4.-Exclusion) can be useful to remove punctuation from a sentence.

In [443]:
phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [444]:
re.findall('[^!.? ]+', phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [445]:
clean = ' '.join(re.findall('[^!.? ]+', phrase))

clean

'This is a string But it has punctuation How can we remove it'

#### 5. Brackets for Grouping

As we showed above we can use brackets `[]` to group together options, for example if we wanted to find hyphenated words :

In [446]:
text = '''Only find the hypen-words in this sentence.
But you do not know how long-ish they are'''

In [447]:
re.findall(r'[\w]+-[\w]+', text)

['hypen-words', 'long-ish']

#### 6. Parenthesis for Multiple Options

If we have multiple options for matching, we can use parenthesis to list out these options.<br>For example :

In [448]:
# Find words that start with 'cat' and end with one of these options: 'fish','nap', or 'claw'
textOne = "Hello, would you like some catfish?"
textTwo = "Hello, would you like to take a catnap?"
textThree = "Hello, have you seen this caterpillar?"

In [450]:
re.search(r'cat(fish|nap|claw)', textOne)

<re.Match object; span=(27, 34), match='catfish'>

In [451]:
re.search(r'cat(fish|nap|claw)', textTwo)

<re.Match object; span=(32, 38), match='catnap'>

In [452]:
re.search(r'cat(fish|nap|claw)', textThree)        # None is returned

In [459]:
re.search(r'cat(fish|nap|erpillar)', textThree)

<re.Match object; span=(26, 37), match='caterpillar'>

<br>
_____________________________________________________________________

# Timing your code
Sometimes it's important to know how long your code is taking to run, or at least know if a particular line of code is slowing down your entire project. Python has a built-in timing module to do this.

### Example Function or Script

Here we have two functions that do the same thing, but in different ways. How can we tell which one is more efficient ? Let's time it!

In [460]:
def func_one(n):
    '''
    Given a number n, returns a list of string integers
    ['0','1','2',...'n']
    '''
    return [str(num) for num in range(n)]

In [461]:
func_one(10)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [462]:
def func_two(n):
    '''
    Given a number n, returns a list of string integers
    ['0','1','2',...'n']
    '''
    return list(map(str,range(n)))

In [463]:
func_two(10)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

### Timing Start and Stop
We can try using the `time` module to simply calculate the elapsed time for the code. Keep in mind, due to the time module's precision, the code needs to take at least 0.1 seconds to complete.

In [464]:
import time

In [465]:
# STEP 1: get start time
start_time = time.time()

# STEP 2: run your code you want to time
result = func_one(1000000)

#STEP 3: calculate total time elapsed
end_time = time.time() - start_time

end_time

0.22466802597045898

### Timeit module

What if we have two blocks of code that are quite fast, the difference from the `time.time()` method may not be enough to tell which is fater. In this case, we can use the `timeit` module.

The `timeit` module takes in two strings, a statement (stmt) and a setup. It then runs the setup code and runs the stmt code some `n` number of times and reports back average length of time it took.

In [466]:
import timeit

The setup (anything that needs to be defined beforehand, such as def functions.)

In [468]:
setup = '''
def func_one(n):
    return [str(num) for num in range(n)]
'''

In [469]:
stmt = 'func_one(100)'

In [470]:
timeit.timeit(stmt, setup, number=100000)

1.7208647910156287

Now let try running func_two 100,000 times and compare the length of time it took.

In [471]:
setup2 = '''
def func_two(n):
    return list(map(str,range(n)))
'''

In [472]:
stmt2 = 'func_two(100)'

In [473]:
timeit.timeit(stmt2, setup2, number=100000)

1.3566677809867542

It looks like func_two is more efficient. You can specify more number of runs if you want to clarify the different for fast performing functions.

In [474]:
timeit.timeit(stmt,setup,number=1000000)

16.38055953499861

In [475]:
timeit.timeit(stmt2, setup2, number=1000000)

12.934618312981911

<br>

### Timing your code with Jupyter 'magic' method

**NOTE:** This method is ONLY available in Jupyter and the magic command needs to be at the top of the cell with nothing above it (not even commented code)

In [476]:
%%timeit
func_one(100)

17.6 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [477]:
%%timeit
func_two(100)

12.6 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Hence it shows that func_two is indeed faster than func_one.

<br>
___________________________________________________________________________________________________________

# Unzipping and Zipping files

As you are probably aware, files can be compressed to a zip format. Often people use special programs on their computer to unzip these files, luckily for us, Python can do the same task with just a few simple lines of code.

### Create files to Compress

In [478]:
# slashes may need to change for MacOS or Linux

f = open("new_file.txt",'w+')
f.write("Here is some text.")
f.close()

# slashes may need to change for MacOS or Linux
f = open("new_file2.txt",'w+')
f.write("Here is some text in file-2.")
f.close()

### Zipping Files

The [zipfile library](https://docs.python.org/3/library/zipfile.html) is built in to Python, we can use it to compress folders or files. To compress all files in a folder, just use the `os.walk()` method to iterate this process for all the files in a directory.

In [479]:
import zipfile

Create Zip file first , then write to it (the write step compresses the files.)

In [481]:
exampleZip = zipfile.ZipFile('exampleZip.zip', 'w')

In [482]:
exampleZip.write('new_file.txt', compress_type=zipfile.ZIP_DEFLATED)
exampleZip.write('new_file2.txt', compress_type=zipfile.ZIP_DEFLATED)

Remember to close the zip file after using it.

In [483]:
exampleZip.close()

### Extracting from Zip files

We can easily extract files with either the `extractall()` method to get all the files, or just using the `extract()` method to only grab individual files.

In [488]:
zipObj = zipfile.ZipFile('exampleZip.zip', 'r')

In [489]:
zipObj.extractall('extract-exampleZip')       # extract all contents

In [491]:
zipObj.extract('new_file.txt')                  # extract only the new_file.txt file

'/Users/barmanr/Documents/myself/Python3-material/12-Advanced-Python-Module/new_file.txt'

## Using `shutil` library

Often you don't want to extract or archive individual files from a .zip, but instead archive everything at once. The `shutil` library that is built in to python has easy to use commands for this :

In [513]:
import shutil

The shutil library can accept a format parameter, format is the archive format: one of "zip", "tar", "gztar", "bztar", or "xztar".

In [516]:
directory_to_zip = '''/Users/barmanr/Documents/myself/Python3-material/12-Advanced-Python-Module/extract-exampleZip'''

In [517]:
# creating a zip archive
output_filename = 'exampleShutil'

# Just fill in the output_filename and the directory to zip
# Note this won't run as is, because the variable are undefined
shutil.make_archive(output_filename,'zip', directory_to_zip)

'/Users/barmanr/Documents/myself/Python3-material/12-Advanced-Python-Module/exampleShutil.zip'

In [518]:
# Extracting a zip archive

# Notice how the arguement order is slightly different here
shutil.unpack_archive('exampleShutil.zip', 'extract-exampleShutil', 'zip')

In [522]:
shutil.unpack_archive('exampleShutil.zip','extract-exampleShutil', 'zip')

______________________