Book Reference: page 51-84 of **Python for Data Analysis Book by Wes McKinney**

Data Structures and Sequences

I. Data Structures  
II. Create your own reusable Python functions  
II. Mechanics of Python file objects and interacting with your local hard drive 

In [1]:
# Tuple
# fixed-length
# immutable sequence of Python obj

tup = 4,5,6
nested_tup = ((1,2,3),(4,5,6))
print(tup)
print(nested_tup)

tup = tuple('string')
print(tup)


print("Concatenation: ")
print("method 1")
concatenation = (4, None, 'foo') + (6, 0) + ('bar',)
print(concatenation)

print("method 2")
concatenation = ('happy', 'girl') * 5
print(concatenation)

In [2]:
print("Unpacking tuples")
tup = (4, 5, 6)
a, b, c = tup
print(b)

tupNest = (3,2,43, (2,3,4,5))
a,b,c,(d,e,f,g) = tupNest
print(d)

In [3]:
print("In other languages swapping is like this:")

a = 99
b = 1

tmp = a 
a = b
b = tmp
print("a: {0} b: {1}".format(a,b))

print("In python:")
a, b = b,a
print("a: {0} b: {1}".format(a,b))

In other languages swapping is like this:
a: 1 b: 99
In python:
a: 99 b: 1


#### Pluck a few elements from the beginning of a tuple using

```
* rest
```

In [4]:
values = 1,2,3,4,5
a,b, *rest = values
a,b

(1, 2)

In [5]:
rest

[3, 4, 5]

In [6]:
# or
a,b, *_ = values
_

[3, 4, 5]

In [7]:
# discard *_ (unwanted variables)

In [8]:
# List
# variable-length and theri contents can be modified in-place
# frequently used in data processing as a way to materialize an 
# iterator or generator expression

gen = range(10)
gen

range(0, 10)

In [9]:
list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
# adding and removing elements

b_list = ['food', 'heart', 'cloud']
b_list.append('dwarf')
b_list

['food', 'heart', 'cloud', 'dwarf']

In [11]:
b_list.insert(1, 'red')
b_list

['food', 'red', 'heart', 'cloud', 'dwarf']

### <span style="color:red">Warnings on Tuples</span> 

- Insert is computationally expensive compared with append because of the shifting of the subsequent elements
- If you need to insert at start and end of a seq, explore `collections.deque`, a double-ended queue


In [12]:
# pop item by index
b_list.pop(2)

'heart'

In [13]:
b_list

['food', 'red', 'cloud', 'dwarf']

In [14]:
b_list.append('food')
b_list

['food', 'red', 'cloud', 'dwarf', 'food']

In [15]:
# remove by value
b_list.remove('food') ## REMOVES ONLY FIRST VALUE FOUND FROM THE LIST
b_list

['red', 'cloud', 'dwarf', 'food']

### <span style="color:red">Warnings on Lists</span> 

- If performance is not a concern, by using append and remove, you can use a Python list as a perfectly suitable “multiset” data structure.
- searching in lists: linear scan WHILE
- searching in dicts, sets: constant time (based on hash table)

In [63]:
everything = ['Taylor', 22, None]
to_add = ['Swift', 1989, "December", 1, 2, 3]
long_list = [to_add] * 1000

In [64]:
# concatenating and combining lists
# method 1 (the PLUS method)


def method1(everything, long_list):
    for chunk in long_list: 
        everything = everything + chunk

%timeit method1(everything, long_list)

6.04 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [66]:
# method 2 (the extend method)
def method2(everything, long_list):
    for chunk in long_list:
        everything.extend(chunk)

%timeit method2

23.7 ns ± 0.683 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


### <span style="color:red"> Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using extend to append elements to an existing list, especially if you are building up a **large list**, is usually preferable. <span>
    
If small list lang ang iaappend, okay lang din yung addition method.

In [69]:
# Sorting a list
a = [10, 13, 2, 4,3,2, 88, 23]
a.sort()

In [70]:
a

[2, 2, 3, 4, 10, 13, 23, 88]

In [72]:
b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
# note the extra sort **key** argument to sort words acc to len

b

['He', 'saw', 'six', 'small', 'foxes']

## Binary search 

The built-in bisect module implements binary search and insertion into a sorted list. `bisect.bisec`t finds the location where an element should be inserted to keep it sorted,while `bisect.insort` actually inserts the element into that location:

In [81]:
import bisect
c = [1, 2, 2, 2, 3, 4, 7]
bisect.bisect(c,2) # returns index of the insert location

4

In [83]:
bisect.bisect(c, 5)

6

In [85]:
bisect.insort(c, 6) # insort actually inserts the element to the insert location
c

[1, 2, 2, 2, 3, 4, 6, 6, 7]

## Slicing

In [87]:
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5] # outputs seq with index 1 to 4 (5 is excluded)

[2, 3, 7, 5]

In [88]:
seq[3:4] = [6, 3]

In [90]:
seq # the number 7 in seq[3] is replaced by [6, 3]

[7, 2, 3, 6, 3, 5, 6, 0, 1]

In [91]:
seq[:5] # outputs seq with index 0 to 4 
# Outputs first 5 elements

[7, 2, 3, 6, 3]

In [92]:
seq[5:] # outputs last 5 elements

[5, 6, 0, 1]

In [94]:
seq[-4:] # Outputs last 4 elements

[5, 6, 0, 1]

In [95]:
seq[-6:-2] # Outputs last 5 elements except the last 2 elements

[6, 3, 5, 6]

In [97]:
seq[::2] # number after the 2nd colon is the increment

[7, 3, 3, 6, 1]

In [98]:
seq[::-1] # reversing the list

[1, 0, 6, 5, 3, 6, 3, 2, 7]

### Built-in Sequence Functions
- Enumerate
    - 

In [100]:
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
    
mapping

{'bar': 1, 'baz': 2, 'foo': 0}

- Sorted

In [2]:
sorted([10,2,3,5,4,67,32])

[2, 3, 4, 5, 10, 32, 67]

- zip

In [36]:
seq1 = ['star', 'iron', 'batman']
seq2 = ['wars', 'man', 'joker']

zipped = zip(seq1, seq2)
zipped

<zip at 0x1bb6ec25b08>

In [37]:
list(zipped)

[('star', 'wars'), ('iron', 'man'), ('batman', 'joker')]

In [40]:
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}'.format(i, a, b))

0: star, wars
1: iron, man
2: batman, joker


In [41]:
# unzip sequence

pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)

In [42]:
first_names

('Nolan', 'Roger', 'Schilling')

In [43]:
last_names

('Ryan', 'Clemens', 'Curt')

- reversed

In [44]:
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

- dict / hash map/ associative array
    - likely the most important built-in Python data structure
    - key-value pairs

In [64]:
movieRatings = {"Joker": 5, "Endgame" : 4, "Split": 3}
movieRatings["Weathering with you"] = 4
movieRatings

{'Endgame': 4, 'Joker': 5, 'Split': 3, 'Weathering with you': 4}

In [65]:
"Weathering with you" in movieRatings

True

In [66]:
# Delete using del or pop

del movieRatings["Joker"]
movieRatings

{'Endgame': 4, 'Split': 3, 'Weathering with you': 4}

In [67]:
end = movieRatings.pop("Endgame")
end

4

In [68]:
list(movieRatings.keys())

['Split', 'Weathering with you']

In [69]:
list(movieRatings.values())

[3, 4]

In [71]:
# one liner - dicts from sequences

mapping = dict(zip(range(5), reversed(range(5))))
mapping

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

In [76]:
# dict.get(key, default value)

## Default values

In [79]:
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)
        
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

rewrite the previous code block to:

In [80]:
by_letter = {}
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
    
by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

or use collections

In [83]:
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)

by_letter

defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})

In [84]:
# Check if an object is hashable (hashability)
# hashable objects are valued dict key types

hash('string')

5644831557549205984

In [85]:
hash([1,2,3])

TypeError: unhashable type: 'list'

#### set 
- unordered collection of unique elements

In [90]:
a = set([2, 2, 2, 1, 3, 3])
a

{1, 2, 3}

In [91]:
# or
{2, 2, 2, 1, 3, 3}

{1, 2, 3}

In [92]:
b = set([3,54,5,4,21,1])
a.union(b)

{1, 2, 3, 4, 5, 21, 54}

In [94]:
a.intersection(b)

{1, 3}

### List, Set, and Dict Comprehensions
```
dict_comp = {key-expr : value-expr for value in collection
if condition}
```

In [97]:
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
flattened

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [98]:
[[x for x in tup] for tup in some_tuples]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

## Functions
### As a rule of thumb, if you anticipate needing to repeat the same or very similar code **more than once**, it may be worth writing a reusable function.

Function
    - positional argument
    - keyword argument - commonly used to specify default values or optional arguments

### Namespaces, Scope, and Local Functions

Scopes or also called Namespace
    - Global
    - Local

In [100]:
a = None
def bind_a_variable():
    global a
    a = []
    bind_a_variable()
    
print(a)

None


Warning written on page 71:  
<span style="color:red"> I generally discourage use of the global keyword. Typically global
variables are used to store some kind of state in a system. If you
find yourself using a lot of them, it may indicate a need for objectoriented
programming (using classes). </span>

In [None]:
### Data cleaning

In [101]:
states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
    'south carolina##', 'West virginia?']

import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result

In [102]:
clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

In [104]:
# or pass functions as arguments to clean_strings function

states = [' Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
    'south carolina##', 'West virginia?']

def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

### Anonymous (Lambda) Functions


In [108]:
def short_function(x):
    return x * 2
equiv_anon = lambda x: x * 2
equiv_anon(2)

4

In [109]:
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

As another example, suppose you wanted to sort a collection of strings by the number of distinct letters in each string:

In [110]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

In [111]:
strings.sort(key=lambda x: len(set(list(x))))

In [112]:
strings

['aaaa', 'foo', 'abab', 'bar', 'card']

### Currying: Partial Argument Application

is computer science jargon (named after the mathematician Haskell Curry)
that means deriving new functions from existing ones by partial argument application

In [113]:
def add_numbers(x, y):
    return x + y

add_five = lambda y: add_numbers(5, y)

In [114]:
add_five(2)

7

In [116]:
# instead of lambda you can use functools

from functools import partial
add_seven = partial(add_numbers, 7)
add_seven(3)

10

### Generators
- a consistent way to iterate over sequences,
- iterator protocol, a generic way to make objects iterable

In [117]:
some_dict = {'a': 1, 'b': 2, 'c': 3}
dict_iterator = iter(some_dict)
dict_iterator

<dict_keyiterator at 0x1bb6ee1e228>

In [118]:
list(dict_iterator)

['a', 'b', 'c']

generators return a sequence of multiple results 
### **lazily**

In [120]:
def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2
        
gen = squares()
gen
# It is not until you request elements from the generator 
# that it begins executing its code:

<generator object squares at 0x000001BB6EC04A40>

In [121]:
for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

In [122]:
gen = (x ** 2 for x in range(100)) # a generator expression # still lazy
gen

<generator object <genexpr> at 0x000001BB6ECFE728>

### itertools module

In [123]:
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']
for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names))

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


## Files and the Operating System

In [125]:
path = 'segismundo.txt'
f = open(path) # By default, the file is opened in read-only mode 'r'.

for line in f:
    pass

# The lines come out of the file with the end-of-line (EOL) markers intact
lines = [x.rstrip() for x in open(path)] 

lines

['SueÃ±a el rico en su riqueza,',
 'que mÃ¡s cuidados le ofrece;',
 '',
 'sueÃ±a el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueÃ±a el que a medrar empieza,',
 'sueÃ±a el que afana y pretende,',
 'sueÃ±a el que agravia y ofende,',
 '',
 'y en el mundo, en conclusiÃ³n,',
 'todos sueÃ±an lo que son,',
 'aunque ninguno lo entiende.',
 '']

When you use open to create file objects, it is important to explicitly close the file
when you are finished with it. Closing the file releases its resources back to the operating
system:

In [None]:
f.close()

Common use methods with files
- read
- seek 
- tell

In [126]:
f = open(path)
f.read(10)
f.tell()

10

In [127]:
import sys
sys.getdefaultencoding()

'utf-8'

In [128]:
# seek changes the file position to the indicated byte in the file:
f.seek(3)
f.read(1)

'Ã'

In [129]:
with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)