# INFO 212: Data Science Programming 1
___

### Week 2, Lecture 2
___

### Wed., April 11, 2018
---

**Question:**
- What built-in capabilities does Python provide for data analysis?

**Objectives:**
- Creat and change sets
- Use comprehensions
- Define and call functions
- Use anonymous functions
- Pass functions as objects
- I/O and coding in Python files

### dict
dict is likely the most important built-in Python data structure. A more common
name for it is hash map or associative array. It is a flexibly sized collection of key-value
pairs, where key and value are Python objects. One approach for creating one is to use
curly braces {} and colons to separate keys and values:

The dict methods get and pop can take a default value to be returned, so that
the above if-else block can be written simply as:
```
value = some_dict.get(key, default_value)
```

With setting values, a common case is for the values in a dict to be other collections,
like lists. For example, you could imagine categorizing a list of words by their
first letters as a dict of lists:

In [None]:
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)
by_letter

The setdefault dict method is for precisely this purpose. The preceding for loop
can be rewritten as:
```
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
```

The built-in collections module has a useful class, defaultdict, which makes this
even easier. To create one, you pass a type or function for generating the default value
for each slot in the dict:
```
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)
    ```

#### Valid dict key types

Keys of dicitonaries must be hashable or immutable.
```
hash('string')
hash((1, 2, (2, 3)))
hash((1, 2, [2, 3])) # fails because lists are mutable
```

```
d = {}
d[tuple([1, 2, 3])] = 5
d
```

### set

A set is an unordered collection of unique elements.
```
set([2, 2, 2, 1, 3, 3])
{2, 2, 2, 1, 3, 3}
```

Set union, intersection, difference, and symmetric difference

In [None]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

```
a.union(b)
a | b
```

```
a.intersection(b)
a & b
```

Set supports mathematical set operations. Set elements must be immutable.

In [None]:
c = a.copy()
c |= b
c
d = a.copy()
d &= b
d

In [None]:
my_data = [1, 2, 3, 4]
my_set = {tuple(my_data)}
my_set

In [None]:
a_set = {1, 2, 3, 4, 5}
{1, 2, 3}.issubset(a_set)
a_set.issuperset({1, 2, 3})

In [None]:
{1, 2, 3} == {3, 2, 1}

### List, Set, and Dict Comprehensions

List comprehensions are one of the most-loved Python language features. They allow
you to concisely form a new list by filtering the elements of a collection, transforming
the elements passing the filter in one concise expression. They take the basic form:
```
[expr for val in collection if condition]
```

This is equivalent to the following for loop:
```
result = []
for val in collection:
    if condition:
        result.append(expr)
```
The filter condition can be omitted, leaving only the expression. For example, given a
list of strings, we could filter out strings with length 2 or less and also convert them to uppercase like this:

```
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
[x.upper() for x in strings if len(x) > 2]
```

Set and dict comprehensions are a natural extension, producing sets and dicts in an
idiomatically similar way instead of lists. A dict comprehension looks like this:
```
dict_comp = {key-expr : value-expr for value in collection if condition}
```

A set comprehension looks like the equivalent list comprehension except with curly
braces instead of square brackets:
```
set_comp = {expr for value in collection if condition}
```

```
unique_lengths = {len(x) for x in strings}
unique_lengths
```

```
set(map(len, strings))
```

```
loc_mapping = {val : index for index, val in enumerate(strings)}
loc_mapping
```

#### Nested list comprehensions

In [None]:
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]

You might have gotten these names from a couple of files and decided to organize
them by language. Now, suppose we wanted to get a single list containing all names
with two or more e’s in them. We could certainly do this with a simple for loop:
```
names_of_interest = []
for names in all_data:
    enough_es = [name for name in names if name.count('e') >= 2]
    names_of_interest.extend(enough_es)
    ```

A single nested list comprehension does this nicely:
```
result = [name for names in all_data for name in names
          if name.count('e') >= 2]
```

Flatten a list of tuples:
```
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
```

A multple statements for loops:
```
flattened = []

for tup in some_tuples:
    for x in tup:
        flattened.append(x)
        ```

List comprehension inside a list comprehension:
```
[[x for x in tup] for tup in some_tuples]
```

## Functions

Functions are defined by using 'def' keyword:
```
def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)
        ```

```
my_function(5, 6, z=0.7)
my_function(3.14, 7, 3.5)
my_function(10, 20)
```

### Namespaces, Scope, and Local Functions

variables defined inside functions have local scope:
```
def func():
    a = []
    for i in range(5):
        a.append(i)
```

```
a = []
def func():
    for i in range(5):
        a.append(i)
func()
```

Assigning variables outside of the function’s scope is possible, but those variables
must be declared as global via the global keyword:
```
a = None
def bind_a_variable():
    #global a
    a = []
bind_a_variable()
print(a)
```

### Returning Multiple Values

Python functions can return multiple values:
```
def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f()
```

In data analysis and other scientific applications, you may find yourself doing this
often. What’s happening here is that the function is actually just returning one object,
namely a tuple, which is then being unpacked into the result variables. In the preceding
example, we could have done this instead:
```
return_value = f()
```

Instead, return a dictionary:

```
def f():
    a = 5
    b = 6
    c = 7
    return {'a' : a, 'b' : b, 'c' : c}
    ```

### Functions Are Objects
Suppose we were doing some data cleaning and
needed to apply a bunch of transformations to the following list of strings.

Many user-submitted real world data has seen messy results
like these. Lots of things need to happen to make this list of strings uniform and
ready for analysis: stripping whitespace, removing punctuation symbols, and standardizing on proper capitalization. One way to do this is to use built-in string methods along with the re standard library module for regular expressions.

In [None]:
states = ['   Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda',
          'south   carolina##', 'West virginia?']

```
import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result
 ```   

In [None]:
clean_strings(states)

An alternative approach that you may find useful is to make a list of the operations
you want to apply to a particular set of strings:
```
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result
    ```

In [None]:
clean_strings(states, clean_ops)

You can use functions as arguments to other functions like the built-in map function,
which applies a function to a sequence of some kind:
```
for x in map(remove_punctuation, states):
    print(x)
    ```

### Anonymous (Lambda) Functions
Python supports defining anonymous functions using the lambda keyword. These functions are referred to as lambda functions

```
def short_function(x):
    return x * 2

equiv_anon = lambda x: x * 2
```

```
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * 2)
```

These lambda functions are especially convenient in data analysis:
- many cases where data transformation functions will take functions as arguments. 
- It’s often less typing (and clearer) to pass a lambda function as opposed to writing a full-out function declaration or even assigning the lambda function to a local variable.

As another example, suppose you wanted to sort a collection of strings by the number
of distinct letters in each string:

In [1]:
strings = ['foo', 'card', 'bar', 'aaaa', 'abab']

```
strings.sort(key=lambda x: len(set(list(x))))
strings
```

In [2]:
strings.sort(key=lambda x: len(set(list(x))))
strings

['aaaa', 'foo', 'abab', 'bar', 'card']

In [3]:
[x for x in strings if len(x) >3]

['aaaa', 'abab', 'card']

### Currying: Partial Argument Application

Currying means deriving new functions from existing ones by partial argument application.
For example, suppose we had a trivial function that adds two numbers together:
```
def add_numbers(x, y):
    return x + y
    ```

```
add_five = lambda y: add_numbers(5, y)
```

```
from functools import partial
add_five = partial(add_numbers, 5)
```

In [4]:
def add_numbers(x, y):
    return x + y

In [8]:
add_five = lambda y: add_numbers(5, y)

In [11]:
from functools import partial
add_five = partial(add_numbers, 5)

print(add_five)

functools.partial(<function add_numbers at 0x111ff80d0>, 5)


### Generators

Iterating over sequences is provided by iterator protocal:
```
some_dict = {'a': 1, 'b': 2, 'c': 3}
for key in some_dict:
    print(key)
    ```

In [12]:
some_dict = {'a': 1, 'b': 2, 'c': 3}
for key in some_dict:
    print(key)

a
b
c


An iterator is any object that will yield objects to the Python interpreter when used in
a context like a for loop. Most methods expecting a list or list-like object will also
accept any iterable object. This includes built-in methods such as min, max, and sum,
and type constructors like list and tuple.
```
dict_iterator = iter(some_dict)
dict_iterator
```

In [13]:
dict_iterator = iter(some_dict)

In [20]:
list(dict_iterator)

[]

In [21]:
list(dict_iterator)

[]

A generator is a concise way to construct a new iterable object. Generators return a sequence of multiple results lazily, pausing after each one until the next one is requested. To create a generator, use the yield keyword instead of return in a function:
```
def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2
        ```

In [22]:
def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

Nothing is executed:
```
gen = squares()
gen
```

In [23]:
gen = squares()
gen

<generator object squares at 0x111fc3938>

The elements will be generated when requested in a loop:
```
for x in gen:
    print(x, end=' ')
    ``

In [24]:
for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100 

#### Generator expresssions
Put list comprehension in parenthesis

```
gen = (x ** 2 for x in range(100))
gen
```

In [25]:
gen = (x ** 2 for x in range(100))
gen

<generator object <genexpr> at 0x112064360>

It is equivalent to the lengthy function:
```
def _make_gen():
    for x in range(100):
        yield x ** 2
gen = _make_gen()
```

Generator expressions can be passed to other functions:
```   
sum(x ** 2 for x in range(100))
dict((i, i **2) for i in range(5))
```

In [28]:
sum(x ** 2 for x in range(100))
dict((i, i **2) for i in range(5))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

#### itertools module
The standard library itertools module has a collection of generators for many common
data algorithms. For example, groupby takes any sequence and a function,
grouping consecutive elements in the sequence by return value of the function. Here’s
an example:

```
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']
for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names is a generator
```

In [36]:
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']
for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names)) # names is a generator

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


### Errors and Exception Handling
In data analysis applications, many functions only work on certain
kinds of input. As an example, Python’s float function is capable of casting a string
to a floating-point number, but fails with ValueError on improper inputs:

```
float('1.2345')
float('something')
```

In [38]:
float('1.2345')
float('something')

ValueError: could not convert string to float: 'something'

Handle errors and exceptions gracefully:
```
def attempt_float(x):
    try:
        return float(x)
    except:
        return x
        ```

In [39]:
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

In [43]:
attempt_float("1.234")

1.234

In [42]:
attempt_float('1.2345')
attempt_float('something')

'something'

Different Exceptions:
```
float((1, 2))
```

In [44]:
float((1,2))

TypeError: float() argument must be a string or a number, not 'tuple'

```
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x
        ```

In [48]:
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

In [49]:
attempt_float((1, 2))

(1, 2)

```
def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x
```

In [51]:
def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

In [52]:
attempt_float((1, 2))

(1, 2)

Suppress an exception, but execute code regardless of whether the code in the try block succeeds or not
```
f = open(path, 'w')

try:
    write_to_file(f)
finally:
    f.close()
```

In [55]:
f = open(path, 'w')

try:
    write_to_file(f)
finally:
    f.close()

NameError: name 'www' is not defined

Use else with try block:
```
f = open(path, 'w')

try:
    write_to_file(f)
except:
    print('Failed')
else:
    print('Succeeded')
finally:
    f.close()
   ```

## Files and the Operating System
Most of data analysis uses high-level tools like pandas.read_csv to read data files from
disk into Python data structures. However, it’s important to understand the basics of
how to work with files in Python. Fortunately, it’s very simple, which is one reason
why Python is so popular for text and file munging.

In [59]:
path = '../datasets/segismundo.txt'
f = open(path)

The file is opened in read-only mode 'r'. We can then treat the file handle
f like a list and iterate over the lines like so:
```
for line in f:
    do something
```

In [63]:
for line in f:
    print(line)

Sueña el rico en su riqueza,

que más cuidados le ofrece;



sueña el pobre que padece

su miseria y su pobreza;



sueña el que a medrar empieza,

sueña el que afana y pretende,

sueña el que agravia y ofende,



y en el mundo, en conclusión,

todos sueñan lo que son,

aunque ninguno lo entiende.





The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll
often see code to get an EOL-free list of lines in a file like:
```
lines = [x.rstrip() for x in open(path)]
```

In [65]:
lines = [x.rstrip() for x in open(path)]
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

Close an opened file to release resources:
```
f.close()
```

In [66]:
f.close()

The 'with' statement will automatcally close file when exits the with statment:
```
with open(path) as f:
    lines = [x.rstrip() for x in f]
```

In [68]:
with open(path) as f:
    lines = [x.rstrip() for x in f]

**Caution**: If we had typed f = open(path, 'w'), a new file at examples/segismundo.txt would have been created (be careful!), overwriting any one in its place. There is also the 'x' file mode, which creates a writable file but fails if the file path already exists.

For readable files, some of the most commonly used methods are read, seek, and
tell. read returns a certain number of characters from the file. What constitutes a
“character” is determined by the file’s encoding (e.g., UTF-8) or simply raw bytes if
the file is opened in binary mode:
```
f = open(path)
f.read(10)
f2 = open(path, 'rb')  # Binary mode
f2.read(10)
```

In [72]:
f = open(path)
f.read(10)
f2 = open(path, "rb")
f2.read(10)

b'Sue\xc3\xb1a el '

The read method advances the file handle’s position by the number of bytes read.
tell gives you the current position:

```
f.tell()
f2.tell()
```

In [74]:
f.tell()
#f2.tell()

11

Check default encoding:
```
import sys
sys.getdefaultencoding()
```

In [75]:
import sys
sys.getdefaultencoding()

'utf-8'

seek changes the file position to the indicated byte in the file:
```
f.seek(3)
f.read(1)
```

In [76]:
f.seek(3)
f.read(1)

'ñ'

In [77]:
f.close()
f2.close()

To write text to a file, you can use the file’s write or writelines methods. For example, we could create a file with no blank lines like so:
```
with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)
with open('tmp.txt') as f:
    lines = f.readlines()
```

In [79]:
with open('tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)
with open('tmp.txt') as f:
    lines = f.readlines()

In [80]:
import os
os.remove('tmp.txt')

### Bytes and Unicode with Files
The default behavior for Python files (whether readable or writable) is text mode,
which means that you intend to work with Python strings (i.e., Unicode). This contrasts
with binary mode, which you can obtain by appending b onto the file mode.
Let’s look at the file (which contains non-ASCII characters with UTF-8 encoding)
from the previous section:

```
with open(path) as f:
    chars = f.read(10)
chars
```

In [81]:
with open(path) as f:
    chars = f.read(10)
chars

'Sueña el r'

UTF-8 is a variable-length Unicode encoding, so when I requested some number of
characters from the file, Python reads enough bytes (which could be as few as 10 or as
many as 40 bytes) from the file to decode that many characters. If I open the file in
'rb' mode instead, read requests exact numbers of bytes:
```
with open(path, 'rb') as f:
    data = f.read(10)
data
```

In [82]:
with open(path, 'rb') as f:
    data = f.read(10)
data

b'Sue\xc3\xb1a el '

Depending on the text encoding, you may be able to decode the bytes to a str object
yourself, but only if each of the encoded Unicode characters is fully formed:
```
data.decode('utf8')
data[:4].decode('utf8')
```

In [83]:
data.decode('utf8')
data[:4].decode('utf8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpected end of data

Text mode, combined with the encoding option of open, provides a convenient way
to convert from one Unicode encoding to another:
```
sink_path = 'sink.txt'
with open(path) as source:
    with open(sink_path, 'xt', encoding='iso-8859-1') as sink:
        sink.write(source.read())
with open(sink_path, encoding='iso-8859-1') as f:
    print(f.read(10))
   ```

In [86]:
sink_path = 'sink.txt'
with open(path) as source:
    with open(sink_path, 'xt', encoding='iso-8859-1') as sink:
        sink.write(source.read())
with open(sink_path, encoding='iso-8859-1') as f:
    print(f.read(10))

FileExistsError: [Errno 17] File exists: 'sink.txt'

In [87]:
os.remove(sink_path)

Beware using seek when opening files in any mode other than binary. If the file position
falls in the middle of the bytes defining a Unicode character, then subsequent
reads will result in an error:
```
f = open(path)
f.read(5)
f.seek(4)
f.read(1)
f.close()
```

In [89]:
f = open(path)
f.read(5)
f.seek(4)


4

If you find yourself regularly doing data analysis on non-ASCII text data, mastering
Python’s Unicode functionality will prove valuable. See Python’s online documentation
for much more.

## Conclusion
With some of the basics and the Python environment and language now under our
belt, we are ready to move on and learn about NumPy and array-oriented computing in
Python.