# Python Data Types and Structures
Python is a *dynamically typed* language, meaning the Python interpreter infers the type of an object at runtime. In comparison, compiled languages like `C` are generally *statically typed*. In these cases, the type of an object has to be attached to the object before compile time.

*New in Python 3.5*: type hints - when writing functions, developers can define what data types are expected for the inputs, similar to a strongly typed language. It is not required, but it makes the code easier to use.

## Basic Data Structures
* Tuple
* List
* Dictionaries (dicts)
* Set

*all of these data structures. Python uses zero-based numbering (unlike R, which uses 1-based numbering). So the first element in a list, dictionary, or set is [0].*

### Tuples

In [12]:
# Tuple
t = (1,2.5, 'data')
type(t)

tuple

In [13]:
t[0]

1

In [14]:
type(t[2])

str

In [15]:
t.count('data')

1

In [16]:
t.index(1)

0

### Lists

In [17]:
l = list(t)
l

[1, 2.5, 'data']

In [18]:
type(l)

list

In [19]:
l.append([4,3]) # appends this list as a new item in the list
l

[1, 2.5, 'data', [4, 3]]

In [20]:
l.extend([1,1.4,3]) # adds elements to the list
l

[1, 2.5, 'data', [4, 3], 1, 1.4, 3]

In [21]:
l.insert(1,'insert') # inserts the string 'insert' before the index position = 1
l

[1, 'insert', 2.5, 'data', [4, 3], 1, 1.4, 3]

In [22]:
l.remove('data')
l

[1, 'insert', 2.5, [4, 3], 1, 1.4, 3]

In [23]:
p = l.pop(3)
print(l,p)

[1, 'insert', 2.5, 1, 1.4, 3] [4, 3]


In [24]:
# Slicing
l[2:5]

[2.5, 1, 1.4]

#### List Comprehensions
Very compact "loop" like functions approaching vectorized calculations

In [25]:
m = [i**2 for i in range(5)]
m

[0, 1, 4, 9, 16]

### Dictionaries
Key-Value stores, unordered and unsortable

In [26]:
d = {
    'Name' : 'Angela Merkel',
    'Country' : 'Germany',
    'Profession' : 'Chancelor',
    'Age' : 60
}

type(d)

dict

In [27]:
print(d['Name'], d['Age'])

Angela Merkel 60


In [28]:
d.keys()

dict_keys(['Name', 'Country', 'Profession', 'Age'])

In [29]:
d.values()

dict_values(['Angela Merkel', 'Germany', 'Chancelor', 60])

In [30]:
d.items()

dict_items([('Name', 'Angela Merkel'), ('Country', 'Germany'), ('Profession', 'Chancelor'), ('Age', 60)])

In [31]:
birthday = True
if birthday:
    d['Age'] += 1

print(d['Age'])

61


In [32]:
for item in d.items():
    print( item)

('Name', 'Angela Merkel')
('Country', 'Germany')
('Profession', 'Chancelor')
('Age', 61)


In [33]:
for value in d:
    print(type(value))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


## Sets
The objects are unordered collections of other objects, containing every element only once:


In [34]:
s = set(['u','d','ud','du','d','du'])
s

{'d', 'du', 'u', 'ud'}

In [35]:
t = set(['d','dd','uu','u'])

In [36]:
s.union(t) #all of s and t

{'d', 'dd', 'du', 'u', 'ud', 'uu'}

In [37]:
s.intersection(t) # in both s and t

{'d', 'u'}

In [38]:
s.difference(t) # in s but not t

{'du', 'ud'}

In [39]:
t.difference(s) # in t but not s

{'dd', 'uu'}

In [40]:
s.symmetric_difference(t) # in either one but not both

{'dd', 'du', 'ud', 'uu'}

### Good for getting rid of duplicates in lists!

In [41]:
from random import randint
l = [randint(0,10) for i in range(1000)] #1000 random integers between 0 and 10
len(l)

1000

In [42]:
l[:20]

[9, 8, 6, 0, 2, 0, 7, 2, 7, 0, 10, 8, 6, 9, 5, 4, 5, 10, 5, 4]

In [43]:
s = set(l)
s

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

## Control Structures

### Loops

In [44]:
for element in l[2:5]:
    print(element**2)

36
0
4


## Functions

In [46]:
def f(x):
    return x**2

def even(x):
    return x % 2 == 0

print("f(2) = %3.2f \neven(3) = %r" % (f(2), even(3)))

f(2) = 4.00 
even(3) = False


Applying a function over a whole list: use "map" 

In [47]:
x = map(even, range(10))
list(x) #convert the iterable to a list

# OR
[even(val) for val in range(10)]

[True, False, True, False, True, False, True, False, True, False]

In [48]:
x = map(lambda x: x**2, range(10))
list(x)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

### Using functions to filter a list

In [49]:
x = filter(even, range(15))
list(x)

[0, 2, 4, 6, 8, 10, 12, 14]

### Reduce
Applies a function to all elements of a list and "reduces" it down to one value.

This was removed from Python 3.  Use an explicit loop instead or some other good function...

In [50]:
sum(range(10))

45

## NumPY Data Structures

### Arrays with Python Lists
While base-Python lists are very useful and convenient across many applications, NumPy arrays can be more performant and are sometimes the expected data-type for other libraries, like scikit-learn or other more calculation intensive processes.

These arrays can be generalized into vectors and matrices.

In [51]:
# An "array" using a regular Python list:
v = [0.5,0.75,1.0,1.5,2.0] # vector of numbers

In [52]:
# Create a matrix as a list of lists
m = [v, v, v] # matrix of numbers 
m

[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]

In [53]:
m[1]

[0.5, 0.75, 1.0, 1.5, 2.0]

In [54]:
m[1][0]

0.5

In [55]:
v1 = [0.5,1.5]
v2 = [1,2]
m = [v1,v2]
c = [m,m] # cube of numbers
c

[[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]]

In [56]:
c[1][1][0]

1

In [57]:
v = [0.5, 0.75, 1.0, 1.5, 2.0]
m = [v,v,v]
m

[[0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0],
 [0.5, 0.75, 1.0, 1.5, 2.0]]

In [58]:
v[0] = 'Python'
m

[['Python', 0.75, 1.0, 1.5, 2.0],
 ['Python', 0.75, 1.0, 1.5, 2.0],
 ['Python', 0.75, 1.0, 1.5, 2.0]]

## Regular NumPy arrays
Unlike base-Python lists, NumPy arrays are expected to all be the same data type, usually numeric. Since they were were built to represent vectors and matrices, there are good number of built in functions that we can use that are not part of the regular list class. 

* Indexing: this works the same as a regular list
* Functions:
    * sum
    * mean
    * std
    * cumsum - running cumulative sum
    
You can also call functions on the array that will be applied across all elements.

In [59]:
import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0])
a[:2]

array([0. , 0.5])

In [60]:
a.sum()

5.0

In [61]:
a.std()

0.7071067811865476

In [62]:
a.cumsum()

array([0. , 0.5, 1.5, 3. , 5. ])

### Vectorized operations

In [63]:
a*2

array([0., 1., 2., 3., 4.])

In [64]:
np.sqrt(a)

array([0.        , 0.70710678, 1.        , 1.22474487, 1.41421356])

In [65]:
b = np.array([a,a*2])
b

array([[0. , 0.5, 1. , 1.5, 2. ],
       [0. , 1. , 2. , 3. , 4. ]])

In [66]:
b[0]

array([0. , 0.5, 1. , 1.5, 2. ])

In [67]:
b.sum()

15.0

#### Axis calculations
* Axis = 0: Column Wise
* Axis = 1: Row Wise

In [68]:
b.sum(axis=0)

array([0. , 1.5, 3. , 4.5, 6. ])

### Initiating NP Arrays:
* np.zeros
* np.ones
* np.ones_like
* np.zeros_like

In [69]:
c = np.zeros(( 2, 3, 4), dtype ='i', order ='C') # also: np.ones() 
c

array([[[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]], dtype=int32)

In [70]:
d = np.ones_like(c,dtype=np.dtype(float),order='C')
d

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

In [71]:
c = np.zeros((2,3,4),dtype=np.dtype(int),order = 'C')
c

array([[[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]])

In [72]:
d = np.ones_like(c,dtype = np.dtype(float), order = 'C')
d

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

In [73]:
import random
I = 5000

In [74]:
%time mat = np.random.standard_normal((I,I))

Wall time: 888 ms


In [75]:
%time mat.sum()

Wall time: 48.9 ms


374.8811853619307

## Structured Arrays
Structured arrays allow us to have different NumPy data types per column.

In [76]:
dt = np.dtype([('Name', 'S10'), ('Age', 'i4'), ('Height', 'f'), ('Children/ Pets', 'i4', 2)]) 
s = np.array([('Smith', 45, 1.83, (0, 1)), ('Jones', 53, 1.72, (2, 2))], dtype = dt) 
s

array([(b'Smith', 45, 1.83, [0, 1]), (b'Jones', 53, 1.72, [2, 2])],
      dtype=[('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), ('Children/ Pets', '<i4', (2,))])

In [77]:
s['Name']

array([b'Smith', b'Jones'], dtype='|S10')

In [78]:
s['Height']

array([1.83, 1.72], dtype=float32)

In [79]:
s['Height'].mean()

1.7750001

## Vectorization of Code
Vectorization of code ais a strategy to get more compact code that is possibly executed faster.

In [80]:
r = np.random.standard_normal(( 4, 3)) 
s = np.random.standard_normal(( 4, 3))
r + s

array([[-0.12049523, -2.54376717,  0.3687021 ],
       [ 1.56631688, -1.14717175, -0.86685662],
       [-0.49977622, -0.48870262, -0.58263886],
       [ 0.59388697, -0.12134607, -1.42827709]])

In [81]:
2*r+3

array([[0.85164194, 1.92616713, 2.11813878],
       [3.65358392, 0.61508712, 3.34464467],
       [3.68201046, 1.8312175 , 3.04818971],
       [4.22321941, 5.01930732, 2.46636394]])

In [82]:
s = np.random.standard_normal(3)

In [83]:
r + s

array([[-1.49458009, -0.48634902, -0.0451837 ],
       [-0.0936091 , -1.14188903,  0.56806924],
       [-0.07939583, -0.53382384,  0.41984176],
       [ 0.19120865,  1.06022107,  0.12892887]])

## [Pandas](https://pandas.pydata.org/)
* Series
* Data Frames
* GroupBy Objects
* Pivots

Like data frames in R, the main structure in Pandas is a data frame. We'll start a new Notebook to experiment and explore Pandas further.