# Python Fundamentals - Minimal Viable Python


### Steve Goodman, July 2019

The objective of these sessions is to advance your Python knowledge beyond the very basics of "hello world", with a particular focus on data analysis.
This is part 1 of 2, and in this part we will walk through some key, fundamental Python constructs that will lead into Part 2 when will tour the main data analysis libraries. 

It is not a complete introduction to the language - there are hundreds of resources out there that will teach you the very basics at your own pace - but a curated list of concepts I think you need to learn in order to perform data analysis effectively in Python. 


We are going to cover 1) Built-in Data Structures 2) Iteration 3) Functions 4) Numpy 5) Classes (optional)

# 0. Preliminaries - Jupyter notebooks

Before that, a few words about Jupyer notebooks. These are Jupyter-specific convenience functions that e.g. let you interact with the operating system. Some of these commands will be slightly different on MS Windows

In [None]:
%magic list

In [None]:
%ls # %dir on Windows

In [None]:
%pwd

In [None]:
#or  you could use this os library instead - works anywhere not just on Jupyter
import os
os.getcwd()

In [None]:
help(os.getcwd)

In [None]:
os.getcwd?

Will execute the code cells on another notebook - note (this currently doesn't point to a real notebook so will error out if you execute the cell right now)

In [None]:
%load myothernoteboook.ipydb 

In [21]:
# Not strictly neccessary for today, but I can use it to demo some use-cases
import pandas as pd
import numpy as np
import seaborn as sns 
titanic = sns.load_dataset('titanic')


# 1. Built-in Data Structures
## Lists, tuples, sets and dictionaries

Basic data types - ints, floats, strings - but remember, in python everything is an object, so even ints have properties and methods

In [None]:
i = 1
type(i)

In [None]:
dir(i)

In [None]:
i

Python is "dynamically typed" so the type information of a variable is determined at run time, not compile time like many other languages.
That means, no type declarations necessary (e.g. int i = 1 in C/C++/Java), and variables can be coerced to a different type by simple reassignment like this... 
Be careful though, this can be a source of many bugs

In [None]:
i = 'Steve'
type(i)

Strings have far more useful methods compared to integers

In [None]:
dir(i)

So we can do this with a method call

In [None]:
i.upper()

## Lists

Moving on to more complex data structures. Lists are used A LOT in Python, so if you absorb just one thing today, make it this section

In [None]:
#lists
weekdays = ['mon', 'tues', 'wed', 'thurs', 'fri']
integers_list = [1,2,3,4,5] 

As per our earlier discussion on dynamic typing, mixing types in a list is perfectly legal syntax

In [None]:
hetrogeneous_list = ['mon', 1, 'tues', 2]

However, there are some drawbacks..if we do some operations on those lists, the operation may be expecting a particular data type (I will exaplain more on these operations later)

In [None]:
sum(integers_list)

In [None]:
sum(hetrogeneous_list)

In [None]:
[i.upper() for i in weekdays ]

In [None]:
[i.upper() for i in hetrogenous_list ]

Easily diagnosed in these cases - harder when your importing a large data source and expecting numerics, but somehow a string value sneaks in (or vice versa). Will return to this topic when we look at Pandas

Moving on we can select and retrieve a particular element of a list through subscripting

In [None]:
weekdays[0]


We can also take a slice of a list by using a colon. Slicing behaviour is poweful and very commonplace - study it well. The general form is list(start_value : end_value). Indices are zero based in Python, so the index of the first value is 0

In [None]:
weekdays[0:3]

In [None]:
weekdays[3:]

Negative indices means the indexing starts at the end of the list. Suppose I want to retrieve the last element

In [None]:
weekdays[-1]

This example has a double colon :: - the number after the second colon is called the stride - it represets the step size, so for example, suppose I only want to skip every other  element

In [None]:
weekdays[::2]

What happens if the stride is a negative number?

In [None]:
weekdays[::-1]

### Use case 
a lot in pandas maninpulation where I have a dataset with lots of columns. 
Consider this very simple example of preprocessing the data ready to build a classification model. 
For scikit learn you typically need a matrix X for your input variables, and a vector y for your class label. 
In this case, the first column 'survived' is the class label 

In [22]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [38]:
#Take all columns, except the first
X = titanic.iloc[:,1:]
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [39]:
y= titanic.iloc[:,1].head()
y.head()

# of course, I could have done this as well:
# y = titanic['survived']


0    3
1    1
2    3
3    1
4    3
Name: pclass, dtype: int64

### VIEWS and COPY gotchas


You can reassign an existing list to another variable like below. Now it appears we have 2 lists, but be careful though, the 2 variables effectively point to the same list, so if you alter the content of list it will be reflected in BOTH variables. E.g.

In [None]:
weekdays

In [None]:
weekdays_v2 = weekdays
weekdays_v2

In [None]:
weekdays_v2.append('sat')

As you would expect there is now an extra day on weekdays_v2

In [None]:
weekdays_v2

But the original weekdays has been modified as well! What actually gets returned from the above assignment statement is a VIEW of the same underlying data structure not an independent COPY of the data structure

In [None]:
weekdays

The location of the data in memory is the same for both variables 

In [None]:
id(weekdays)

In [None]:
id(weekdays_v2)

So if you want an independent copy of the list to work on without modifying the original, you need to do so explicitly with a copy

In [None]:
weekdays_v3 = weekdays.copy()


Now the memory adress will be different

In [None]:
id(weekdays_v3)

so if I now modify this new data structure, the changes don't get pased back to the original list

In [None]:
weekdays_v3.append('sun')
weekdays_v3

In [None]:
weekdays

Generally this default VIEW behavior is mostly desirable when working in Pandas and Numpy, especially when working with larger datasets because reepated copying will quickly eat-up memory.

Lists also have lots of other methods besides append() that you can apply. I'll demo some (and how we can use a list to make our own custom data structure) later in the lesson

## Tuples

Are effectively read-only lists. You may not appear to want create them yourself often, but we'll see a few examples later where they are useful and you should at least know when to recognise them

In [None]:
#either form is acceptable
tup = (1,2,3)
tup2 = 1, 2, 3

In [None]:
print(tup)

In [None]:
type(tup)

In [None]:
#Since they are read-only, there are far fewer methods available compared to lists
dir(tup)

Here's a common use case - functions by default always return one and only one value - but you can return multiple results from the function by creating a tuple, and then get the results either as a single object OR by "tuple unpacking"

In [None]:
def calc_stats(numbers):
    lo = min(numbers)
    hi = max(numbers)
    tot = sum(numbers)
    avg = tot/len(numbers)
    return tot, avg, hi, lo



#access like this:
results = calc_stats([1,2,3,4])

#or like like this:
total, avg, high, low = calc_stats([1,2,3,4])


In [None]:
results[0]

In [None]:
total

## Dictionaries

Dictionaries (called associative arrays or maps in other languages) work kind of like a phone book (remember those?) or the Index at the back of a normal book (remember those?). 

Data is stored as "key-value pairs" The "key" is how you access/lookup individual items by name, and therefore need to be unique. The "key" exists purely for lookup purposes and must be a  hashable (readonly/immutable) type - so an int, string or a tuple is fine, a list is not. The "value" usually contains the meat of the thing you want to store, the "data" such as a record containing multiple fields perhaps, and ccould be almost any object (potentially a very complex one). The "value" is not read only, and can be changed at will.

Unlike lists, there is no ordering. 

In this example the key is the surname, and the value (or data) is first name

In [None]:
adict = {'Idle': 'Eric' , 'Cleese': 'John', 'Jones':'Terry', 'Palin':'Michael'}

In [None]:
adict['Idle']

In [None]:
#another way to create a dict -  same result as above, the creation is just different - you may see either form

In [None]:
bdict = dict(Idle= 'Eric' , Cleese= 'John', Jones= 'Terry', Palin='Michael')
bdict

Wait up - we're missing one of the Pythons - let's add him with an assignment statement

In [None]:
adict['Gilliam'] = 'Terry'
adict

Note, since there are 2 Terrys in Monty Python, using the first name would be a bad choice for the key because it needs to be unique. If we make a second assignment to the key Terry, then the second value would just overwrite the first one

As I said earlier, the Key could be a tuple as well

In [None]:
tup = ('Ministry','of','walks')
adict[tup] = 'somevalue'
adict

So that works, but it's a bit silly, so lets delete that element

In [None]:
del adict[tup]
adict

We can't however use a list as the key 

In [None]:
adict[weekdays]= 'somevalue'

Although the Value part of the dictionary *could*  be a list (and when we look at Pandas, we'll show this as one way of creating a DataFrame)

In [None]:
adict['mylist'] = weekdays
adict

Again, weekdays doesn't belong here so lets get rid of it

In [None]:
del adict['mylist']
adict

What happens when we try to access with a key that doesn't exist?



In [None]:
adict['Monty']

We're about to head over to Iteration soon, so here's a couple of Dict methods that can be iterated over. We could just retrieve the keys

In [None]:
adict.keys()

Or just the values

In [None]:
adict.values()

Or we could retrieve both keys and values as tuples

In [None]:
adict.items()

## Sets

Less frequently used, these work like mathematical sets so:
there is no ordering, 
there are no duplicate items,
you can perform standard set operations on 2 sets like union, intersection etc

In [None]:
seta = set([1,2,3,4])
setb = set([4,5,6,7])

In [None]:
seta.intersection(setb)

In [None]:
seta.union(setb)

passing it a list containing dupes

In [None]:
set([1,2,2,2,4,4,4])

In [None]:
#we can also convert the set back to a list like so -  a common way of removing dupes from a list

In [None]:
list(set([1,2,2,2,4,4,4]))

# 2. Iteration

Now we have our data structures, we will want to iterate over them and process the individual items somehow. The most common is a for loop. 

In [None]:
mylist = [1,2,3,4,5]
for i in mylist:
    print(i)

Creating a list of integers as above is tedious, so their is a more convenient way to create an ascending sequence of integers (note that as with slicing,  the second argument is not included in the list

In [None]:
for i in range(0,5):
    print(i)

And just like slicing, we can include a third parameter which is the stride or step size

In [None]:
for i in range(0,5,2):
    print(i)

and we can reverse the ordering of the range and iterate over it backwards

In [None]:
for i in range(5,0,-1):
    print(i)

Turns out range is a data structure in its own right of type 'range'

In [None]:
type(range(0,5))


you can't normally see the individual elements...

In [None]:
range(0,10)

unless you pass it to a list or a loop

In [None]:
list(range(0,10))

But we can also iterate over non numeric data structures too (sometimes called "foreach" in other languages)

In [None]:
for day in weekdays:
    print(day)

Sometimes you want to iterate over items like this but at the same time you want to know the integer index/position in the list. You can do it like this

In [None]:
position = 0
for day in weekdays:
    print(position, day)
    position = position + 1 #explicit so you can see whats happening "position += 1" is more commonly used

That's a little ugly - heres a more Pythonic version with enumerate that calculates position automatically

In [None]:
for something in enumerate(weekdays):
    print(something)

Enumerate returns a tuple. We can make use of tuple unpacking to store the elemnts of each tuple in separate variables

In [None]:
for position, day in enumerate(weekdays):
    print(f"position={position}, day={day}")

In [None]:
#we can iterate over Dictionaries as well, but remember - dicts have no ordering, so the processing may happen in any order

In [None]:
for item in adict:
    print(item)

Thats just the keys however. We typically want the values as well. Remember how we did that earlier?

In [None]:
for item in adict.items():
    print(item)

Again we can use tuple unpacking to store the keys and values in separate variables

In [None]:
for key, value in adict.items():
    print(f"key={key}, value={value}")

Sometimes we want to iterate over a data structure and generate a list as a result. Lets calculate the square of a sequence of numbers and return them as a list

In [None]:
squares = []
for x in range(1,5):
    squares.append(x*x)
squares

That's a common operation but a bit clunky, here's a more pythonic approach - called a List comprehension

In [None]:
squares = [x*x for x in range(1,5)]
squares

We can also add a filter operation to this list with IF 

In [None]:
[x*x for x in range(1,5) if x*x % 2 ==0]

We can iterate over the dict we've just seen as well - lets join the first_name, surname together to create a full name.
First, here's the way with a standard for loop

In [None]:
full_name = []
for key, value in adict.items():
    full_name.append(key + ' ' + value)
full_name

And this is the more Pythonic list comprehension version 

In [None]:
full_name = [key + ' ' + value for key, value in adict.items() ]

In [None]:
full_name

BTW: we can create dictionary comprehensions if we want to as well. They look similar, but note they use {} rather than [] and also the use of the colon to determine the key/value pair. 

Lets take the full_name list we've just created and turn it back into a Dict. 

First you need to know that if we take a single element of the list like "Idle Eric" we can use str.split(' ') to
split it into 2 parts based on the space. the result will be a tuple...

In [None]:
'Idle Eric'.split(' ')

and we can use indexing [0] [1] to get the 0th and 1st elements of that tuple



In [None]:
'Idle Eric'.split(' ')[0]

Back to the task at hand - Note the use of colon - the string part before the colon becomes the key, and the string part after the colon is the value

In [None]:
new_dict = {names.split(' ')[0] : names.split(' ')[1]  for names in full_name }
new_dict

That's the end of our look at for and comprehensions. You can also use WHILE loops but far less commonly used in Python.

When we get to Numpy and Pandas, it turns out we can iterate over those data structures whithout even using a for loop, which is even more Pythonic and usually faster. But before that, we will need to understand a bit about functions...

Another use case - supposing I have hundreds of variables in my dataset - its good practice to prefix or suffix them with some identifier that you can group them by. Supposing I have a long time series and my date variables are prefixed with t - here I can just extact out the date variables into a list for further processing that pertains to those variables only and not the remaining feilds

In [46]:

model_variables = ['cust_id', 't_period1', 't_period3', 't_period4', 't_period5', 'another_var']

In [47]:
time_variables = [i for i in model_variables if i.startswith('t_') ]
time_variables             

['t_period1', 't_period3', 't_period4', 't_period5']

# 3. Functions

In [None]:
#functions with keyword args
def adder(a, b, c=30, d=40):
    return a + b + c + d

adder(1, 2)

In [None]:
#the only caviat with this scenario is as kws are optional you have to check for exist before you access them
# so remember dict.get()??
# another typical pattern would be to use dict.items() to iterate over the list
def func(*args, **kwargs):
    print(f"positional args = {args[0]}")
    if kwargs.get('name'):
        print(f"kw arg = {kwargs['name']}")
    return 
func('hello')

In [None]:
def sum_and_avg(a, b, c):
    tot = a + b + c
    avg = tot/3
    return tot, avg
results = sum_and_avg(1, 2 , 3)

Remember the list comp created ealier that calcualted the list of squared values? We can do the same thing but using an approach more common to Functional languages with an operation called Map.
For map to work we need to pass it 2 things a data structure and a function

In [None]:
def square(x):
    return x * x

map(square, [1,2,3,4,5])

to force it to evaluate,  we're going to have to wrap it in a list

In [None]:
list(map(square, [1,2,3,4,5]))

As an aside - it turns out we can create abitrarily complex functions and arbitrarily large data structures using this pattern. It turns out that MapReduce (aka Hadoop) uses this Map function and the Reduce function  (see below) to process huge datasets, because this kind of approach is parallelizable and thefore highly scaleable. 

Slightly more relevantly, we will use this basic pattern in Pandas to create our own custom functions that iterate over tables of data, using method calls like map() and  apply()

In [None]:
from IPython.display import Image
Image(filename='topgun.jpg')

# 5. Numpy basics 

The data structures we have looked at so far are dynamically typed i.e. can contain a mixture of types. and Python checks the type of each element at runtime to determine how to process it. This is slow and takes up extra RAM.
But much of the time, we expect that our data to contain just a sequence of the same type, be it floats, or ints or whatever. Statically typed data structures will impose this limitation and will be much faster.
For a more comprehensive overview of Numpy, see:
https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html



In [None]:
#Arrays - are built into the standard libary for efficient, typed data structures.
# Hint - you probably don't need them, use numpy arrays instead - far more useful for numeric compuation and are the foundation of Pandas
import array
x = array.array('i',[1,2,3])

In [None]:
import numpy as np

Methods of creating new Numpy arrays

In [None]:
np.ones(10,'int16')

In [None]:
np.zeros(5,'float')

In [None]:
np.linspace(0,1, 5)

This time let's create an ascending sequence of integers, Numpy has its own range() function called arange that returns a numpy array.

But also, we'ere going to reshape it so that its a 2d array/matrix rather than a vector. 

In [None]:
mat = np.arange(1,10).reshape(3,3)
mat

In [None]:
mat.shape

In [None]:
Missing values in Numpy and their equivalent functions

## Interlude - some Scipy Fun

We're not covering scipy in these lessons, the libraries are vast and specialised, but here's a quick sample

In [None]:
import scipy.special as sp
sp.factorial(5)

In [None]:
from scipy import special, optimize
from matplotlib.pylab import plot

#Find the maximum point on this bessel curve (nb: there is no maximize function, just invert minimize)
sol = optimize.minimize(lambda x: -special.jv(3, x), 1.0)

x = np.linspace(0, 10, 5000)


plot(x, special.jv(3, x), '-', sol.x, -sol.fun, 'o')



In [None]:
sol.x

# 6. Classes (optional)

Python also makes use of object oriented programming. Remember in Python everthing is an object, even simple data types, functions and operators.
You probably dont need to write your own classes but you will need a general awareness of what they are
to use libraries like Pandas effectively

In [None]:
#Heres a concrete implementation of a simple class that creates a Stack (Last In First Out) data structure
#By convention:
# class name is capitalized
# methods (fancy name of functions that reside within classes) are lower_case (and normally words are separated by underscores)
# "hidden" variables/properties  are prefixed with a _
# special or "magic" methods are pre/postfixed with double underscores __
# ignore the references to "self" for now. They are important but outside the scope of the lesson

# classes typically contain a combo of data (variables/properties) and operations (methods/functions)
class Stack:
    def __init__(self):
        self._s=[]
        self.name = 'Stack'
    def push(self, element):
        self._s.append(element)
    def pop(self):
        return self._s.pop()
    def __len__(self):
        len(self._s)


In [None]:
#creating an instance of the class 
mystack = Stack()

#now use the instance (not the class itself )
mystack.push(1)


In [None]:
#heres one way of using magic methods with the built-in len function
len(mystack)


In [None]:
# you can access the underlying properties if you wish
mystack.name
mystack._s


In [None]:
#Heres another example, this time its a First In First Out data structure, again implemented using a List

class Queue:
    def __init__(self):
        self._q=[]
    def enqueue(self, element):
        self._q.insert( 0, element)
    def dequeue(self):
        return self._q.pop()
    def __len__(self):
        len(self._s)
myq = Queue()
myq.enqueue(1)
myq.enqueue(2)
len(myq)

In [None]:
import pandas as pd
mydf = pd.DataFrame([]) # DataFrame is a Class (note how its captalised), mydf is an instance of that class

In [None]:
mydf.aggregate() # this is a method call - note the results will only be applied to the instance "mydf" not all dataframes

concat however is just a function from the pandas library (note; its all lowercase) - its not bound to a particular class instance, BUT! In this particular example it takes a list as a paramter, and the list contains 2 dataframe objects! 

In [None]:
pd.concat([df, df2]) 

# 7. Summary


We have covered most of the main data structures built in to standard Python and a couple of the common programming constructs to iterate over them. We have taken a brief look at 2 third party libraries, Numpy and SciPy that contain their own data structures and algorithms that are more efficient for numerical and data analysis. We will look at how Numpy is used in concert with Pandas next time.
Finally, we've briefly covered classes and functions. You may not need to create your own classes or functions when doing data analysis in Python, but you will need to recognise them when you see them in the third party libraries. We will also see alternative ways of iterating over Python and Numpy data structures without using for loops