# Jupyter notebooks. 

In Jupyter notebooks we use markdown cells to add explanatory text, just like we do with any well **organised** notebook. The special thing about this kind of notebook is that we can bring in our programs, explanations, visualisations and data--keeping it all in the same place.

In Data Science we often work with **pipelines** that are made up of many small code "scripts" made up by different functions. If we are not organised, it is very likely we will get lost in a forest of code we do not understand, and this will make it hard for you to assemble useful pipelines quickly, efficiently, and effectively.

Very often, there is a need to keep the foundational theory that supports the implementation of your computer programs nearby. Markdown cells allow you write your text in sections, add font variants for bold, italic and code-text; add enumeration and, also to introuduce formulae using the $\LaTeX$ math environment format. For example:

$$
Z = \sqrt{\frac{X}{Y^2}}
$$

which is also very useful.

For more details on Markdown please check [this](https://www.markdownguide.org/cheat-sheet/)


We will work with Google colab in this course. However in my day to day work I use [pyenv](https://realpython.com/intro-to-pyenv/) to manage different Python environments and [Jupyter Lab](https://jupyter.org). For example, I use different virtual environments for my Natural Language Processing work, or for heavy Network Science work. Virtual environments like those created with _pyenv_ allow me to organise my Python versions and libraries, thus making the deployment of my programs easier, since my versions of Python are not "married" directly to the Python version that is installed in my operating system.

Press SHIFT-ENTER to finish your cell and move on -- it "executes" the cell.

# Python Basics

## Lazy programming

Python works using an approach known as *lazy programming* -- that is, we can declare variables as we instantiate them with values. This allows for very fast code prototyping. However, one thing you should keep in mind for the future is that, at some point, it will be a good idea to learn how to use strong explicit typing in your **functions** to make them run faster. We will touch on functions later, but there will be a little more detail on our next session.

## White Matters

Notice that, in the code below, there are no curly braces to identify a block of code. Python knows this is a block of code because we have used **indentation** as information. If you don't indent your code correctly, Python will complain, and often give you errors. Sometimes badly indented code will execute, but it may give you wrong results, specially when you have nested `for` loops for example. We will speak about `for` loops later today.

## Every Variable in Python is an Object 

This means that every variable has _values_, but also _methods_ available to it. We will see what this means, as we practise doing things with actual code.

In [None]:
# a very simple for loop in Python

for i in [1,2,3,4,5]:
  print(i+2)

3
4
5
6
7


In [None]:
# a very simple if statement in Python

x = 5
if x < 5:
  print('less than five')
else:
  print('five or more')

five or more




Python lets you use more white-space freedom when you are writing code inside parenteses, and square brakets.

For example, if we want to initialise a matrix:

In [None]:
# white space matters in code blocks, but not inside brackets and parentheses.

m1 = [[1,2,3], [2,1,2], [3,4,1]]

# is the same as 

m2 = [
       [1,2,3],
       [2,1,2],
       [3,4,1]
]

# but the second one is more readable.
print(m1)
print(m2)

[[1, 2, 3], [2, 1, 2], [3, 4, 1]]
[[1, 2, 3], [2, 1, 2], [3, 4, 1]]


# Python Modules

A lot of what we do in Python is not natively available in the the default Python installation. Very often we need to **load modules** (by importing them into our programs or notebooks). Modules contain functionalities we want to use. For example, lets take regular expressions:

In [None]:
# lets import the library for regular expressions, re
import re

In [None]:
word = input('Enter a string: ')
found = re.search(r'e.*e', word)
if found:
    print(word, 'contains the pattern "e" any number of chars and "e".')
else:
    print(word, 'does not contain the pattern.')

Enter a string: hello willy
hello willy contains the letter "e".


Notice that here we have used some basic constructs in Python. The control structure `if` and the `input` construct to get data from the user.

## Modules, aliases and sub-modules

Python allows you call imported modules with any name you wish  (alias). Once you define an alias, you must use it when calling the alised module's functionalities. 

In [None]:
# using a dummy alias example to refer to the regular expressions
# re module as maria
import re as maria

word = input('Enter a string: ')
found = maria.search(r'e', word)
if found:
    print(word, 'contains the letter "e".')
else:
    print(word, 'does not contain the letter "e".')

Enter a string: Using Maria instead of re
Using Maria instead of re contains the letter "e".


### Modules and submodules 

Some Python modules are huge (that is they have many sub-modules and functions). It is not a very good idea to import entire modules when all you want to use is a few of their functionalities. For example, the module `collections` is relatively large, and when we import it into our code we are usually only after some specific functions. For this reason, it is highly advisable that you read the documentation of the modules you are using and import only what you really need when possible. 

Let's see an example from `collections`---it contains an useful  sub-module/function called `Counter` which is not native in Python. That function can tally the number of elements in a list, so that we can know how many of each unique existing element is present in the list.

Let's look at how that works!

In [None]:
# here we import only a part of the collections module.
from collections import Counter

# and we will use what we imported to do a tally count
my_list = ['hello', 'i', 'am', 'hello', 'lucy', 'i', 'mean', 'lucy', 'lucy', 'bell', 'hello', 'michael']
tally_count = Counter(my_list)
print(tally_count)

# now we can ask tally_count how many instances of a given word are in my_list
tally_count['hello']

Counter({'hello': 3, 'lucy': 3, 'i': 2, 'am': 1, 'mean': 1, 'bell': 1, 'michael': 1})


3

In [None]:
# Counter returns a response data structure that is "proprietary" to the module, 
# but we can conver it to one of the native Python data stuctures that 
# is very similar to a list, called dictionary and which we will discuss later.
tally_dict = dict(tally_count)
tally_dict

{'am': 1, 'bell': 1, 'hello': 3, 'i': 2, 'lucy': 3, 'mean': 1, 'michael': 1}

In [None]:
tally_dict['lucy']

# Some Native Variables

Consider the variable instantiations below.

```python
i = 3 # this is an integer
j = 0.6 # this is a float
b = True

s = 'this is a string'
l = [1,2,2,3] # this is a native python list
t = ("test 01", 15)
st = {4,5,3} # this is a native python set
d = {'name' : 'Manuel', # this is a dictionary
        'last name': 'Pita',
        'occupation': 'Scientist'}
```

While `i`, `j` and `b` are _single values_, `s`,`l`,`st` and `d` are _iterables_ (lists, tuples, sets or dictionaries). There are different ways to refer to the elements of an iterable. Often, we use the index (integer position). In Python the first element of an iterable starts at position zero.

The information provided below is meant to give you access to the main methods available to the native data types in Python. Remember that everything in Python is an object, and because of of this, you have methods and sometimes extra information about your variables.

### integers and floats

With `int` variables what we do, for the most part, is arithmehic.
The basic operators are `+`, `-`, `*`, `/`, `%`, and `**` (for exponent). The division operator `/` will return a `float`, if you want only the integer part of quotient of a division use `//`.

When using any number representation, we are often interested in using basic arithmetic, but also more complex operations, like trigonometric functions, logarithms  and so on (which are of course also applicable to `int` as well). 

A lot of the numeric results we get from data analytics are floats. The native representation of floats has a ton of digits representing the fractional part of the number (after the decimal point). We often want to reduce the precision (the number of digits in the fractional part). This can be done simply by:

`round(float_number, n)`

where `float_number` is a variable of type `float` and `n` is the number of decimal digits you want in the rounded result.

Some specific and more complex unary and binary functions are available in the `math` [module](https://docs.python.org/3/library/math.html).

### Strings

In general, `strings` are represented by any text enclosed in single our double quotes. The following are all valid instantiations of strings in Python

` s = 'hello'`  
` s = "hello"`  
` s = "hello, 'said' Martin"`  
` s = 'hello, "said" Martin'`   
` s = "hello \"said\" Martin"`

There are many things we do with strins in Python. The basic operations are (1) getting the length of a string, (2) selecting and slicing the string; (3) asking if a string `a` is contained in another string `b`; and splitting strings into lists of words.

1. **Length**. To get the number of characters in a given string `s` simply call the `len(s)` method.
2. **selecting**. Like in lists, string elements can be retrieved by indexing using square brackets, and string elements are counted from zero. For example `s[1]` returns the second character in a string `s`. 
3. **slicing**. getting a substring between a starting and ending index is done by calling `s[start:end]` where start and end are integers, both positive and smallet than `len(s)`. This will return all the characters from `start` (inclusive) to `end` (exclusive).
4. **spliting**. We often get a string that represents a phrase, or even and entire document, and want to split it into a list of words. Calling `s.split()` will split your string using the whitespace as separator. Other ways of spliting as possible too.

### Booleans

Boolean variables are used for many tasks in Data Science that require the use of logical operators. The keywords `True` and `False` (case sensitive) are reserved in Python, but in Boolean operations, if you use 1 and 0 instead, they will work the same.

The most common context in which we use Booleans, often without noticing, is when we do comparisons, using almost always the operator `a == b` that returns True if `a` and `b` are the same object, or `a != b` that returns True if `a` and `b` are not the same object. 

The basic binary high-level logical operators `and`, `or` and `not` are often used in comparisons that go beyond equality. The main low level Bit-Wise operators are `&` (and), `|` (or) and `~` (not) sometimes the minus sign can be used for Not as well.

### Lists and Tuples

Lists and tuples are collections of elements of any type, and in Python you can have a list where elements have different types. However, it is very uncommon to mix data types in a list, because any operation applied to the list is likely to expect inputs of a single type. Lists are mutable. This means you can add, delete, insert and replace elements. You can join two lists using the `+` operator, which when applied to a pair of lists means _concatenation_. Tuples are used when you don't want the pattern of the list to be mutable. For example, if you are working in a geographical system you may want to keep (x,y) coordinates always as (x,y).

There is a lot to learn about lists, but in this course we will focus more on vectors and matrices using the Numpy module later.

### Sets

Sets are also iterable datatypes in Python, but there is no representation of sequence order amongst the elements. In that sense, a set is just a bag of elements, which is consistent with the mathematical meaning of a set. There are a number of set operations that allow to compute set union, intersection, set complement, and so on. However we will not study them in this introduction. One common use of sets in Data Science is finding the unique elements in a list by doing:

`unique_elements = set(my_list)`

where `my_list` is a python list.

### Dictionaries

Python dictionaries are becoming one of the most widely used data structures in Data Science. These structures are based on the idea of a "phone-book". You give them a _key_ (like the name of a person) and you get a _value_ back (in this case the phone number). So, a dictionary is a collection of key-value pairs, and in Python we define it as follows:

`my_dict = {key1 : value1, key2 : value2, ..., keyn : valuen}`

keys in a Python dictionary have to be _unique_ and _immutable_.

In [None]:
# Let us look at some basic arithetic operations on numbers

i = 3 # this is an integer
j = 0.6 # this is a float


k1 =  j + i #adding numbers is simple
print('adding i plus j yields', k1) 


k2 = i**2 # this is the value of i, squared
print('i squared is ', k2)


k3 =  18 // 5  # integer part of the division
print('the integer part of the quotient between 18 and 5 is ', k3)


# For more complex math functions we have to import math
import math


k4 =  math.sqrt(i) # here we are asking for the square root of i which is i = 3
print('the square root of i is ', k4)


k5 =  math.log2(18) # here the log base 2 of 18
print('the logarithm base two of eighteen is ', k5)





adding i plus j yields 3.6
i squared is  9
the integer part of the quotient between 18 and 5 is  3
the square root of i is  1.7320508075688772
the logarithm base two of eighteen is  4.169925001442312


In [None]:
# Let's look at string basics in this cell

s = 'this is a string'

# string element by index
print('the first element of s is', s[0])

# sting slice by indexing [start:end]
print('the first four chars in s are:', s[0:4])

# spliting a string
print('if I split s by white char I get', s.split())

# concatenating two strings
phrase = "hello " + "how are you"
phrase




the first element of s is t
the first four chars in s are: this
if I split s by white char I get ['this', 'is', 'a', 'string']


'hello how are you'

In [None]:
# Let's look at some examples of using Booleans

# Comparison test: True for equality
print('testing 3 == 3 yields', 3 == 3)

# Comparison test: True for inequality
print('testing 3 != 4 yields', 3 != 4)

# logical operators

p = True
q = False

print('the and test for p and q yields: ' , p and q)

print('the or test for p and q yields: ' , p or q)

print('p and not q yields: ' , p and not q)


testing 3 == 3 yields True
testing 3 != 4 yields True
 the and test for p and q yields:  False
 the or test for p and q yields:  True
 p and not q yields:  True


In [None]:
# Let's look at some examples of lists, tuples and sets

# this is a native Python list
l = ['a','b','c','a','d']

# we can append an element to the list
l.append('e')

print('after appending element "e" our list becomes: ', l)

# we can find the first occurence of element
pos = l.index('a')

print('the element "a" is first located in position ', pos, 'counting from the start')

# but also from a given index 
pos = l.index('a', 2)

print('the element "a" is first located in position ', pos, 'counting from the 3rd position')


# the plus operator concatenates lists
l = l + ['f', 'g', 'h']

print('concatenating l and [f,g,h] using "+" yields:', l)

# this is a native tuple, remember tuples are immutable
t = ("test", 8)
print('the variable t is the tuple', t)


# this is a native set, remember order and indexing is not relevant in sets
st = {4,5,3} # this is a native python set

# let us use set to get the unique elements in our list "l"
print('the original list l has', l, 'the unique elements are', set(l))
print(set(l))


after appending element "e" our list becomes:  ['a', 'b', 'c', 'a', 'd', 'e']
the element "a" is first located in position  0 counting from the start
the element "a" is first located in position  3 counting from the 3rd position
concatenating l and [f,g,h] using "+" yields: ['a', 'b', 'c', 'a', 'd', 'e', 'f', 'g', 'h']
the varibale t is the tuple ('test', 8)
the original list l has ['a', 'b', 'c', 'a', 'd', 'e', 'f', 'g', 'h'] the unique elements are {'a', 'e', 'b', 'h', 'g', 'c', 'd', 'f'}
{'a', 'e', 'b', 'h', 'g', 'c', 'd', 'f'}


In [None]:
# Let's look at some examples of dictionaries

my_dict = {
    'helen' : 16,
    'lucy' : 15,
    'matt' : 16,
    'lucas' : 14,
    'mary': 15

}

# what is the score obtained by Lucas?
print("Lucas' score is:", my_dict['lucas'])

# What are the keys in my dictionary?
print('my dictionary has the following keys', my_dict.keys())

# What are the possible values present in my dictionary?
print('my dictionary has the following values', my_dict.values())

# dictionaries can be nested
my_dict2 ={
    'helen' : {'math': 16, 'biology':13},
    'lucy' : {'math': 15, 'biology':17},
    'mary' : {'math': 14, 'biology':18}
  
}

print("Mary's score in math is given by: ", my_dict2['mary']['math'])

Lucas' score is: 14
my dictionary has the following keys dict_keys(['helen', 'lucy', 'matt', 'lucas', 'mary'])
my dictionary has the following values dict_values([16, 15, 16, 14, 15])
Mary's score in math is given by:  14


# Basic Control Structures 



## The `for` Loop

Below we use a `for` loop to iterate over the members of a dummy variable I created, `my_list` and inside the `for` loop (notice I have indented) I add another control structure, in this case an `if` statement.

In [None]:
my_list = ['this', 'is', 'a', 'list', 'of', 'strings']

# A pythonic for loop would be like this

res = []
for item in my_list:
  if len(item) > 2:
    res.append(item)


In [None]:
res

['this', 'list', 'strings']

I can write a for loop in a more traiditional way, like this:

In [None]:
my_list = ['this', 'is', 'a', 'list', 'of', 'strings']
res = []

# a more traditional, less pythonic way would be

for i in range(0, len(my_list)):
  if len(my_list[i]) > 2:
    res.append(my_list[i])



In [None]:
res

['this', 'list', 'strings']

In [None]:
# and this would be *super* pythonic
my_list = ['this', 'is', 'a', 'list', 'of', 'strings']
res = [item for item in my_list if len(item) > 2] # list comprehension
res

['this', 'list', 'strings']

### Ranges

The `range` construct in Python is very widely used. It simply instantiates a list of integers from a number to another, both of which are often passed as parameters:

In [None]:
my_range = range(5)
my_range

range(0, 5)

the range object is in its essence a list, but optimised to allow its use in iterating over the elements of a list. We can put it in list form:

In [None]:
list(my_range)

[0, 1, 2, 3, 4]

Notice something interesting, the zero is included, but the five is excluded from the list. This is to ensure the consistency in how we count things in Python, starting with zero. In general, any range interval represented in Python is inclusive of the first element, but exclusive of the last.

## The `if` Statement 

The `if` statment performs a block of instructions once granted that a given test evaluates to TRUE. When it does not, the default operation is not to do anything. We can specify an `else` block, and as many intermediate conditions using `elif`. Remember that indentation matters.

In [None]:
# the code below exits when it finds that a given word has either x, y or z, tested in that order. 
# there is a scape route for the case in which neither these letters is in the word
word = 'pyramidal'

if 'x' in word:
  print('word has an x')
elif 'y' in word:
  print('word has a y')
elif 'z' in word:
  print('word has a z')
else:
  print('word does not have any of the last three letters')

word has a y


How would we test for the presence of the three letters?

In [None]:
word = 'pyramidal'

if 'x' in word:
  print('word has an x')
else:
  print('no x in the word')
if 'y' in word:
  print('word has an y')
else:
  print('no y in the word')
if 'z' in word:
  print('word has an z')
else:
  print('no z in the word')

no x in the word
word has an y
no z in the word


The code above is ugly code. As you learn more Python, you should be able to write this sort of code in a much more elegant and readable way!

## The `while` statement 

We use while statements to repeat computations while some condition is true. While loops stop when their condition is no longer met.

In [None]:
i = 10
while i > 0:
  print("repeating", i)
  i -= 1

repeating 10
repeating 9
repeating 8
repeating 7
repeating 6
repeating 5
repeating 4
repeating 3
repeating 2
repeating 1


### Unit increments and decrements

One of the things we do a lot as programmers is to increment (or decrement a variable by a single unit). Usually all we do is 

`i = i+1`

In Python we simplify this by using a shortcut:

`i += 1` to increment
`i -= 1` to decrement

In [None]:
while True:
  a = input("\nEnter an integer between 0 and 5: ")

  if not a.isdigit():
    print("not right")
    continue
  elif (int(a) < 0) or (int(a) > 5):
    print('not right') 
  else:
    break
print("Thank you")


Enter an integer between 0 and 5: 8
not right

Enter an integer between 0 and 5: 5
Thank you


# Functions

Functions are essential in the Python philosophy. Most people who do data science with Python believe in the importance of keeping functions as _minimal conceptual units_, by divising them as unbreakable blocks---like atoms. In contrast, many software engineers write procedures that may implement long sequences of heterogeneous steps and computations to achieve a given goal, and this is not necessarily frowned upon. However, bringing that mindset to thinking about functions in Python is not a good idea.  The main reason for thinking in conceptually atomic functions in Python is that we want to be able to plug (reuse) these functions in many different analytics **pipelines**. Data science in Python is all about pipelines. With regards to functions in Python we,

1. **Define them** with zero or more parameters
2. **Call them** to get zero or more results

The most empty function you can define with a body is the following.

```python
def do_nothing():  
  pass
```  
Notice the keyword `def`, the name of the function (which must always begin with a letter) and that we must use parenteses. Inside the parentheses we may define the parameters needed by the function. Notice that the function definition line ends with a colon (:) and that it has an indented body.

## Arguments

```python
def echo(word):  
  print(word + ' ', + word)
```  

Use only argument names in lazy programming, and use typed arguments when you want your code to execute more efficiently. Typing your parameters is easy, for example:

```python
def echo(word:string):  
  print(word + ' ', + word)
```  
**A word of caution:** if you type your parameters, you are enforcing that the parameters passed are of that type. If they are not your program with throw an error.

**Another note:** when we speak about the variables that are passed to a function in Python we refer to them as the function _arguments_. But inside the function the actual values the function is working with are referred to as _parameters_.

### Positional and keyword argument passing

Usually, arguments in Python functions are passed in a positional manner. This means that Python expects to see the arguments come in correspondance with the order in which they were defined. So, if you write a function that takes the arguments `day`, `month` and `year`, defined in that order, it would not be too difficult to get a wrong function call, where you may have swapped day and month. To avoid these problems, you can call your function passing explicit argument names and assigning them values. In the following example:

```python
def parse_date(day, month, year):
  return 'the date is ' + str(m) + ', ' + str(d) + ' ' + str(y)
```

you can call:

```python
parse_date(year=1985, month='August', day=28)
```

## Default parameter values

You can specify default values for parameters.
The default value is used if the caller did not provide one or more arguments. In the function definition below, we will assume that if the function caller does not pass a year, it is because it is implicit that is the current year.

```python
def parse_date(day, month, year=2021):
  return 'the date is ' + str(m) + ', ' + str(d) + ' ' + str(y)
```


## Return

Our functions above are very limited. All they do is to print something and come back empty handed. Very often functions are conceived to take some arguments and return a value. Once we have computed that value we use the keyword `return` at the end of the function to bring a value back to whatever called the function. For example:

```python
def fancy_math(a:int,b:int,c:int):
  result = (a**2 + b) / (b**2 - c)
  return result
```

## A short exercise.

Write a function `guess_the_veg` that received an argument `colour`. If the we pass red, the function will return _tomato!_; if we pass green the function will return _spinach!_ and otherwise it will print: _i don't know_ and return -1.

In [None]:
# Add your function to this cell. The cells below may help you.




In [None]:
# lets define a single function called echo, that prints is argument twice


def echo(word):  
  print(word, ' ', word)

echo(2)

# try out what happens when we define the function's only argument to be 
# of type string (str)

def string_echo(word:str):
  print(word + ' ' + word)

string_echo('hey')

2   2
hey hey


In [None]:
# lets define a single function called extract_letter, 
# that removes a letter from a word


def extract_letter(word, ltr):
  aux = []
  for c in word:
    if c != ltr:
      aux.append(c)
  return "".join(aux)



In [None]:
# Functions expect that either you pass the arguments in order
# or that you use the argument names as keywords

def parse_date(day, month, year):
  return 'the date is ' + str(month) + ', ' + str(day) + ' ' + str(year)

parse_date(month='september', day=10, year=1993)

'the date is september, 10 1993'

In [None]:
# an alternative definition of extract_letter 

def extract_letter2(word, ltr):
  # strings are like lists
  s = ""
  for c in word:
    if c != ltr:
      s += c
  return s

In [None]:
extract_letter2('hello', 'o')

'hell'

In [None]:
# and yet another alternative definition of extract_letter 

def extract_letter3(word, ltr):
  # this is a list comprehension
  return ''.join([ c for c in word if c != ltr])

In [None]:
extract_letter3('hello', 'o')

'hell'

In [None]:
# just a little more info on the relationship between strings and lists

print("-".join(['a','b', 'c']))
print("".join(['a','b', 'c']))

a-b-c
abc
