# Python Demo

This notebook is provided to get you familiar with the basics of the Python programming language and the Jupyter Notebook environment.

## Jupyter Basics
A Jupyter notebook consists of a series of 'cells'. Each cell can contain either Python code or markdown text, depending on its cell type. You can change the type of a cell by selecting the cell then going to the menu at the top of the page and selecting Cell -> Cell Type -> (select the type you want).

This cell contains markdown

In [4]:
print('And this one contains code')

And this one contains code


You can create a new cell by selecting Insert -> Cell Above/Below. Try making your own cell.

You can run a cell by selecting it and then clicking Cell -> Run Cells. Running a markdown cell will render the markdown text, while running a code cell will execute the code.

It is important to note that all of the cells in this notebook share the same python interpreter. This means that if you define a variable in one cell, it will be available in all other cells. It also means that if you run a cell that defines variables, those variables will still exist even if you change the code in the cell. This can cause confusing results, so it is recommended to use Kernel -> Restart & Run All to reset the interpreter to check your notebook works correctly with the latest code.

## Python Basics: Variables and Functions

Variables can be defined by simply assigning a value to a variable name.

In [5]:
my_var1 = 6
my_var2 = 'Hello'
my_var3 = 'World!'

In [6]:
# Single-line comments are preceded by '#'
print(my_var1) # 'print' built in function outputs text representation of objects
print(my_var2)
print(my_var3)

6
Hello
World!


Python is a weakly typed language which means you do not need to specify the type of your variables, however the variables are still given a type based on the value that is assigned to them. This type can be seen by using the 'type' built in function

In [7]:
type(my_var1)

int

In [8]:
type(my_var2)

str

Functions are defined using the 'def' keyword.
In Python, code segments are denoted by indents. This means that all of the code which is indented after the function definition is part of the function body.

In [9]:
def my_func1():
    # This is indented and so is part of the function body
    return 1

# This is not indented so the function body has ended.
print(my_func1())

1


Functions can take parameters as input. Parameters can be given a default value by using '='. When calling functions, parameters with default values do not need to be provided as arguments, in which case their default value will be used.

In [10]:
def my_func2(my_param1, my_param2=6, my_param3=1):
    x = my_param1 + my_param2 - my_param3
    return x

print(my_func2(1, 3))
print(my_func2(2))
print(my_func2(2, my_param3=2))

3
7
6


## Python Basics: Control

Python offers the 3 basic controls: while, for, and if-else.

In [11]:
i = 0
while i<10:
    i += 1
print(i)

for i in range(10): # range(n) creates an iterable of all the integers starting from 0 and going up to n-1
    # The for loop statement sets i to be each of these values in turn.
    # Note that this overwrites the previously existing i
    i += 1

# The i variable from the for loop still exists
print(i)

if i > 10:
    print(True)
else:
    print(False)

10
10
False


While Python does have while and for loops, it is worth mentioning that good Python code typically avoids using these where possible. This is because these loops are very slow in Python (and interpreted languages in general). Instead, vectorized functions and comprehensions (which we will cover later) are preferred, as these are usually much faster.

## Python Data Structures
The main data structures we will be using are lists, sets, and dictionaries (dicts). Python dicts store (key, value) pairs, and allow the corresponding value of a key to be retrieved very quickly. They are analagous to hash maps in other programming languages. 

In [12]:
my_list = [1, 2, 3]
my_set = {'a', 'b', 'c'}
my_dict = {'d': 4, 'e': 5, 'f': 6}

print(my_list[0]) # list elements can be retrieved by specifying an index (0-based).
print(my_dict['d']) # dict elements can be retrieved by specifying a key.
print('a' in my_set) # individual elements cannot be retrieved from a set, but the prescence of an element can be checked.

1
4
True


Lists and dicts can also be modified after they have been created.

In [13]:
my_list.append(4) # Add another element, 4, to the end of the list.
print(my_list)

my_dict['g'] = 7 # Add another key-value pair, ('g', 7), to the dict.
print(my_dict)

[1, 2, 3, 4]
{'d': 4, 'e': 5, 'f': 6, 'g': 7}


Lists and dicts can also be created using comprehensions. Comprehensions allow you to specify a formula for creating a data structure.

In [14]:
my_list2 = [i * i for i in range(10)] # Creates a list of the first 10 square integers.
my_set2 = {i for i in range(10)}
my_dict2 = {i: i*3 + 1 for i in range(10)}

print(my_list2)
print(my_set2)
print(my_dict2)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{0: 1, 1: 4, 2: 7, 3: 10, 4: 13, 5: 16, 6: 19, 7: 22, 8: 25, 9: 28}


In [15]:
# Comprehensions can also range over the elements in a data structure.
my_list3 = [my_list2[i] * i for i in my_set2]
print(my_list3)

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]


### Exercise
Try using a dict comprehension to create a dict that maps each of the elements in my_list2 to a boolean value representing whether or not the value is in my_set2

In [17]:
dictionary = {}
for a in my_list2:
    if a in my_set2:
        dictionary[a] = True
    else:
        dictionary[a] = False
print(dictionary)

{0: True, 1: True, 4: True, 9: True, 16: False, 25: False, 36: False, 49: False, 64: False, 81: False}


## Numpy
Numpy contains functions for performing common numerical calculations, on numbers as well as vectors and matrices.

In [18]:
import numpy as np # This statement imports the package numpy and gives it the name np

Numpy provides an array data type, which can be used to store matrix values.

In [19]:
my_array1 = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]) # This is a 2x3 matrix.
my_array2 = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) # This is a 3x2 matrix.

print(my_array1)
print(my_array1.shape)

print(my_array2)
print(my_array2.shape)

[[1. 2. 3.]
 [4. 5. 6.]]
(2, 3)
[[1. 2.]
 [3. 4.]
 [5. 6.]]
(3, 2)


In [20]:
np.sum(my_array1)

21.0

In [21]:
np.matmul(my_array1, my_array2) # Matrix multiplication

array([[22., 28.],
       [49., 64.]])

In [17]:
# Matrix transpose
print(np.transpose(my_array2))
print(np.transpose(my_array2).shape)

[[1. 3. 5.]
 [2. 4. 6.]]
(2, 3)


Operations such as '+' and '*' can be directly applied to matrices, where they will be applied element-wise. Note that this means that the operands must have the same shape.

In [18]:
my_array1 + np.transpose(my_array2)

array([[ 2.,  5.,  8.],
       [ 6.,  9., 12.]])

Arrays can also be indexed, just like lists. The first index describes the rows to be selected, and the second index describes the columns to be selected.

In [19]:
print(my_array1[0, 1]) # The element in the first row and second column.

2.0


In [20]:
print(my_array1[1:]) # Everything from the second row onwards.

[[4. 5. 6.]]


In [21]:
print(my_array1[:, 1:]) # Everything from the second column onwards.

[[2. 3.]
 [5. 6.]]


In [22]:
print(my_array1[:-1]) # Everything upto (but not including) the last row

[[1. 2. 3.]]


### Exercise
Try computing the sum of all of the elements in my_array1 except for the last column.

In [22]:
print(np.sum(my_array1[:, :-1]))

12.0


## Pandas

Pandas provides a DataFrame class, which is used to store tabular data. The DataFrame has a list of column names, and each column name is assosciated with a column of data.

In [23]:
import pandas as pd

DataFrames can be created from a dict that maps each column name to a list of values. Alternatively, if you already have your data in a csv file then you can use pd.read_csv to create a DataFrame object from it, where the first row is used as the column names.

In [24]:
my_df = pd.DataFrame({'c1': [1.0, 2.0, 3.0],
             'c2': ['a', 'b', 'c'],
             'c3': [True, False, True]})

print(my_df)
print(my_df.shape)

    c1 c2     c3
0  1.0  a   True
1  2.0  b  False
2  3.0  c   True
(3, 3)


You can index elements in a DataFrame just like an array by using df.iloc[].

In [25]:
my_df.iloc[1:, :2] # From the second row on, up to the second column

Unnamed: 0,c1,c2
1,2.0,b
2,3.0,c


You can also index a DataFrame by using column names

In [26]:
my_df['c2']

0    a
1    b
2    c
Name: c2, dtype: object

# File I/O

In Python reading and writing to files can be done using the 'open' keyword, which creates a file handle for the given path. It is good practice to always use 'open' inside a 'with' clause. This will ensure that the file handle is closed properly once the with clause finishes.

In [27]:
with open('my_file.txt', 'w') as f: # Note that the 'w' means we want to write strings to this path.
    # *IMPORTANT* If the file already exists, it will be overwritten.
    f.write('Hello\nWorld!')
# After the with clause, the file will be closed.

You should now see that a my_file.txt file has been created in the same directory as this notebook.

In [28]:
with open('my_file.txt', 'r') as f: # Note that the 'r' means we want to read a string from this path.
    text = f.read() # Create a string containing all of the file contents.

text

'Hello\nWorld!'

# Strings

In this course we will be working extensively with strings (documents), as such you will need to be comfortable manipulating them.

Strings can be indexed just like lists.

In [29]:
text[0]

'H'

In [30]:
text[-6:]

'World!'

Python provides various functions which check properties of strings.

In [31]:
print(text[0].isupper()) # True iff the string is entirely capital
print(text.isupper())

True
False


In [32]:
text.isdigit() # True iff the string is entirely made up of digits

False

In addition, Python has wide range of built in functionallity.

In [33]:
text.split('o') # Returns a list containing the chunks between each occurence of the substring 'o'.

['Hell', '\nW', 'rld!']

In [34]:
text.split('o', 1) # Can also specify the maximum number of splits, after this many splits have been found the rest is returned.

['Hell', '\nWorld!']

In [35]:
# join() is the opposite of split, it takes a list of strings and combines them into one.
','.join(text.split('\n'))

'Hello,World!'

In [36]:
text.replace('!', '!!!') # Returns a new string with all occurrences of substring '!' replaced with '!!!'.

'Hello\nWorld!!!'

In [37]:
text.replace('!', '') # Can also be used to delete.

'Hello\nWorld'

In [38]:
text.index('o') # Returns the index of the first occurrence of 'o'.

4

Strings can be concatenated using the '+' operation

In [39]:
print(text + ' Again!')

Hello
World! Again!


Python also provides string formatting functionality in the form of 'f-strings'. When a string is prefixed with f, variables can be inserted inside { } and their values will be inserted into the string. This makes it easy to create strings from variable values.

In [40]:
f'my_list contains {len(my_list)} elements, and its smallest value is {min(my_list)}'

'my_list contains 4 elements, and its smallest value is 1'

### Exercise
Try and use the text variable to create a string 'Hello All Worlds' by using replace.

# Further String Processing

In this example we will perform some more advanced operations on some simple documents.

In [41]:
docs_string = "Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States, the first African American to hold the office. He served as the junior United States Senator from Illinois from January 2005 until he resigned after his election to the presidency in November 2008.\nObama is a graduate of Columbia University and Harvard Law School, where he was the president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney in Chicago and also taught constitutional law at the University of Chicago Law School from 1992 to 2004.\nObama served three terms in the Illinois Senate from 1997 to 2004. Following an unsuccessful bid for a seat in the U.S. House of Representatives in 2000, Obama ran for United States Senate in 2004. His victory, from a crowded field, in the March 2004 Democratic primary raised his visibility. His prime-time televised keynote address at the Democratic National Convention in July 2004 made him a rising star nationally in the Democratic Party. He was elected to the U.S. Senate in November 2004 by the largest margin in the history of Illinois.\nHe began his run for the presidency in February 2007. After a close campaign in the 2008 Democratic Party presidential primaries against Hillary Rodham Clinton, he won his party's nomination, becoming the first major party African American candidate for president. In the 2008 general election, he defeated Republican nominee John McCain and was inaugurated as president on January 20, 2009."
docs = docs_string.split('\n') # Create a list of each document's text.
print(f'There are {len(docs)} docs')
docs

There are 4 docs


['Barack Hussein Obama II (born August 4, 1961) is the 44th and current President of the United States, the first African American to hold the office. He served as the junior United States Senator from Illinois from January 2005 until he resigned after his election to the presidency in November 2008.',
 'Obama is a graduate of Columbia University and Harvard Law School, where he was the president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney in Chicago and also taught constitutional law at the University of Chicago Law School from 1992 to 2004.',
 'Obama served three terms in the Illinois Senate from 1997 to 2004. Following an unsuccessful bid for a seat in the U.S. House of Representatives in 2000, Obama ran for United States Senate in 2004. His victory, from a crowded field, in the March 2004 Democratic primary raised his visibility. His prime-time televised keynote address at the Democratic Nati

Our goal is to answer the following question, "Which of the 4 docs has the highest portion of numeric tokens in it?".

So we will need to calculate for each document, how many of its tokens are numbers and how many tokens it has.

First, we will compute the number of tokens in each document.

In [42]:
def count(doc):
    return len(doc.split()) # split() with no arguments will split the string on all whitespace.

doc_lens = [count(doc) for doc in docs]
doc_lens

[51, 58, 92, 60]

Next, we will find all of the tokens which are numbers.

To do this we will make use of regular expressions, which are used to define patterns of strings.

In [43]:
import re # The regular expression package


def find_numbers(doc):
    # Here, the regular expression [0-9]+ means to match all occurrences of one or more numbers in the range 0, 1, ..., 9
    return re.findall('[0-9]+', doc)


doc_numbers = [find_numbers(doc) for doc in docs]
doc_numbers

[['4', '1961', '44', '2005', '2008'],
 ['1992', '2004'],
 ['1997', '2004', '2000', '2004', '2004', '2004', '2004'],
 ['2007', '2008', '2008', '20', '2009']]

We are not ready to answer our question.

In [44]:
number_portions = [len(doc_numbers[i]) / float(doc_lens[i]) for i in range(len(doc_lens))]
number_portions

[0.09803921568627451,
 0.034482758620689655,
 0.07608695652173914,
 0.08333333333333333]

In [45]:
np.argmax(number_portions)

0

So the first document has the largest portion of numeric terms.

Regular expressions can be used to match more complicated patterns, and you can see their full documentation here https://docs.python.org/2/library/re.html