Colab Notebook is here: https://colab.research.google.com/drive/1EasdlhLKK12gdrYiqXCVxKy92uva604Z#scrollTo=D75z19r49bhK

#1. Introduction to Python

<p>Welcome to the first lesson of the <i>Legal Data Analytics</i> course. Let's start by understanding some key terms:</p>

<ul>
<li><b>The Code</b>: This is the text you'll write, either in the Console when coding on your computer or within the cells of this Colab Notebook.</li>
    
<li><b>The Console</b>: The interactive environment where you input code to produce results. Variables and data persist as long as you don't restart the console or leave a Colab page. Remember to regularly save your data and code, as it won't stay in memory indefinitely.</li>
    
  <li><b>Comments</b>: These are non-executable notes within your code, used to describe or explain what's happening. Comments in Python start with the hashtag <code>#</code>.</li>
  
  <li><b>Errors</b>: Occur when trying to run invalid code. Common errors include:
      <ol>
          <li><i>Syntax Error</i>: Caused by incorrect code structure, vital to avoid in Python.</li>
          <li><i>Type Error</i>: Results from trying to combine incompatible data types.</li>
      </ol>
  </li>
  
  <li><b>PyCharm</b>: An Integrated Development Environment (IDE) that enhances the coding experience. It color-codes different parts of your code and provides an integrated console. Useful shortcuts include:
      <ol>
          <li><i>Up Arrow</i>: When in the console, recalls previous inputs in reverse chronological order.</li>
          <li><i>Tab Key</i>: Autofills your code or displays a menu of options.</li>
          <li><i>Alt+Shift+E</i> (Windows - may differ on Mac): Executes highlighted code from a .py script.</li>
      </ol>
  </li>
  
  <li><b>Colab</b>: A cloud-based coding platform powered by Google. Unlike local coding, which uses your computer's resources, Colab runs on external servers. It's ideal for tasks requiring high computational power, but it has certain restrictions on data processing and package compatibility. For example, Colab might not be the best choice for web scraping tasks.</li>
</ul>


<b>1. </b>Computer code, at its most basic, calculates stuff. You can think of this course and everything that
follows as expanding the uses of a calculator. For instance, if you input 2+2 in the Console, press "Enter", output will be 4.

In [None]:
var = 1

# This is a very simple computation

Typically, in Python you'd use the command `print` to have stuff appear on your screen. It is also more precise, as
it gives your computer the exact command to process: if you type two lines of computation before pressing enter,
only the last will render; however, both will render an output if you specify that both need to be printed.

In [None]:
2+2
3+3

6

`print` is a <em>command</em>. Like most commands (or functions), it requires some <em>arguments</em>, that are
indicated
within
brackets - as here 2+2.

In [None]:
print(2+2)
print(2*3)

4
6


<b>2. </b>We will come back to functions a bit later. Before that, we need to discuss <em>variables</em>, which you can think of as recipients in which you store information.

#Variable Types

Variables are typically written in lower caps; the way  you create/assign data to a variable is with a `=` sign, according
to the syntax `variable = value`.
You can then use variables directly in functions (such as print), or do operations between them.

You can assign and re-assign variables at will: you can even assign a variable to another variable.


In [None]:
alpha = 1
beta = 2
gamma = 2 * 3
print(alpha + beta + gamma)

9


Variables that contain numbers can also be added or subtracted to with a specific syntax: `var += 2` means that 2
will be added to my variable, and this every time you input this particular command. Think of it as an update of the
original
variable.

In [None]:
gamma = gamma + alpha
print(gamma)
gamma += 1
print(gamma)
gamma -= 3
print(gamma)

7
8
5


In [None]:
gamma = 6

##Strings

Variables need not be numbers. They can also be text, which in Python is known as a `string`. Likewise, you can make operations with them, such as collating two strings.

Do note that the print command does exactly what you ask it to do: it did not insert a space between the two strings
here, it's for you to think of this kind of details. Programming is deterministic: output follows input with, most
of the time, no role for randomness. On the plus side, this means you should be assured that you'll get an output if
we type proper input; on the minus side, this therefore requires utmost precision on your part.

In [None]:
alpha = "Hello World"
beta = "Hello Cake"
print(alpha + beta)

Hello WorldHello Cake


Also important to keep into account is that strings are different from number. And you cannot, for instance, add
strings to number: this would throw a TypeError.

In [None]:
print(alpha + gamma)  # gamma has been defined above and is still known to the console's environment

In what follows, we'll use text and strings taken from Mervyn Peake's poem <a href ="https://gormenghasts.tumblr.com/post/80656474535/the-frivolous-cake-a-freckled-and-frivolous-cake"><i>The Frivolous Cake</i></a>. I have numeroted every verse; we'll store it in a variable for now and come back to it later.

In [None]:
poem = """The Frivolous Cake
1.1  A freckled and frivolous cake there was
1.1  That sailed upon a pointless sea,
1.2  Or any lugubrious lake there was
1.3  In a manner emphatic and free.
1.4  How jointlessly, and how jointlessly
1.5  The frivolous cake sailed by
1.6  On the waves of the ocean that pointlessly
1.7  Threw fish to the lilac sky.

2.1  Oh, plenty and plenty of hake there was
2.1  Of a glory beyond compare,
2.2  And every conceivable make there was
2.3  Was tossed through the lilac air.

3.1  Up the smooth billows and over the crests
3.1  Of the cumbersome combers flew
3.2  The frivolous cake with a knife in the wake
3.3  Of herself and her curranty crew.
3.4  Like a swordfish grim it would bounce and skim
3.5  (This dinner knife fierce and blue) ,
3.6  And the frivolous cake was filled to the brim
3.7  With the fun of her curranty crew.

4.1  Oh, plenty and plenty of hake there was
4.1  Of a glory beyond compare -
4.2  And every conceivable make there was
4.3  Was tossed through the lilac air.

5.1  Around the shores of the Elegant Isles
5.1  Where the cat-fish bask and purr
5.2  And lick their paws with adhesive smiles
5.3  And wriggle their fins of fur,
5.4  They fly and fly neath the lilac sky -
5.5  The frivolous cake, and the knife
5.6  Who winketh his glamorous indigo eye
5.7  In the wake of his future wife.

6.1  The crumbs blow free down the pointless sea
6.1  To the beat of a cakey heart
6.2  And the sensitive steel of the knife can feel
6.3  That love is a race apart
6.4  In the speed of the lingering light are blown
6.5  The crumbs to the hake above,
6.6  And the tropical air vibrates to the drone
6.7  Of a cake in the throes of love."""

In [None]:
poem

'The Frivolous Cake\n1.1  A freckled and frivolous cake there was\n1.1  That sailed upon a pointless sea,\n1.2  Or any lugubrious lake there was\n1.3  In a manner emphatic and free.\n1.4  How jointlessly, and how jointlessly\n1.5  The frivolous cake sailed by\n1.6  On the waves of the ocean that pointlessly\n1.7  Threw fish to the lilac sky.\n\n2.1  Oh, plenty and plenty of hake there was\n2.1  Of a glory beyond compare,\n2.2  And every conceivable make there was\n2.3  Was tossed through the lilac air.\n\n3.1  Up the smooth billows and over the crests\n3.1  Of the cumbersome combers flew\n3.2  The frivolous cake with a knife in the wake\n3.3  Of herself and her curranty crew.\n3.4  Like a swordfish grim it would bounce and skim\n3.5  (This dinner knife fierce and blue) ,\n3.6  And the frivolous cake was filled to the brim\n3.7  With the fun of her curranty crew.\n\n4.1  Oh, plenty and plenty of hake there was\n4.1  Of a glory beyond compare -\n4.2  And every conceivable make there was\

In [None]:
var = poem[1199:1200]

var.encode("utf8").decode()

'\x91'

## Lists

Another type of variable is a list, which is exactly what you think it is: it lists things, such as data, or even other variables, or even other lists ! Lists are denoted by using brackets and commas. You update a list by using the function .append() directly from the list, as follows: see that the item you added with append now appears at the end of the list.

In [None]:
beta = "Cake"
my_list = ["Frivolous", 42, beta, ["This is a second list, with two items", 142], "Peake"]
print(my_list)
my_list.append("Swordfish")
print(my_list)
my_list.append("Swordfish")
print(my_list)

['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake']
['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish']
['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']


A very important feature of lists is that they are ordered. This means if you know the (numerical) index of an item in a list, you can access it immediately. This is called "indexing".
Learn it once and for all: <b>in Python, indexes start at 0</b>; the first element in a list can be found at index 0. This is not intuitive, but you need to get used to it: 0, not 1, marks the beginning of a list.

In [None]:
print(my_list[0])
print(my_list[1])
print(my_list[3][0])

Frivolous
42
This is a second list, with two items


Note that the last indexing returns the second list that was in my_list. As such, it can also itself be indexed.

In [None]:
print(my_list[0][1])

r


Indexing also works using the relative position of an item in a list: [-1] gives you the last item, [-2] the penultimate, etc.

In [None]:
print(my_list)

['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']


In [None]:
print(my_list[-1])   # Will return 'Peake', the penultimate term since we added 'Swordfish' as last term

Swordfish


More importantly, you can select what's called a range by using the <code>:</code> operator. The operator is not inclusive of the outer limit, meaning that the item on the right-hand-side of the  <code>:</code> operator won't be included in the list that is rendered. For instance, if you look for indexes  <code>[0:2]</code>, you'll get items at index 0 and index 1, but not 2 (because it's excluded).

In [None]:
print(my_list[0:2])

['Frivolous', 42]


You can leave the selection open-ended, according to the same principles: the right-hand-side index won't be included, but the left-hand-side one is. So <code>[:5]</code> means "any element until the 6th (not included)", while <code>[2:]</code> means "every element after the third element (included)".

In [None]:
print(my_list[:2])
print(my_list[2:])
print(my_list[-2:])
print(my_list[0::3]) # This last type of range gives you every 3 items starting from 0

['Frivolous', 42]
['Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']
['Swordfish', 'Swordfish']
['Frivolous', ['This is a second list, with two items', 142], 'Swordfish']


## Booleans

Another data type is what's called a boolean. It is simply a statement True or False, but it is often very useful when you have to check conditions. It's based on the logic invented by George Boole in the mid-1800s, which is basically what powers computers now (see <a href="https://computer.howstuffworks.com/boolean.htm">here</a>), the bunch of 0s and 1s that signify computer data.

In [None]:
var_bol = True
var_bol2 = False
print(bool(var_bol))
print(bool(var_bol2))

True
False


Boolean logic works by manipulating <code>True</code> and <code>False</code> statements. In Python, you often need to check if something is <code>True</code> or not, for instance in the context of conditions (next module). The most basic way to do this is with the <code>==</code> (double equal - not to be confused with single equal, which is used to assign a variable. Its opposite is <code>!=</code>.

In [None]:
gamma = 5
print(gamma == 5) # Since gamma is indeed 5, this prints True
print(gamma != 5) # Since gamma is not different from 5, this prints False

In [None]:
print(True + 3) # What does it print ?

4


This sounds basic, but Booleans are really at the basis of everything. The whole idea of smart contracts, for instance, is premised on the concept of assigning True or False to various contract terms and performances. Booleans will be particularly helpful when we get to conditions, in the Syntax module.

## Sets and Dictionaries

Finally, there are two other types of data worth knowing at this stage: `sets` and `dictionaries`.

Sets are like lists (they can take any sort of variable, but not a list), except they are unordered, and they can't have duplicates. They are very useful to check if two sets of data overlap, or what they have or don't have in common. Since they are not ordered, you cannot select an element from a set. If you create a set with a duplicate element, it will ignore it and returns a set without the duplicate.

In [None]:
my_set = {1, 2, 2, 3, 3, 4, "Cake", "Cake", "Knife"}
print(my_set)

{1, 2, 3, 4, 'Cake', 'Knife'}


A Dictionary is a type of data that links a `key` to a `value`. The key becomes the index of your dictionary; if you give a key to the dictionary, it will return the value. It is useful to track down relations between different data points. Here as well, you can use any type of data you want. You use brackets, and indicate the relationship with a ":" operator.

In [None]:
my_dict = {42: "Mervyn", "Peake": 2, "My List" : my_list}
print(my_dict[42])
print(my_dict["My List"])

Mervyn
['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']


Before switching to the next section, find a way to print "Mervyn Peake" indexing both the list
`my_list` and the dictionary `my_dict`.

In [None]:
var = my_dict[42] + " " + my_list[4]
print (var)

Mervyn Peake


# Functions

Now, coming back to functions, they are what allows you to do operations over data and variables in Python
(more info <a href='https://www.w3schools.com/python/python_functions.asp'>here</a>). For this, you need to pass it
the expected arguments.

Many functions are native to Python, meaning you don't need to either create them yourself, or import them from an existing library. Amongst these native functions are those that allow you to play with the types of variables. For instance, `str` transform your variable into a string, `int` into a number, `list` into a list, etc. The function `set` takes a list and returns a set, while the function `len` tells you how long a variable is.

In [None]:
sent = "How long is this sentence ? :"
print(sent, len(sent))
print(list(sent))
print(round(2.3))
print(set([1, 1, 1, 3]))
print(str(3) + "2")

If you are not sure what a function does, you can always type  `help(function)`, and the console shall return an answer.

In [None]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



You can create your own fucntions with the specific term `def`, then give it a name, and specify expected arguments within brackets.  

Then, and <u>this is crucial</u>, you input a new line, and a tab - the inside of your function should <b>not</b> be on the same line as the declaration, everything should be shifted by one tab. (Indentation is a key syntax method we'll see again and again.)

The example given is a very simple function returning a sum, and while you'd get
the same result merely by doing  the sum immediately (without passing it to a function), it's sometimes useful to
write down things formally in this fashion.

In [None]:
def my_function(alpha, beta):  # An example of a function that just returns the sum of the two arguments you pass to it
    # (which will be known as beta and alpha in the sole context of the function)
    return beta + alpha

print(my_function(1, 2))  # We call the function with brackets to include the expected arguments
print(my_function("E", "H")) # Notice that the order of the arguments is important

3
HE


However, Python has already plenty of built-in
functions, so that you don't have to invent the wheel everytime you need to do something. As you already saw, you
don't need a function to calculate the sum of two variables, so `my_function` is redundant.

Python is a shared resources, and for most uses someone has already created a function for you, such that
you just need to import it. This is the system of packages and librairies that power Python.

When you find a package or module that intests you, you should first install it (from the internet) in your local
Python environment. The way to do this is with the command `pip install X`, with X being the name of your package ;
you type this command in the Terminal (not the Console), unless you have downloaded the iPython package (which
boosts the console and allows direct installations). Once a package it's installed, you don't have to do it again. (On you computer, that is. On Colab, you need to redo it every time, since you are only renting temporary computing capacity on the cloud)



In [None]:
pip install llm

Note that the "pip install X" is not Python code; it's a command line function, which is different.

Now, in order to use the packages you downloaded, you need to import them every time you restart the console. Every script file starts with "import statements" which indicates which packages you will be using in the context of this
script. You need the keyword <code>import</code>, and you can give aliases to the modules with the keyword <code>as</code>.

If you get an error when trying to import a package, usually it's because you have not installed it (with "pip install" as described above).

You can either import a full package (such as `numpy` here), or dedicated functions or "modules" in a package (using
the "from X import Y" syntax. The difference is in terms of performance (some packages are heavy). You then call the functions in accordance with
their name as imported, a name that you can set yourself (some are conventional: `pandas` is nearly always `pd`,
`numpy` is  `np`, etc.)

In [None]:
import numpy as np  # We import numpy, but it's typically aliased 'np'
from collections import Counter  # we can also import only a selected functions from a package
import pandas as pd

The first package here, `numpy`  is specialised in numbers and mathematical operations; it typically goes
further than the basic Python functions. For instance, if you want to compute a mean: you could just use `sum` and
divide by `len`; but it's easier to just use `np.mean()` on a list of numbers.

The syntax is always the same: you go from a module to a function by adding a period (`.`), and then add the required
parameters to your function. If you don't know what is the required parameters, you can usually use CTRL+P to learn
more about the inside of a function.

In [None]:
np.mean([1, 5, 10, 15])  # Instead of creating a function to calculate a mean, we can just leverage the existing package numpy

7.75

# Methods

All of the data and variables you will manipulate in this course will have built-in functions (called
"methods") already attached to them. They also have what's called attributes, which are data points. Methods take an
argument (within brackets),
while attributes have no brackets (and only output the data point).
(We won't learn it, but it's related to `classes` in Python, about which you can learn more <a href='https://docs.python.org/3/tutorial/classes.html'>here</a>).

We saw it earlier when we used `append` to add an item to a list. The equivalent method for sets is `add`.

We will be working on legal data, which means, for a large part, text data. Fortunately, strings are quite easy to
work with, as they have built-in functions in Python. For instance, the function `split` can be used to obtain a
list of items in that string depending on a splitting criterion. The opposite of that function would be `join`,
whereby you join items in a list with a common character.

These two functions are also a good example of how Python works between different data types: lists become strings, and the reverse.

In [None]:
splitted_words = "A freckled and frivolous cake there was".split(" ")  # The variable splitted_verse will take the result of the right-hand side expression, which splits a string according to a criterion
splitted_cake = "A freckled and frivolous cake there was".split("cake")

print(splitted_words)
print(splitted_cake)

print(" X ".join(splitted_words))

['A', 'freckled', 'and', 'frivolous', 'cake', 'there', 'was']
['A freckled and frivolous ', ' there was']
A X freckled X and X frivolous X cake X there X was


In [None]:
words = "Science Po"
words.split(" ")

['Science', 'Po']

Sets have a number a functions attached to them as well, allowing for comparisons between sets:
<ul><li><code>difference</code> will return the difference between set1 and 2;</li>
    <li><code>intersection</code>, for items that are in both sets;</li>
    <li><code>union</code> returns a set with both sets' content; and</li>
    <li><code>symmetric_difference</code> returns the items that are only in one set</li>
    </ul>

In [None]:
set1 = {1,2,3,4}
set2 = {3,4,5,6}
print(set1.difference(set2))
print(set1.intersection(set2))
print(set1.union(set2))
print(set1.symmetric_difference(set2))

{1, 2}
{3, 4}
{1, 2, 3, 4, 5, 6}
{1, 2, 5, 6}


4

## Exercises

Find the two sentences from the poem that have the most letters in common (each letter counted only once).

In [None]:
sentences = ["1.1  A freckled and frivolous cake there was", "1.1  That sailed upon a pointless sea, ", "1.2  Or any lugubrious lake there was", "1.3  In a manner emphatic and free."]

l0 = list(sentences[0])
set0 = set(l0)
l1 = list(sentences[1])
set1 = set(l1)
var = set0.intersection(set1)
print(len(var))

14


Find the number of words that are common to two paragraphs of the poems. There are (at least) three steps to do so.

In [None]:
a = 'freckled and frivolous cake there was\nThat sailed upon a pointless sea, \nOr any lugubrious lake there was\nIn a manner emphatic and free.\nHow jointlessly, and how jointlessly\nThe frivolous cake sailed by\nOn the waves of the ocean that pointlessly\nThrew fish to the lilac sky.'
b = 'Around the shores of the Elegant Isles\nWhere the cat-fish bask and purr\nAnd lick their paws with adhesive smiles\nAnd wriggle their fins of fur, \nThey fly and fly neath the lilac sky -\nThe frivolous cake, and the knife\nWho winketh his glamorous indigo eye\nIn the wake of his future wife.'

la = a.split(" ")
lb = b.split(" ")
# Your answer here
print(la)
print(lb)

seta = set(la)
setb = set(lb)

var = seta.intersection(setb)
print(var, len(var))

['freckled', 'and', 'frivolous', 'cake', 'there', 'was\nThat', 'sailed', 'upon', 'a', 'pointless', 'sea,', '\nOr', 'any', 'lugubrious', 'lake', 'there', 'was\nIn', 'a', 'manner', 'emphatic', 'and', 'free.\nHow', 'jointlessly,', 'and', 'how', 'jointlessly\nThe', 'frivolous', 'cake', 'sailed', 'by\nOn', 'the', 'waves', 'of', 'the', 'ocean', 'that', 'pointlessly\nThrew', 'fish', 'to', 'the', 'lilac', 'sky.']
['Around', 'the', 'shores', 'of', 'the', 'Elegant', 'Isles\nWhere', 'the', 'cat-fish', 'bask', 'and', 'purr\nAnd', 'lick', 'their', 'paws', 'with', 'adhesive', 'smiles\nAnd', 'wriggle', 'their', 'fins', 'of', 'fur,', '\nThey', 'fly', 'and', 'fly', 'neath', 'the', 'lilac', 'sky', '-\nThe', 'frivolous', 'cake,', 'and', 'the', 'knife\nWho', 'winketh', 'his', 'glamorous', 'indigo', 'eye\nIn', 'the', 'wake', 'of', 'his', 'future', 'wife.']
{'and', 'of', 'frivolous', 'the', 'lilac'} 5


# Syntax

Finally, some notions of syntaxes. You write code as you would write anything: sequentially. This means
you first define your variables or your functions before using it, or Python won't be able to know what you mean.
This being said, there are two basic syntaxic ideas that are crucial to any coding script - or indeed, to any software you are currently using. These are loops, and conditions.

## Loops

A `loop`,  tells Python to go over (the term is "iterate") a number of elements, most often from a list.
The syntax is always the same: `for x in list: y`, where "x" represents the temporary name of element in the "list" in
turn, and "y" what happens to that "x". In other words, start with the first element (called "x" in the context of
the loop), do stuff ("y") with that element, then go over the
next element (which will also be called "x"), and so on.

In [None]:
words = ["A", "Freckled", "and", "Frivolous", "Cake", "There", "Was"]

for x in words:  # This loop will print each word from the list one by one
    print(x)

A
Freckled
and
Frivolous
Cake
There
Was


After the loop has been completed, the variable `x` is still available: it represents whatever was the last item iterated over.

In [None]:
print(x)

Was
Was


You will note that for your loop to work, the second level of instructions needs to be shifted to the right (and you
have a colon at the end of your `for` statement). That's
called identation, and this is crucial in Python. It's also one of the main reasons why people don't like this
language. Other languages are more explicit as to when a section of your code is actually contained in another
section: for example, in C++ you would put stuff within brackets, or indicate the end of a statement with a semi-colon.

You can loop over lists, strings, and other objects we will discover later.

In [None]:
recreated_text = ""  # We start by creating an empty text variable
for letter in "Swordfish":   # We loop over the string Swordfish (strings can be used as lists of letters)
    print(letter)  # We first print the letters, one by one
    recreated_text += letter  # Then we add the letter to the existing recreated text; remember that x += 1 increment x by 1
    print(recreated_text)

S
S
w
Sw
o
Swo
r
Swor
d
Sword
f
Swordf
i
Swordfi
s
Swordfis
h
Swordfish


In [None]:
newl = [] # We will recreate the list, but with a twist

for x in words[2:]:
  newl = [x] + newl  # This is equivalent to newl.append(x)
  print(newl)

print(newl)

['and']
['Frivolous', 'and']
['Cake', 'Frivolous', 'and']
['There', 'Cake', 'Frivolous', 'and']
['Was', 'There', 'Cake', 'Frivolous', 'and']
['Was', 'There', 'Cake', 'Frivolous', 'and']


As everywhere else in Python, the order of things is very important, including in the context of a loop.

In [None]:
for x in words: # We loop over the words
    y = x  # We assign a new variable y that's the same as every x, one by one
    print(x + " - " + y)

for x in words: # In this second loop, y has not been assigned yet, so it is still the last-assigned y
    print(x + " - " + y)
    y = x

## Exercise

Without using any dedicated function, reverse the order of the `words` list. Print the new list.

## Conditions

The second important syntax element, and really the basic building block of so much code that runs your
daily life, is the `if/else` statement. It simply asks if a condition is met (with booleans !), and then accomplish the resulting code.
The syntax is of the form `if x:` , where "x" need to be <code>True</code> (in the boolean sense) for the (indented) code coming after the colon to output. 2 + 2 = 4, so a statement `if 2+2 == 4: print("Correct")` would print correct. (Note that
we use "==" to check an identity, since the single "=" sign is used to assign variables.)

In the example below, we will check that the letter "e" (i.e., a string corresponding to the lower case "e") is present in a list of words. This simply requires to check if that letter is `if` the target variable.

Note that the `else` will compute only if the `if` condition has not been met.

In [None]:
for word in words:
    if "e" in word:
        print(word)
    else:
        print(word, " : No 'e' in that word")

A
Freckled
and
Frivolous
Cake
There
Was


They are several ways to syntax `if` statements:
<ul><li>With an <code>is in</code> if need to check that an item is part of a list or a set (or the inverse
<code>is not in</code>);</li>
    <li>By itself if you are checking a boolean (<code>if my_bol:</code> will return <code>True</code> or
<code>False</code> depending on the value of <code>my_bol</code>);</li>
    <li>With the double equal sign <code>==</code> for identity between two variables, or <code>!=</code> for lack of
identity; and</li>
    <li>With a combination of the signs <code>></code>, <code><</code> and <code>=</code> when comparing two
quantities.</li></ul>

Finally, you can add conditions with the keywords `and` and `or`.

In case you want to try a second condition after a first one is not met, you can use the keyword `elif` ("else if"), which works exactly like if.

In [None]:
sentence = " ".join(words)  # We recreate the sentence from the list of words with the method join
print(sentence)
len(sentence)

A Freckled and Frivolous Cake There Was


39

In [None]:
sentence = " ".join(words)  # We recreate the sentence from the list of words with the method join
print(sentence)
my_bol = False  # We set a boolean that's False

if "frivolous" in sentence:
    print("First Condition Met")
elif not my_bol:
    print("Second Condition Met")
elif len(sentence) == 50:   # len() is a built-in function rendering the length of a list or string
    print("Third Condition Met")
elif len(sentence) >= 30 and "e" in sentence or "cake" not in sentence:
    print("Fourth Condition Met")
else:
    pass

A Freckled and Frivolous Cake There Was
Second Condition Met


This is all, or nearly. On the basis of these very basic concepts run most of the rest of the Python scripts you can see out there.

## Exercise

Find the (i) longest sentence in the poem that has (ii) the sound "ake" but (iii) not the word "knife", but (iv) has fewer than 8 words (not counting line numbers).

In [9]:
poem = 'The Frivolous Cake\n1.1  A freckled and frivolous cake there was\n1.1  That sailed upon a pointless sea, \n1.2  Or any lugubrious lake there was\n1.3  In a manner emphatic and free.\n1.4  How jointlessly, and how jointlessly\n1.5  The frivolous cake sailed by\n1.6  On the waves of the ocean that pointlessly\n1.7  Threw fish to the lilac sky.\n\n2.1  Oh, plenty and plenty of hake there was\n2.1  Of a glory beyond compare, \n2.2  And every conceivable make there was\n2.3  Was tossed through the lilac air.\n\n3.1  Up the smooth billows and over the crests\n3.1  Of the cumbersome combers flew\n3.2  The frivolous cake with a knife in the wake\n3.3  Of herself and her curranty crew.\n3.4  Like a swordfish grim it would bounce and skim\n3.5  (This dinner knife fierce and blue) , \n3.6  And the frivolous cake was filled to the brim\n3.7  With the fun of her curranty crew.\n\n4.1  Oh, plenty and plenty of hake there was\n4.1  Of a glory beyond compare -\n4.2  And every conceivable make there was\n4.3  Was tossed through the lilac air.\n\n5.1  Around the shores of the Elegant Isles\n5.1  Where the cat-fish bask and purr\n5.2  And lick their paws with adhesive smiles\n5.3  And wriggle their fins of fur, \n5.4  They fly and fly \x91neath the lilac sky -\n5.5  The frivolous cake, and the knife\n5.6  Who winketh his glamorous indigo eye\n5.7  In the wake of his future wife.\n\n6.1  The crumbs blow free down the pointless sea\n6.1  To the beat of a cakey heart\n6.2  And the sensitive steel of the knife can feel\n6.3  That love is a race apart\n6.4  In the speed of the lingering light are blown\n6.5  The crumbs to the hake above, \n6.6  And the tropical air vibrates to the drone\n6.7  Of a cake in the throes of love.'
poem

# Step 1: Get a list of lines

poem_lines = poem.split("\n")
print(poem_lines)
longest_line = ""

# Step 2: Loop over the list

for line in poem_lines:
  if "ake" in line:
    if "knife" not in line.lower():
      line_words = line.split()[1:]
      if len(line_words) < 8:
        if len(line) > len(longest_line):
          longest_line = line

print(longest_line)
print(len(longest_line))
  # Step 3: check that the string "ake" is in the line
  # Step 4: check that the word "knife" is not in the line
  # Step 5: count the number of words in the line (cut it with indexing to remove line numbers). If that number < 8, that line qualifies
  # Step 6: keep track of what line is the longest with a dedicated variable that is updated to be the longest line

# Step 7: Success !

['The Frivolous Cake', '1.1  A freckled and frivolous cake there was', '1.1  That sailed upon a pointless sea, ', '1.2  Or any lugubrious lake there was', '1.3  In a manner emphatic and free.', '1.4  How jointlessly, and how jointlessly', '1.5  The frivolous cake sailed by', '1.6  On the waves of the ocean that pointlessly', '1.7  Threw fish to the lilac sky.', '', '2.1  Oh, plenty and plenty of hake there was', '2.1  Of a glory beyond compare, ', '2.2  And every conceivable make there was', '2.3  Was tossed through the lilac air.', '', '3.1  Up the smooth billows and over the crests', '3.1  Of the cumbersome combers flew', '3.2  The frivolous cake with a knife in the wake', '3.3  Of herself and her curranty crew.', '3.4  Like a swordfish grim it would bounce and skim', '3.5  (This dinner knife fierce and blue) , ', '3.6  And the frivolous cake was filled to the brim', '3.7  With the fun of her curranty crew.', '', '4.1  Oh, plenty and plenty of hake there was', '4.1  Of a glory beyond

# Regexes

Earlier, we devised a basic algorithm to count the number of words in a text. However, there is a
much better, simple way to do this: It's time to introduce regular expressions,  or "regex" for short. (more info <a href='https://docs.python.org/3/library/re.html'>here</a>)

We'll spend some time on it because it is extremely important for text-heavy applications; in a course about
finance  or statistics we would not need it too much, but since we'll be analysing judgments and legal texts, regexes
    are essential. And they are great. At the end of this task, you'll be annoyed every time search engines (like
Google) don't do regex. It's just so much better.

Regexes are <i>patterns</i> that allow you to identify text. These patterns rely on special symbols to
cover a range of characters in natural, written language. Because they rely on patterns, it's much more powerful
than a search that focuses on a specific word: the word itself might be conjugated, or put in lower caps; a sentence
could have extra words. You might be interested in a range of number and not a specific one, etc.
    
For instance, the symbol "\d" means "any number", and if you try to match this pattern with a sentence that includes a number, there will be a positive result.

In [None]:
import regex as re # You need to import the regex module

target_sentence = "Count: 30 frivolous cakes and 40 knifes !"
pattern = r"\d"
result = re.search(pattern, target_sentence)
print(result)

<regex.Match object; span=(7, 8), match='3'>


Regex.search() will return a regex object (here, the variable `result`), which comes with a number of characteristics. For instance, that object stores the start of the matching pattern in the target sentence, as well as its end, and the exact matched pattern (method ".group()").


In [None]:
print("Pattern was found at index ", result.start(), " of target string !")
print("String continued after pattern at index ", result.end())
print("Regex search found ", result.group(), " that matched this pattern")

Pattern was found at index  7  of target string !
String continued after pattern at index  8
Regex search found  3  that matched this pattern



You'd note that there were several numbers in the target sentence, but the "search" function only found one - the first
    one. To get all matches, you need another function, which is `findall`, and returns a list of result.

In [None]:
re.findall("\d", target_sentence)

['3', '0', '4', '0']

In addition, you have `re.sub(pattern, newpattern, target_sentence)`, that substitutes a pattern for a new
pattern.

There is also `re.split(pattern, target_sentence)` which returns a list of strings from the original text, as
split by the pattern. Notice that the result does not display the splitting pattern.

In [None]:
print(re.sub("Cake", "Hake (?!)", poem[:19]))
print(re.split(" ", poem[:20]))

The Frivolous Hake (?!)

['The', 'Frivolous', 'Cake\n1']


All very good, now, here are the basic patterns:
<ul><li>Any particular word or exact spelling will match itself: <code>cake</code> will match <code>cake</code> (but not
<code>Cake</code>, unless you command regex to be case-insensitive - see below);</li>
    <li><code>.</code>, catches anything, really, so <code>c.ke</code> would get "cake" or "coke", or even "cOke"; if you need to look specifically for a period, you need to escape it with an antislash <code>\.</code> </li>
    <li><code>\s</code> matches white spaces, including line breaks, etc.; note that the upper-case version,
<code>\S</code>,matches anything <i>but</i> a white space; and</li>
    <li><code>\w</code> matches a letter, while <code>\W</code> matches anything but a letter.</li>
    </ul>


In [None]:
print(re.findall(".ake", poem)) # Plenty of "ake" sounds in that poem
print(re.findall("\d\.\d", poem)) # Too look for a period, you need to escape it with an antislash
print(re.search("\W", poem))  # It will find the first space in the poem

['Cake', 'cake', 'lake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'cake']
['1.1', '1.1', '1.2', '1.3', '1.4', '1.5', '1.6', '1.7', '2.1', '2.1', '2.2', '2.3', '3.1', '3.1', '3.2', '3.3', '3.4', '3.5', '3.6', '3.7', '4.1', '4.1', '4.2', '4.3', '5.1', '5.1', '5.2', '5.3', '5.4', '5.5', '5.6', '5.7', '6.1', '6.1', '6.2', '6.3', '6.4', '6.5', '6.6', '6.7']
<regex.Match object; span=(3, 4), match=' '>


In addition, the following rules apply:
<ul><li>Square brackets can be used to indicate a range of characters, such as <code>[0-8a-q,]</code> will only look
for a number between 0 and 8 OR a letter between a and q, or a comma (if you need hyphens in your range, put them at
the end of the range);
</li>
    <li>The symbol <code>|</code> (that's Alt + 6 on your keyboard) means "or";</li>
    <li>You'd indicate the expected number of hits with braces: <code>[A-Q]{3}</code> means you are looking for three
(consecutive) upper-case letters between A and Q, while <code>[A-Q]{3,6}</code> means you expect between 3 and 6,
and <code>[A-Q]{3,}</code> means "at least 3" (but potentially more), on the same logic as indexing (except use
        commas instead of colons).</li>
    <li>Two special characters do the same job, but open-ended, "+" means that you are expected
at least one hit, while <code>*</code> means you expect any number of hits (including none; add a <code>?</code> for
non-greediness). A concrete example would be <code>\d{4}</code>: a date;</li>
    <li>Any pattern becomes optional if you add a <code>?</code> behind it: <code>cakey?</code> will find <code>cakey</code> or
<code>cake</code>;</li>
    <li>You can group patterns by bracketing them with parentheses, and then build around it: for instance, <code>(
[0-8a-q])|([9r-z])</code>. You can even name the groups to retrieve them precisely from the regex object when there is a match.</li>
    <li>Characters that are usually used for patterns (such as  <code>?</code> or  <code>|</code>) can be searched for
themselves by "escaping" them with an anti-slash  <code>\</code> (and the antislash can be escaped with another
antislash:  <code>\\</code> will look for  <code>\</code>). Note that <code>regex</code> provides you with an
<code>escape</code> function that returns a pattern, but escaped.</li>
    </ul>

In [None]:
print(re.findall("[chlwm]ake", poem))
print(re.search("cake|knife", poem))
print(re.search("\d\d-\d{2}-\d+", "This is a date: 11-02-1992"))  # Note that \d\d and \d{2} are strictly equivalent
print(re.search("cakey?", "cake or cakey?")) # Here as well, if you ever need to look for an "?", you need to escape it: "\?"

['cake', 'lake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'cake']
<regex.Match object; span=(49, 53), match='cake'>
<regex.Match object; span=(16, 26), match='11-02-1992'>
<regex.Match object; span=(0, 4), match='cake'>


Finally, there are so-called <i>flags</i> that are typically used outside of the pattern (but can be used inside for a single sub-pattern), as a third argument, to indicate further instructions, such as:
<ul><li>Ignorecase, <code>re.I</code>;</li>
    <li>Ignore linebreaks <code>re.S</code>;</li>
    <li>Verbose (allows you to add white spaces that don't count as pattern), <code>re.X</code>; and</li>
    <li>Multilines (<code>$</code> and <code>^</code> will work for any single line, and not simply for the start
and end of the full text), <code>re.M</code></li>
    </ul>

In [None]:
print(re.search("cake", "The Frivolous Cake", re.I))  # This works despite the capital C since we specified re.I

<regex.Match object; span=(14, 18), match='Cake'>


Regex really turns powerful in that you can add a number of conditions to you regex pattern.

<ul><li>A pattern preceded by a  <code>^</code> will be looked for only at the beginning of a line; a pattern
followed by a <code>$</code> will only look for it if it finishes the line or text;</li>
    <li>Adding a <code>(?=2ndpattern)</code> <i>after</i> your first pattern will indicate that your first pattern
will match <i>only if</i> the target text matches your second pattern, but the second pattern won't be caught by the regex object (this is very useful, e.g., for substitution).</li>
    <li>In the same vein, <code>(?!2ndpattern)</code>, <code>(?&lt;=2ndpattern)</code>, and <code>(?!&lt;2ndpattern)
</code> are conditions for "if it does not match after"; "if it matches before", and "if it doesn't match before",
respectively. This can be hungry in terms of computing power, so don't overdo it.</li>    
    </ul>

In [None]:
print(re.search("^A Freckled|throes of love.$", poem)) # Only the second alternative will be found, since the first words are not at
# the beginning of a line (the numbers 1.1 are)
print(re.search("plenty of (?=cake)", poem)) #This returns None since there are no "plenty of cake" in the poem
print(re.search("plenty of (?=.ake)", poem)) #But this returns a match, since there is "hake"

<regex.Match object; span=(1667, 1682), match='throes of love.'>
None
<regex.Match object; span=(357, 367), match='plenty of '>


Latest versions of regex also provides for fuzzy searches - that is, with a bit of leeway to catch things despite errors in the pattern (this is exponentially greedy in resources, though, so be careful when you use it). For instance, `re.search("(coke){e<=1}", poem)`, where the braced statement means "one or less errors (e)" will find "cake", as there is only one difference (the latter o/a) between the pattern and the word.

Finally, regex objects count as boolean: <code>if result</code> will return <code>True</code> if there was a match, while you can check for a null result by asking "if result is None". (`None` is a special Python object that means that data is empty.)

Note that there are tools to help you check if your regexes work well on the given dataset, such as <a
href="https://www.debuggex.com/">this one</a> online.

In [None]:
for line in poem.split("\n"): # We split the poem by lines and we loop over these lines
    if re.search("cake|knife", line, re.I):  # we check that the term "cake" is or not in the line
        print(line)  # If it is, we print the line
    else:
        print("No Cake or knife in that line...")

The Frivolous Cake
1.1  A freckled and frivolous cake there was
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
1.5  The frivolous cake sailed by
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
3.2  The frivolous cake with a knife in the wake
No Cake or knife in that line...
No Cake or knife in that line...
3.5  (This dinner knife fierce and blue) , 
3.6  And the frivolous cake was filled to the brim
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...
No Cake or knife in that line...

# Wordle

This is a <a href="https://www.nytimes.com/games/wordle/index.html">popular online game</a> ! Let's try to reproduce it.

We first need to make the required imports:
<ul>
    <li>a module to simulate randomness (so as to have a new word every time we play);</li>
    <li>regex to check if texts match what we want; and</li>
    <li>a corpus of words to pick from.</li>
    </ul>

In [None]:
import random
import re
import nltk
nltk.download('brown')
from nltk.corpus import brown  # nltk may need to be first installed with pip install nltk

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Then we need to create a list of words to choose from, since the brown corpus has millions of words - and we only need words with five letters. We also want to avoid proper names.

In [None]:
words = []
for x in brown.words():
    if len(x) == 5 and re.search(r"^A-Z|[\.,]", x) is None:
        words.append(x.upper())  # We harmonise all words with caps

word = random.choice(words)  # We pick a random word

Then we create a function that will embody the algorithm needed to play the game. That function will take as an input/argument the word guessed by the player.

The first few steps are to check whether that word can even qualify for the game: if it is 5-letter long, and is part of the existing corpus.

Then, if this is the case, we iterate over the letters of the guessed word, one by one, and we check three cases:
<ul>
    <li>if the letter is in the target word and at the same right place (so, same index), we color it green;</li>
    <li>if the letter is in the target word, but not at the right place, we color it yellow; and</li>
    <li>if the letter is not in the target word, we color it grey.</li>
    </ul>
(I found the code to color output on the internet.)

In [None]:
def play(answer):  # We create a function that returns all words in a given format depending on how close we are from the right answer
    answer = answer.upper()  # Get the all caps version of the word to compare with dataset of words
    if len(answer) > 5:  # We first check that the input word in answer fits the requirement: be 5 in len, and in the dataset
        print("Too long")
    elif len(answer) < 5:
        print("Too Short")
    elif answer not in words:
        print("Word does not exist")
    else:  # If this is a proper guess, we proceed to the main part of the function
        for e, letter in enumerate(answer):  # The function enumerate allows you to iterate over a list together with the index
            if letter in word and answer[e] == word[e]:  # If the letter is in the word and at the exact same place, we return a green square
                print('\x1b[1;30;42m' + letter + '\x1b[0m', end=" ")
            elif letter in word:  # If it is in the word, but at a different place, we return a yellow square
                print('\x1b[1;30;43m' + letter + '\x1b[0m', end=" ")
            else:  # Otherwise we just return the letter
                print(letter, end=" ")

In [None]:
play("Angel")

A [1;30;43mN[0m [1;30;43mG[0m [1;30;42mE[0m L 

# List Comprehension

Python code is a good middle ground between very verbose code (VBA for instance), and languages that are
perfectly opaque to the neophyte. When you look at the syntax, given a few basics, you can have a rough idea of
what's happening.

The issue with verbosity, however, is that it take space and time. If you need to populate a list from another list
given a condition, you have now learned that you can use a loop and a conditional statement to perform the operation.
But again, it can be cumbersome to write down all of this.

Enter list comprehensions, which is a way to create a list in a single line. The syntax is of the kind:
`[x for x in list]`

So, what you are trying to do is to invoke every element in the list, and operate over it to create a new list
(hence the brackets around the statement).

Take for instance these three lines, which add numbers to a list after tripling then.

In [None]:
my_list = []
for x in range(1,25):
    my_list.append(x * 3)
print(my_list)

[3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72]


This can be rewritten as a list comprehension in line with the syntax above

In [None]:
new_list = [x * 3 for x in range(1,25)]
print(new_list)

[3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72]


Note that the power of this method comes from the fact that you can go much further than the bare statement I gave
you here. in particular,  you can add conditions. For instance, let's say we are looking for every even number in a list of numbers.

In [None]:
even_list = []
for x in my_list:
    if x % 2 == 0:  # The modulo operator, using the percent symbol, returns the remainder of a division. Every even number's
        # remainder is always 0
        even_list.append(x)

print(even_list)

[6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72]


And this is the same list created with a list comprehension.

In [None]:
new_even_list = [x for x in new_list if x % 2 == 0]
print(new_even_list)

[6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 72]


Note that you can add conditions, and the usual `and`, `or`, and `None` commands or booleans work in this context as
well.

Finally, the first item in the list can also be operated upon. Let's say we now want the even numbers from `my_list`,
except times three and in a string that starts with "Number: ".

In [None]:
even_more_new_list = ["Number : " + str(x * 3) for x in new_list if x % 2 == 0]
print(even_more_new_list)

['Number : 18', 'Number : 36', 'Number : 54', 'Number : 72', 'Number : 90', 'Number : 108', 'Number : 126', 'Number : 144', 'Number : 162', 'Number : 180', 'Number : 198', 'Number : 216']
