## Code style in Pythona

    Readability counts.
    Special cases aren't special enough to break the rules.

Some good style advice can be found in PEP 8 (Python Enhancement Proposal): 
https://www.python.org/dev/peps/pep-0008/#id4

Pylint is one of the available programs for checking style of Python source files.

Installation:
    
```bash
pip install --user pylint
```

Usage:

```bash
pylint file.py
```

# Text processing

### Szymon Talaga 11.11.2019 | Adapted from: Julian Zubek and Michał Denkiewicz 27.11.2017

<hr />

## Some of the typical string operations

**Reverse a string:**

In [None]:
L = [1, 2, 3, 4, 5]

L[:] == L
L[:] is L

In [None]:
"abcdefgh"[::-1]  # this makes a copy

# Use `reversed` to iterate over the same string

**Splitting text:**

In [None]:
"Mouse in the house".split()

In [None]:
"Going up, or maybe down.".split(",")

**Join iterable of strings:**

In [None]:
" ".join(["PhD", "Jane", "Doe"])

**Strip whitespace from sides** 

In [None]:
s = '  something  '
# Strip from the left
s.lstrip()
# Strip from the right
s.rstrip()
# Strip from both sides
s.strip()

Whitespace stripping is a standard processing step that should be done almost always.

**String interpolation**

String interpolation is a process of substituting special templates within a string with values. For various reasons Python breaks its own rules and have multiple string interpolation syntaxes/methods instead of just one. We will review only two most modern ones.

In [None]:
from math import pi

"The number π is: {}".format(pi)
"The number π is: {:.3f}".format(pi)

x = 7
y = 8

"{0} + {1} = {2}".format(x, y, x + y)
"{x} + {y} = {result}".format(
    x=x,
    y=y,
    result=x+y
)

In [None]:
from math import pi

f"The number π is: {pi}"
f"The number π is: {pi:.3f}"

x = 7
y = 8

f"{x} + {y} = {x + y}"

**in, find -- wyszukiwanie podciągu w ciągu:**

In [None]:
"Mouse" in "Mouse in the house"

In [None]:
"Dog" in "Mouse in the house"

In [None]:
"Mouse in the house".find("house")

**replace -- zamiana podciągu:**

In [None]:
"bread: 2 PLN, butter: 10 PLN, eggs: 7 PLN".replace("PLN", "EUR")

<hr />

## Working with files

In Python we interact with external files using special objects representing file handles. We create them with the `open()` built-in fanction and after using we should close the with the `.close()` method defined on a file object.

```python
f = open("file.txt")
... # do something with the file
f.close()
```

However, usually a more elegant solution is the use a context manager using the `with` keyword and work with a file within a context. The advantage of this approach is that we do not have to remember about closing the file.

```python
with open("file.txt") as f:
    ... # do something with the file

# We did not have to call `f.close()`
```

If a file is not too big we can read its all content at once with the `read` method.

```python
with open("file.txt") as f:
    s = f.read()
```

However, in general the more natural way to process a file is to iterate over its lines. This allows to work with large files, since we iterate over the lines with a generator, so at any given point in time only one line is stored in the memory.

```python
with open("file.txt") as f:
    for line in f:
        print(line)
```

**NOTE:** Loaded lines will always end with the new line sign `\n`. This one of the reasons why we almost always want to strip the whitespace from lines when reading them in. So a better line reading loop would be:

```python
with open("file.txt") as f:
    for line in f:
        line = line.strip()
        ... # Do something
```

Plik można otworzyć do zapisu podając do funkcji `open` dodatkowy parametr `w`. Następnie piszemy do pliku wykorzystując metodę `write`.
A file can be opened in a read, write or append mode (in fact there are more options; you can read about them in the documentation of the `open` function).

```python
# Read from file (default)
with open("file.txt", "r") as f:
    ... # do something
```

```python
# Write to file
with open("file.txt", "w") as f:
    f.write("some text")

# NOTE. Write mode deletes any preexisting content of the file before writing

# If you want to create a new file each time
# and make sure you do not overwrite anything use 'x' symbol
with open("file.txt", "x") as f:
    f.write("some text")

# It will create a new file or raise `FileExistsError`.
```

```python
# Append to file (add new lines at the end)
with open("file.txt", "a") as f:
    f.write("some text")
```

## Excercise 1

A file `KIAA0319.seq` containes a amino acid sequence represented as a list of symbols, each at a separate line. Count numbers of occurences of all unique amino acids. The output should be a mapping.

In [None]:
from collections import Counter

## Exercise 2

1. Group words in `makbet.txt` by length. For now we assume that words are divided with a space and that punctuation marks are included in words' lengths (but not whitespaces).
2. Compute distribution of word lengths (output should be a mapping).
2. Find the longest word or words.

In [None]:
# Maybe defaultdict will be useful?
from collections import defaultdict

## Exercise 3

**A.** File 'samples.txt' contain data about some samples together with text description. Each sample begins with a line like this:

    >>> Sample X
    
where `X` is the sample number. After the initial line there are few text description lines (their number is not fixed) followed by a matrix of numbers (multiple rows of numbers separated with comas; each line has the same number of elements). Data matrices for different samples can be of different size.

Compute average values for all samples. Output should be a list of 2-tuples with the sample id at the first position and the sample average at t

HINT 1. While looping through text lines you keep a contextual variable telling you wheter you are currently looping through a header or data matrix.

HINT 2. You can use the `sum` function to sum a iterables of numbers.

HINT 3. You can use interactive debugger to see what exactly is happening in your code. For instance, you can inspect every step in a loop:

```python
for x in collection:
    import pdb; pdb.set_trace()
    # Do something
```

Within a debugger session you can press `c` to go to the next debug trace, `n` to evaluate current line and move to the next one, `l` to show nearby lines of code and `ll` to show more nearby lines of code.

**B.** Modify your code so it creates a new file called `samples_processed.txt` in which data matrices are substituted with average values.

In [None]:
result = []

# Write your code here

expected = [
    (0, 6), (1, 81), (2, 44), (3, 29), (4, 111), (5, 60), (6, 52), (7, 206), (8, 17), (9, 39),
    (10, 315), (11, 172), (12, 92), (13, 35), (14, 62), (15, 251), (16, 146), (17, 176), (18, 7), (19, 1),
    (20, 128), (21, 36), (22, 19), (23, 32), (24, 259), (25, 275), (26, 15), (27, 315), (28, 98), (29, 306),
    (30, 11), (31, 90), (32, 68), (33, 255), (34, 75), (35, 319), (36, 14), (37, 248), (38, 28), (39, 100),
    (40, 78), (41, 117), (42, 113), (43, 198), (44, 139), (45, 114), (46, 34), (47, 183), (48, 60), (49, 93),
    (50, 88), (51, 112), (52, 64), (53, 60), (54, 7), (55, 252), (56, 82), (57, 27), (58, 159), (59, 162),
    (60, 20), (61, 95), (62, 57), (63, 250), (64, 8), (65, 66), (66, 44), (67, 93), (68, 26), (69, 76),
    (70, 213), (71, 40), (72, 255), (73, 108), (74, 80), (75, 19), (76, 228), (77, 222), (78, 150), (79, 76),
    (80, 60), (81, 11), (82, 208), (83, 23), (84, 227), (85, 16), (86, 183), (87, 13), (88, 250), (89, 10),
    (90, 61), (91, 99), (92, 81), (93, 159), (94, 22), (95, 229), (96, 15), (97, 122), (98, 246)
]

assert result == expected, "Sorry, your answer is not correct." 

<hr />

## Regular expressions

<img src="//imgs.xkcd.com/comics/regular_expressions.png" title="Wait, forgot to escape a space.  Wheeeeee[taptaptap]eeeeee." alt="Regular Expressions">

Regular expressions (RegEx) provide a flexible framework for searching for patterns in textual data. The syntax of regular expression is designed to express the patterns in a concise way. However, this also means that regular expression are often quite hard to read. Modern programming language such as Python tend to use an extended version of regular expression that was popularized by Perl language.

Applications:

   1. Extracting phone numbers or e-mail addresses from text
   2. Replacing text.
   3. Validating correctness of formal strings such as e-mail addresses.

In Python regular expressions are implemented in the `re` module available in the standard library.

In [None]:
import re

RegEx patterns are defined with strings. However, since RegEx patterns use a lot of special characters it is usually prefered to pass them as **raw strings**.

In [None]:
print("\n") # standard string
print(r"\n") # raw string

**Example:** fitting a pattern from the beginning of a string (check if string is a valid e-mail address)

In [None]:
email_addrs = [
    "abrakadabra@yahoo.com",
    "A.B.C@D.E",
    "John Kovalsky",
    "fredbloggs@fredbloggs.plus.com",
    "Abc.example.com",
    "A@b@c@example.com"
]

[re.match(r"(^|(?<=\s))[^@\s]+@[^@\s]+\.[^.@\s]+((?=\s)|$)", addr) for addr in email_addrs]

In [None]:
s1 = "Something"
s2 = "Here is Something"

re.match(r"Something", s1)
re.match(r"Something", s2)

**Example:** find the first occurence of a pattern

In [None]:
email_addrs = """
    abrakadabra@yahoo.com A.B.C@D.E
    John Kovalsky
    fredbloggs@fredbloggs.plus.com
    Abc.example.com
    A@b@c@example.com
"""

re.search(r"(^|(?<=\s))[^@\s]+@[^@\s]+\.[^.@\s]+((?=\s)|$)", email_addrs)

**Example:** find all occurences of a pattern

In [None]:
list(re.finditer(r"(^|(?<=\s))[^@\s]+@[^@\s]+\.[^.@\s]+((?=\s)|$)", email_addrs))

### Overview of the theory of regular expressions

Each regular expressions defines a formal _language_ (set of strings that fit the defined pattern). For instance, the expression `"A?B"` defines the `{'B', 'AB'}` language.

Expressions are formed through:

In [None]:
r"A|B"  #  {'A', 'B'}

**1) Gluing** (concatenation) of simpler expressions (the simplest one is a single character which defines the simplest possible language):

In [None]:
print(re.match(r'abc', 'abc'))
print(re.match(r'abc', 'dabc'))

**2) Alternative (disjunction)** of two expressions:

In [None]:
r"ABC|BCA"  # {'ABC', 'BCA'}

r"AB(C|B)CA"   # {'ABCCA'. 'ABBCA'}

In [None]:
print(re.match(r"ABC|BCA", "ABC"))
print(re.match(r"AB(C|B)CA", "ABBCA"))

**3) Class** of characters:

"`[`" i "`]`" określają _klasę znaków_. dopasowany zostanie dowolny znak z tej klasy.

Characters in a class can be explicitly listed (order does not matter):

In [None]:
print(re.match(r'ab[cdX][cdX]', 'abcdX'))

Or defined as a range:

In [None]:
print(re.match(r'[a-eA-E][0-9]', 'c8'))
print(re.match(r'[a-e]', 'z'))

A definition of a class can be also negative (defined as a complement of a class).
We denote negative class definition with additional `^` sign at the first place.

In [None]:
print(re.match(r'[^XY]', 'A'))
print(re.match(r'[^XY]', 'B'))
print(re.match(r'[^XY]', '&'))
print(re.match(r'[^XY]', 'X'))
print(re.match(r'[^XY]', 'Y'))

There are some useful built-in classes:

* **`\w` (word)** - letters, digits and '\_' (underscore).
* **`\d` (digit)** - digits
* **`\s` (space)** - whitespace (space, tabs, end of line).

Complements of the above built-in classes are denoted with capital letters,
e.g. the class of non-digits is `\D`.

There is also a special sign `.` which denote any character except the end of line (`\n`).

`\A` and `\Z` match the start and the end of a string.

`^` and `$` match the start and the end of a string or a line within a string (delimited with the new line character `\n`).

In [None]:
re.match(r"a+?", "aaaaaaaaaaaaaa")

**4) Quantifiers** allow to specify multiplicity of a pattern

* `*` - 0 or more
* `+` - 1 or more
* `?` - 0 or 1
* `{n}` - exactly `n`
* `{n,m}` - from `n` to `m`
* `{n,}` - minimum `n`
* `{,m}` - up to `m`

By default regular expression behave in a greedy manner. This means that they try to match as long a string as possible. If this is not a behavior you want, you can switch to the lazy mode in which the shortest possible string is matched. Lazy mode can be turned on by appending additional `?` mark to the quantifier.

* `*?` - 0 or more (lazy)
* `+?` - 1 or more (lazy)
* `??` - 0 or 1 (lazy)
* `{n}?` - exactly `n` (lazy)
* `{n,m}?` - from `n` to `m` (lazy)
* `{n,}?` - minimum `n` (lazy)
* `{,m}?` - up to `m` (lazy)

In [None]:
re.match(r"Something", "SoMeThIng", re.IGNORECASE) 

## Warmup Exercise

Extract all e-mails from the following string.

In [None]:
s = "Some emails: Martin <martin@google.com>, Jessica <jess@uw.edu.pl>, Anna <anna@uw.edu.pl>."

## Exercise 4

Write a regular expression that will validate phone numbers formed according to the following templates:

    XXXXXXXXX
    XXX XXX XXX
    XXX-XXX-XXX
    +XXXXXXXXXXX
    +XX XXX XXX XXX
    +XX-XXX-XXX-XXX
    (+XX)XXXXXXXXX
    (+XX)XXX XXX XXX
    (+XX)XXX-XXX-XXX
    (+XX) XXXXXXXXX
    (+XX) XXX XXX XXX
    (+XX) XXX-XXX-XXX
   

In [None]:
X = [
    "123456789",
    "123 456 789",
    "123-456-789",
    "+00123456789",
    "+00 123 456 789",
    "+00-124-456-789",
    "(+00)123456789",
    "(+00)123 456 789",
    "(+00)123-456-789",
    "(+00) 123456789",
    "(+00) 123 456 789",
    "(+00) 123-456-789",
    "(+00) 123 456-789",
    "423 432"
]

pattern = ""

rx = re.compile(pattern)

for x in X:
    print(rx.match(x))

## Exercise 5

Modify your solution to the exercise 2 so punctuation marks ("',.:;?!-) are not included in word lengths.

HINT. Check out the `re.split` function.

#### Exercise 2

1. Group words in `makbet.txt` by length. For now we assume that words are divided with a space and that punctuation marks are included in words' lengths (but not whitespaces).
2. Compute distribution of word lengths (output should be a mapping).
2. Find the longest word or words.

In [None]:
from collections import defaultdict

dct = defaultdict(list)

with open('makbet.txt', 'r') as f:
    pass

## Exercise 6

Write a regular expression for splitting text into sentences (you can assume that it is reasonably well-formatted). Store it in a compiled form as a object that can be reused anytime you want without having to rewrite the regex again (check `?re.compile`).

Then try to improve on the expression by wrapping it in a function that tries deals with special cases. You can test your solution on the text of Wikipedia article about Leonardo da Vinci.

In [None]:
import requests
import re
# Get text from the Wikipedia API
response = requests.get('https://en.wikipedia.org/w/api.php?action=query&prop=cirrusdoc&titles=Leonardo_da_Vinci&format=json')
# Parse response body as JSON
data = response.json()
text = data['query']['pages']['18079']['cirrusdoc'][0]['source']['text']