Data Science Fundamentals: Python |
[Table of Contents](../index.ipynb)
- - - 
<!--NAVIGATION-->
Module 6. | [Generators](./01_generators.ipynb) | **[Regular Expressions](./02_regex.ipynb)** | [String Manipulation & Regular Expressions](03_str_manipulate.ipynb) | [Closures & Generators](./04_closures_generators.ipynb) | [Exercises](./05_gen_regex_exercises.ipynb)

## Regular Expression Generators for Python

### [Pyregex](http://www.pyregex.com/)

### [Pythex](https://pythex.org/) 

- - -

[Regular Expression Cheatsheet](https://learnbyexample.github.io/cheatsheet/python/python-regex-cheatsheet/)
<br>
[Regular Expression Operations Documentation](https://docs.python.org/2/library/re.html)

## Regular Expressions

### Examples

We have already seen that we can ask from a string `str`
whether it begins with some substring as follows:
`str.startswith('Apple')`.
If we would like to know whether it starts with `"Apple"` or
`"apple"`, we would have to call `startswith` method twice.
Regular expressions offer a simpler solution:
`re.match(r"[Aa]pple", str)`.
The bracket notation is one example of the special syntax of
*regular expressions*. In this case it says that any of the
characters inside brackets will do: either `"A"` or `"a"`. The other
letters in `"pple"` will act normally. The string `r"[Aa]pple"` is
called a *pattern*.

A more complicated example asks whether the string `str`
starts with either `apple` or `banana` (no matter if the first letter
is capital or not):
`re.match(r"[Aa]pple|[Bb]anana", str)`.
In this example we saw a new special character `|` that denotes
an alternative. On either side of the bar character we have a
*subpattern*.

A legal variable name in Python starts with a letter or an
underline character and the following characters can also be
digits.
So legal names are, for instance: `_hidden`, `L_value`, `A123_`.
But the name `2abc` is not a valid variable name.
Let’s see what would be the regular expression pattern to
recognise valid variable names:
`r"[A-Za-z_][A-Za-z_0-9]*\Z"`.
Here we have used a shorthand for character ranges: `A-Z`.
This means all the characters from `A` to `Z`.

The first character of the variable name is defined in the first
brackets. The subsequent characters are defined in the second
brackets.
The special character `*` means that we allow any number
(0,1,2, . . . ) of the previous subpattern. For example the
pattern `r"ba*"` allows strings `"b"`, `"ba"`, `"baa"`, `"baaa"`, and
so on.
The special syntax `\Z` denotes the end of the string.
Without it we would also accept `abc-` as a valid name since
the `match` function normally checks only that a string starts with a pattern.

The special notations, like `\Z`, also cause problems with string
handling.
Remember that normally in string literals we have some
special notation: `\n` stands for newline, `\t` stands for tab, and
so on.
So, both string literals and regular expressions use similar
looking notations, which can create serious confusion.
This can be solved by using the so-called *raw strings*. We
denote a raw string by having an `r` letter before the first
quotation mark, for example `r"ab*\Z"`.
When using raw strings, the newline (`\n`), tab (`\t`), and other
special string literal notations aren’t interpreted. One should
always use raw strings when defining regular expression
patterns!

### Patterns

A pattern represents a set of strings. This set can even be
potentially infinite.
They can be used to describe a set of strings that have some
commonality; some regular structure.
Regular expressions (RE) are a classical computer science topic.
They are very common in programming tasks. Scripting
languages, like Python, are very fluent in regular expressions.
Very complex text processing can be achieved using regular
expressions.

In patterns, normal characters (letters, numbers) just represent
themselves, unless preceded by a backslash, which may trigger
some special meaning.
Punctuation characters have special meaning, unless preceded
by backslash (`\`), which deprives their special meaning.
Use `\\` to represent a backslash character without any special
meaning.
In the following slides we will go through some of the more
common RE notations.

```
. Matches any character
[...] Matches any character contained within the brackets
[^...] Matches any character not appearing after the hat (ˆ)
ˆ Matches the start of the string
$ Matches the end of the string
* Matches zero or more previous RE
+ Matches one or more previous RE
{m,n} Matches m to n occurences of previous RE
? Matches zero or one occurences of previous RE
```

We have already seen that a `|` character denotes alternatives.
For example, the pattern `r"Get (on|off|ready)"` matches
the following strings: `"Get on"`, `"Get off"`, `"Get ready"`.
We can use parentheses to create groupings inside a pattern:
`r"(ab)+"` will match the strings `"ab"`, `"abab"`, `"ababab"`,
and so on.
These groups are also given a reference number starting from 1. 
We can refer to groups using backreferences: `\number`.
For example, we can find separated patterns that get
repeated: `r"([a-z]{3,}) \1 \1"`.
This will recognise, for example, the following strings: `"aca
aca aca"`, `"turn turn turn"`. But not the strings `"aca
aba aca"` or `"ac ac ac"`.


In the following, note that a hat (ˆ) as the first character
inside brackets will create a complement set of characters:

```
`\d` same as `[0-9]`, matches a digit
`\D` same as `[ˆ0-9]`, matches anything but a digit
`\s` matches a whitespace character (space, newline, tab, ... )
`\S` matches a nonwhitespace character
`\w` same as `[a-zA-Z0-9_]`, matches one alphanumeric character
`\W` matches one non-alphanumeric character
```

Using the above notation we can now shorten our previous
variable name example to `r’[a-zA-Z_]\w*\Z’`

The patterns `\A`, `\b`, `\B`, and `\Z` will all match an empty
string, but in specific places.
The patterns `\A` and `\Z` will recognise the beginning and end
of the string, respectively.
Note that the patterns `ˆ` and `$` can in some cases match also
after a newline and before a newline, correspondingly.
So, `\A` is distinct from `ˆ`, and `\Z` is distinct from `$`.
The pattern `\b` matches at the start or end of a word. The
pattern `\B` does the reverse.

### Match and search functions

We have so far only used the `re.match` function which tries
to find a match at the beginning of a string
The function `re.search` allows to match any substring of a
string.
Example: `re.search(r'\bback\b', s)` will match
strings `"back"`, `"a back, is a body part"`, `"get back"`. But it
will not match the strings `"backspace"` or `"comeback"`.

The function `re.search` finds only the first occurence.
We can use the `re.findall` function to find all occurences.
Let’s say we want to find all present participle words in a
string `s`. The present participle words have ending `'ing'`.
The function call would look like this:
`re.findall(r'\w+ing\b', s)`.
Let’s try running this:

In [2]:
import re
s = "Doing things, going home, staying awake, sleeping later"
re.findall(r'\w+ing\b', s)

['Doing', 'going', 'staying', 'sleeping']

Let’s say we want to pick up all the integers from a string.
We can try that with the following function call:
`re.findall(r'[+-]?\d+', s)`.
An example run:

In [3]:
re.findall(r'[+-]?\d+', "23 + -24 = -1")

['23', '-24', '-1']

Suppose we are given a string of if/then sentences, and we
would like to extract the conditions from these sentences.
Let’s try the following function call:
`re.findall(r'[Ii]f (.*), then', s)`.
An example run:

In [5]:
s = ("If I’m not in a hurry, then I should stay. " +
    "On the other hand, if I leave, then I can sleep.")
re.findall(r'[Ii]f (.*), then', s)

['I’m not in a hurry, then I should stay. On the other hand, if I leave']

But I wanted a result: `["I'm not in a hurry", 'I leave']`. That
is, the condition from both sentences. How can this be fixed?

The problem is that the pattern `.*` tries to match as many
characters as possible.
This is called *greedy matching*.
One way of solving this problem is to notice that the two
sentences are separated by a full-stop (.).
So, instead of matching all the characters, we need to match
everything but the dot character.
This can be achieved by using the complement character
class: `[^.]`. The hat character (`ˆ`) in the beginning of a
character class means the complement character class

After the modification the function call looks like this:
`re.findall(r'[Ii]f ([^.]*), then', s)`.
Another way of solving this problem is to use a non-greedy
matching.
The repetition specifiers `+`, `*`, `?`, and `{m,n}` have
corresponding non-greedy versions: `+?`, `*?`, `??`, and `{m,n}?`.
These expressions use as few characters as possible to make
the whole pattern match some substring.
By using non-greedy version, the function call looks like this:
`re.findall(r’[Ii]f (.*?), then’, s)`.



### Functions in the `re` module

Below is a list of the most common functions in the `re` module

* `re.match(pattern, str)`
* `re.search(pattern, str)`
* `re.findall(pattern, str)`
* `re.finditer(pattern, str)`
* `re.sub(pattern, replacement, str, count=0)`

Functions `match` and `search` return a *match object*.
A match object describes the found occurence.
The function `findall` returns a list of all the occurences of
the pattern. The elements in the list are strings.
The function `finditer` works like `findall` function except
that instead of returning a list, it returns an iterator whose
items are match objects.
The function `sub` replaces all the occurences of the pattern in
`str` with the string replacement and returns the new string.

An example: The following program will replace all "she"
words with "he"

```
import re
str = "She goes where she wants to, she's a sheriff."
newstr = re.sub(r'\b[Ss]he\b', 'he', str)
print newstr
```

This will print `he goes where he wants to, he's a sheriff.`

The `sub` function can also use backreferences to refer to the
matched string. The backreferences \1, \2, and so on, refer
to the groups of the pattern, in order.
An example:
```
import re
str = """He is the president of Russia.
He’s a powerful man."""
newstr = re.sub(r'(\b[Hh]e\b)', r'\1 (Putin)', str, 1)
print newstr
```

This will print
```
He (Putin) is the president of Russia.
He’s a powerful man.
```

### Match object

Functions `match`, `search`, and `finditer` use `match` objects
to describe the found occurence.
The method `groups()` of the match object returns the tuple
of all the substrings matched by the groups of the pattern.
Each pair of parentheses in the pattern creates a new group.
These groups are are referred to by indices 1, 2, ...
The group 0 is a special one: it refers to the match created by
the whole pattern.

Let’s look at the match object returned by the call

```
mo = re.search(r'\d+ (\d+) \d+ (\d+)',
'first 123 45 67 890 last')
```

The call `mo.groups()` returns a tuple `(’45’, ’890’)`.
We can access just some individual groups by using the
method `group(gid, ...)`.
For example, the call `mo.group(1)` will return `’45’`.
The zeroth group will represent the whole match:
`’123 45 67 890’`

In addition to accessing the strings matched by the pattern
and its groups, the corresponding indices of the original string
can be accessed:

* The `start(gid=0)` and `end(gid=0)` methods return the start
and end indices of the matched group gid, correspondingly
* The method `span(gid)` just returns the pair of these start
and end indices

The match object mo can also be used like a boolean value:

```python
mo = re.search(...)
if mo:
    # do something
```

will do something if a match was found.
Alternatively, the match object can be converted to a boolean
value by the call `found = bool(mo)`.

### Miscellaneous stuff

If the same pattern is used in many function calls, it may be
wise to precompile the pattern, mainly for efficiency reasons.
This can be done using the `compile(pattern, flags=0)` function
in the `re` module. The function returns a so-called RE object.
The RE object has method versions of the functions found in
module `re`.
The only difference is that the first parameter is not the
pattern since the precompiled pattern is stored in the RE
object.

The details of matching operation can be specified using
optional flags.
These flags can be given either inside the pattern or as a
parameter to the compile function.
Some of the more common flags are given in the following
table

| x   | Flag |
|-----|--------------|
|`(?i)` | re.IGNORECASE|
|`(?m)` | re.MULTILINE|
|`(?s)` | re.DOTALL|

The elements on the left can appear anywhere in the pattern
but preferably in the beginning.
On the right there are attributes of the re module that can be
given to the compile function as the second parameter

The `IGNORECASE` flag makes lower- and uppercase
characters appear as equal.
The `MULTILINE` flag makes the special characters `ˆ` and `$`
match the beginning and end of each line in addition to the
beginning and end of the whole string. These flags make `\A`
differ from `ˆ`, and `\Z` differ from `$`.
The `DOTALL` flag makes the character class `.` (dot) also
accept the newline character, in addition to all the other
letters.

When giving multiple flags to the compile function, the flags
can be separated with the `|` sign.
For example, `re.compile(pattern, re.MULTILINE | re.DOTALL)`.
This is equal to `re.compile('(?m)(?s)' + pattern)`.

#### <div class="alert alert-info">Group Exercise 1 (integers in brackets)</div>

Write function `integers_in_brackets` that finds from a given string all integers that are enclosed in brackets.

Example run:
`integers_in_brackets("  afd [asd] [12 ] [a34]  [ -43 ]tt [+12]xxx")`
returns
`[12, -43, 12]`.
So there can be whitespace between the number and the brackets, but no other character besides those that make up the integer.

Test your function from the `main` function.
<hr/>

In [7]:
re.findall(r'[+-]?\d+', "23 + -24 = -1")

['23', '-24', '-1']

## Basic file processing

A file can be opened with the `open` function. The call `open(filename, mode="r")` will return a *file object*, whose type is `file`. This file object can be used to refer to a file on disk. For example, when we want to read from or write to a file, we can used the methods `read` and `write` of the file object. After the file object is no longer needed, a call to the `close` method should be made.

We can control what kind of operations we can perform on a file with the *mode* parameter of the `open` function. Different options include opening a file for reading or writing,
whether the file should exists already or be created with the
call to open, etc. Here's a list of all the opening modes:

| Mode | Description |
| ---- | ----------- |
| `r`  | read-only mode, file must exist |
| `w`  | write-only mode, creates, or overwrites an existing file |
| `a`  | write-only mode, write always appends to the end |
| `r+` | read/write mode, file must already exist |
| `w+` | read/write mode, creates, or overwrites an existing file |
| `a+` | read/write mode, write will append to end |

In the end of the mode string either the letter `t` or `b` can be appended. These stand for text mode and binary mode. If this letter is not given, the file type is text mode by default. 

For binary mode the contents of the file are not interpreted in any way, and the read and write methods handle bytes. (A byte consists of 8 bits and can be used to represent a number in the range 0 to 255.)

In the text mode two interpretations happen

* On Windows operating system the end of line in files is encoded by two characters. When the file is read these two charactes are converted to `'\n'` character. During writes to a file this conversion happens in the opposite direction.
* One character is encoded in the file as one or more bytes. This conversion happens automatically during read and write operations. One common encoding between bytes and characters is utf-8. In this encoding, the Finnish character `'ä'`, for example, is encoded as the following sequence of bytes:

In [0]:
"ä".encode("utf-8")

b'\xc3\xa4'

Above the two bytes were expressed as hexadecimals. In decimal notation they would be 195 and 164. (Both in the range from 0 to 255.)

In [0]:
list("ä".encode("utf-8"))              # Show as a list of integers

[195, 164]

What is the utf-8 encoding of the letter `'a'`?

During this course we will only consider files containing text, so the default text mode is fine for us. But we might sometimes have to specify the encoding of a file, if it is not the usual utf-8.

### Some common file object methods
* `read(size)` will read size characters/bytes as a string
* `write(string)` will write string/bytes to a file
* `readline()` will read a string until and including the next newline character is met
* `readlines()` will return a list of all lines of a file
* `writelines()` will write a list of lines to a file
* `flush()` will try to make sure that the changes made to a file are written to disk immediately

In [16]:
#f = open("../realworld/csv/04_rw_csv.ipynb", "r") # Let's open this notebook file,
f = open("02_regex.ipynb", "r")

                              # which is essentially a text file.
                              # So you can open it in a texteditor as well.
        
for i in range(5):            # And read the first five lines
    line = f.readline()
    print(f"Line {i}: {line}", end="")
f.close()

Line 0: {
Line 1:  "cells": [
Line 2:   {
Line 3:    "cell_type": "markdown",
Line 4:    "metadata": {


It is easy to forget to close the file. One can use a *context manager* to solve this problem. A context manager is created with the `with` statement. After the indented block of the `with` statement exits, the file will be automatically closed.

In [17]:
with open("../realworld/csv/04_rw_csv.ipynb", "r") as f:          # the file will be automatically closed,
                                              # when the with block exits
    for i in range(5):
        line = f.readline()
        print(f"Line {i}: {line}", end="")

Line 0: {
Line 1:  "cells": [
Line 2:   {
Line 3:    "cell_type": "markdown",
Line 4:    "metadata": {},


The `file` object is iterable. This means that we can iterate through the lines in the file using a for loop, like in the below example:

In [18]:
max_len = 0
with open("../realworld/csv/04_rw_csv.ipynb", "r") as f:
    for line in f:    # iterates through all the lines in the file
        if len(line) > max_len:
            max_len = len(line)
print(f"The longest line in this file has length {max_len}")

The longest line in this file has length 220


### Standard file objects
Python has automatically three file objects open:

* `sys.stdin` for *standard input*
* `sys.stdout` for *standard output*
* `sys.stderr` for *standard error*
To read a line from a user (keyboard), you can call `sys.stdin.readline()`. To write a line to a user (screen), call `sys.stdout.write(line)`. The standard error is meant for error messages only, even though its output often goes to the same destination as standard output.

The print function uses the file `sys.stdout` and input function uses the file `sys.stdin`. An example of usage:

In [19]:
import sys
import random
i=random.randint(-10,10)
if i >= 0:
    sys.stdout.write("Got a positive integer.\n")
else:
    sys.stderr.write("Got a negative integer.\n")

Got a positive integer.


These standard file objects are meant to be a basic input/output mechanism in textual form. The destinations of the file objects can be changed to point
somewhere else than the usual keyboard and screen. Very often these are redirected to some files. For example, it is usual to point the stderr to a file where all
error messages are logged.

## sys module

We saw above that the `sys` module contains the three file objects `sys.stdin`, `sys.stdout`, and `sys.stderr`. It has also few other useful attributes. The attribute `sys.path` is the list of folders that Python uses to look for imported modules. The list `sys.argv` contains the so called *command line parameters*. For example in Linux if you are using the terminal, then you can run your program with the command `python3 programname.py param1 param2 ...`. After Python has started your program, the command line parameters are visible as follows. The name of the program is in `sys.argv[0]`. The rest of the command line parameters are after the program name in this list: `sys.argv[1]=="param1"`, `sys.argv[2]=="param2"`, and so on. The command line parameters can be useful in adjusting the behaviour of your program. A few examples of these will be in the following exercises. (The terminal window is a textual interface to your computer instead of the usual graphical interface.)

The function `sys.exit` can be used to exit immediately your program. The integer parameter given to this function is the return value of the program. Usually the return value 0 means that the program ran successfully, and non-zero integer means that an error occurred. This return value is accessible from the terminal window from where you started the program.

#### <div class="alert alert-info">Group Exercise 2 (file listing)</div>

The file `files/listing.txt` contains a list of files with one line per file. Each line contains seven fields: access rights, number of references, owner's name, name of owning group, file size, date, filename. These fields are separated with one or more spaces. Note that there may be spaces also within these seven fields.

Write function `file_listing` that loads the file `files/listing.txt`. It should return a list of tuples (size, month, day, hour, minute, filename). Use regular expressions to do this (either `match`, `search`, `findall`, or `finditer` method).

An example: for line
```
-rw-r--r-- 1 jttoivon hyad-all   25399 Nov  2 21:25 exception_hierarchy.pdf
```
the function should create the tuple `(25399, "Nov", 2, 21, 25, "exception_hierarchy.pdf")`.
<hr/>

In [27]:
os.listdir(path='.')

['.DS_Store',
 '02_regex.ipynb',
 'supplemental',
 '04_closures_generators.ipynb',
 'README.md',
 '05_gen_regex_exercises.ipynb',
 '01_generators.ipynb',
 'files',
 '.ipynb_checkpoints',
 '03_str_manipulate.ipynb']

In [26]:
def getListOfFiles(dirName):
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles

In [29]:
dirName = '.';
 
# Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)
print(listOfFiles)

['./.DS_Store', './02_regex.ipynb', './supplemental/README.md', './04_closures_generators.ipynb', './README.md', './05_gen_regex_exercises.ipynb', './01_generators.ipynb', './files/listing.txt', './files/alice.txt', './files/rgb.txt', './files/table.png', './.ipynb_checkpoints/03_str_manipulate-checkpoint.ipynb', './.ipynb_checkpoints/01_generators-checkpoint.ipynb', './.ipynb_checkpoints/04_closures_generators-checkpoint.ipynb', './.ipynb_checkpoints/02_regex-checkpoint.ipynb', './03_str_manipulate.ipynb']


#### <div class="alert alert-info">Group Exercise 3 (red green blue)</div>

The file `files/rgb.txt` contains names of colors and their numerical representations in RGB format. The RBG format allows a color to be represented as a mixture of red, green, and blue components. Each component can have an integer value in the range [0,255]. Each line in the file contains four fields: red, green, blue, and colorname.
Each field is separated by some amount of whitespace (tab or space in this case).
The text file is formatted to make it print nicely, but that makes it harder to process by a computer. Note that some color names can also contain a space character.
 
Write function `red_green_blue` that reads the file `rgb.txt` from the folder `files`.  Remove the irrelevant first line of the file. The function should return a list of strings. Clean-up the file so that the strings in the returned list have four fields separated by a single tab character (`\t`). Use regular expressions to do this.

The first string in the returned list should be:
```
'255\t250\t250\tsnow'
```

<hr/>

In [1]:
with open("files/rgb.txt", "r") as f:
# for line in f:
#     f.read().split()
#     print(f)
# with open("files/rgb.txt", "r") as fin:
     # data = fin.read().splitlines(True)
# with open("files/rgb.txt", "w") as fout:
      # fout.writelines(data[1:1])
      #print(data)
      for line in f:
          line = f.readline()
          print(line)
      # print(f.read())
f.close()

255 250 250		snow

248 248 255		GhostWhite

245 245 245		WhiteSmoke

255 250 240		floral white

253 245 230		old lace

250 240 230		linen

250 235 215		AntiqueWhite

255 239 213		PapayaWhip

255 235 205		BlanchedAlmond

255 218 185		peach puff

255 222 173		navajo white

255 228 181		moccasin

255 255 240		ivory

255 250 205		LemonChiffon

240 255 240		honeydew

245 255 250		MintCream

240 248 255		alice blue

230 230 250		lavender

255 240 245		LavenderBlush

255 228 225		MistyRose

  0   0   0		black

 47  79  79		DarkSlateGray

 47  79  79		DarkSlateGrey

105 105 105		DimGray

105 105 105		DimGrey

112 128 144		SlateGray

112 128 144		SlateGrey

119 136 153		LightSlateGray

119 136 153		LightSlateGrey

190 190 190		grey

211 211 211		LightGrey

211 211 211		LightGray

 25  25 112		MidnightBlue

  0   0 128		navy blue

100 149 237		cornflower blue

 72  61 139		dark slate blue

106  90 205		slate blue

123 104 238		medium slate blue

132 112 255		light slate blue

  0   0 205		medium

#### <div class="alert alert-info">Group Exercise 4 (word frequencies)</div>

Create function `word_frequencies` that gets a filename as a parameter and returns a dict with the word frequencies. In the dictionary the keys are the words and the corresponding values are the number of times that word occurred in the file specified by the function parameter. Read all the lines from the file and split the lines into words using the `split()` method. Further, remove punctuation from the ends of words using the `strip("""!"#$%&'()*,-./:;?@[]_""")` method call.

Test this function in the main function using the file `alice.txt`. In the output, there should be a word and its count per line separated by a tab:

```
The     64
Project 83
Gutenberg	26
EBook   3
of      303
```

<hr/>

In [25]:
from collections import Counter
def word_count(fname):
        with open(fname) as f:
                return Counter(f.read().split())

print("Number of words in the file :",word_count("files/alice.txt"))



#### <div class="alert alert-info">Group Exercise 5 (file count)</div>

This exercise can give two points at maximum!

Part 1.

Create a function `file_count` that gets a filename as parameter and returns a triple of numbers. The function should read the file, count the number of lines, words, and characters in the file, and return a triple with these count in this order. You get division into words by splitting at whitespace. You don't have to remove punctuation.

Part 2.

Create a main function that in a loop calls `file_count` using each filename in the list of command line parameters `sys.argv[1:]` as a parameter, in turn.
For call `python3 src/file_count file1 file2 ...`
the output should be
```
?      ?       ?       file1
?      ?       ?       file2
...
```
The fields are separated by tabs (`\t`). The fields are in order: linecount, wordcount, charactercount, filename.
<hr/>

In [34]:
import os

def count_files(in_directory):
    joiner= (in_directory + os.path.sep).__add__
    return sum(
        os.path.isfile(filename)
        for filename
        in map(joiner, os.listdir(in_directory))
    )

In [37]:
count_files("/usr/lib")

293

#### <div class="alert alert-info">Group Exercise 6 (file extensions)</div>

This exercise can give two points at maximum!

Part 1.

Write function `file_extensions` that gets as a parameter a filename.
It should read through the lines from this file. Each line contains a filename.
Find the extension for each filename. The function should return a pair, where the
first element is a list containing all filenames with no extension (with the preceding period (`.`) removed).
The second element of the pair is a dictionary with extensions as keys and corresponding values are lists with filenames having that extension.

Sounds a bit complicated, but hopefully the next example will clarify this.
If the file contains the following lines
```
file1.txt
mydocument.pdf
file2.txt
archive.tar.gz
test
```
then the return value should be the pair:
`(["test"], { "txt" : ["file1.txt", "file2.txt"], "pdf" : ["mydocument.pdf"], "gz" : ["archive.tar.gz"] } )`

Part 2.

Write a `main` method that calls the `file_extensions` function with "src/filenames.txt" as the argument. Then print the results so that for each extension there is a line consisting of the extension and the number of files with that extension. The first line of the output should give the number of files without extensions.

With the example in part 1, the output should be
```
1 files with no extension
gz 1
pdf 1
txt 2
```
Had there been no filenames without extension then the first line would have been `0 files with no extension`. In the printout list the extensions in alphabetical order.
<hr/>

In [2]:
import os

# unpacking the tuple
file_name, file_extension = os.path.splitext("files/rgb.txt")

print(file_name)
print(file_extension)

files/rgb
.txt
('.', '')
('/Users/pankaj/a.b/image', '.png')



<!--NAVIGATION-->
Module 6. | [Generators](./01_generators.ipynb) | **[Regular Expressions](./02_regex.ipynb)** | [String Manipulation & Regular Expressions](03_str_manipulate.ipynb) | [Closures & Generators](./04_closures_generators.ipynb) | [Exercises](./05_gen_regex_exercises.ipynb)
<br>
[Top](#)

- - -

Copyright © 2020 Qualex Consulting Services Incorporated.