# Chapitre 2 - Les fonctions et les fichiers

-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*

---

Le chapitre précédent a, espérons-le, aiguiser votre appétit. Dans ce chapitre, nous allons nous concentrer sur l'une des tâches les plus importantes de la recherche en sciences humaines: le traitement du texte. L'un des objectifs du traitement de texte est de nettoyer vos données pour ensuite faire leur analyse. Un autre objectif banal consiste à convertir une collection de textes en un format différent: par exemple de fichiers textes vers fichiers XML TEI. Dans ce chapitre, nous allons vous fournir les outils nécessaires pour travailler avec des collections de textes, les nettoyer et effectuer quelques analyses de données rudimentaires sur eux.

## Lire des fichiers

Supposons que vous ayez un texte stocké sur votre ordinateur. Comment pouvons-nous lire ce texte en utilisant Python? Python fournit une fonction très simple appelée `open` avec laquelle on peut lire des textes. Dans le dossier `data`, vous trouverez quelques petits extraits de texte que nous utiliserons dans ce chapitre. Regardez-les si vous en avez l'envie. Nous pouvons ouvrir ces fichiers avec la commande suivante:

In [7]:
fichier_ouvert = open('data/cid.v1071.1682.txt')

Maintenant, affichons la variable `fichier_ouvert`. Que pensez-vous qu'il arrivera ? 

In [8]:
print(fichier_ouvert)

<_io.TextIOWrapper name='data/cid.v1071.1682.txt' mode='r' encoding='UTF-8'>


"Je pensais pas que ça ferait ça" est probablement ce qui vous passe à l'esprit. Python n'affiche pas le contenu du fichier mais seulement une mention mystérieuse d'un certain `TextIOWrapper`. Ce truc de `TextIOWrapper` est la façon de Python de dire qu'il a ouvert une connexion au fichier `data/cid.v1071.1682.txt`. 

Mais cela nous donne également des informations auxquelles nous devrions prêter attention. Regardez la partie qui commence `encoding=`. `UTF-8` est le modèle d'encodage des caractères du fichier (Vous pouvez en apprendre un peu plus sur la chaine Computerphile : https://www.youtube.com/watch?v=MijmeoH9LT4 ). Par défaut, Python3 (contrairement à Python 2) gère ses données en UTF-8. On aurait pu cependant faire ce qui suit:

In [9]:
encodage_latin = open('data/cid.v1071.1682.txt', encoding='latin')
print(encodage_latin)

<_io.TextIOWrapper name='data/romans_1:14-23_gk.txt' mode='r' encoding='latin'>


Vous avez pour `encodage_latin` `encoding='latin'` dans la description de ce `TextIOWrapper`. Vous devrez vous assurer de toujours spécifier votre encodage comme UTF-8 si vous travaillez avec des textes grecs dans Windows. Cependant, les systèmes Linux et Mac ne devraient pas en avoir besoin.

Maintenant, si nous voulons *lire* le contenu du fichier, nous devons ajouter la fonction `read` comme suit:

In [10]:
print(fichier_ouvert.read())

Il n'est pas temps encor de chercher le trépas :
Ton prince et ton pays ont besoin de ton bras.
La flotte qu'on craignait, dans ce grand fleuve entrée,
Croit surprendre la ville et piller la contrée.
Les Mores vont descendre, et le flux et la nuit
Dans une heure à nos murs les amène sans bruit.
La Cour est en désordre, et le peuple en alarmes :
On n'entend que des cris, on ne voit que des larmes.
Dans ce malheur public mon bonheur a permis
Que j'ai trouvé chez moi cinq cents de mes amis,
Qui sachant mon affront, poussés d'un même zèle,
Se venaient tous offrir à venger ma querelle.
Tu les as prévenus ; mais leurs vaillantes mains
Se tremperont bien mieux au sang des Africains.
Va marcher à leur tête où l'honneur te demande :
C'est toi que veut pour chef leur généreuse bande.
De ces vieux ennemis va soutenir l'abord :
Là, si tu veux mourir, trouve une belle mort ;
Prends-en l'occasion, puisqu'elle t'est offerte ;
Fais devoir à ton roi son salut à ta perte ;
Mais reviens-en plutôt les pal

`read` est une fonction qui fonctionne sur les objets` TextWrapper` et nous permet de lire le contenu d'un fichier dans Python. Assignons le contenu du fichier à la variable `texte`:

In [13]:
# Ajoutez `encoding='UTF-8'` si nécessaire
fichier_ouvert = open('data/cid.v1071.1682.txt') 
texte = fichier_ouvert.read()

La variable `texte` contient le contenu du fichier `data/cid.v1071.1682.txt` et nous pouvons le manipuler désormais comme n'importe quelle autre chaîne. Après avoir lu le contenu d'un fichier, le `TextWrapper` n'a plus besoin d'être ouvert. En fait, il est bon de le fermer dès que vous n'en avez plus besoin. Pour ce faire, il suffit d'utiliser la méthode `close()`:

In [12]:
fichier_ouvert.close()

---

#### Exercice

Juste pour récapituler certaines des choses que nous avons apprises dans le chapitre précédent. Pouvez-vous écrire un bloc de code qui définit la variable `nombre_de_e` et compte combien de fois la lettre *e* se trouve dans le « texte »? (Astuce: utilisez une boucle `for` et une instruction `if`).

Prenez aussi le temps de comprendre `assert` qui nous permet de vérifier vos travaux. 

In [21]:
nombre_de_e = 0
# Votre code ici

# Ce code vérifiera ce que vous avez écrit
assert nombre_de_e == 180, "On devrait trouver 180 'e'"

AssertionError: On devrait trouver 180 'e'

Enfin, il existe une autre syntaxe pour gérer un fichier à ouvrir et lire : il s'agit d'utiliser la déclaration `with` :

In [24]:
with open("data/cid.v1071.1682.txt") as fichier_cid:
    texte = fichier_cid.read()

Cette méthode a cela de particulier qu'elle ferme d'elle-même le fichier qui a été ouvert. Tout comme un `if`, le with concerne l'ensemble du bloc ouvert en-dessous du `with` et permet de faire des opérations sur le fichier. On remarque l'utilisation de `as` : en français, on traduirait cette ligne en `avec le fichier ouvert cid.v1071.1682.txt en tant que variable fichier_cid`.

Par ailleurs, les variables modifiées dans cet ensemble sont encore disponible à la fin. Mais le fichier sera clos. Pouvez-vous deviner ce qui se passera avec les lignes suivantes :

In [27]:
print(texte)
fichier_cid.read()

Il n'est pas temps encor de chercher le trépas :
Ton prince et ton pays ont besoin de ton bras.
La flotte qu'on craignait, dans ce grand fleuve entrée,
Croit surprendre la ville et piller la contrée.
Les Mores vont descendre, et le flux et la nuit
Dans une heure à nos murs les amène sans bruit.
La Cour est en désordre, et le peuple en alarmes :
On n'entend que des cris, on ne voit que des larmes.
Dans ce malheur public mon bonheur a permis
Que j'ai trouvé chez moi cinq cents de mes amis,
Qui sachant mon affront, poussés d'un même zèle,
Se venaient tous offrir à venger ma querelle.
Tu les as prévenus ; mais leurs vaillantes mains
Se tremperont bien mieux au sang des Africains.
Va marcher à leur tête où l'honneur te demande :
C'est toi que veut pour chef leur généreuse bande.
De ces vieux ennemis va soutenir l'abord :
Là, si tu veux mourir, trouve une belle mort ;
Prends-en l'occasion, puisqu'elle t'est offerte ;
Fais devoir à ton roi son salut à ta perte ;
Mais reviens-en plutôt les pal

ValueError: I/O operation on closed file.

`I/O Operation` signifie `Input/Output` et vise les méthodes de lecture et d'écriture de fichiers. Notre fichier ayant été clos après `with`, il n'est plus possible de le lire.

#### Ce que l'on a appris

Pour finir cette section, voici un récapitulatif des concepts appris. Lisez la liste et posez des questions si certaines choses ne sont pas claires.

- `open()`
- `UTF-8`
- `.close()`
- `.read()`
- le fonctionnement de `TextIOWrapper`
- `with ___ as ___ :`
- `assert ___ , ___`

---

## Écrire notre première fonction

In the previous quiz, you probably wrote a loop that iterates over all characters in `text` and adds 1 to `number_of_epsilons` each time the program finds the letter *ε*. Counting objects in a text is a very common thing to do. Therefore, Python provides the convenient function `count`. This function operates on strings (`somestring.count(argument)`) and takes as argument the object you want to count. Using this function, the solution to the quiz above can now be rewritten as follows:

In [None]:
number_of_epsilons = text.count("ε")
print(number_of_epsilons)

In fact, `count` takes as argument any string you would like to find. We could just as well count how often the conjunction `τε` occurs:

In [None]:
print(text.count("τε"))

The string `τε` is found 14 times in our text. Does that mean that the word *τε* occurs 14 times in our text? Go ahead. Count it yourself. In fact, *τε* occurs only four times... Think about this. Why does Python print 14?

If we want to count how often the word *τε* occurs in the text and not the string `τε`, we could surround *τε* with spaces, like the following:

In [None]:
print(text.count(" τε "))

Although it gets the job done in this particular case, it is generally not a very solid way of counting words in a text. What if there are instances of *τε* followed by a comma or some end-of-sentence marker? Then we would need to query the text multiple times for each possible context of *τε*. For that reason, we're going to approach the problem using a different, more sophisticated strategy. 

Recall from the previous chapter the function `split`. What does this function do? The function `split` operates on a string and splits a string on spaces and returns a list of smaller strings (or words):

In [None]:
print(text.split())

---

#### Quiz!

All the things you have learnt so far should enable you to write code that counts how often a certain items occurs in a list. Write some code that defines the variable `number_of_hits` and counts how often the word *ἐν* (assigned to `item_to_count`) occurs in the the list of words called `words`.

In [None]:
words = text.split()
item_to_count = "ἐν"
# insert your code here

# The following test should print True if your code is correct 
print(number_of_hits == 6)

---

We will go through the previous quiz step by step. We would like to know how often the preposition *ἐν* occurs in our text. As a first step we will split the string `text` into a list of words:

In [None]:
words = text.split()

Then we call the `count` method on `words` to find the number of `ἐν`s in `words`:

In [None]:
item_to_count = "ἐν"
number_of_hits = words.count(item_to_count)
print(number_of_hits)

Now, say we would like to know how often the word *καὶ* occurs in our text. We could adapt the previous lines of code to search for the word *καὶ*, but what if we also would like to count the number of times *εἰς* occurs, and *δύναμις* and *θεός* and... It would be really cumbersome to repeat all these lines of code for each particular search term we have. Programming is supposed to reduce our workload, not increase it. Just like the function `count` for strings, we would like to have a function that operates on lists, takes as argument the object we would like to count and returns the number of times this object occurs in our list.

In this and the previous chapter you have already seen lots of functions. A function does something, often based on some argument you pass to it, and generally returns a result. You are not just limited to using functions in the standard library but you can write your own functions.

In fact, you *must* write your own functions. Separating your problem into sub-problems and writing a function for each of those is an immensely important part of well-structured programming. Functions are defined using the `def` keyword, they take a name and optionally a number of parameters. 

    def some_name(optional_parameters):

The `return` statement returns a value back to the caller and always ends the execution of the function. 

Going back to our problem, we want to write a function called `count_in_list`. It takes two arguments: (1) the object we would like to count and (2) the list in which we want to count that object. Let's write down the function definition in Python:

    def count_in_list(item_to_count, list_to_search):
    
Do you understand all the syntax and keywords in the definition above? Now all we need to do is to add the lines of code we wrote before to the body of this function:

In [None]:
def count_in_list(item_to_count, list_to_search): 
    number_of_hits = list_to_search.count(item_to_count)               
    return number_of_hits                         

All code should be familiar to you, except the `return` keyword. The `return` keyword is there to tell python to return as a result of calling the function the argument `number_of_hits`. OK, let's go through our function one more time, just to make sure you really understand all of it.

1. First we define a function using `def` and give it the name `count_in_list` (line 1);
2. This function takes two arguments: `item_to_count` and `list_to_search` (line 1);
3. Within the function, we define a variable `number_of_hits` and assign to it the value of the `count` called on our `list_to_search` with our `item_to_count` as the argument
4. Return the result of `number_of_hits` (line 3).

Let's test our little function! We will first count how often the word *καὶ* occurs in our list of words `words`.

In [None]:
print(count_in_list("καὶ", words))

---

#### Quiz!

Using the function we defined, print how often the word *θεοῦ* occurs in our text

In [None]:
# insert your code here

---

## A more general count function

Our function `count_in_list` is a concise and convenient piece of code allowing us to rapidly and without too much repitition count how often certain items occur in a given list. Now what if we would like to find out for all words in our text how often they occur. Then it would be still quite cumbersome to call our function for each unique word. We would like to have a function that takes as argument a particular list and counts for each unique item in that list how often it occurs. There are multiple ways of writing such a function. We will show you two ways of doing it.

### A count function (take 1)

In the previous chapter you have acquainted yourself with the `dictionary` structure. Recall that a dictionary consists of keys and values and allows you to quickly lookup a value. We will use a dictionary to write the function `counter` that takes as argument a list and returns a `dictionary` with `keys` for each unique item and `values` showing the number of times it occurs in the list. We will first write some code without the function declaration. If that works, we will add it, just as before, to the body of a function.

We start with defining a variable `counts` which is an empty dictionary:

In [None]:
counts = {}

Next we will loop over all words in our list `words`. For each word, we check whether the dictionary already contains it. If not, we call the `count` method on our `words` list to discover how often the word occurs.

In [None]:
for word in words:
    if word not in counts:
        counts[word] = words.count(word)
print(counts)
print(counts['σοφοῖς'])

If you don't remember anymore how dictionaries work, go back to the previous chapter and read the part about dictionaries once more.

Notice that we didn't do anything if we found the words already in our dictionary.  We don't have to since we already have the counts for that word.

Now that our code is working, we can add it to a function. We define the function `counter` using the `def` keyword. It takes one argument (`list_to_search`).

In [None]:
def counter(list_to_search):                 
    counts = {}                              
    for word in list_to_search:              
        if word not in counts:                   
            counts[word] = list_to_search.count(word)                 
    return counts                            

Hopefully we are boring you, but let's go through this function step by step.

1. We define a function using `def` and give it the name `counter` (line 1);
2. This function takes a single argument `list_to_search` which is the list we want to search through (line 1);
3. Next we define a variable `counts` which is an empty dictionary (line 2);
4. We loop over all words in `list_to_search` (line 3);
5. If the word is not already in `counts`, we look up the number of times it occurs in our list (lines 4-5);
7. Return the result of counts (line 6);

Let's try out our new function!

In [None]:
print(counter(words))
print('γὰρ⸃ occurs ' + str(count['γὰρ⸃']) + ' times.')

---

#### Quiz!

Let's put some of the stuff we learnt so far together. What we want you to do is to read into Python the file `data/romans_gk.txt`, convert it to a list of words and assign to the variable `theou_count` how often the word *θεοῦ* occurs in the text.

In [None]:
# insert you code here

# The following test should print True if your code is correct 
print(theou_count == 44)

---

### A count function (take 2)

Let's train our function writing skills a little more. We are going to write another counting function, this time using a slightly different strategy. Recall our function `count_in_list`. It takes as argument a list and the item we want to count in that list. It returns the number of times this item occurs in the list. If we call this function for each unique word in `words`, we obtain a list of frequencies, quite similar to the one we get from the function `counter`. What would happen if we just call the function `count_in_list` on each word in `words`? 

In [None]:
infile = open('data/romans_1:14-23_gk.txt')
text = infile.read()
infile.close()
words = text.split()

for word in sorted(words):
    print(word, count_in_list(word, words))

As you can see, we obtain the frequency of each word token in `words`, where we would like to have it only for unique word forms. The challenge is thus to come up with a way to convert our list of words into a structure with solely unique words. For this Python provides a convenient data structure called `set`. It takes as argument some iterable (e.g. a list) and returns a new object containing only unique items:

In [None]:
x = ['a', 'a', 'b', 'b', 'c', 'c', 'c']
unique_x = set(x)
print(unique_x)

Using `set` we can iterate over all unique words in our word list and print the corresponding frequency:

In [None]:
unique_words = set(words)
for word in sorted(unique_words):
    print(word, count_in_list(word, words))

We wrap the lines of code above into the function `counter2`:

In [None]:
def counter2(list_to_search):
    unique_words = set(list_to_search)
    for word in unique_words:
        print(word, count_in_list(word, list_to_search))

A final check to see whether our function behaves correctly:

In [None]:
counter2(words)

---

#### Quiz!

We have written two functions `counter` and `counter2`, both used to count for each unique item in a particular list how often it occurs in that list. Can you come up with some pros and cons for each function? Why is `counter2` better than `counter` or why is `counter` better than `counter2`?

*Double click this cell and write down your answer.*

---

## Text clean up

In the previous section we wrote code to compute a frequency distribution of the words in a text stored on our computer. The function `split` is a quick and dirty way of splitting a string into a list of words. However, if we look through the frequency distributions, we notice quite an amount of noise. For instance, the conjunction *γὰρ* occurs 5 times, but we also find `γὰρ⸃` occurring once.  And `ὁ` occurs once while the capitalized `Ὁ` also occurs 1 time. Of course we would like to add these counts together. As it appears, the tokenization of our text using `split` is fast and simple, but it leaves us with noisy and incorrect frequency distributions. 

There are essentially two strategies to follow to correct our frequency distributions. The first is to come up with a better procedure of splitting our text into words. The second is to clean-up our text and pass this clean result to the convenient `split` function. For now we will follow the second path.

Some words in our text are capitalized. To lowercase these words, Python provides the function `lower`. It operates on strings:

In [None]:
x = 'Ὁ'
x_lower = x.lower()
print(x_lower)

We can apply this function to our complete text to obtain a completely lowercased text, using:

In [None]:
text_lower = text.lower()
print(text_lower)

This solves our problem with miscounting capitalized words, leaving us with the problem of punctuation. The function `replace` is just the function we're looking for. It takes two arguments: (1) the string we would like to replace and (2) the string we want to replace the first argument with:

In [None]:
x = 'Please. remove. all. dots. from. this. sentence.'
x = x.replace(".", "")
print(x)

Notice that we replace all dots with an empty string written as `""`. 

---

#### Quiz!

Write code that to lowercase and remove all commas in the following short text:

In [None]:
short_text = "Commas, as it turns out, are so much overestimated."
# insert your code here

# The following test should print True if your code is correct 
print(short_text == "commas as it turns out are so much overestimated.")

---

We would like to remove all punctuation from a text, not just dots and commas. We will write a function called `remove_punc` that removes all (simple) punctuation from a text. Again, there are many ways in which we can write this function. We will show you two of them. The first strategy is to repeatedly call `replace` on the same string each time replacing a different punctuation character with an empty string. 

In [None]:
def remove_punc(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    for marker in punctuation:
        text = text.replace(marker, "")
    return text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc(short_text))

The second strategy we will follow is to show you that we can achieve the same result without using the built in function `replace`. Remember that a string consists of characters. We can loop over a string accessing each character in turn. Each time we find a punctuation marker we skip to the next character.

In [None]:
def remove_punc2(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    clean_text = ""
    for character in text:
        if character not in punctuation:
            clean_text += character
    return clean_text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc2(short_text))

---

#### Quiz!

1) Can you come up with any pros or cons for each of the two functions above?

*Write your answer here* (double click me)

2) Now it is time to put everything together. We want to write a function `clean_text` that takes as argument a text represented by string. The function should return this string with all punctuation removed and all characters lowercased.

In [None]:
def clean_text(text):
    # insert your code here
    
# The following test should print True if your code is correct 
short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(clean_text(short_text) == 
      "commas as it turns out are overestimated dots however even more so")

3) This last excercise puts everything together. We want you to open and read the file `data/romans_gk.txt` text once more, clean up the text and recompute the frequency distribution. Assign to `christou_counts` the number of times the genetive title *χριστοῦ* occurs in the text.

But before we do that, we need a slightly altered version of our `remove_punc` function that uses the Greek instead of the English punctuation (i.e., ’(.;)—⸂⸀⸁,·⸃) and a `clean_text` function that uses this new function.

In [None]:
def remove_punc_greek(text):
    punctuation = '’(.;)—⸂⸀⸁,·⸃'
    for marker in punctuation:
        text = text.replace(marker, "")
    return text

def clean_greek(text):
    return remove_punc_greek(text.lower())

In [None]:
# insert your code here

# The following test should print True if your code is correct 
print(christou_counts == 27)

---

## Writing results to a file

We have accomplished a lot! You have learnt how to read files using Python from your computer, how to manipulate them, clean them up and compute a frequency distribution of the words in a text file. We will finish this chapter with explaining to you how to write your results to a file. We have already seen how to read a text from our disk. Writing to our disk is only slightly different. The following lines of code write a single sentence to the file `first-output.txt`.

In [None]:
outfile = open("first-output.txt", mode="w")
outfile.write("My first output.")
outfile.close()

Go ahead and open the file `first-output.txt` located in the folder where this course resides. As you can see it contains the line `My first output.`. To write something to a file we open, just as in the case of reading a file, a `TextIOWrapper` which can be seen as a connection to the file `first-output.txt`. The difference with opening a file for reading is the *mode* with which we open the connection. Here the mode says `w`, meaning "open the file for writing". To open a file for reading, we set the mode to `r`. However, since this is Python's default setting, we may omit it.

---

#### Quiz!

In the final quiz of this chapter we will ask you to write the frequency distribution over the words in `data/romans_gk.txt` to the file `data/romans-frequency-distribution.txt`. We will give you some code to get you started

In [None]:
# first open and read data/romans_gk.txt. Don't forget to close the infile
infile = open("data/romans_gk.txt")
text = infile.read() # read the contents of the infile
# close the file handler

# clean the text


# next compute the frequency distribution using the function counter

# now open the file data/romans-frequency-distribution.txt for writing

#The following lines will write the frequency distribution to a text file
for word, frequency in frequency_distribution.items():
    outfile.write(word + ";" + str(frequency) + '\n')
    
# close the outfile


---

Ignore the following, it's just here to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>