# Digraphs and trigraphs in the French language

We will use "Ã€ l'ombre des jeunes filles en fleurs" by Marcel Proust.

In [1]:
with open("proust", 'r') as f:
    lines = f.readlines()

## Digraphs

We can study the digraphs by looping through pairs of characters in all the strings. When a new digraph is discovered, we add it to a dictionary. Each entry to this dictionary holds a counter which is incremented each time the digraphs is seen again.

In [2]:
digraphs = {}
total_digraphs = 0
for line in lines:
    length = len(line)
    for i in range(1, length):
        total_digraphs += 1
        if (line[i - 1] + line[i]) not in digraphs.keys():
            digraphs[line[i - 1] + line[i]] = 1
        else:
            digraphs[line[i - 1] + line[i]] += 1

for digraph in sorted(digraphs, key=lambda x: digraphs[x], reverse=True)[:21]:
    print("{0}: {1} - {2:.3f}%".format(digraph, digraphs[digraph], digraphs[digraph] * 100 / total_digraphs))

e : 56757 - 4.614%
s : 34662 - 2.818%
t : 26915 - 2.188%
 d: 25632 - 2.084%
ai: 20090 - 1.633%
es: 18935 - 1.539%
 l: 18428 - 1.498%
 p: 17769 - 1.444%
en: 17676 - 1.437%
le: 17410 - 1.415%
, : 16184 - 1.316%
nt: 15922 - 1.294%
re: 15690 - 1.275%
de: 15603 - 1.268%
qu: 14973 - 1.217%
 c: 14148 - 1.150%
ou: 13344 - 1.085%
on: 13264 - 1.078%
 s: 13133 - 1.068%
n : 12648 - 1.028%
it: 12642 - 1.028%


As seen above, the most used digraphs in a Proust book are "e ", "s ", "t ", " d" and "ai"

## Trigraphs

We can do the same with the trigraphs

In [3]:
trigraphs = {}
total_trigraphs = 0
for line in lines:
    length = len(line)
    for i in range(2, length):
        total_trigraphs += 1
        if (line[i - 2] + line[i - 1] + line[i]) not in trigraphs.keys():
            trigraphs[line[i - 2] + line[i - 1] + line[i]] = 1
        else:
            trigraphs[line[i - 2] + line[i - 1] + line[i]] += 1

for trigraph in sorted(trigraphs, key=lambda x: trigraphs[x], reverse=True)[:21]:
    print("{0}: {1} - {2:.3f}%".format(trigraph, trigraphs[trigraph], trigraphs[trigraph] * 100 / total_trigraphs))

 de: 13291 - 1.081%
es : 12413 - 1.010%
 qu: 11664 - 0.949%
de : 10431 - 0.848%
nt : 8981 - 0.731%
it : 8452 - 0.687%
ent: 8263 - 0.672%
ait: 8014 - 0.652%
le : 7904 - 0.643%
que: 7678 - 0.625%
 le: 7266 - 0.591%
e d: 7016 - 0.571%
re : 6410 - 0.521%
ne : 6170 - 0.502%
ais: 5953 - 0.484%
ue : 5862 - 0.477%
is : 5826 - 0.474%
 pa: 5741 - 0.467%
lle: 5717 - 0.465%
e l: 5498 - 0.447%
 la: 5376 - 0.437%
