# 02_02 Loading text files

In [1]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

We begin by loading a list of words from a file. Your exercise files contain a list that we can use an example. The file is `words.txt`, and it sits in the same folder as this Jupyter notebook.

It is in the fact the 1934 dictionary that is distributed with many Unix systems. If you wish, you can find a better dictionary and use that instead.

In Python we talk of _idioms_ when we think of code constructs that have become the preferred way to achieve a certain goal. One example is looping through all the lines of a text file.

To so we open the file for reading (which is the default mode, so no option is needed, and use the file as an iterable in a `for` loop, which has the result of giving us the lines one by one.

For the moment, we will just collect all lines in a list.

In [37]:
words = []
for line in open('words.txt'):
    words.append(line)

What did we get? More than 200,000 words. Let's look at the first few, using slicing (as we learned in chapter 1).

In [6]:
len(words)

235886

In [7]:
words[:10]

['A\n',
 'a\n',
 'aa\n',
 'aal\n',
 'aalii\n',
 'aam\n',
 'Aani\n',
 'aardvark\n',
 'aardwolf\n',
 'Aaron\n']

Very good. I do see two problems though: every words ends in the newline character, denoted as in the C language by the `\n` combination. Also, some words are capitalized, which will interfere with our signature scheme.

We can fix both issues using Python string methods. To strip leading and trailing whitespace (which includes newlines), we can apply `strip`. Let's take "Aaron" for example. 

In [8]:
'Aaron\n'.strip()

'Aaron'

To switch the entire string to lowercase, we use the method `lower`:

In [43]:
'Aaron\n'.strip().lower()

'aaron'

We now have something more interesting to do in the body of our file-reading loop:

In [47]:
words = []
for line in open('words.txt'):
    words.append(line.strip().lower())

In [48]:
words[:10]

['a',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aani',
 'aardvark',
 'aardwolf',
 'aaron']

I do see a duplicate, which comes from "A" appearing both in uppercase and lowercase. One way to get rid of duplicates is to build not a list, but a set. So once again we iterate through the file, but replace the initial empty list with an empty set, and replace `append` with `add`.

In [49]:
words = set()
for line in open('words.txt'):
    words.add(line.strip().lower())

Given that the body of the loop is just one line, there is an even more idiomatic way of writing it: you probably guessed it already---as a comprehension. The comprehension is delimited by braces (because it's a set, not a list); the body edits each line; and the loop goes through the file.

In [60]:
words = {line.strip().lower() for line in open('words.txt')}

In [61]:
words

{'pseudotachylite',
 'queencake',
 'degreaser',
 'intercommonable',
 'pinatype',
 'homeomorphic',
 'thrummy',
 'anaryan',
 'insobriety',
 'papillosity',
 'prionus',
 'swerver',
 'udaler',
 'crevice',
 'discommend',
 'clavecinist',
 'jockteleg',
 'lupiform',
 'bromus',
 'theoryless',
 'pectinite',
 'basaree',
 'directrices',
 'paranete',
 'mesotonic',
 'stethophonometer',
 'fioretti',
 'unnamable',
 'boroglycerine',
 'cupflower',
 'nonadult',
 'doorweed',
 'ocean',
 'tragasol',
 'deadhearted',
 'sowans',
 'naipkin',
 'soreheadedness',
 'multiareolate',
 'ichthyoidal',
 'silverish',
 'earlike',
 'frocklike',
 'alphabetization',
 'jumpiness',
 'camphol',
 'ornamentality',
 'upeat',
 'glomerulonephritis',
 'noncallable',
 'voluptary',
 'truttaceous',
 'symmedian',
 'vaucheriaceae',
 'beheadlined',
 'tonally',
 'pentathlos',
 'dumpiness',
 'bunty',
 'nonfelony',
 'gonidial',
 'propylacetic',
 'costuming',
 'platydactylous',
 'upgo',
 'supportful',
 'increately',
 'prognostication',
 'werele

To get a list in alphabetical order, we can just wrap the set in the Python builtin `sorted`:

In [9]:
words = sorted({line.strip().lower() for line in open('words.txt')})

In [10]:
words

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aani',
 'aardvark',
 'aardwolf',
 'aaron',
 'aaronic',
 'aaronical',
 'aaronite',
 'aaronitic',
 'aaru',
 'ab',
 'aba',
 'ababdeh',
 'ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'abanic',
 'abantes',
 'abaptiston',
 'abarambo',
 'abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'abassin',
 'abastardize',
 'abatable',
 'abate',
 'a

We are now ready to make anagrams!

By the way, if you want to try in a different language, such as French, you just need the right file. Python strings are natively _unicode_, meaning that they can handle international character sets transparently. The characters are encoded internally using multiple bytes (either 1, 2, or 4 for each character), as needed. The only care we need to take is to tell Python which _encoding_ to use for the files we read and write. (Unicode includes multiple encodings that map character sets to bytes, so we need to know the right one).

Your exercise files include a French dictionary, written using the ISO-8859 encoding, known also as 'latin1'. So we can do:

In [18]:
paroles = sorted({line.strip().lower()
                 for line in open('francais.txt', encoding='latin-1')})

In [19]:
paroles

['',
 'a',
 'ab',
 'abaissa',
 'abaissai',
 'abaissaient',
 'abaissais',
 'abaissait',
 'abaissant',
 'abaissas',
 'abaissasse',
 'abaissassent',
 'abaissasses',
 'abaissassiez',
 'abaissassions',
 'abaisse',
 'abaissement',
 'abaissements',
 'abaissent',
 'abaisser',
 'abaissera',
 'abaisserai',
 'abaisseraient',
 'abaisserais',
 'abaisserait',
 'abaisseras',
 'abaisserez',
 'abaisseriez',
 'abaisserions',
 'abaisserons',
 'abaisseront',
 'abaisses',
 'abaisseur',
 'abaisseurs',
 'abaissez',
 'abaissiez',
 'abaissions',
 'abaissons',
 'abaissâmes',
 'abaissât',
 'abaissâtes',
 'abaissèrent',
 'abaissé',
 'abaissée',
 'abaissées',
 'abaissés',
 'abandon',
 'abandonna',
 'abandonnai',
 'abandonnaient',
 'abandonnais',
 'abandonnait',
 'abandonnant',
 'abandonnas',
 'abandonnasse',
 'abandonnassent',
 'abandonnasses',
 'abandonnassiez',
 'abandonnassions',
 'abandonne',
 'abandonnent',
 'abandonner',
 'abandonnera',
 'abandonnerai',
 'abandonneraient',
 'abandonnerais',
 'abandonnerait',