# Process Files / NICE-TO-KNOW

Consider the files in in the folder `files`.

To get the all the text files (files ending with `.txt`) in `files` a great library is `glob`.

How to process the files.
1. Get all filenames to process.
2. Read all file content to process (what if we have too many or too big files)
3. Convert or prepare the data to processing
4. Process data
5. Output the result (possibly writing the data)

Let's make a simple example of frequency counting of words (simple example).

```Python
import glob

filenames = glob.glob('files/*.txt')
```

In [1]:
import glob

In [2]:
# 1
filenames = glob.glob('files/*.txt')

In [3]:
filenames

['files/speckled.txt',
 'files/face.txt',
 'files/twisted.txt',
 'files/squires.txt',
 'files/coronet.txt',
 'files/carbuncle.txt',
 'files/treaty.txt',
 'files/bachelor.txt',
 'files/patient.txt',
 'files/bohemia.txt',
 'files/problem.txt',
 'files/crooked.txt',
 'files/engineer.txt',
 'files/interpreter.txt',
 'files/gloria_scott.txt',
 'files/clerk.txt',
 'files/copper.txt',
 'files/ritual.txt',
 'files/blaze.txt',
 'files/league.txt',
 'files/boscombe.txt']

In [4]:
# 2: Read content
corpus = []

for filename in filenames:
    with open(filename, encoding='utf-8') as f:
        content = f.read()
        corpus.append(content)

In [5]:
len(corpus), len(filenames)

(21, 21)

In [7]:
# 3: Convert the data for processing

words = []

for file_content in corpus:
    file_words = file_content.split()
    words += file_words


In [8]:
len(words)

177047

In [9]:
words[:10]

['THE',
 'ADVENTURE',
 'OF',
 'THE',
 'SPECKLED',
 'BAND',
 'On',
 'glancing',
 'over',
 'my']

In [10]:
# 4: Process data

freq = {}

for word in words:
    freq[word] = freq.get(word, 0) + 1

In [11]:
# 5: output result

for word, count in freq.items():
    print(word, count)

THE 26
ADVENTURE 6
OF 6
SPECKLED 1
BAND 1
On 65
glancing 24
over 292
my 1532
notes 11
of 4433
the 9055
seventy 3
odd 3
cases 32
in 2806
which 1251
I 4152
have 1488
during 65
last 144
eight 15
years 70
studied 1
methods 16
friend 97
Sherlock 121
Holmes, 194
find 139
many 59
tragic, 1
some 413
comic, 1
a 4172
large 63
number 15
merely 14
strange, 5
but 702
none 23
commonplace; 1
for, 10
working 14
as 1277
he 1722
did 188
rather 104
for 1099
love 18
his 1919
art 6
than 267
acquirement 1
wealth, 1
refused 8
to 4451
associate 3
himself 103
with 1354
any 250
investigation 11
not 868
tend 2
towards 52
unusual, 1
and 4575
even 85
fantastic. 1
Of 37
all 511
these 120
varied 3
cases, 8
however, 148
cannot 68
recall 6
presented 12
more 270
singular 55
features 14
that 2693
was 2532
associated 11
well-known 7
Surrey 5
family 28
Roylotts 2
Stoke 10
Moran. 3
The 487
events 16
question 45
occurred 23
early 20
days 43
association 1
when 433
we 839
were 614
sharing 3
rooms 29
bachelors 1
Baker 43
Stree