### Searching and reading local files

In this chapter, we will introduce the basic operations to read information from files, starting with searching and opening files stored in different directories and subdirectories. Then, we'll describe some of the most common file types and how to read them, including formats such as raw text files, PDFs, and Word documents.

The last recipe will seach for a word inside different kinds of files, recursively in a directory tree.

We'll cover the following recipes:

- Crawling and searching directories
- Reading text files
- Dealing with encodings
- Reading CSV files
- Reading log files
- Reading file metadata
- Reading images
- Reading PDF files
- Reading Word documents
- Scanning documents for a keyword

We will start by accessing all the files in a directory tree.

### Crawling and searching directories

In this recipe, we'll learn how to scan a directory recursively to get all the files 
contained there. That will include all the files in subdirectories. The matched files 
can be of a particular kind, like text files, or every single one of them.
This is normally a starting operation when dealing with files, to detect all the 
existing ones.

### Getting ready
Let's start by creating a test directory with some file information:

```
$ mkdir dir
$ touch dir/file1.txt
$ touch dir/file2.txt
$ mkdir dir/subdir
$ touch dir/subdir/file3.txt
$ touch dir/subdir/file4.txt
$ touch dir/subdinr/file5.txt
$ touch dir/file6.pdf
```

All the files will be empty; we will use them in this recipe only to dicover them. Notice there are four files that have a *.txt* extension, and two that have a *.pdf* extension.

Enter the created dir directory
```
$ cd dir
```

### How to do it...

In [1]:
# Print all the filenmaes in the dir directory and subdirectories:
import os

print("All files:")
for root, dirs, files in os.walk("."):
    for file in files:
        print(file)

print("\nFull path of the files:")

# print the full path of the files, joining with the root:
for root, dirs, files in os.walk("."):
    for file in files:
        full_file_path = os.path.join(root, file)
        print(full_file_path)

print("\nOnly .pdf files:")

# Print only the .pdf files:
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith(".pdf"):
            full_file_path = os.path.join(root, file)
            print(full_file_path)

print("\nOnly files that contain an even number:")

# print only files that contain an even number:
import re

for root, dirs, files in os.walk("."):
    for file in files:
        if re.search(r"[13579]", file):
            full_file_path = os.path.join(root, file)
            print(full_file_path)

All files:
file2.txt
first_file.ipynb
zen_of_python.txt
file1.txt
file3.txt
file4.txt
file5.pdf

Full path of the files:
.\file2.txt
.\first_file.ipynb
.\zen_of_python.txt
.\dir\file1.txt
.\dir\subdir\file3.txt
.\dir\subdir\file4.txt
.\dir\subdir\file5.pdf

Only .pdf files:
.\dir\subdir\file5.pdf

Only files that contain an even number:
.\dir\file1.txt
.\dir\subdir\file3.txt
.\dir\subdir\file5.pdf


### How it works...

*os.walk()* goes through a whhile diretory and all subdirectories under it, returning all teh files. For each directory, it returns a tuple with the directory, any subdirectories under it, and all the files:

In [5]:
for root, dirs, files in os.walk("."):
    print("\nRaiz:", root, "\nDirectorios o carpetas:", dirs, "\nArchivos:", files)


Raiz: . 
Directorios o carpetas: ['dir'] 
Archivos: ['file2.txt', 'first_file.ipynb', 'zen_of_python.txt']

Raiz: .\dir 
Directorios o carpetas: ['subdir'] 
Archivos: ['file1.txt']

Raiz: .\dir\subdir 
Directorios o carpetas: [] 
Archivos: ['file3.txt', 'file4.txt', 'file5.pdf']


The *os.path.join()* function allows us to join two paths, such as the base path and the file.

As path are returned as pure strings, any kind of filtering can be done, as in step 3. In step 4, the full power of regular expressions can be used to filter.

In te next recipe, we'll deal with the content of the files, and not just the filename.

### How to do it...

In [24]:
# Open and print the whole file line by line (The result is not diplayed):
# with open('zen_of_python.txt') as file:
#     for line in file:
#         print(line)
        
# Open the file and print any line containing the string should:
with open('zen_of_python.txt', 'r') as file:
    for line in file:
        if "should" in line.lower():
            print(line)
            
# Open the file and print the first line containing the word better:
with open('zen_of_python.txt', 'rt') as file:
    for line in file:
        if 'better' in line.lower():
            print(line)
            break

Errors should never pass silently.

There should be one-- and preferably only one --obvious way to do it.

Beautiful is better than ugly.

