# Chapter 4. Searching and Reading Local Files
### Crawling and searching directories (p. 114)
Download [dir](https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter04/documents/dir) using DownGit.   [os.path doc](https://docs.python.org/3/library/os.path.html) [OO high-level os.pathlib doc](https://docs.python.org/3/librarpathlib.html)

In [12]:
import os, re
for root, dirs,files in os.walk('dir'):
    print(f'     root={root}')
    for file in files:
        if file.endswith('.pdf'):
            print('PDF===>',end=' ')
        if re.search(r'[13579]', file):
            print('ODD===>',end=' ')
        full_path = os.path.join(root,file)
        print(file, full_path)

     root=dir
file2.txt dir/file2.txt
ODD===> file1.txt dir/file1.txt
PDF===> file6.pdf dir/file6.pdf
     root=dir/subdir
ODD===> file3.txt dir/subdir/file3.txt
file4.txt dir/subdir/file4.txt
PDF===> ODD===> file5.pdf dir/subdir/file5.pdf


### Reading text files (p. 117)
downloaded [zen_of_python.txt](https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/blob/master/Chapter04/documents/zen_of_python.txt)

In [20]:
with open('zen_of_python.txt') as file:
    for line in file:
        if 'should' in line.lower():
            print(line) 
        if 'dutch' in line.lower():
            print(line)
            break

Errors should never pass silently.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.



### Dealing with encodings (p. 120)
downloaded documents dir. [Encoding aliases](https://docs.python.org/3/library/codecs.html#standard-encodings)

In [None]:
with open('documents/example_utf8.txt') as file:
    print(file.read())
with open('documents/example_iso.txt', encoding='iso-8859-1') as file:
    print(file.read())

In [None]:
with open('documents/example_utf8.txt') as file:
    content = file.read()
with open('documents/junk_output_iso.txt', 'w', encoding='iso-8859-1') as file:
    file.write(content)
with open('documents/junk_output_iso.txt', encoding='iso-8859-1') as file:
    print(file.read())

In [32]:
from bs4 import UnicodeDammit
with open('documents/example_iso.txt', 'rb') as file:
    content = file.read()
suggestion = UnicodeDammit(content)
suggestion.original_encoding
suggestion.unicode_markup

'20£'

### Reading CSV files
[csv docs](https://docs.python.org/3/library/csv.html)

In [None]:
import csv
with open('documents/top_films.csv') as file:
    data = csv.reader('file')
    for row in file:
        print(row)

In [66]:
#with open('documents/top_films.csv') as file:
file =open('documents/top_films.csv')
#  treats first line as headers with KEYs
data = csv.DictReader(file)
structured_data = [row for row in data]
structured_data[0]
structured_data[0].keys()
structured_data[0]['Rank']
file.close()

['excel', 'excel-tab', 'unix']

In [None]:
csv.list_dialects()
with open('documents/top_films.csv', newline='') as file:
    dialect = csv.Sniffer().sniff(file.read())
    
with open('documents/top_films.csv', newline='') as file:
    reader = csv.reader(file, dialect)
    for row in reader:
        print(row)

### Reading log files
From Chapter 1:

In [80]:
import parse, delorean
from decimal import Decimal

class PriceLog(object):
    def __init__(self,timestamp,product_id,price):
        self.timestamp = timestamp
        self.product_id = product_id
        self.price = price
    def __repr__(self):
        return '<PriceLog({}, {}, {})>'.format(self.timestamp,self.product_id,self.price)
        
    @classmethod
    def parse(cls,log):
        '''
        Parse from a text log with the format
            [<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>
        to a PriceLog object
        '''
        def price(string):
            return Decimal(string)
        def isodate(string):
            return delorean.parse(string)
        FORMAT = ('[{timestamp:isodate}] - SALE - PRODUCT: {product:d} - PRICE: ${price:price}')
        formats_extra = {'price' : price, 'isodate' : isodate}
        result = parse.parse(FORMAT,log,formats_extra)
        print('result>>>{}<<<<',result)  # yuri's debug output
        return cls(timestamp=result['timestamp'], product_id=result['product'], price=result['price'])

In [85]:
with open('documents/example_logs.log') as file:
    logs = [PriceLog.parse(log) for log in file]
len(logs)
logs[0]
total = sum(log.price for log in logs) #tot sales
total

result>>>{}<<<< <Result () {'timestamp': Delorean(datetime=datetime.datetime(2018, 6, 17, 22, 11, 50, 268396), timezone='UTC'), 'product': 1489, 'price': Decimal('9.99')}>
result>>>{}<<<< <Result () {'timestamp': Delorean(datetime=datetime.datetime(2018, 6, 17, 22, 11, 50, 268442), timezone='UTC'), 'product': 4508, 'price': Decimal('5.30')}>
result>>>{}<<<< <Result () {'timestamp': Delorean(datetime=datetime.datetime(2018, 6, 17, 22, 11, 50, 268454), timezone='UTC'), 'product': 8597, 'price': Decimal('15.49')}>
result>>>{}<<<< <Result () {'timestamp': Delorean(datetime=datetime.datetime(2018, 6, 17, 22, 11, 50, 268461), timezone='UTC'), 'product': 3086, 'price': Decimal('7.05')}>
result>>>{}<<<< <Result () {'timestamp': Delorean(datetime=datetime.datetime(2018, 6, 17, 22, 11, 50, 268468), timezone='UTC'), 'product': 1489, 'price': Decimal('9.99')}>


Decimal('47.82')

how many units have been sold of each product_id.  [cound list doc Counter](https://docs.python.org/3/library/collections.html#counter-objects)

In [86]:
from collections import Counter
counter = Counter(log.product_id for log in logs)
counter

Counter({1489: 2, 4508: 1, 8597: 1, 3086: 1})

### Reading file metadata
meta -- not data content itself -- `zen_of_python.txt`  [doc](https://docs.python.org/3/library/os.html)

In [90]:
import os 
from datetime import datetime
stats = os.stat('zen_of_python.txt')
stats
stats.st_size
datetime.fromtimestamp(stats.st_mtime) #modified
datetime.fromtimestamp(stats.st_atime) #accessed

datetime.datetime(2022, 7, 27, 20, 30, 15, 776877)

In [92]:
#also
os.path.getsize('zen_of_python.txt')
os.path.getmtime('zen_of_python.txt') #unix fmt
os.path.getatime('zen_of_python.txt')

1658979014.5832143

### Reading images
downloaded dir `images`, requirements pillow (Python Image Lib --changed to 9.0.0 for Python 3.10) and xmltodict. 

The metadata information in photo files is defined in the EXIF (Exchangeable Image File) format. __[EXIF](https://www.slrphotographyguide.com/what-is-exif-metadata/)__ is a standard for storing information about pictures, including things like what camera took the picture, when it was taken, GPS describing the location, exposure, focal length, color info.

JPG files store the EXIF info directly, PNG files store __[XMP](https://www.adobe.com/devnet/xmp.html)__ info, a more generic standard that can contain EXIF data inside

Also for GPS cf. ch04-gps_conversion.py

In [119]:
from PIL import Image
from PIL.ExifTags import TAGS, GPSTAGS
import xmltodict
from ch04_gps_conversion import exif_to_decimal, rdf_to_decimal

image1 = Image.open('images/photo-dublin-a1.jpg')
image1.height
image1.width
image1.format

exif_info_1 = { TAGS.get(tag,tag) : value    for tag,value in image1._getexif().items()}  # _getexif(), not getexif()ß
exif_info_1


{'TileWidth': 512,
 'TileLength': 512,
 'GPSInfo': {1: 'N',
  2: (53.0, 20.0, 48.86),
  3: 'W',
  4: (6.0, 14.0, 52.07),
  5: b'\x00',
  6: 5.0,
  7: (11.0, 7.0, 55.0),
  12: 'K',
  13: 0.20771699692994697,
  16: 'T',
  17: 90.6789667896679,
  23: 'T',
  24: 90.6789667896679,
  29: '2018:04:21',
  31: 4.0},
 'ResolutionUnit': 2,
 'ExifOffset': 222,
 'Make': 'Apple',
 'Model': 'iPhone X',
 'Software': 'Photos 3.0',
 'Orientation': 1,
 'DateTime': '2018:04:21 12:07:55',
 'XResolution': 72.0,
 'YResolution': 72.0,
 'ExifVersion': b'0221',
 'ComponentsConfiguration': b'\x01\x02\x03\x00',
 'ShutterSpeedValue': 11.092364532019705,
 'DateTimeOriginal': '2018:04:21 12:07:55',
 'DateTimeDigitized': '2018:04:21 12:07:55',
 'ApertureValue': 1.6959937156323646,
 'BrightnessValue': 10.59212376933896,
 'ExposureBiasValue': 0.0,
 'MeteringMode': 5,
 'Flash': 16,
 'FocalLength': 4.0,
 'ExifImageWidth': 4032,
 'ExifImageHeight': 3024,
 'FocalLengthIn35mmFilm': 28,
 'SceneCaptureType': 0,
 'SubsecTimeOr

In [113]:
image2 = Image.open('images/photo-dublin-a2.png')
image2.format
xmp_info = xmltodict.parse(image2.info['XML:com.adobe.xmp'])

rdf_info_2 = xmp_info['x:xmpmeta']['rdf:RDF']['rdf:Description']
rdf_info_2['tiff:Model']
rdf_info_2['exifEX:LensModel']
rdf_info_2['xmp:CreateDate']

1710

In [127]:
exif_info_1['GPSInfo']
gps_info_1 = { GPSTAGS.get(tag,tag) : value   for tag, value in exif_info_1['GPSInfo'].items() }
# exif_to_decimal(gps_info_1) -- TypeError?
rdf_to_decimal(rdf_info_2)

('N53.346905', 'W6.247796666666667')

In [126]:
image3 = Image.open('images/photo-dublin-b.png')
xmp_info = xmltodict.parse(image3.info['XML:com.adobe.xmp'])
rdf_info_3 = xmp_info['x:xmpmeta']['rdf:RDF']['rdf:Description']
rdf_info_3['xmp:CreateDate']
rdf_to_decimal(rdf_info_3)

('N53.34984166666667', 'W6.260388333333333')

For OCR, need to install __[Tesseract](https://github.com/tesseract-ocr/tessdoc)__ and pytesseract