# FileManager - Workings

The purpose of this notebook is to allow for some testing of and workings on processes related to the FileManager project.

The assumed directory/file structure for this notebook is as follows:

- QuizWorkings.ipynb  
- assets/
  - images/
    - IMG_20180703_205212.jpg
  - videos/
    - Snapchat-2084940612.mp4
  - mixed/
    - received_10204964983279784.jpeg
    - results_screenshot.jpg
    - Screenshot_20171225-113409.png
    - Snapchat-1225537317.mp4
    - WIN_20220906_11_22_52_Pro.mp4

## Table Of Contents

* **0.** [Dependancies and Settings](#0-Dependancies-and-Settings)  
* **1.** [Looking at Metadata](#1-Looking-at-Metadata)  

## 0 Dependancies and Settings

Using `pathlib` as it is cross-platform:

In [1]:
from pathlib import Path

Using `platform` to check users current system:

In [2]:
import platform

Using `time` and `datetime` for waiting and conversion:

In [3]:
import time
from datetime import datetime

`math` for simple operations:

In [4]:
import math

Using `Pillow` and `hachoir` for metadata extraction:

In [34]:
from PIL import Image, ExifTags
from hachoir.parser import createParser
from hachoir.metadata import extractMetadata

Current working directory for ease use:

In [6]:
cwd = Path.cwd()
cwd

WindowsPath('C:/Users/seani/Documents/Projects/FileManager')

Function used to get names of files in a directory:

In [107]:
def get_filenames(path: str, mode='relative'):
    '''
    Returns name of file or all files in directory that path is pointing to. \
    If mode is 'relative', then only the filenames will be listed, if 'absolute' \
    then the full absolute paths will be listed.
    
    inputs: 
      path: string of path to the target directory/file
      mode: either 'relative' or 'absolute'
    outputs: list of string(s) of name(s) of file(s) in target directory/file
    '''
    
    # get the file path from string
    path = Path(path)
    
    # convert to an absolute path
    path = path.absolute()
    
    # store all filename(s) used in a list
    file_list = []
    
    # check if we are referencing a file or directory (folder)
    # if directory, pass through each file and add path to list
    if path.is_dir():
        # get path of each file and append
        for file in path.glob("*"):
            # if relative only use name
            if mode == 'relative':
                temp = file.name
            # if absolute use full path
            elif mode == 'absolute':
                temp = str(file)
            
            # append to list
            file_list.append(temp)
    # if file, just add that path to the list
    else:
        # if relative only use name
        # filename is the last part of this path
        if mode == 'relative':
            temp = path.name
        # if absolute use full path
        elif mode == 'absolute':
            temp = str(path)
            
        # add filename to list
        file_list.append(temp)

    # sort file_list
    file_list = sorted(file_list)
    
    return file_list

## 1 Looking at Metadata

In this section we will investigate methods to look at file metadata.

Main images directory:

In [8]:
img = cwd / r"assets\images"
img

WindowsPath('C:/Users/seani/Documents/Projects/FileManager/assets/images')

Main videos directory:

In [9]:
vid = cwd / r"assets\videos"
vid

WindowsPath('C:/Users/seani/Documents/Projects/FileManager/assets/videos')

Mixed files directory:

In [10]:
mixed = cwd / r"assets\mixed"
mixed

WindowsPath('C:/Users/seani/Documents/Projects/FileManager/assets/mixed')

### 1.1 Single image file

First, let's obtain a single image file: `IMG_20180703_205212.jpg`

In [11]:
path = img / r"IMG_20180703_205212.jpg"
path

WindowsPath('C:/Users/seani/Documents/Projects/FileManager/assets/images/IMG_20180703_205212.jpg')

We can look at its statdata:

In [12]:
statdata = path.stat()
statdata

os.stat_result(st_mode=33206, st_ino=5629499534378696, st_dev=2457206766, st_nlink=1, st_uid=0, st_gid=0, st_size=3855920, st_atime=1688050022, st_mtime=1687474906, st_ctime=1687474906)

There are numerous timestamps to look at:

- Time of last access (`st_atime`)
- Time of last change (`st_ctime`)
- Time of last modification (`st_mtime`)


Depending on the OS used, these timestamps can mean different things. On Windows - for files of image or video type - it is likely that `ctime` and `mtime` are the same. For files that are modified regularly (eg: `.txt` files), `ctime` likely refers to the time of creation (although not always). On Mac, we can look at a parameter `st_birthtime` for the creation timestamp. For Linux systems, it is more difficult to obtain creation dates, so the best estimate may be `mtime`. *Explanation for this can be found in [this stack overflow answer](https://stackoverflow.com/questions/237079/how-do-i-get-file-creation-and-modification-date-times/39501288#39501288).*

*NOTE: the `st_ino` parameter details the inode of the file. Explanation of an inode is not important here, but note that every file in any Unix system has an inode, which contains the files metaparameters.*

As a quick aside to test this, let's create a text-file, wait 10 seconds, then modify it, wait 10 seconds, then access it. We can then check these timestamps to see what's different. First, let's define a function to convert Unix timestamps to `YYYY-MM-DD H-M-S` format:

In [13]:
def unix_to_readable_timestamp(timestamp):
    '''
    Converts a timestamp from Unix (epoch in 00:00:00 UTC on 1 Jan 1970) to a readable format.
    '''
    
    # convert date from Unix to UTC
    converted = datetime.utcfromtimestamp(timestamp)
    # format in readable time
    formatted = converted.strftime('%Y-%m-%d %H:%M:%S')
    
    return formatted

In [14]:
# new file path string
test_text_path = 'test_text_file.txt'

# create file and write data to it
with open(test_text_path, 'w') as f:
    f.write('some data to be written to the file')

print('File created.\n')

# get pathlib reference to file
f = Path(test_text_path)
# get stats
f_stat = f.stat()

# print timestamp
print(f'timestamp: {time.time():10.7f} --> {unix_to_readable_timestamp(time.time())}')
print(f'atime:     {f_stat.st_atime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_atime)}')
print(f'ctime:     {f_stat.st_ctime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_ctime)}')
print(f'mtime:     {f_stat.st_mtime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_mtime)}')

# wait 10 seconds
print('\nModifying file...\n')
time.sleep(10)

# modify file by opening again
with open(test_text_path, 'w') as f:
    f.write('\nsome more data to be written to the file')

# get pathlib reference to file
f = Path(test_text_path)
# get stats again
f_stat = f.stat()

# print timestamp
print(f'timestamp: {time.time():10.7f} --> {unix_to_readable_timestamp(time.time())}')
print(f'atime:     {f_stat.st_atime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_atime)}')
print(f'ctime:     {f_stat.st_ctime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_ctime)}')
print(f'mtime:     {f_stat.st_mtime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_mtime)}')

# wait 10 seconds
print('\nAccessing file...\n')
time.sleep(10)

# access file by opening again
with open(test_text_path, 'r') as f:
    f.readlines()

# get pathlib reference to file
f = Path(test_text_path)
# get stats again
f_stat = f.stat()

# print timestamp
print(f'timestamp: {time.time():10.7f} --> {unix_to_readable_timestamp(time.time())}')
print(f'atime:     {f_stat.st_atime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_atime)}')
print(f'ctime:     {f_stat.st_ctime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_ctime)}')
print(f'mtime:     {f_stat.st_mtime:10.7f} --> {unix_to_readable_timestamp(f_stat.st_mtime)}')

# delete file
f.unlink()
print('\nFile deleted.')

File created.

timestamp: 1688055416.8685851 --> 2023-06-29 16:16:56
atime:     1688055416.8675902 --> 2023-06-29 16:16:56
ctime:     1688055416.8675902 --> 2023-06-29 16:16:56
mtime:     1688055416.8675902 --> 2023-06-29 16:16:56

Modifying file...

timestamp: 1688055426.8740654 --> 2023-06-29 16:17:06
atime:     1688055426.8731878 --> 2023-06-29 16:17:06
ctime:     1688055416.8675902 --> 2023-06-29 16:16:56
mtime:     1688055426.8731878 --> 2023-06-29 16:17:06

Accessing file...

timestamp: 1688055436.8876927 --> 2023-06-29 16:17:16
atime:     1688055436.8876927 --> 2023-06-29 16:17:16
ctime:     1688055416.8675902 --> 2023-06-29 16:16:56
mtime:     1688055426.8731878 --> 2023-06-29 16:17:06

File deleted.


As we can see, when modifying the file `mtime` changed, and in both modifying and accessing the file the `atime` changed. Importantly, the `ctime` attribute remained constant throughout.

A cross-platform implementation of the creation timestamp checking is as follows:

In [15]:
def get_creation_timestamp(path_to_file):
    """
    Try to get the Unix timestamp that a file was created, falling back to when it was
    last modified if that isn't possible.
    See http://stackoverflow.com/a/39501288/1709587 for explanation.
    
    path_to_file: string of the path to the file
    """
    
    # get path variable
    path = Path(path_to_file)
    # get its stats
    statdata = path.stat()
    
    # if windows, simply the ctime
    if platform.system() == 'Windows':
        return statdata.st_ctime
    # if not windows, try Mac method
    else:
        try:
            return statdata.st_birthtime
        except AttributeError:
            # We're probably on Linux. No easy way to get creation timestamps here,
            # so we'll settle for when its content was last modified.
            return statdata.st_mtime

Testing this:

In [16]:
unix_to_readable_timestamp(get_creation_timestamp(str(path)))

'2023-06-22 23:01:46'

**NOTE: Google Drive changes the files ctime on downloading and uploading. Thus, we need to look at the images/videos metadata not the files metadata.** This is the *Origins* section in the file properties window on Windows. For images, this is also referred to as the EXIF data. This is much more difficult to obtain and so for images we use the `Pillow` package and for videos we use the `hachoir` package. This is all done in the following:

In [92]:
def get_date_image_taken(path_string):
    '''
    Returns the image metadata corresponding to creation date.
    
    path_string: string of the path to the file
    '''
    
    # open image and get the exif data
    img_exif = Image.open(path_string).getexif()
    
    # return the data corresponding to date (note this key can be obtained using ExifTags.TAGS)
    return img_exif[306]

def get_date_video_taken(path_string):
    '''
    Returns the video metadata corresponding to media taken date.
    
    path_string: string of the path to the file
    '''
    
    # create the parser
    parser = createParser(path_string)
    # use it to extract metadata
    with parser:
        metadata = extractMetadata(parser)
    # pass through each line in the metadata text
    for line in metadata.exportPlaintext():
        # find the creation date line
        if line.split(':')[0] == '- Creation date':
            # format while removing the '- Creation date: ' part
            return line[17:]

# testing both
print(f'Image: {get_date_image_taken(str(path))}')
print(f'Video: {get_date_video_taken(str(vid / r"Snapchat-2084940612.mp4"))}')

Image: 2018:07:03 20:52:13
Video: 2018-02-27 17:56:08


We can get the filetype by looking at the suffix:

In [18]:
path.suffix

'.jpg'

Check if it is a file:

In [19]:
path.is_file()

True

We can define a function to return whether a path is a directory, image, video or audio by looking at the extension:

In [20]:
def get_path_type(path_string):
    '''
    Returns a string of either 'directory', 'image', 'audio', or 'video' depending on the files extension.
    '''
    
    # define list of extensions
    audio_extensions = ['.mp3', '.ogg']
    video_extensions = ['.mp4', '.mkv']
    image_extensions = ['.jpg', '.jpeg', '.png']
    
    # get as a path reference
    path = Path(path_string)
    
    # check if a file
    if path.is_file():
        # get suffix and lower it
        extension = path.suffix.lower()
        # check if an audio
        if extension in audio_extensions:
            return 'audio'
        # check if an video
        elif extension in video_extensions:
            return 'video'
        # check if an image
        elif extension in image_extensions:
            return 'image'
        # otherwise
        else:
            'other'
    # is a directory
    else:
        return 'directory'

Trying on the image file:

In [21]:
get_path_type(str(path))

'image'

Now, let's check the size of the image file, also from the stat data:

In [22]:
statdata.st_size

3855920

This is the size in bytes. For ease of use, let's define a function that returns the size in kilobytes, megabytes, or gigabytes depending on the file size:

In [23]:
def get_readable_filesize(filesize):
    '''
    Returns the filesize of an object in a readable format as a string depending on the size.
    '''
    
    # get magnitude of size
    magnitude = math.log10(filesize)
    # floor it
    magnitude = math.floor(magnitude)
    
    # check if fits GB
    if magnitude >= 9:
        # format so that GB magnitude is removed and ceiling to 3 digits
        filesize_format = math.ceil(filesize / 1e6) / 1e3
        # return as string
        filesize_string = f'{filesize_format:0.3f} GB'
    # check if fits MB
    elif magnitude >= 6:
        # format so that MB magnitude is removed and ceiling to 3 digits
        filesize_format = math.ceil(filesize / 1e3) / 1e3
        # return as string
        filesize_string = f'{filesize_format:0.3f} MB'
    # check if fits KB
    elif magnitude >= 3:
        # format so that KB magnitude is removed and ceiling to 3 digits
        filesize_format = math.ceil(filesize) / 1e3
        # return as string
        filesize_string = f'{filesize_format:0.3f} KB'
    else:
        # return as bytes string
        filesize_string = f'{filesize} B'
    
    return filesize_string

Testing this:

In [24]:
get_readable_filesize(statdata.st_size)

'3.856 MB'

### 1.2 Multiple files of various types

In this section we will generalise the functions and operations created/investigated already to work on multiple files. Let's start by importing the filenames we'll use:

In [81]:
filenames = get_filenames(mixed)
filenames

['IMG_20180201_161822 (1).jpg',
 'Screenshot_20171225-113409.png',
 'Snapchat-1225537317.mp4',
 'WIN_20220906_11_22_52_Pro.mp4',
 'received_10204964983279784.jpeg',
 'results_screenshot.jpg']

So we have images and videos of various types, some have been created locally on machine while others were downloaded from Google Drive. First, let's create a function to correctly identify the file-type and then return the creation date (as in EXIF or other metadata) This makes use of the previously defined functions:

- `get_date_image_taken(path_string)`
- `get_date_video_taken(path_string)`
- `get_path_type(path_string)`

In [93]:
def get_date_taken(path_string):
    '''
    Returns the date (YYYYMMDD) as a string the file was taken given it's file path string.
    '''
        
    # get the filetype
    filetype = get_path_type(path_string)
    
    # decide whether image or other
    if filetype == 'image':
        # try image date method
        try:
            date_taken = get_date_image_taken(path_string)
            # fix format into datetime object
            date_taken = datetime.strptime(date_taken, '%Y:%m:%d %H:%M:%S')
        # if does not have that info we set to ctime
        except KeyError:
            date_taken = unix_to_readable_timestamp(get_creation_timestamp(path_string))
            # fix format into datetime object
            date_taken = datetime.strptime(date_taken, '%Y-%m-%d %H:%M:%S')
    else:
        # try video date method (should work for audio too)
        try:
            date_taken = get_date_video_taken(path_string)
            # fix format into datetime object
            date_taken = datetime.strptime(date_taken, '%Y-%m-%d %H:%M:%S')
        # if does not have that info we set to ctime
        except:
            date_taken = unix_to_readable_timestamp(get_creation_timestamp(path_string))
            # fix format into datetime object
            date_taken = datetime.strptime(date_taken, '%Y-%m-%d %H:%M:%S')
    
    return date_taken.strftime('%Y%m%d')

Now testing on each file:

In [94]:
# pass through each filename
for filename in filenames:
    # get date of this file (from mixed path)
    date_taken = get_date_taken(str(mixed / filename))
    # print
    print(f'{filename:31} | {date_taken}')

IMG_20180201_161822 (1).jpg     | 20191023
Screenshot_20171225-113409.png  | 20230629
Snapchat-1225537317.mp4         | 20180526
WIN_20220906_11_22_52_Pro.mp4   | 20220906
received_10204964983279784.jpeg | 20230629
results_screenshot.jpg          | 20230629


This has worked as intended for each file. Note some pitfalls appear to be files which are:

- copied
- taken from whatsapp
- taken from the old (pre-huawei p30 lite) phone

These files do not contain any of the metadata needed to pinpoint an accurate date.

## 2 Operations on files

In this section we will look at methods to rename files to the desired new filename and how to sort files using the `.txt` method proposed.

### 2.1 Renaming files

The file format we want to introduce is as follows: `YYYYMMDD_XXXXX_filetype.suffix`, where:

- `YYYYMMDD` is as obtained using the `get_date_taken` function
- `XXXXX` counts from `00000` to `99999` depending on the number of files in the current `YYYYMMDD`
- `filetype` is as obtained using the `get_path_type` function
- `.suffix` is obtained from the `path.suffix` function

To test this, let's create a `.txt` file and rename it:

In [98]:
# new file path string
test_text_path = 'test_text_file.txt'

# create file and write data to it
with open(test_text_path, 'w') as f:
    f.write('some data to be written to the file')

print('File created.\n')

# print all files in directory
print(f'Files: {get_filenames(cwd)}')

# get pathlib reference to file
f = Path(test_text_path)
# rename the file
f.rename(f'renamed_test_text_file{f.suffix}')

print('\nFile renamed.\n')

# print all files in directory
print(f'Files: {get_filenames(cwd)}')

# get pathlib reference to the "new" file
f = Path(f'renamed_test_text_file{f.suffix}')
# delete file
f.unlink()
print('\nFile deleted.')

File created.

['.git', '.gitignore', '.ipynb_checkpoints', 'FileWorkings.ipynb', 'README.md', 'assets', 'test_text_file.txt', 'venv']

File renamed.

['.git', '.gitignore', '.ipynb_checkpoints', 'FileWorkings.ipynb', 'README.md', 'assets', 'renamed_test_text_file.txt', 'venv']


Note that when renaming the file you must create a new `Path` instance to use it after renaming. Further, you must specify the path to where the file is to be stored when renaming (if you are keeping it in the same place). From testing it was found this did not affect the metadata of files.

### 2.2 Sorting files

The `.txt` method proposed involves:

- Create a text file of the same name as the desired (sorted) "folder"
- Each line in the text file contains the name of the file that is included in this "folder"

This allows (using the accompanying app to be made) to sort by only loading the files named in the text file, as each file will normally reside in the master folder. This allows us to mimick a "folder".

This should be implemented using a class structure, but here we will experiment with the functionality that will be required. First, let's create a function that takes a list of strings of the paths to the files we want to include, and returns a string that will be passed into the `.txt` file:

In [103]:
def filenames_list_to_string(filenames):
    '''
    
    '''
    
    # join them and separate by a line separator
    filenames_string = '\n'.join(filenames)
    
    return filenames_string

In [105]:
filenames_list_to_string(filenames)

'IMG_20180201_161822 (1).jpg\nScreenshot_20171225-113409.png\nSnapchat-1225537317.mp4\nWIN_20220906_11_22_52_Pro.mp4\nreceived_10204964983279784.jpeg\nresults_screenshot.jpg'

Now, let's create a `.txt` file of the name `jpegs`, and write in it the name of the files we want in this "folder". Then we can read all these lines back into a list:

In [114]:
# create file and write data to it
with open('jpegs.txt', 'w') as f:
    f.write(filenames_list_to_string(get_filenames(mixed, mode='absolute')))

print('File created.')

# list to store file paths
lines_list = []
# read data from file
with open('jpegs.txt', 'r') as f:
    # get all lines
    lines = f.readlines()
    # pass through each except last
    for line in lines[:-1]:
        # remove \n
        lines_list.append(line[:-1])
    # last one is special as it does not have \n
    lines_list.append(lines[-1])

lines_list

File created.


['C:\\Users\\seani\\Documents\\Projects\\FileManager\\assets\\mixed\\IMG_20180201_161822 (1).jpg',
 'C:\\Users\\seani\\Documents\\Projects\\FileManager\\assets\\mixed\\Screenshot_20171225-113409.png',
 'C:\\Users\\seani\\Documents\\Projects\\FileManager\\assets\\mixed\\Snapchat-1225537317.mp4',
 'C:\\Users\\seani\\Documents\\Projects\\FileManager\\assets\\mixed\\WIN_20220906_11_22_52_Pro.mp4',
 'C:\\Users\\seani\\Documents\\Projects\\FileManager\\assets\\mixed\\received_10204964983279784.jpeg',
 'C:\\Users\\seani\\Documents\\Projects\\FileManager\\assets\\mixed\\results_screenshot.jpg']