# Secret Archive
Just read the README.

This Notebook is basically my thought process while making this.

**Day 1**

Now, first off. Compress first and then encrypt or vice versa?

Compression uses statistical probability to compress better. If a chunk of data keeps popping up here and there, compression will be better. In encryption, we avoid repeating of data since this can weaken the encrytion.

For example, we know that the letter 'e' is the most used letter in English. Then, if a chunk of data keeps reappearing in the encrypted data, it is likely to be the letter 'e'. Using this, we can nearly completely decrypt the encrypted data.

Essentially, encryption causes randomness of the data. Compression likes predictability. So, for best results, we compress first, and then encrypt.

All right. So we compress, then encrypt, and then add the file name and new encrypted file name to a registry. When we want to access this file, we grab the encrypted file name, decrypt, decompress, and then we have our original file back.

I just hope the compression is worth it. If we end up with a much larger encrypted and so-called compressed file at the end, there is no use of the compression. I'll have to figure out a way to ensure that the compression actually makes the file smaller and not bigger.

But, that is for later. I first need to build a working prototype.

So, which compression and encryption algorithms should I use?

Obviously, we need a lossless compression algorithm. If the result is a garbled mess, there is no point in this "archive". We also want the archive to work on any data. So, it must be a general purpose algorithm that runs fast, and is lossless. Also, I want the algorithm to be patent free. I want anyone to be able to use it.

So, we need to find a compression algorithm that need not give extreme compression rates, but needs to be fast. We don't want to wait hours for a super secret birthday party plan word document to compress.

For now, putting speed on top, gzip is probably the way to go. It's fast, it's got a good compression ratio in general cases, and needs a low amount of memory. Maybe later, I can add lzma or bzip2 to this?

The source I'm using for the compression algorithm choice is [this benchmark.](https://tukaani.org/lzma/benchmarks.html)

Now, encryption standard.

I'll just us PGP. It's a widely accepted standard.

Now, we need to worry about keys.

The keys will be generated using the email as user@computer, and the password will be the master password for the archive. I may move to per-file passwords later, but that'll be a lot of keys.

So, now we have a private and a public key. The private key will be given to the user, and the user will store it somewhere safe (like a thumb drive). The user can of course store the private key on the same drive, but that's just security by obscurity. I'm assuming the person who wants to see your files will also have access to your drive. If the private key is on the same drive and the attacker is tech savvy enough, the files are as good as unencrypted.

The public key, of course will stay where the vault is. It'll be accessible by the program directly.

So, we now have a workflow.

The program starts, asks the user to set a master password, which is used to generate the PGP public and private keys along with the username of the user running the program and the machine name. We will then ask the user to store the private key somewhere safe (like a thumb drive), and the public key is stored wherever the vault of encrypted and compressed files is. Now, the user has created a secure vault. The user selects a file to store, which is first compressed using gzip, then we get a compressed gz file, which is then encrypted using the public key. The encrypted and compressed file is stored in the vault, with no file extension, and the original name of the file along with the original extension is added to a registry. The encrypted file name will just be the same as the original file name without the extension. If an encrypted file already exists, we ask the user if they want to replace the old encrypted file or want to name the new one file1 or something like that.

For example if a file.png is added to the vault after processing, the vault will contain a file named file.
The registry will map that file.png maps to file. If we add another file.jpg, we already have a file in the vault, so we ask the user to either replace or rename. If replace, the old file in vault is removed, and the new processed file is added. The registry will update to show that file.jpg now maps to file, and that no file.png is encrypted and stored. If the file is instead renamed to file1, the vault will then contain file1 and file. Th registry shows that file.png maps to file, and file.jpg maps to file1.

Now, the user can continue to add more files to the vault or quit or open a file from the vault.

If the user wants to open a file from the vault, we list the files in vault, and their original file mappings using the registry, and the user chooses one file. We then ask the user to provide the private key. The private key is then used to decrypt the specified file, which is then decompressed, and the correct file extension is applied, and then the final data is ready.

That is quite the project. Let's get to work!

First, I want to get the MVP (Minimum Viable Product) ready. I want it to be able to compress, and encrypt files. I'll implement that first. I'll polish the UI later.

So, we need:
- [gzip](https://docs.python.org/3/library/gzip.html)
- [gnupg](https://pypi.org/project/python-gnupg/)

gzip is already in the Python Standard Library, so no need to add that to requirements.
We will need to install gnupg though.

I'm not gonna have any input in the initial prototype. The file path will be a variable. The registry will be like this:
```
"myfile.txt" -> "myfile"
"myfile.png" -> "myfile1"
"myfileabc.mp3" -> "myfileabc"

```

For now, that's going to be the format. I might move to JSON later. For today, I probably won't get all that far, and I'll just be able to get the final output file. I won't be able to implement the registry today.

In [65]:
import gzip
import gnupg
import shutil

fileloc = "/home/soumitradev/Desktop/Code/secret archive/"
filename = "img.png"
vaultloc = fileloc + "vault/"

filepath = fileloc + filename
fileid = filename.split(".")[0]
ext = filename.split(".")[1]

with open(filepath, 'rb') as f_in:
    with gzip.open(vaultloc + fileid, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
gpg = gnupg.GPG()
gpg.encoding = "utf-8"


key = gpg.gen_key(gpg.gen_key_input())
fp = key.fingerprint

with open(vaultloc + fileid, 'rb') as comp_in:
    encrypted_ascii_data = gpg.encrypt_file(comp_in, fp, output = vaultloc + fileid + "_enc")

with open(vaultloc + fileid + "_enc", 'rb') as dec:
    decrypted_data = gpg.decrypt_file(dec, output = vaultloc + fileid + "_decrypt." + ext + ".gz")
    
with gzip.open(vaultloc + fileid + "_decrypt." + ext + ".gz", "rb") as f:
    with open(vaultloc + fileid + "_fin." + ext, "wb") as fp:
        shutil.copyfileobj(f, fp)

Yay! MVP is done! The compression and encryption are working! However, the file sizes are not the best. A 19byte file became 577 bytes after processing. A 31MB pdf became like 40MB after processing.

The compression is *really* bad. The 31MB pdf became 29 MB, and the 19byte file became 44 bytes after so called compression. 

A 20.04 Kubuntu iso (2.4GB) I had lying around took around 2 minutes 51 seconds to compress to a size of 2.3GB. Then, it took another 2 minutes 39 seconds to encrypt, becoming a final size of 3.1GB. Then, it took 1 minute 12 seconds to decrypt, and then less than 30 seconds to decompress.

I tested if not compressing it made any difference. It does, but it's a very marginal one. One that can be noticed only after passing GBs of data. I guess it's only helpful when the data is very redundant.

So I made a completely white image in Photoshop, and the .psd file was 1MB large. The png I saved without any compression cane out to be 33.2MB. It was a white image of 4K resolution. The psd compressed to 5.7KB, and the final encrypted file was 4.5KB! This was very fast too. The png compressed to 43.9KB, and the final encrypted file came out to be 15.7KB!

So this only helps greatly if the data is extremely redundant.

I monitored my CPU and RAM during all these, and looks like this only runs on 1 core. We can speed it up by using multithreading. That is for later though. RAM usage is low as expected.

Also, I performed all these tasks on a 512GB SSD (on a 141GB Ubuntu 20.04 partition with no swap if that matters). The source and output files were on the same partition. Since this program uses a lot of disk resources, this probably matters.

Overall, the MVP is kinda dope! I just need to package this code in some functions, and add some UI elements along with error handling.

For now, I'm kinda struggling with the encryption part. I do understand that the public key is being used for encryption, but I kinda find it hard to understand why the keys are stored like this on my system: It is in `/home/soumitradev/.gnupg/`, but the file structure is one thing I don't get. The whole keyring keychain thing is one thing I don't get. Where are the .pub files?

Whether this will work for multiple files or not is a question. Will I mess up the key management? Plus, the whole registry thing is kinda daunting.

But for now, I'll just have to start moving this to the console and packaging the code into functions and classes. I'll go do something else now. Day 1 was productive!

**Day 2**

I have an exam coming up, so I don't think I can spend too much time on this today, but I'll try my best to package this code in a bunch of functions, and get it running today.

I want this to be cross platform, so I decided I'll be using the `pathlib` module to reference locations from now on. 

I really want to simplify this, so I decided we'll use relative paths to reference file locations from here on. So, I will create another folder called `import` which will be a place to put the files you want to store in the vault.

In [16]:
import gzip
import gnupg
import shutil
from pathlib import Path


def compress(filename):
    datapath = Path("./import/"  + filename)
    vaultpath = Path("./vault/")
    if datapath.exists():
        with open(datapath, 'rb') as f_in:
            with gzip.open(vaultpath / datapath.stem, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    else:
        raise IOError("No such file! Please enter a valid filename, or ensure the file is in the 'import' folder")

def encrypt(filename):
    vaultpath = Path("./vault/")
    datapath = vaultpath / Path(filename)
    if datapath.exists():
        gpg = gnupg.GPG()
        gpg.encoding = "utf-8"


        key = gpg.gen_key(gpg.gen_key_input())
        fp = key.fingerprint

        with open(vaultloc / datapath.stem, 'rb') as comp_in:
            encrypted_ascii_data = gpg.encrypt_file(comp_in, fp, output = "./vault/" + datapath.stem + "_enc")
    
    else:
        raise IOError("No such file! Please enter a valid filename, or ensure the file is in the 'vault' folder")
        
def decrypt(filename, ext):
    gpg = gnupg.GPG()
    gpg.encoding = "utf-8"
    vaultpath = Path("./vault/")
    datapath = vaultpath / Path(filename)
    if datapath.exists():
        with open(datapath, 'rb') as dec:
            decrypted_data = gpg.decrypt_file(dec, output = "./vault/" + datapath.stem[:-4]  + ext + ".gz")
    else:
        raise IOError("No such file! Please enter a valid filename, or ensure the file is in the 'vault' folder")
        
def decompress(filename):
    vaultpath = Path("./vault/")
    datapath = vaultpath / Path(filename)
    original_extension = datapath.stem.split(".")[-1]
    name = datapath.stem.split(".")[0]
    if datapath.exists():
        with gzip.open(datapath, "rb") as f:
            with open(vaultloc / (name + "_fin." + original_extension), "wb") as fp:
                shutil.copyfileobj(f, fp)
    else:
        raise IOError("No such file! Please enter a valid filename, or ensure the file is in the 'vault' folder")
    
compress("file.jpg")
encrypt("file")
decrypt("file_enc", "jpg")
decompress("file.jpg.gz")

All right! So I got the code seperated neatly, now I just need to change some naming conventions, and add the registry thing. I'll probably do that tomorrow. Right now, the decrypt function needs the extension of the files, which of course will be added in the registry. I'll have to design a master flow that uses the registry in conjunction with the 4 functions to complete the code.

I'll start working on this tomorrow. I have an exam in like 1 week, so I don't have a lot of time to code. I will try my best to do quality work on this though.

I really want to make the final files much smaller, since right now for standard files, the archive output is larger than the original files, and that's not how compression is supposed to work. I could switch to a better compression algorithm, but that'll need me to redo the compress and decompress functions, and might significantly slow down compression. I guess I'll have to use multithreading for that.

Basically, I *could* make the files smaller, but I don't know how much of a difference that will make, and if that's worth the time. I might switch to a slower algorithm, just so the files are smaller, but that will probably need higher end workstations. Originally, I wanted this to be fast, and use low memory, but that's not really doing a good job.

Maybe I can add different algorithms and have te user choose between them? This probably seems to be the best way out.

Anyway, before I add more algorithms, I will have to finish the first finished version that uses CLI, and works well.

I'll work on the registry tomorrow.

Update: Found some more time, so now I'm back to coding.

I don't want to write my own format for the registry, and JSON is just easier, so I'll use that.

Our registry needs to be small and simple: It'll contain the following:
- File extension (only extension of original file)
- Final name (name of encrypted and compressed file without extension)

We will also show when the files were last modified. This will be stored in the metadata of the files themselves, so we won't store this in the registry.

In [None]:
import JSON
import re
from pathlib import Path

# Our registry will be in the form of a dictionary. final name is key, and original extension is value.
def addtoreg(originalext, final):
    registry = {}
    regpath = Path("./registry.json")
    
#     If registry file exists,
    if regpath.exists():
#         Load the registry into the variable
        with open(regpath, "rb") as regfile:
            registry = JSON.loads(regfile)
#         If such an encrypted file doesn't already exist, add it to registry and write registry to disk
        if not (final in registry):
            with open(regpath, "wb") as regfile:
                registry[final] = originalext
                regfile.write(JSON.dumps(registry))
                
#         If such an encrypted file exists, ask if user wants to rename it.
        else:
#         Nice litte code block that finds the new name for the renamed file
#         If the file name ends in a number enclosed in brackets, increment the number for the new name until such a file doesn't exist
            if re.match(".*\(\d\)", final):
                newname = final
                while (newname in registry):
                    name = final.split("(")[0]
                    number = int(final.split("(")[1][:-1])
                    newname = name + "(" + str(number + 1) + ")"
#         Else, just add a ' (1)' at the end of the file
            else:
                newname = final + " (1)"
            
#             Print error and input prompt
            print("Another encrypted file with the same name already exists. It has the extension ." + registry[final] + " Do you want to:\n    [1] - Replace the existing encrypted file\n    [2] - Save the new encrypted file as " + newname)
            k = input()
        
#             Validate input
            while not (k in ["1", "2"]):
                print("\nPlease enter valid input.\nAnother encrypted file with the same name already exists. It has the extension ." + registry[final] + " Do you want to:\n    [1] - Replace the existing encrypted file\n    [2] - Save the new encrypted file as " + newname)
                k = input()
            
#             Perform action as per input
            if k == "1":
                registry[final] = originalext
            else:
                registry[newname] = originalext
                
#             Write registry to disk
            with open(regpath, "wb") as regfile:
                regfile.write(JSON.dumps(registry))
#     If registry file doesn't exist, add the file to registry and write it to a new registry file.
    else:
        with open(regpath, "wb") as regfile:
            registry[final] = originalext
            regfile.write(JSON.dumps(registry))

I gotta say... That was kinda challenging and fun. I had to map out the possibilities for naming errors, and I even learnt a bit of regex! Now I just need to add some more registry utilities, like updating registry, and removing entries if the encrypted files are no longer in the vault. I also need to add error handling in case the extensions are not correct or someting goes wrong during compression / encryption / decryption / decompression.