# Secret Archive
Just read the README.

This Notebook is basically my thought process while making this.

**Day 1**

Now, first off. Compress first and then encrypt or vice versa?

Compression uses statistical probability to compress better. If a chunk of data keeps popping up here and there, compression will be better. In encryption, we avoid repeating of data since this can weaken the encrytion.

For example, we know that the letter 'e' is the most used letter in English. Then, if a chunk of data keeps reappearing in the encrypted data, it is likely to be the letter 'e'. Using this, we can nearly completely decrypt the encrypted data.

Essentially, encryption causes randomness of the data. Compression likes predictability. So, for best results, we compress first, and then encrypt.

All right. So we compress, then encrypt, and then add the file name and new encrypted file name to a registry. When we want to access this file, we grab the encrypted file name, decrypt, decompress, and then we have our original file back.

I just hope the compression is worth it. If we end up with a much larger encrypted and so-called compressed file at the end, there is no use of the compression. I'll have to figure out a way to ensure that the compression actually makes the file smaller and not bigger.

But, that is for later. I first need to build a working prototype.

So, which compression and encryption algorithms should I use?

Obviously, we need a lossless compression algorithm. If the result is a garbled mess, there is no point in this "archive". We also want the archive to work on any data. So, it must be a general purpose algorithm that runs fast, and is lossless. Also, I want the algorithm to be patent free. I want anyone to be able to use it.

So, we need to find a compression algorithm that need not give extreme compression rates, but needs to be fast. We don't want to wait hours for a super secret birthday party plan word document to compress.

For now, putting speed on top, gzip is probably the way to go. It's fast, it's got a good compression ratio in general cases, and needs a low amount of memory. Maybe later, I can add lzma or bzip2 to this?

The source I'm using for the compression algorithm choice is [this benchmark.](https://tukaani.org/lzma/benchmarks.html)

Now, encryption standard.

I'll just us PGP. It's a widely accepted standard.

Now, we need to worry about keys.

The keys will be generated using the email as user@computer, and the password will be the master password for the archive. I may move to per-file passwords later, but that'll be a lot of keys.

So, now we have a private and a public key. The private key will be given to the user, and the user will store it somewhere safe (like a thumb drive). The user can of course store the private key on the same drive, but that's just security by obscurity. I'm assuming the person who wants to see your files will also have access to your drive. If the private key is on the same drive and the attacker is tech savvy enough, the files are as good as unencrypted.

The public key, of course will stay where the vault is. It'll be accessible by the program directly.

So, we now have a workflow.

The program starts, asks the user to set a master password, which is used to generate the PGP public and private keys along with the username of the user running the program and the machine name. We will then ask the user to store the private key somewhere safe (like a thumb drive), and the public key is stored wherever the vault of encrypted and compressed files is. Now, the user has created a secure vault. The user selects a file to store, which is first compressed using gzip, then we get a compressed gz file, which is then encrypted using the public key. The encrypted and compressed file is stored in the vault, with no file extension, and the original name of the file along with the original extension is added to a registry. The encrypted file name will just be the same as the original file name without the extension. If an encrypted file already exists, we ask the user if they want to replace the old encrypted file or want to name the new one file1 or something like that.

For example if a file.png is added to the vault after processing, the vault will contain a file named file.
The registry will map that file.png maps to file. If we add another file.jpg, we already have a file in the vault, so we ask the user to either replace or rename. If replace, the old file in vault is removed, and the new processed file is added. The registry will update to show that file.jpg now maps to file, and that no file.png is encrypted and stored. If the file is instead renamed to file1, the vault will then contain file1 and file. Th registry shows that file.png maps to file, and file.jpg maps to file1.

Now, the user can continue to add more files to the vault or quit or open a file from the vault.

If the user wants to open a file from the vault, we list the files in vault, and their original file mappings using the registry, and the user chooses one file. We then ask the user to provide the private key. The private key is then used to decrypt the specified file, which is then decompressed, and the correct file extension is applied, and then the final data is ready.

That is quite the project. Let's get to work!

First, I want to get the MVP (Minimum Viable Product) ready. I want it to be able to compress, and encrypt files. I'll implement that first. I'll polish the UI later.

So, we need:
- [gzip](https://docs.python.org/3/library/gzip.html)
- [gnupg](https://pypi.org/project/python-gnupg/)

gzip is already in the Python Standard Library, so no need to add that to requirements.
We will need to install gnupg though.

I'm not gonna have any input in the initial prototype. The file path will be a variable. The registry will be like this:
```
"myfile.txt" -> "myfile"
"myfile.png" -> "myfile1"
"myfileabc.mp3" -> "myfileabc"

```

For now, that's going to be the format. I might move to JSON later. For today, I probably won't get all that far, and I'll just be able to get the final output file. I won't be able to implement the registry today.

In [56]:
import gzip
import gnupg
import shutil

fileloc = "/home/soumitradev/Desktop/Code/secret archive/"
filename = "file.txt"
vaultloc = fileloc + "vault/"

filepath = fileloc + filename
fileid = filename.split(".")[0]
ext = filename.split(".")[1]

with open(filepath, 'rb') as f_in:
    with gzip.open(vaultloc + fileid, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
        
gpg = gnupg.GPG()
gpg.encoding = "utf-8"


key = gpg.gen_key(gpg.gen_key_input())
fp = key.fingerprint

password = "oogabooga"

# ascii_armored_public_keys = gpg.export_keys(fp, passphrase = password)
# ascii_armored_private_keys = gpg.export_keys(fp, True, passphrase = password)

with open(vaultloc + fileid, 'rb') as comp_in:
    encrypted_ascii_data = gpg.encrypt_file(comp_in, fp, output = vaultloc + fileid + "_enc")

with open(vaultloc + fileid + "_enc", 'rb') as dec:
    decrypted_data = gpg.decrypt_file(dec, output = vaultloc + fileid + "_decrypt." + ext + ".gz")
    
with gzip.open(vaultloc + fileid + "_decrypt." + ext + ".gz", "rb") as f:
    with open(vaultloc + fileid + "_fin." + ext, "wb") as fp:
        shutil.copyfileobj(f, fp)