## Hashing

Hashing is a process that transforms input data of any size into a fixed-size value, typically called a hash code or hash value. 

This technique is widely used in computer science for efficient data lookup, storage, and retrieval, such as in hash tables or dictionaries.


 ### Properties of Hashing
 
 1. **Deterministic**: The same input will always produce the same hash value.
 2. **Uniformity**: A good hash function distributes input values (keys) uniformly across the hash table to minimize collisions.
 3. **Efficiency**: Hashing should be computationally efficient to perform.
 4. **Fixed Output Size**: Regardless of the input size, the output hash value is always of fixed length.
 5. **Pre-image Resistance**: It should be difficult to reconstruct the original input from its hash value (important in cryptographic hashes).
 6. **Avalanche Effect**: A small change in input should result in a significantly different hash value.
 7. **Minimized Collisions**: A well-designed hash function minimizes the chances that two different inputs produce the same hash value.


 ### Common Hashing Algorithms and Their Output Size

 Different hashing algorithms produce hash values (digests) of different fixed sizes. Here are some of the popular ones:

 | Algorithm     | Output Size (bits) | Output Size (hex) |
 |---------------|-------------------|-------------------|
 | MD5           | 128               | 32                |
 | SHA-1         | 160               | 40                |
 | SHA-256       | 256               | 64                |
 | SHA-512       | 512               | 128               |
 | SHA-3-256     | 256               | 64                |

 The output size determines the possible number of unique hash values the algorithm can generate. Larger sizes reduce the probability of collisions but may use more memory.
 

In [1]:
 import hashlib
 
 text = "hello world"
 
 md5_hash = hashlib.md5(text.encode()).hexdigest()
 sha1_hash = hashlib.sha1(text.encode()).hexdigest()
 sha256_hash = hashlib.sha256(text.encode()).hexdigest()
 
 print("MD5:", md5_hash, len(md5_hash), "hex digits")
 print("SHA-1:", sha1_hash, len(sha1_hash), "hex digits")
 print("SHA-256:", sha256_hash, len(sha256_hash), "hex digits")



MD5: 5eb63bbbe01eeed093cb22bb8f5acdc3 32 hex digits
SHA-1: 2aae6c35c94fcfb415dbe95f408b9ce91ee846ed 40 hex digits
SHA-256: b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9 64 hex digits


 This code demonstrates generating hash values using different hashing algorithms and shows the different output sizes in hexadecimal digits.

 ### Hash Collisions
 
 A **hash collision** occurs when two different inputs produce the same hash value. Since the output size of a hash function is finite, but the possible inputs are infinite, collisions are unavoidable in practice.
 
 For example, with a hash function that outputs 3 bits, there are 2<sup>3</sup> possible hash values, but an infinite number of possible inputs. Eventually, distinct data will produce the same hash—this is known as a collision.
 
 ![Screenshot 2025-10-17 121956.png](<attachment:Screenshot 2025-10-17 121956.png>)
 
 Collisions can be problematic, especially in security-critical applications. For example, if two different files hash to the same value, a system using the hash for verification can't distinguish between them. That's why cryptographic hash functions are designed to make collisions extremely hard to find.
 
 It is important to select hashing algorithms with a low risk of collisions, and to avoid algorithms known to be vulnerable to collision attacks (such as MD5 and SHA-1).


## Pros and Cons of Hashing
 
 **Pros:**
 - Extremely fast lookup and storage operations.
 - Output size is fixed, making comparison and storage efficient.
 - Reduces data security risks when using cryptographic hash functions.
 - Enables efficient data verification and integrity checking.
 - Widely applicable, from password storage to data deduplication.

 **Cons:**
 - Susceptible to collisions, where two inputs may produce the same hash.
 - Fixed output size limits uniqueness for very large datasets.
 - Some hash functions (e.g., MD5, SHA-1) are vulnerable to attacks.
 - Irreversible: can't recover original input from its hash value.
 - Quality and security rely on the choice of hash function.
