# Strings are Strange
We will explore
* what it means for strings to be the "same",
* some optimisation things,
* immutability,

and conclude that none of it really matters, mostly.

The idea is to motivate why languages might choose to handle string differently.

## Sameness

In [None]:
#Warm-up
print(f"Does 1 == 2? {1 == 2}")

In [None]:
x = "Hello"
y = "Hello"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

In [None]:
print(f"id(x) == {id(x) % 10000}")
print(f"id(y) == {id(y) % 10000}")

Does anything here stike you as a bit odd?

In [None]:
# Compile time optimisation (peep-hole/AST)
x = "Hello"
y = "Hell" + "o"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

In [None]:
# Runtime equivalance?
x = "Hello"
y = "Hell"
y = y + "o"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

**Source code** --> (Compiler) --> **bytecode** --> (Interpreter) --> **runtime**

In [None]:
# Look at the bytecode
import dis

def some_code():
    x = "Hello"
    y = "Hell" 
    y = y + "o"

dis.dis(some_code)

## String Interning

In [None]:
# Explicit interning
import sys

x = "Hello"
y = "Hell"
y = y + "o"

y = sys.intern(y)   # <---- the new bit

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

### String Pools
A lookup table, so strings can be compared simply by number.
|string  |reference|
|--------|---------|
|"Hello" |39       |
|"World" |74       |
|"Cheese"|15       |
|"Cake"  |52       |

### Conditions for implicit interning

In [None]:
# Back to basics
x = "Hello"
y = "Hello"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

In [None]:
# Added some special chars (',', ' ', & '!')
x = "Hello, World!"
y = "Hello, World!"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

In [None]:
# Stop implicit interning with long strings
x = "0123456789"*409
y = "0123456789"*409

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

*This is fabulous, but, so what?*

### Performance implications

In [None]:
import sys
import time

x = "The quick brown fox jumps over the lazy programmer. "*400
y = "The quick brown fox jumps over the lazy programmer. "*400

x = sys.intern(x)
y = sys.intern(y)

start = time.perf_counter()

for _ in range(1000000):
    x == y

time.perf_counter() - start

# String Immutability
Immutable == unalterable

In [None]:
x = "Hello, world!"
y = x

x = "Cake"

y

In [None]:
# Mutate at index?
x = "Hello, world!"
x[0] = "h"
x

In [None]:
# Concatenate?
x = "Hello, World!"
print(f"Before: id == {id(x) % 10000}")

x += " ...?"
print(f" After: id == {id(x) % 10000}")

In [None]:
# CPython optimises if buffer big enough and no other references
print(f"Before: id == {id(x) % 10000}")

x += " ...!"
print(f" After: id == {id(x) % 10000}")

What if we run the above block repeatedly?

# Conclusions

* String literals are immutable, ***mostly***.
  - e.g. concatenation in an ample buffer

* Constant string expressions ***might*** be optimised at compile-time.

* String literals are interned at compile-time, ***sometimes***.
  - e.g. < 4096 characters, and no special characters

* You can explicitly **intern** strings with `sys.intern()` to make comparisons super fast, 
  - **but** it might become slow and/or memory hungry.

* None of this really matters, 
  - **except** for when it does.

### When does all this matter?

When you need:
* The best speed possible.
* Consistency of behaviour.
* To run on hardware with constrained resources.