# Strings are Strange
We will explore
* what it means for strings to be the "same",
* some optimisation things,
* immutability,

and conclude that none of it really matters, mostly.

The idea is to motivate why languages might choose to handle string differently.

## Sameness

In [2]:
x = "Hello"
y = "Hello"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

print(f"   id(x) == {id(x) % 10000}")
print(f"   id(y) == {id(y) % 10000}")

(x == y) == True
(x is y) == True
   id(x) == 5408
   id(y) == 5408


Does anything here stike you as a bit odd?

In [3]:
# Compile time optimisation (peep-hole/AST)
x = "Hello"
y = "Hell" + "o"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

(x == y) == True
(x is y) == True


In [4]:
# Runtime equivalance?
x = "Hello"
y = "Hell"
y = y + "o"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

(x == y) == True
(x is y) == False


In [7]:
# Look at the bytecode
import dis

def some_code():
    x = "Hello"
    y = "Hell" + "o"

dis.dis(some_code)

  5           0 LOAD_CONST               1 ('Hello')
              2 STORE_FAST               0 (x)

  6           4 LOAD_CONST               1 ('Hello')
              6 STORE_FAST               1 (y)
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE


## String Interning

**Source code** --> (Compiler) --> **bytecode** --> (Interpreter) --> **runtime**

In [10]:
# Explicit interning
import sys

x = "Hello"
y = "Hell"
y = y + "o"

#y = sys.intern(y)   # <---- the new bit

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

(x == y) == True
(x is y) == False


### String Pools
A lookup table, so strings can be compared simply by number.
|string  |reference|
|--------|---------|
|"Hello" |39       |
|"World" |74       |
|"Cheese"|15       |
|"Cake"  |52       |

### Conditions for implicit interning

In [11]:
# Back to basics
x = "Hello"
y = "Hello"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

(x == y) == True
(x is y) == True


In [12]:
# Added some special chars (',', ' ', & '!')
x = "Hello, World!"
y = "Hello, World!"

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

(x == y) == True
(x is y) == False


In [16]:
# Stop implicit interning with long strings
x = "0123456789"*409
y = "0123456789"*409

print(f"(x == y) == {x == y}")
print(f"(x is y) == {x is y}")

(x == y) == True
(x is y) == True


*This is fabulous, but, so what?*

### Performance implications

In [21]:
import sys
import time

x = "The quick brown fox jumps over the lazy programmer. "*400
y = "The quick brown fox jumps over the lazy programmer. "*400

# x = sys.intern(x)
# y = sys.intern(y)

start = time.perf_counter()

for _ in range(1000000):
    x == y

time.perf_counter() - start

0.9701649939524941

# String Immutability
Immutable == unalterable

In [22]:
x = "Hello, world!"
y = x

x = "Cake"

y

'Hello, world!'

In [23]:
# Mutate at index?
x = "Hello, world!"
x[0] = "h"
x

TypeError: 'str' object does not support item assignment

In [24]:
# Concatenate?
x = "Hello, World!"
print(f"Before: id == {id(x) % 10000}")

x += " ...?"
print(f" After: id == {id(x) % 10000}")

Before: id == 512
 After: id == 8880


In [30]:
# CPython optimises if buffer big enough and no other references
print(f"Before: id == {id(x) % 10000}")

x += " ...!"
print(f" After: id == {id(x) % 10000}")

Before: id == 6688
 After: id == 6192


What if we run the above block repeatedly?

# Conclusions

* String literals are immutable, ***mostly***.
  - e.g. concatenation in an ample buffer

* Constant string expressions ***might*** be optimised at compile-time.

* String literals are interned at compile-time, ***sometimes***.
  - e.g. < 4096 characters, and no special characters

* You can explicitly **intern** strings with `sys.intern()` to make comparisons super fast, 
  - **but** it might become slow and/or memory hungry.

* None of this really matters, 
  - **except** for when it does.

### When does all this matter?

When you need:
* The best speed possible.
* Consistency of behaviour.
* To run on hardware with constrained resources.