In [None]:
'''
Some strings are also authomatically interned -- but not all of them. 

As the Python code is compiled, identifiers are interned: 

variable names, 
function names, 
class names, 
etc.

Identifiers: 
must start with _ or a letter
can only contain _, letters and numbers 


Some string literals may also be automatically interned: 
string literals that look like identifier, it may still get interned ....... but don't count on it. 


Why is Python interning? 

It's all about (speed and , possibly, memory) optimization. 
Python, both internally, and in the code you write, deals with lots and lots of dictionary type lookups, 
on string keys, which means a lot of string equality testing. 
Let's say we want to see if two strings are equal: 
a = 'some_long_string'
b = 'some_long_string' 

Using a == b, we need to compare the two strings character by charater 

But if we know that 'some_long_string' has been interned, then a and b are the same string if they both point to the same memory address

In which case, we can use a is b instead -- which compares two intgers(memory address): 

This is MUCH faster than the charater by character comparison . 

Not ALL string s are automatically interned by Python 

But you can force strings to be interned by using the sys.intern() method. 

import sys
a = sys.intern('the quick brown fox')
b = sys.intern('the quick borwn fox')

a is b ------> True
much faster than a == b 

When sould you do this? 

1. dealing with a large number of strings that could have high repetition e.g. tokenizing a large corpus of text(NLP)
2. lots of string comparisons BUT usually do not start interning strings right after. 

In general though, you do not need to intern strings youself. Only do this if you really need to. 

'''

In [2]:
a = 'hello'
b = 'hello'
print(id(a), id(b))

4510364464 4510364464


In [3]:
a = 'hello world'
b = 'hello world'

In [4]:
print(id(a), id(b))   # Don't look like a identifier and therefore did not get interned. BUT Don't count on this

4510367216 4510367344


In [5]:
a == b 

True

In [6]:
a is b

False

In [7]:
a = 'hello'
b = 'hello'

In [8]:
a == b

True

In [9]:
a is b

True

In [12]:
a = '_this_is_a_long_string_that_could_be_used_as_an_identifier'

In [13]:
b = '_this_is_a_long_string_that_could_be_used_as_an_identifier'

In [14]:
a is b   # Though these two are long but they seem like identical so they are internalized as the same. 

True

In [15]:
import sys

In [19]:
a = sys.intern('hello world')  # We can force interning

In [20]:
b = sys.intern('hello world')

In [21]:
c = 'hello world'

In [22]:
print(id(a), id(b), id(c))   # And therefore a and b have the same id. 

4510168304 4510168304 4510592880


In [23]:
a == b # comparing character by character, takes longer time

True

In [24]:
a is b # this takes shorter time

True

In [25]:
a == c

True

In [26]:
a is c

False

In [28]:
def compare_using_equals(n): 
    a = 'a long string that is not interned'*200
    b = 'a long string that is not interned'*200
    for i in range(n): 
        if a == b: 
            pass


In [32]:
def compare_using_interning(n): 
    a = sys.intern('a long string that is not interned'*200)
    b = sys.intern('a long string that is not interned'*200)
    for i in range(n): 
        if a == b: 
            pass

In [29]:
import time 

In [30]:
start = time.perf_counter()
compare_using_equals(10000000)
end = time.perf_counter()
print('equality', end-start)

equality 3.194199092999952


In [33]:
start = time.perf_counter()
compare_using_interning(10000000)    # Interning makes more sense when the amount of comparing is huge
end = time.perf_counter()
print('equality', end-start)

equality 0.46663639300004434
