# Chapter-7 Strings
This notebook contains the sample source code explained in the book *Hands-On Julia Programming, Sambit Kumar Dash, 2021, bpb Publications. All Rights Reserved*.

In [86]:
using Pkg
pkg"activate ."
pkg"instantiate"

[32m[1m  Activating[22m[39m environment at `C:\Users\vishn\Hands-on-Julia-Programming\Chapter 07\Project.toml`
[32m[1m  Activating[22m[39m environment at `C:\Users\vishn\Hands-on-Julia-Programming\Chapter 07\Project.toml`


## 7.1 Introduction

Strings can be considered as a collection of characters. For a detailed understanding please refer to the book chapter. 

## 7.2 String

Simple example of strings presented with various initialization literal definitions. 

In [87]:
str = "This is a string"

"This is a string"

"This is a string"

In [88]:
str = """ 
        This is a preformatted 
        "string" """

" \nThis is a preformatted \n\"string\" "

" \nThis is a preformatted \n\"string\" "

In [89]:
a = "Jack"
b = "Jill"
c = "100"

str = "$a owes $b $c dollars"

"Jack owes Jill 100 dollars"

"Jack owes Jill 100 dollars"

In [90]:
str = "This is a \"quoted\\  ' string"

"This is a \"quoted\\  ' string"

"This is a \"quoted\\  ' string"

## 7.3 String Methods

Strings are immutable. They cannot be manupulated. String methods combine or work on various strings and return either an attribute of a string or provide a derivative of an original string. 

### Comparisons

In [91]:
s1 = "abc"
s2 = "def"
s1 < s2

true

true

In [92]:
s2 > s1

true

true

In [93]:
s1 = "abc"
s2 = "abc"
s1 == s2

true

true

In [94]:
s1 === s2

true

true

### Iteration

Strings can be iterated as character collections. But, valid indices are only at the character boundaries. 

In [95]:
s = "Julia"
for c in s
    println(c)
end

J
u
l
i
a
J
u
l
i
a


In [96]:
s[1], s[2], s[3], s[4], s[5] 

('J', 'u', 'l', 'i', 'a')

('J', 'u', 'l', 'i', 'a')

In [97]:
s[begin], s[begin+2], s[end-1], s[end]

('J', 'l', 'i', 'a')

('J', 'l', 'i', 'a')

In [98]:
s = "\u2200 x \u2203 y"

"∀ x ∃ y"

"∀ x ∃ y"

In [99]:
length(s)

7

7

In [100]:
sizeof(s)

11

11

In [101]:
s[1]

'∀': Unicode U+2200 (category Sm: Symbol, math)

'∀': Unicode U+2200 (category Sm: Symbol, math)

In [102]:
s[2]

LoadError: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '

LoadError: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' '

In [103]:
s[4]

' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

In [104]:
for c in s
    println(c)
end

∀
 
x
 
∃
 
y
∀
 
x
 
∃
 
y


In [105]:
i, l = firstindex(s), lastindex(s)
while i <= l
    println(s[i])
    i = nextind(s, i)
end

∀
 
x
 
∃
 
y∀
 
x
 
∃
 
y



### Split and Concatenate

Both sets of operations return a newly defined string. The old string is not modified. 

In [106]:
str = "This is a String"
str[1:4]

"This"

"This"

In [107]:
str[1:4]*str[end-6:end]

"This String"

"This String"

In [108]:
repeat("A:-", 5)

"A:-A:-A:-A:-A:-"

"A:-A:-A:-A:-A:-"

In [109]:
"A:="^4

"A:=A:=A:=A:="

"A:=A:=A:=A:="

In [110]:
join(["1", "2", "3", "4", "5"])

"12345"

"12345"

In [111]:
join(["Jack", "Jill", "Cathy", "Trevor"], ", ", " and ")

"Jack, Jill, Cathy and Trevor"

"Jack, Jill, Cathy and Trevor"

In [112]:
str = "This is a\nString\n"
chomp(str)

"This is a\nString"

"This is a\nString"

In [113]:
chop("October")

"Octobe"

"Octobe"

In [114]:
chop("October", head=2, tail=3)

"to"

"to"

In [115]:
s = "\u2200 x \u2203 y"
ss = split(s)

4-element Vector{SubString{String}}:
 "∀"
 "x"
 "∃"
 "y"

4-element Vector{SubString{String}}:
 "∀"
 "x"
 "∃"
 "y"

In [116]:
s = "\u2200,x,\u2203,y"
ss = split(s, ',', limit=2)

2-element Vector{SubString{String}}:
 "∀"
 "x,∃,y"

2-element Vector{SubString{String}}:
 "∀"
 "x,∃,y"

In [117]:
s = "\u2200,x,\u2203,y"
ss = rsplit(s, ',', limit=2)

2-element Vector{SubString{String}}:
 "∀,x,∃"
 "y"

2-element Vector{SubString{String}}:
 "∀,x,∃"
 "y"

In [118]:
lpad("string", 10, "p")

"ppppstring"

"ppppstring"

In [119]:
rpad("string", 10, "s")

"stringssss"

"stringssss"

In [120]:
strip("     string 123  ")

"string 123"

"string 123"

In [121]:
strip(" {a}     string 123  ", ['{', 'a', '}', ' '])

"string 123"

"string 123"

In [122]:
strip("     string 123  aaa") do x
    return x == ' ' || x == 'a'
end

"string 123"

"string 123"

### Case Conversion

In [123]:
uppercase("Julia")

"JULIA"

"JULIA"

In [124]:
lowercase("JUliA")

"julia"

"julia"

In [125]:
titlecase("hands on programming in julia")

"Hands On Programming In Julia"

"Hands On Programming In Julia"

In [126]:
uppercasefirst("julia")

"Julia"

"Julia"

In [127]:
lowercasefirst("Julia")

"julia"

"julia"

### Match and Replace

In [128]:
str = "Introduction to Julia"
startswith(str, "Intro")

true

true

In [129]:
endswith(str, "Julia")

true

true

In [130]:
contains(str, "to")

true

true

In [131]:
occursin("to", str)

true

true

In [132]:
r = findfirst("o", "Introduction to Julia")
while r !== nothing 
    println(r)
    r = findnext("o", "Introduction to Julia", r.stop+1)
end

5:5
11:11
15:15
5:5
11:11
15:15


In [133]:
findlast("o", "Introduction to Julia")

15:15

15:15

In [134]:
replace("Introduction to Julia", "o"=>"a")

"Intraductian ta Julia"

"Intraductian ta Julia"

#### Regular Expressions

Regular expressions are part of text pattern matching languages. Readers are suggested to refer to a text on the specific topic for a detailed understanding of them. 

In [135]:
rx = Regex("a.a")

r"a.a"

r"a.a"

In [136]:
m = match(rx, "abracadabra")

RegexMatch("aca")

RegexMatch("aca")

In [137]:
m.match

"aca"

"aca"

In [138]:
m = match(rx, "abracadabra", 5)

RegexMatch("ada")

RegexMatch("ada")

In [139]:
rx = Regex("a(.)a")
m = match(rx, "abracadabra")
m.captures

1-element Vector{Union{Nothing, SubString{String}}}:
 "c"

1-element Vector{Union{Nothing, SubString{String}}}:
 "c"

In [140]:
rx = Regex("a(?<key>.)a")
m = match(rx, "abracadabra")
m.captures

1-element Vector{Union{Nothing, SubString{String}}}:
 "c"

1-element Vector{Union{Nothing, SubString{String}}}:
 "c"

In [141]:
m["key"]

"c"

"c"

In [142]:
rx = r"a.a"
m = eachmatch(rx, "abracadabra", overlap=true)

Base.RegexMatchIterator(r"a.a", "abracadabra", true)

Base.RegexMatchIterator(r"a.a", "abracadabra", true)

In [143]:
collect(m)

2-element Vector{RegexMatch}:
 RegexMatch("aca")
 RegexMatch("ada")

2-element Vector{RegexMatch}:
 RegexMatch("aca")
 RegexMatch("ada")

In [144]:
m = eachmatch(rx, "abracadabra", overlap=false)

Base.RegexMatchIterator(r"a.a", "abracadabra", false)

Base.RegexMatchIterator(r"a.a", "abracadabra", false)

In [145]:
collect(m)

1-element Vector{RegexMatch}:
 RegexMatch("aca")

1-element Vector{RegexMatch}:
 RegexMatch("aca")

## 7.4 Encodings

`String` objects are internally stored in the UTF-8 encoding. However, they can be translated to or from other Unicode transformations like UTF-16 or UTF-32. 

In [146]:
s = "\u2200 x \u2203 y"

"∀ x ∃ y"

"∀ x ∃ y"

In [147]:
transcode(UInt16, s)

7-element Vector{UInt16}:
 0x2200
 0x0020
 0x0078
 0x0020
 0x2203
 0x0020
 0x0079

7-element Vector{UInt16}:
 0x2200
 0x0020
 0x0078
 0x0020
 0x2203
 0x0020
 0x0079

In [148]:
transcode(UInt8, s)

11-element Base.CodeUnits{UInt8, String}:
 0xe2
 0x88
 0x80
 0x20
 0x78
 0x20
 0xe2
 0x88
 0x83
 0x20
 0x79

11-element Base.CodeUnits{UInt8, String}:
 0xe2
 0x88
 0x80
 0x20
 0x78
 0x20
 0xe2
 0x88
 0x83
 0x20
 0x79

In [149]:
transcode(UInt32, s)

7-element Vector{UInt32}:
 0x00002200
 0x00000020
 0x00000078
 0x00000020
 0x00002203
 0x00000020
 0x00000079

7-element Vector{UInt32}:
 0x00002200
 0x00000020
 0x00000078
 0x00000020
 0x00002203
 0x00000020
 0x00000079

In [150]:
transcode(String, transcode(UInt16, s))

"∀ x ∃ y"

"∀ x ∃ y"

### Some Useful Functions

In [151]:
isascii("∀ x ∃ y"), isascii("abcd ef")

(false, true)

(false, true)

In [152]:
iscntrl('a'), iscntrl('\x1')

(false, true)

(false, true)

In [153]:
isdigit('a'), isdigit('9')

(false, true)

(false, true)

In [154]:
isxdigit('a'), isxdigit('x')

(true, false)

(true, false)

In [155]:
isletter('1'), isletter('a')

(false, true)

(false, true)

In [156]:
isnumeric('1'), isnumeric('௰') #No 10 in Tamil (Indian) Language

(true, true)

(true, true)

In [157]:
isuppercase('A'), islowercase('a')

(true, true)

(true, true)

In [158]:
isspace('\n'), isspace('\r'), isspace(' '), isspace('\x20')

(true, true, true, true)

(true, true, true, true)

## 7.5 Character Arrays

If you need to manipulate character by character, then it may be best to transform a `String` into an `Vector{Char}`. 

In [159]:
collect("∀ x ∃ y")

7-element Vector{Char}:
 '∀': Unicode U+2200 (category Sm: Symbol, math)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 '∃': Unicode U+2203 (category Sm: Symbol, math)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)

7-element Vector{Char}:
 '∀': Unicode U+2200 (category Sm: Symbol, math)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 '∃': Unicode U+2203 (category Sm: Symbol, math)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 'y': ASCII/Unicode U+0079 (category Ll: Letter, lowercase)

## 7.6 Custom Strings

If Unicode based `String` type does not meet all your needs, you may have to implement your own string type deriving it from `AbstractString`. If the character code you are planning to use does not map to a UTF-8 `Char` you can create your own character type derived from `AbstractChar`. `LegacyStrings.jl` package in Julia has some sample implementations of such string types for reference. 

In [160]:
eltype("abcd")

Char

Char

The subsequent command may take many minutes to complete if your environment has never been updated. 

In [161]:
]add LegacyStrings

[32m[1m   Resolving[22m[39m [32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\vishn\Hands-on-Julia-Programming\Chapter 07\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\vishn\Hands-on-Julia-Programming\Chapter 07\Manifest.toml`
package versions...
[32m[1m  No Changes[22m[39m to `C:\Users\vishn\Hands-on-Julia-Programming\Chapter 07\Project.toml`
[32m[1m  No Changes[22m[39m to `C:\Users\vishn\Hands-on-Julia-Programming\Chapter 07\Manifest.toml`


In [162]:
using LegacyStrings

In [163]:
s = ASCIIString("abcd")

"abcd"

"abcd"

In [164]:
ncodeunits(s)

4

4

In [165]:
codeunit(s)

UInt8

UInt8

In [166]:
s16 = UTF16String(transcode(UInt16, "abcd\0"))

"abcd"

"abcd"

In [167]:
codeunit(s16)

UInt16

UInt16

In [168]:
typeof(s16)

UTF16String

UTF16String

In [169]:
ncodeunits(s16)

4

4

Both `UTF16String` and `ASCIIString` will behave like collections of `Char` while internally they will store the data in 16-bit and 8-bit formats respectively. Hence,  it's not necessary every string class derived from `AbstractString` needs to implement an `AbstractChar`.

In [170]:
eltype(s), eltype(s16)

(Char, Char)

(Char, Char)