# Strings and File Processing

## Strings

Strings are sequences of characters, and they are used in a wide range of programming applications. Julia provides extensive functionality for working with strings and characters, including support for so-called Unicode characters. However, Julia also works efficiently using standard so-called ASCII characters and string, which we will focus on here to keep the presentation shorter.

### Characters

ASCII is a 7-bit character set containing 128 characters. It contains the numbers from 0-9, the upper and lower case English letters from A to Z, and some special characters.

The numbers 0 - 31 are used for so-called *control characters*, including for example the carriage return (number 13). The numbers 32 - 126 define the standard characters (the first character below, corresponding to number 32, is the space character):

```
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
```


Julia defines the type `Char` for representing characters, and you can create one using single quotes:

In [1]:
c = 'Q'

'Q': ASCII/Unicode U+0051 (category Lu: Letter, uppercase)

You can also define the character directly using its number:

In [2]:
c_and = Char(38)

'&': ASCII/Unicode U+0026 (category Po: Punctuation, other)

and you can find the number of a `Char` by converting to `Int`:

In [3]:
c_at = Int('@')

64

The control characters can be created using a backslash notation, for example the carriage return:

In [4]:
c_CR = '\n'

'\n': ASCII/Unicode U+000a (category Cc: Other, control)

You can also do comparisons and a limited amount of arithmetic with `Char` values (from Julia documentation):

```julia
julia> 'A' < 'a'
true

julia> 'A' <= 'a' <= 'Z'
false

julia> 'A' <= 'X' <= 'Z'
true

julia> 'x' - 'a'
23

julia> 'A' + 1
'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
```

### String creation

A string can be created from a sequence of characters using double quotes:

In [5]:
str1 = "Hello world!\nJulia is fun\n"

"Hello world!\nJulia is fun\n"

Note that the string is shown using the backslash syntax for the control characters, but when you print it they will be interpreted the correct way (for example in this case, using a carriage return):

In [6]:
print(str1)

Hello world!
Julia is fun


Since the double quote and the backslash characters have special meanings, as well as the dollar character, you need to put an extra backslash in front of them if used in a string:

In [7]:
str2 = "I \"have\" \$50, A\\b\n"
print(str2)

I "have" $50, A\b


Strings can also be created using triple quotes, which is convenient for multiple lines. In this case, the double quote character does not need the extra backslash:

In [8]:
str3 = """
    This is some multi-line text. 
    Carriage return is inserted for new lines.
    Double quotes can be used as-is: "
    But backslash and dollar still need extra backslashes: \\, \$
    The indentation of the lines is determined by the position of the final triple-quote.
"""
print(str3)

    This is some multi-line text. 
    Carriage return is inserted for new lines.
    Double quotes can be used as-is: "
    But backslash and dollar still need extra backslashes: \, $
    The indentation of the lines is determined by the position of the final triple-quote.


### String concatenation

You can concatenate multiple strings by passing them to the `string` function. An alternative syntax is to use the multiplication `*` operator between the strings:

In [9]:
str4a = "Hello"
str4b = "World"
str4ab = string(str4a, " ", str4b, "\n")  # Concatenation
str5ab = str4a * " " * str4b * "\n"       # Same thing

print(str4ab)
print(str5ab)

Hello World
Hello World


Alternatively, Julia allows *interpolation* into string literals using the `$` symbol. This is often a more natural syntax for string concatenation:

In [10]:
str6ab = "$str4a $str4b\n"
print(str6ab)

Hello World


Interpolation allows for general expressions after the `$` sign, inside parentheses unless just a single variable name. These will be evaluated and converted to strings:

In [11]:
vec = rand(3)
println("Random vector: $vec")
println("sin of 45 degrees = $(sind(45))")

Random vector: [0.301531, 0.0656358, 0.958036]
sin of 45 degrees = 0.7071067811865476


Another convenient notation is the power operator `^` with string and integer arguments, which concatenates multiple copies of the string:

In [12]:
"12345 "^9

"12345 12345 12345 12345 12345 12345 12345 12345 12345 "

### String comparison

You can lexicographically compare strings using the standard comparison operators (from Julia documentation):

```julia
julia> "abracadabra" < "xylophone"
true

julia> "abracadabra" == "xylophone"
false

julia> "Hello, world." != "Goodbye, world."
true

julia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"
true
```

### String indexing

Julia strings behave in many ways like a 1D array of characters (but beware of Unicode characters, then the indices might not be consecutive). For example, you can extract a single character from a string by indexing with an integer:

In [13]:
str = "abcdefghij"
str[7]

'g': ASCII/Unicode U+0067 (category Ll: Letter, lowercase)

You can extract a *substring* by indexing with a range:

In [14]:
str[7:end]

"ghij"

Note that integer indexing returns a character, but range indexing returns a string. This means e.g. that indexing with a range of length 1 returns a single character as a string:

In [15]:
str[7:7]

"g"

Alternatively, it is possible to create a *view* into a string using the type `SubString`:

In [16]:
sub = SubString(str, 7, 10)

"ghij"

The length of the string can be found with the `length` function, which can be used to loop over the characters:

In [17]:
for i = 1:length(str)
    println("Character #$i = '$(str[i])'")
end

Character #1 = 'a'
Character #2 = 'b'
Character #3 = 'c'
Character #4 = 'd'
Character #5 = 'e'
Character #6 = 'f'
Character #7 = 'g'
Character #8 = 'h'
Character #9 = 'i'
Character #10 = 'j'


Strings in Julia are *immutable*, which means you cannot change the content of a string after it is created:

In [18]:
str[4] = 'A'    # Error - cannot change strings

MethodError: MethodError: no method matching setindex!(::String, ::Char, ::Int64)

### Example: Check if string is a palindrome

A palindrome is a sequence of characters which reads the same backward as forward. Using array operations and string comparisons, it is trivial to check if a string is a palindrome:

In [19]:
function is_palindrome(str)
    return str[end:-1:1] == str
end

strings = ["racecar", "Sit on a Potato Pan, Otis", "sitonapotatopanotis"]
for str in strings
    println("\"$str\": ", is_palindrome(str))
end

"racecar": true
"Sit on a Potato Pan, Otis": false
"sitonapotatopanotis": true


However, we can practice recursion, string indexing, and substrings by writing the following recursive version of the function:

In [20]:
function is_palindrome_recursive(str)
    if length(str) ≤ 1
        return true
    elseif str[1] == str[end]
        return is_palindrome_recursive(str[2:end-1])
    else
        return false
    end
end

for str in strings
    println("\"$str\": ", is_palindrome_recursive(str))
end

"racecar": true
"Sit on a Potato Pan, Otis": false
"sitonapotatopanotis": true


### String functions

Julia defines a number of functions that operate on strings. For example,
`lowercase` and `uppercase` converts all letters to lower- or upper-case:

In [21]:
println(uppercase("julia123 ") * lowercase("LOWER"))

JULIA123 lower


The `titlecase` commands capitalizes the first character of each word in a string:

In [22]:
str = "SEARCHING FOR CHARACTERS AND SUBSTRINGS"
titlecase(str)

"Searching For Characters And Substrings"

### Searching for characters and substrings

The ∈ operator (type "\in" and tab) returns `true` if a characters appears in a string:

In [23]:
'!' ∈ str

false

The function `findfirst(pattern, str)` returns the indices of the first occurance of `pattern` in `str`:

In [24]:
str = "Hello, World! These are my words."
pattern = "wor"
idx1 = findfirst(pattern, lowercase(str))

8:10

The function `findnext(pattern, str, start)` returns the indicies of the next occurance of `pattern` in `str` after the one at position `start`:

In [25]:
idx2 = findnext(pattern, lowercase(str), idx1[end])

28:30

Similarly, `findlast` and `findprev` starts from the end of the string.

### Replacing substrings

The function `replace(str, pattern=>repl)` searches the string `str` for all occurances of the substring `pattern`, and replace them with the string `repl`:

In [26]:
println(str)
println(replace(str, ", World"=>" there"))

Hello, World! These are my words.
Hello there! These are my words.


## File processing

Files are commonly used by computer codes, for example to save computed data or to read tables and external data files. Julia has extensive support for working with files, but here we will focus on only the basic functionality.

### Reading files

First we consider the most basic way to read a text file. We create an example text file named `test_file.txt` (e.g. using the Jupyter notebook or an editor), containing some lines of text.

The code below shows how to read each line of this file into a string, which can then be further processed in Julia (here it simply displays each line as a Julia string).

- The function `f = open(filename)` returns a so-called stream `f` for accessing the data in the file `filename`. It will break with an error if the operation cannot be completed, for example if the file does not exist.

- The function `eof(f)` (end-of-file) returns `true` if the stream `f` has reached the end of the file.

- The function `readline(f)` returns a string containing the next file in the stream `f`.

- The function `close(f)` closes the stream `f`.

In [45]:
f = open("test_file.txt")
while !eof(f)
    str = readline(f)
    display(str)
end
close(f)

"This is a test file"



""

"This is line #4"

"Here are some comma-separated numbers:"

""

"1,2,3,4,5"

"5,-4,3e3,2.0,1"

The function `eachline` lets you do this in a easier way, and it also supports a filename instead of a stream:

In [46]:
for line in eachline("test_file.txt")
    display(line)
end

"This is a test file"



""

"This is line #4"

"Here are some comma-separated numbers:"

""

"1,2,3,4,5"

"5,-4,3e3,2.0,1"

If you also read the entire file into a Julia string with the `read` function:

In [47]:
str = read("test_file.txt", String)



Alternatively, you can read the entire file into an array, with each line an element:

In [48]:
lines = readlines("test_file.txt")

8-element Array{String,1}:
 "This is a test file"                   
 ""                                      
 "This is line #4"                       
 "Here are some comma-separated numbers:"
 ""                                      
 "1,2,3,4,5"                             
 "5,-4,3e3,2.0,1"                        

You can then access these strings using the usual array syntax, or loop over all of them:

In [51]:
println("Line #2 says: ", lines[2])
println()
println("Here are all the lines which have between 1 and 18 characters:\n")
for line in lines
    if 1 ≤ length(line) ≤ 18
        println(line)
    end
end


Here are all the lines which have between 1 and 18 characters:

This is line #4
1,2,3,4,5
5,-4,3e3,2.0,1


### Writing files

The syntax for writing files is similar. The basic usage is demonstrated below:

In [1]:
f = open("created_data.txt", "w")
for i = 1:5
     # Create random strings of letters
    str = String(rand('a':'z', 50))
    write(f, str * "\n")  # Write string to stream f
end
println(f) # println can be used with streams too

# Print Fibonacci numbers to file
x = y = 1
print(f, "$x $y")
for i = 1:50
    z = x + y
    x = y
    y = z
    print(f, " $z")
end
println(f)
close(f)

In [2]:
# Read file and print each line
for line in eachline("created_data.txt")
    println(line)
end

exnnprbcqjrfrnbmvvrmhvriaszrgsjhelfjhaalbrzisflwjo
ftlwoyaqcjjfyapvnwfmvgoeqbnbsksurvdiowgevyfbkyogre
ibynlzozlucbwxsrsqlpiuyphmtofwjocapokgtjsqqttdpdho
equeokbwcskkrpxkqlcqvmxpnsombcdzvhyapdyvmuvwyakjte
bjavkkoajdpufdxfsdxoikoivvkstcpcqtrkddzwuivlxrajsp

1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 4807526976 7778742049 12586269025 20365011074 32951280099


### Delimited files

The `DelimitedFiles` package contains two convenient functions for reading and writing arrays of data:

- `writedlm(filename, A, delim)` writes the array `A` to file `filename`, using the character or string `delim` between each element in a row.

- `readdlm(filename, delim, T)` reads an array from a file in a similar way, with the (optional) element type `T`

The code below demonstrates these functions.

In [4]:
using DelimitedFiles

# Write file
A = randn(8,5)    # Sample data
writedlm("created_data.txt", A, ',')

# Print file
for line in eachline("created_data.txt")
    println(line)
end

# Read into array
B = readdlm("created_data.txt", ',')

isequal(A,B) # Check identical

-0.5628407393427325,0.7903773061352202,-0.7251732784171993,-0.2596272939905203,0.22524686648984477
0.7350481059306219,-0.8993655849676568,1.188392275040622,-0.5694960066387179,-0.721587315105174
0.2877730925262572,1.302636355561968,-0.6459662343177389,1.2246343747135995,-0.4372129343284092
0.18456034955491143,0.058343571746209355,-1.6224875075195648,0.7692646181334429,0.3229791013057399
-0.24957916183810938,-0.3384041890378802,1.179092009314686,0.11492069277896787,-0.7523759589123526
-1.306671617283084,0.4635291577496365,1.1010846621499475,-0.10919336170875511,0.235584424447275
3.0252863020753638,0.12603203960996734,1.3354966257194938,1.689220797466521,-2.763479637679656
1.6995436218532023,0.6867494126029525,0.5092786188084852,2.211848999929766,1.370143707284061


true

### Example: Coded triangle numbers

Project Euler, problem 42:

> The n<sup>th</sup> term of the sequence of triangle numbers is given by, $t_n = n(n+1)/2$; so the first
> ten triangle numbers are:
> 
>     1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
>
> By converting each letter in a word to a number corresponding to its alphabetical position and adding
> these values we form a word value. For example, the word value for SKY is $19 + 11 + 25 = 55 = t_{10}$. If
> the word value is a triangle number then we shall call the word a triangle word.
>
> Using `p042_words.txt` (right click and 'Save Link/Target As...'), a 16K text file containing nearly
> two-thousand common English words, how many are triangle words?
    
    

In [44]:
function word_value(word)
    return sum(collect(word) .- 'A' .+ 1)
end
word_value("SKY")

55

In [45]:
trinums = [n*(n+1)÷2 for n = 1:50]
words = readdlm("p042_words.txt", ',', String)
nbrtriwords = count([word_value(word) ∈ trinums for word in words])
println("There are $nbrtriwords triangle words in the list")

There are 162 triangle words in the list
