<font size="6"><b>REGULAR EXPRESSIONS (REGEX)</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(htm2txt)
library(pdftools)
library(textreadr)
library(magrittr)

In [None]:
options(repr.matrix.max.rows=100, repr.matrix.max.cols=40) # for limiting the number of top and bottom rows of tables printed 

![xkcd](../imagesbb/regular_expressions.png)

(https://xkcd.com/208/)

According to [Regular-Expression.info](https://www.regular-expressions.info/):

>A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids.

While mathematician Stephen Cole Kleene is the first person to introduce the concept ([Regular expression](https://en.wikipedia.org/wiki/Regular_expression)),

It was again the great Ken Thompson who designed the first implementation to popularize the usage of regex by grep function in text editor ed and later as the grep command in UNIX shell

Let's hear its story from himself as interviewed by the great Brian Kernighan:

[![Ken Thompson interviewed by Brian Kernighan at VCF East 2019](https://img.youtube.com/vi/EY6q5dv_B-o/0.jpg)](https://youtu.be/EY6q5dv_B-o?t=2115)

> So I had grep squirreled away and I'd use it for everything. Again Doug McIlroy, my department head, came in and said
> - You know, it would be really good if we could look for things in files and do this nice
> - Oh, let me think, I'll think about it overnight
>
> So the overnight think was basically getting rid of bugs and things I'd meant to do that I hadn't done and you know an hour work maybe at most and next day I presented him with grep and he was "that's exactly what I wanted".

# Basic functions from stringr

What do you think is the greatest rock song ever?

My answer is:


[![Shine On You Crazy Diamond](https://img.youtube.com/vi/54W8kktFE_o/0.jpg)](https://www.youtube.com/watch?v=54W8kktFE_o)

It was written for their former bandmate SYD Barrett who had mental problems, so S(hine) on Y(ou) crazy D(iamond)

After years of seclusion with no contact, Syd paid a surprise visit to the recording session of the song, a real story:

> Roger was there, and he was sitting at the desk, and I came in and I saw this guy sitting behind him – huge, bald, fat guy. I thought, "He looks a bit... strange..." Anyway, so I sat down with Roger at the desk and we worked for about ten minutes, and this guy kept on getting up and brushing his teeth and then sitting – doing really weird things, but keeping quiet. And I said to Roger, "Who is he?" and Roger said "I don't know." And I said "Well, I assumed he was a friend of yours," and he said "No, I don't know who he is." Anyway, it took me a long time, and then suddenly I realised it was Syd, after maybe 45 minutes. He came in as we were doing the vocals for "Shine On You Crazy Diamond", which was basically about Syd. He just, for some incredible reason picked the very day that we were doing a song which was about him. And we hadn't seen him, I don't think, for two years before. That's what's so incredibly... weird about this guy. And a bit disturbing, as well, I mean, particularly when you see a guy, that you don't, you couldn't recognise him. And then, for him to pick the very day we want to start putting vocals on, which is a song about him. Very strange.

Here he is captured during that real visit:

![syd](https://upload.wikimedia.org/wikipedia/en/b/b6/Syd_Barrett_Abbey_Road_1975.jpg)

In [None]:
shine <- readLines("~/databb/text/shine")

In [None]:
shine %>% paste(collapse = "\n") %>% cat

Select the whole lines matching the pattern:

In [None]:
shine %>% str_subset("Shine")

Extract the parts that match the pattern

Only the first match as a vector:

In [None]:
shine %>% str_extract("Shine on you crazy diamond")

Or all matches as a list:

In [None]:
shine %>% str_extract_all("you")

Replace the first match with sth else:

In [None]:
shine %>% str_replace("you", "YO")

Replace all matches with sth else:

In [None]:
shine %>% str_replace_all("you", "YO")

Check whether each lines matches the pattern:

In [None]:
shine %>% str_detect("Shine")

Get the indices of matching lines:

In [None]:
shine %>% str_which("Shine")

# Fixed "irregular" patterns

Let's first create some gibberish like the one spoken by Evan, played by Steve Carell in Bruce Almighty:


[![Bruce Almighty - Evan's Gibberish](https://img.youtube.com/vi/FiEw1jcLztA/0.jpg)](https://www.youtube.com/watch?v=FiEw1jcLztA)

We will create initial length of words, create the characters for the letters and also number of repetitions for each character for a richer set of possibilities for us to query with regex.

Then we play around with these characters and repetitions to create the words and then sentences:

In [None]:
nwords <- 100

In [None]:
set.seed(20240221)

We use double poisson distribution for word lengths and number of repetitions for each character inside the word:

In [None]:
wlength <- (1 + gamlss.dist::rDPO(nwords, 4, 0.5))

In [None]:
wlength

In [None]:
wlength %>% hist

In [None]:
reps <- lapply(wlength, function(x) (1 + gamlss.dist::rDPO(x, 1, 1.5)))

In [None]:
reps %>% head

In [None]:
reps %>% unlist %>% hist

Sample letters:

In [None]:
lets <- lapply(wlength, function(x) sample(c(LETTERS, letters, rep(0:9, 3), "."), x, replace = T))

Repeat those letters by first creating a function for that:

In [None]:
repv <- function(x, y) mapply(rep, x, y, SIMPLIFY = F) %>% unlist %>% unname

Paste into words:

In [None]:
words <- mapply(repv, lets, reps, SIMPLIFY = F) %>% lapply(paste, collapse = "")

Paste into sentences:

In [None]:
nsentence <- 5

In [None]:
sentences <- words %>% split(rep(1:nsentence, each = nwords/nsentence)) %>% lapply(paste, collapse = " ") %>% unlist

Now we have our gibberish:

In [None]:
sentences

And we can start matching fixed patterns:

In [None]:
sentences %>% str_subset("a") %>% str_extract_all("a")

In [None]:
sentences %>% str_subset("aa") %>% str_extract_all("aa")

In [None]:
sentences %>% str_subset("bbb") %>% str_extract_all("bbb")

In [None]:
sentences %>% str_subset("01") %>% str_extract_all("01")

In [None]:
sentences %>% str_subset("34") %>% str_extract_all("34")

In [None]:
sentences %>% str_subset("42") %>% str_extract_all("42")

Fixed patterns are not so useful to work with

# Building blocks of regex

## Quantifiers

### Exact repetitions

Two times:

In [None]:
sentences %>% unlist %>% str_extract_all("a{2}")

Three times

In [None]:
sentences %>% unlist %>% str_extract_all("b{3}")

Two to three times:

In [None]:
sentences %>% unlist %>% str_extract_all("c{2,3}")

Two or more times:

In [None]:
sentences %>% unlist %>% str_extract_all("7{2,}")

One to four times:

In [None]:
sentences %>% unlist %>% str_extract_all("d{1,4}")

### \+: One or more

In [None]:
sentences %>% unlist %>% str_extract_all("e+")

In [None]:
sentences %>% unlist %>% str_extract_all("f+d")

In [None]:
sentences %>% unlist %>% str_extract_all("2+3")

### \*: Zero or more

In [None]:
sentences %>% unlist %>% str_extract_all("2*3")

### ?: Zero or one

In [None]:
sentences %>% unlist %>% str_extract_all("2?3")

## Capture group

Just like in mathematics:

In [None]:
"aaabb abbb aab bbbaa aa bbbb ababab bab bb" %>% str_extract_all("(a*b+)+")

## Qualifiers

### \.: Any character

In [None]:
sentences %>% unlist %>% str_extract_all("a.*b")

### \*?, \+?: lazy

This is greedy, matches the longest pattern:

In [None]:
"abeaxbbcccb" %>% str_extract_all("a.*b")

This is lazy, matches the shortest pattern:

In [None]:
"abeaxbbcccb" %>% str_extract_all("a.*?b")

Greedy:

In [None]:
"abeaxbbcccb" %>% str_extract_all("a.+b")

Lazy:

In [None]:
"axbbcccb" %>% str_extract_all("a.+?b")

### \\w: alphanumeric

In [None]:
"abc123.,?_-[{vvvv" %>% str_extract_all("\\w+")

### digits

In [None]:
"abc123.,?_-[{vvvv3434aa-bb241" %>% str_extract_all("\\d+")

### \\s: whitespace

In [None]:
"a  b c     d  e f    g" %>% str_replace_all(" +", "") 

## Character set

Match only the characters inside the brackets:

In [None]:
sentences %>% unlist %>% str_extract_all("[abcdef12345]+")

### [a-z]: letters

In [None]:
sentences %>% unlist %>% str_extract_all("[a-z]+")

### [A-Z]: LETTERS

In [None]:
sentences %>% unlist %>% str_extract_all("[A-Z]+")

### [0-9]: digits again

In [None]:
sentences %>% unlist %>% str_extract_all("[0-9]+")

### [^]: Exclude

Do not match characters inside the brackets:

In [None]:
sentences %>% unlist %>% str_extract_all("[^abcdef12345]+")

## Others

### ^: Start anchor

Beginning of line

In [None]:
words %>% str_subset("^1")

### $: End anchor

End of line

In [None]:
words %>% str_subset("a$")

### Word boundary

Beginning or end of words:

In [None]:
"asdasda ccccc dasdasda
ffgggg" %>% str_extract_all("\\b\\w{2}")

In [None]:
"asdasda ccccc dasdasda
ffgggg" %>% str_extract_all("\\w{2}\\b")

### \\: Escape literals

If we want to treat a special character in regex as a literal:

In [None]:
"aa[bbd..ccc{}}" %>% str_extract_all("[\\.\\[\\{\\}]+")

### |: Alternation/Or

In [None]:
sentences %>% unlist %>% str_extract_all("a{3,}|b{1,4}")

In [None]:
sentences %>% unlist %>% str_extract_all("(a|b){2,4}")

## backreference

Using the capture group as a variable referred with \\1 \\2 and so on

In [None]:
"asda sdfsdfs gs fb asdasda bjk gs" %>% str_replace_all("(gs)", "\\1 HURRAH")

## case insensitive

In [None]:
sentences %>% unlist %>% str_extract_all("a+")

In [None]:
sentences %>% unlist %>% str_extract_all("A+")

In [None]:
sentences %>% unlist %>% str_extract_all("(?i)[a-z]{1,6}")

## non-capturing group

In [None]:
"aaaaaa bcgsd gs ee" %>% str_extract_all("(?:bc)(gs)")

In [None]:
"aaaaaa bcgsd gs ee" %>% str_replace_all("(bc)(gs)", "\\1 HURRAH")

\\1 does not back refer to the first group, it is not captured because of ?:

In [None]:
"aaaaaa bcgsd gs ee" %>% str_replace_all("(?:bc)(gs)", "\\1 HURRAH")

In [None]:
"aaaaaa bcgsd gs ee" %>% str_replace_all("(bc)(gs)", "\\1\\2 HURRAH")

## lookahead / lookbehind

### lookahead

#### positive lookahead

Get the pattern followed by sth:

In [None]:
"form field 1: blah blah blah end
form field 2: bluh bluh bluh
bloh bloh bloh
bleh bleh bleh end" %>% strsplit("\\n") %>% unlist %>% str_extract_all("(bl\\wh ?)+(?=end)")

#### negative lookahead

Get the pattern NOT followed by sth:

In [None]:
"form field 1: blah blah blah end
form field 2: bluh bluh bluh
bloh bloh bloh
bleh bleh bleh end" %>% strsplit("\\n") %>% unlist %>% str_extract_all("(bl\\wh ?)+(?!end)$")

### lookbehind

#### positive lookbehind

Get the pattern following sth:

In [None]:
"form field 1: blah blah blah end
form field 2: bluh bluh bluh
bloh bloh bloh
bleh bleh bleh end" %>% strsplit("\\n") %>% unlist %>% str_extract_all("(?<=^form field \\d: )(bl\\wh ?)+")

#### negative lookbehind

Get the pattern not following sth:

In [None]:
"form field 1: blah blah blah end
form field 2: bluh bluh bluh
bloh bloh bloh
bleh bleh bleh end" %>% strsplit("\\n") %>% unlist %>% str_extract_all("^(?<!form field \\d: )(bl\\wh ?)+")

## Parse numbers easy way

We can use regex as we know:

In [None]:
"TRY 100,000.15" %>% str_replace_all("[^0-9\\.]+", "") %>% as.numeric

In [None]:
"100.000,15 TL" %>% str_replace_all("[^0-9,]+", "") %>% str_replace(",", "\\.") %>% as.numeric

Or we can use some helper/wrapper functions from readr package for simpler cases:

In [None]:
"TRY 100,000.15" %>% parse_number()

In [None]:
"TRY 100,000.15" %>% parse_number(locale = locale(decimal_mark = ".", grouping_mark = ","))

In [None]:
"100.000,15 TL" %>% parse_number(locale = locale(decimal_mark = ",", grouping_mark = "."))

Are we ready for some applications?

# Applications

## Universities and Provinces

Let's get text from the YKS report by OSYM:

https://dokuman.osym.gov.tr/pdfdokuman/2021/GENEL/yksdegrapor24122021.pdf

In [None]:
yksreport <- read_pdf("~/databb/pdf/yksdegrapor24122021.pdf")

In [None]:
setDT(yksreport)

Get the pages for the table of universities and split from newlines:

In [None]:
pages <- yksreport[page_id %between% c(103, 110), text]  %>% strsplit("\n") %>% unlist

In [None]:
pages %>% head(20)

Exclude those parts which only have digits, dots, dashes and whitespaces

From the remaining parts, delete the patterns with digit, dot, dashe and whitespace sequences longer than some given characters to clear the mess up:

In [None]:
list1 <- pages %>% tail(-12) %>% head(-2) %>% str_subset("^[\\-\\. \\d]+$", negate = T) %>% str_replace_all("[\\d\\-\\. ]{5,}", "")

In [None]:
listd1 <- data.table(list1)

In [None]:
listd1

There are some sequence of cells that correspond to headers and footers in tables in each page, we want to exclude them.

Figure out what we do here:

In [None]:
listd1[, delx1 := str_detect(list1, "^Devam ediyor\\.$")]

In [None]:
listd1[, delx2 := -str_detect(list1, "^Üstü$")]

In [None]:
listd1[, delx2 := lag(delx2, 1)]

In [None]:
listd1[, delx2 := replace_na(delx2, 0)]

In [None]:
listd1[, delx3 := cumsum(delx1 + delx2)]

In [None]:
listd1

In [None]:
listd1 <- listd1[delx3 != 1]

And delete empty lines that have nothing in between the start and end:

In [None]:
listd1 <- listd1[!str_detect(list1, "^$")]

In [None]:
#listd1[, str_extract(list1, "\\b[^\\s\\)]+$") %>% unique]

In pdfs, the text in cells can be split into multiple lines in an awkward way.

We have to paste them together. Figure out what we do here:

In [None]:
listd1[, cont := !str_detect(list1, "^\\(|^Üni|^ve|^Yüksekokulu") & !str_detect(lag(list1, 1), "( ve *|-|Uluslararası|Sosyal|Teknoloji|bosna|Toplum)$")]

In [None]:
#listd1[, cont := !str_detect(list1, "^\\(|^Üni|^ve|^Yüksekokulu") & str_detect(lag(list1, 1), "(Üniversitesi|Yüksekokulu|Enstitüsü|Cerrahpaşa)$")]

In [None]:
listd1[, cont := replace_na(cont, T)]

In [None]:
listd1[, cont2 := cumsum(cont)]

In [None]:
listd1

Combine rows that correspond to the same entity:

In [None]:
listd2 <- listd1[, .(fulln = paste(list1, collapse = " ")), by = cont2]

Better to standardize text, by converting to lower case and ascii:

In [None]:
listd2[, fulln2 := iconv(tolower(fulln), from = "utf-8", to = "ascii//TRANSLIT")]

Strip university names out of the parantheses for places:

In [None]:
listd2[, fulln3 := str_replace(fulln2, " \\(.*?\\)$", "")]

And keep those places in another field:

In [None]:
listd2[, loc1 := str_extract(fulln2, "\\(.*?\\)$")]

In [None]:
listd2[, loc1 := str_replace_all(loc1, "[\\(\\)]", "")]

Convert empty characters to NA in locations

In [None]:
listd2[nchar(loc1) == 0, loc1 := na_if(loc1, "")]

When location is not stated in a parantheses, it is the first word of the university name:

In [None]:
listd2[is.na(loc1), loc1 := str_extract(fulln3, "^\\w+\\b")]

Some locations are abroad with a dash, exclude them and get the unique values for provinces:

In [None]:
provs1 <- listd2[, unique(loc1) %>% sort] %>% str_subset("-", negate = T)

In [None]:
provs1

In [None]:
#listd2[, str_extract(fulln3, "\\w+$")] %>% unique

In [None]:
listd2 %>% DT::datatable(filter = "top")

Now let's compare with province names in Turkey from the wiki page at:

https://en.wikipedia.org/wiki/Provinces_of_Turkey

In [None]:
prov <- htm2txt::gettxt("~/databb/html/Provinces of Turkey - Wikipedia.html")

Delete non printable byte characters that cause a problem and split into parts from newlines:

In [None]:
prov2 <- prov %>% iconv("UTF-8", "UTF-8", sub='') %>% strsplit(split = "\\n") %>% unlist

Get the lines between two identified delimiters and get the beginning of the lines that start with two digits and alphanumeric characters until the word boundary:

In [None]:
provnames <- prov2[str_which(prov2, "Provinces of the Republic of Turkey"):max(str_which(prov2, "Codes"))] %>%
str_extract("(?<=^\\d{2} )\\w+\\b") %>% na.omit

Standardize again:

In [None]:
provnames <- iconv(tolower(provnames), from = "utf-8", to = "ascii//TRANSLIT")

Now compare the provinces of universities and full province list

Every province have universities:

In [None]:
setdiff(provnames, provs1)

Gebze is not a province:

In [None]:
setdiff(provs1, provnames)

Let's change it to Kocaeli:

In [None]:
listd2[loc1 %in% setdiff(provs1, provnames), loc1 := "kocaeli"]

And let's get the contingency table for the number of universities per province:

In [None]:
listd2[str_detect(loc1, "-", negate = T), .N, by = loc1][order(-N)]

## HSS course codes

Let's extract the codes of HSS courses from:

https://bogazici.edu.tr/Assets/Documents/hss_course_list___2023_fall_term.pdf 

In [None]:
hsslist <- read_pdf("~/databb/pdf/hss_course_list___2023_fall_term.pdf")

In [None]:
hsslistv <- hsslist$text %>% strsplit("\n") %>% unlist

In [None]:
hsslistv %>% head

Let's subset those rows where a code pattern of 2-4 letters, an optional space, 2-3 digits and an optional last letter is within some margin of letters from the beginning of the line:

In [None]:
hsslistv2 <- hsslistv %>% str_subset("^.{0,30}\\b[A-Z]{2,4} ?\\d{2,3}[A-Z]?\\b") %>% na.omit

In [None]:
hsslistv2 %>% head

Now extract only the first course code patterns (there may be subsequent ones that stand for prerequisites) and delete whitespaces for uniformity:

In [None]:
hsslistv2 %>%
str_extract("\\b[A-Z]{2,4} ?\\d{2,3}[A-Z]?\\b") %>% str_replace_all("\\s+", "")

## Get the explicitly used packages in this repo

Get the names of the ipynb files in this repo:

In [None]:
notebooks <- list.files("..", pattern = "*.ipynb", recursive = T, full.names = T)

In [None]:
notebooks

Read them into an object:

In [None]:
all_nb <- lapply(notebooks, readLines)

In [None]:
all_nb %<>% unlist

Packages are either attached by "library" command or used as a namespace with "::".

Let's first get the package names inside library(...) call:

In [None]:
libs <- all_nb %>% str_extract_all("(?<=library\\().*?(?=\\))") %>% unlist %>% unique

In [None]:
#nss <- all_nb %>% str_extract_all("\\b[^, \\(]*?(?=::)") %>% unlist %>% unique

And now, trickier, get those patterns xxxxx::yyyyyy( and extract the xxxxxx part

In [None]:
nss <- all_nb %>% str_extract_all("\\b\\w+?::\\w+?\\(") %>% unlist %>%
str_extract_all("\\b\\w*?(?=::)") %>% unlist %>% unique

There may be some pattern matches which are not packages indeed, false alarms:

In [None]:
nss %>% head

Let's intersect with installed packages:

In [None]:
packs <- installed.packages()[,"Package"] %>% unname

In [None]:
intersect(union(nss, libs), packs)

## Get ips and geo locations: How internet is routed

Let's see how the visit to a webpage hosted in a landlocked country is routed throughout the internet.

Zambia is a landlocked country in Africa.

`host` command in Linux gets the ip of a domain name, `traceroute` command gets the ip addresses of the routers in between.

The output from these commands is already directed to a text file. We will start with that

In [None]:
# host -W 10 parliament.gov.zm | grep -Po "(\d{1,3}\.){3}\d{1,3}" | head -1
# traceroute 41.77.145.34

In [None]:
route <- readLines("~/databb/text/route")

In [None]:
route

Extract all matches for an ip pattern: 4 groups of 1-3 digits separated by dots:

In [None]:
ipsl <- route %>% str_extract_all("(\\d{1,3}\\.){3}\\d{1,3}")

In [None]:
#ips <- ipsl %>% sapply(rev) %>% sapply("[", 1) %>% na.omit
ips <- ipsl %>% unlist %>% na.omit

In [None]:
ips

Only get the first instance of each ip:

In [None]:
ips2 <- data.table(ips)[, ips[1], by = ips] %>% pull(ips)

In [None]:
ips2

Now we will use the geoiplookup command in Linux to get geo location information:

In [None]:
geop <- lapply(sprintf("geoiplookup %s", ips2), system, intern = T)

In [None]:
geop[[1]]

We need only the countries:

In [None]:
countries <- sapply(geop, str_subset, "Country Edition")

In [None]:
countries

Extract the end of the line after the comma and space:

In [None]:
countries2 <- countries %>% str_extract("(?<=, ).*?$")

Get only the first instances of each country:

In [None]:
data.table(countries2)[, countries2[1], by = countries2] %>% pull(countries2)

The routing does not follow a direct terrestrial line.

We are first routed to Europe and then follow a transatlantic route to US and then follow another transatlantic route to Africa, Mauritius island and then from the Indian Ocean to African land ending at Zambia.

Maybe submarine lines are more seamless with fewer routers and secure than terrestrial lines?

# Resources

This tutorial is a good starting point:

http://www.regular-expressions.info/tutorial.html

And this site if for testing regex patterns on some text:

https://regex101.com/

And this competition website if good for progressing in regex and having fun at the same time:

https://regexcrossword.com/