# Strings

A "string" is a programming term for a value containing text. A value of the class "character" is a string.

In [17]:
my_text <- "The ancient Greeks had several different theories with regard to the origin of the world, but the generally accepted notion was that before this world came into existence, there was in its place a confused mass of shapeless elements called Chaos. "
class(my_text)

## Texts as vectors
Texts can be thought of as vectors in a number of ways:
1. A collection of texts
2. A collection of words in a text
3. A collection of sentences in a text

Depending on what we are trying to figure out with text, R can be used in a number of ways.

Let's start by thinking of a text as a series of single words:

In [24]:
text_words <- unlist(strsplit(text, split = "\\s"))  #split at every whitespace - \\s is an escape character
text_words

The text is now a vector, each word with its own index (subset using `[]`):

In [25]:
text_words[5]
text_words[22]

This allows us to perform counts and other summaries.

## Working with strings: The `stringr` package

The package `stringr` is a tidyverse package for working with strings.

In [18]:
library(stringr)

`stringr` can be used in a variety of different ways.

In [27]:
# Changing case (here to lowercase)
str_to_lower(text)

In [28]:
# Looking up words
str_detect(text, "world")

In [29]:
# Counting matches
str_count(text, "world")

The functions of `stringr` also work on a grouping on elements (like a vector). We can see this when we split the text into sentences and then use the same functions

In [30]:
# Splitting text into elements; here separating at commas
# unlist is used to coerce to a vector; otherwise it is returned as a list
text_sent <- str_split(text, pattern = ",") %>% 
unlist()

In [31]:
# Looking up word in each sentence
str_detect(text_sent, "world")

In [32]:
# Counting word in each sentence
str_count(text_sent, "world")

### A note on indexing and booleans
When inputting boolean valules as an index, R will only return the `TRUE` values.

This means that we can use commands like `str_detect` to only return text elements containing specific words.

In [33]:
text_sent[str_detect(text_sent, "world")]

`str_subset` has combined this functionality in one function:

In [34]:
str_subset(text_sent, "world")

## EXERCISE: WORKING WITH A TEXT AS A VECTOR

In the following, you will create a vector containing the sentences of the texts and then looking up certain words.

Make sure you have the package `stringr` installed and loaded.

1. Assign the following text snippet to an object:

    "Themis, who has already been alluded to as the wife of Zeus, was the daughter of Cronus and Rhea, and personified those divine laws of justice and order by means of which the well-being and morality of communities are regulated. She presided over the assemblies of the people and the laws of hospitality."
    

2. Convert the text snippet to a vector of senteces.
    
    a. Split the texts into sentences using `str_split(texts, pattern = ",")`. Assign to an object.
    
    c. Unlist the object to convert to a vector using `unlist()`. Assign to the same or a new object.
    
    
2. Use `str_subset()` to see extract sentences that contain the name "Zeus". 

## Regular expressions
Often when working with text, we have more "fuzzy" patterns we want to search for. 

Regular expression is a common language for processing text patterns.

`stringr` supports regular expression arguements.

Possible uses:
- Finding words of a certain length
- Finding sentences containing a certain word or pattern
- Finding words following a certain pattern
- and so on.

Regular expression (or "regex") can be used in most function in `stringr`. 

The function `str_subset()` creates a subset of elements with the strings containing the pattern.

In [36]:
text_sent

In [79]:
# Return sentences containing either "origin" or "before"
str_subset(text_sent, "origin|before")

In [38]:
# Return sentences containing an uppercase.
str_subset(text_sent, "[A-Z]")

## EXERCISE 4: SIMPLE REGEX

Use your vector of sentences from the previous exercise.

1. Use `str_subset()` to extract sentences containing either the word "justice" or "people".

# Categorical variables (factors)

Categorical variables in R are typically stored as "factors".

Unlike other statistical software solutions, R does not assign categorical variables an underlying numerical value. Values in a factor can therefore only be refered to by their category name!

Factors can sometimes cause issues, as a standard setting for a lot of import functions in R is to import text variables as factors. This causes issues as you have little control over how they are converted to categorical variables.
It often makes more sense to recode the variables as factors yourself.

Factors are necessary in a lot of functions for creating graphs or statistical models.

In [1]:
library(readr)
library(dplyr)

ess_data <- read_csv("https://github.com/CALDISS-AAU/workshop_r-table-data/raw/master/data/ess2014_mainsub_p1.csv")

"package 'dplyr' was built under R version 3.6.2"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Parsed with column specification:
cols(
  idno = col_double(),
  ppltrst = col_character(),
  polintr = col_character(),
  vote = col_character(),
  lrscale = col_character(),
  happy = col_character(),
  health = col_character(),
  cgtsday = col_double(),
  cgtsmke = col_character(),
  alcfreq = col_character(),
  brncntr = col_character(),
  height = col_double(),
  weight = col_double(),
  gndr = col_character(),
  yrbrn = col_double(),
  edlvddk = col_character(),
  marsts = col_character(),
  polpartvt = col_character()
)


In [2]:
# Coerce as factor
ess_data %>%
    mutate(gndr = as.factor(gndr)) 

idno,ppltrst,polintr,vote,lrscale,happy,health,cgtsday,cgtsmke,alcfreq,brncntr,height,weight,gndr,yrbrn,edlvddk,marsts,polpartvt
921018,6,Hardly interested,Not eligible to vote,4,9,Very good,10,I smoke but not every day,2-3 times a month,Yes,178,64,Male,1990,Folkeskole 6.-8. klasse,None of these (NEVER married or in legally registered civil union),[NA] Not applicable
921026,8,Quite interested,Yes,4,8,Very good,,I have never smoked,Several times a week,Yes,172,64,Female,1948,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",Widowed/civil partner died,[1] Socialdemokraterne - the Danish social democrats
921034,8,Quite interested,Yes,7,8,Good,,I don't smoke now but I used to,Every day,Yes,176,87,Male,1957,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",Not applicable,"[7] Venstre, Danmarks Liberale Parti - Venstre"
921181,9,Quite interested,Yes,5,9,Fair,,I don't smoke now but I used to,Once a week,Yes,194,102,Male,1956,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",Not applicable,[2] Det Radikale Venstre - Danish Social-Liberal Party
921204,9,Hardly interested,Yes,7,8,Good,,I don't smoke now but I used to,Once a week,No,157,48,Female,1941,"Kort videregående uddannelse af op til 2-3 års varighed, F.eks. Erhvervsakadem",Not applicable,[NA] Don't know
921262,8,Hardly interested,Yes,7,8,Very good,,I have never smoked,2-3 times a month,Yes,180,93,Male,1987,"Mellemlang videregående uddannelse af 3-4 års varighed. Professionsbachelorer,",None of these (NEVER married or in legally registered civil union),"[7] Venstre, Danmarks Liberale Parti - Venstre"
921319,8,Very interested,Not eligible to vote,8,Extremely happy,Fair,10,I smoke daily,Once a week,Yes,185,82,Male,1997,Folkeskole 9.-10. klasse,None of these (NEVER married or in legally registered civil union),[NA] Not applicable
921327,9,Quite interested,Yes,2,9,Good,,I don't smoke now but I used to,Once a week,Yes,182,97,Male,1940,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",Not applicable,[9] Enhedslisten - Unity List - The Red-Green Alliance
921335,8,Hardly interested,Yes,9,9,Good,,I have never smoked,Several times a week,Yes,178,84,Male,1945,"Faglig uddannelse (håndværk, handel, landbrug mv.), F.eks. Faglærte, Social-",Not applicable,[10] Andet - other
921408,7,Quite interested,Yes,8,9,Fair,1,I smoke but not every day,Several times a week,Yes,203,104,Male,1977,"Lang videregående uddannelse. Kandidatuddannelser af 5.-6. års varighed, F.eks",Not applicable,"[7] Venstre, Danmarks Liberale Parti - Venstre"


In [3]:
# Isolating a factor
gend_cat <- as.factor(ess_data$gndr)

In [4]:
# Inspecting values and levels
unique(gend_cat)

In [8]:
# Create factor as ordered/ordinal (but what order?)
ess_data <- ess_data %>%
    mutate(gend_factor = factor(gndr, ordered = TRUE))

In [9]:
# Inspecting values and levels
unique(ess_data$gend_factor)

In [10]:
# Creating ordered factor but setting custom order
ess_data <- ess_data %>%
    mutate(polintr_fact = factor(polintr, levels = c('Not at all interested', 'Hardly interested',
                                                    'Quite interested', 'Very interested'), ordered = TRUE))

unique(ess_data$polintr_fact)

In [12]:
# Recode using ordered factor (ordinal)
ess_data %>%
    mutate(polintr_dum = ifelse(polintr_fact > "Hardly interested", "Interested", "Not interested")) %>%
    select(idno, polintr, polintr_dum) %>%
    head(4)

idno,polintr,polintr_dum
921018,Hardly interested,Not interested
921026,Quite interested,Interested
921034,Quite interested,Interested
921181,Quite interested,Interested


## `forcats` 

`forcats` is a package specifically for working with factors in R. It provides a range of function for modifying level labels and order of labels for a factor.

See [the cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/factors.pdf).

A lof of the functionality may seem unnecessary but they are very useful in combination with `ggplot2`, as these functions allow to easily change how data is plotted without changing the data.