Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Regular Expressions #268
I recently attended an instructor training and for my instructor checkout I thought of a way to potentially improve a small part of the 'R for Reproducible Scientific Analysis Training: Dataframe Manipulation with dplyr'(http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/)
I found the following a little unintuitive and a bit inefficient.
Get the start letter of each country
I think a better way to do this would be in one line using grepl. Below are three examples of how grepl could be used to do this task.
Three options for using regular expressions
Though regular expressions can be a little tough to grasp at first, they are really useful and a great thing to learn early on. I understand that this is a new topic, however, I don't think that it is more difficult to understand than substring (so it would not take much longer than a few extra minutes to teach) and in the long run knowing regular expressions could be more useful.
I wrote the following brief intro on regular expressions:
A regular expression is sequence of characters that define a search pattern. In programming we use regular expressions to check if a certain pattern occurs in a set of strings. For example if you have a list of students and you wanted to search for students who had the last name Smith you could use regular expression “Smith”. In R you specifically use the syntax:
grepl(“REGULAR EXPRESSION”, VARIABLE”)
Because a regular expression is a sequence of characters, it can only be used on a variable that is a string. However, you can use regular expressions to search for a pattern of numbers if the variable you are searching through is a string. For example you could have an employee id that is a string of numbers and letters. You want to subset on all employee ids that start with the pattern of numbers “185” because “185” represents the employees who do work in a particular field. You can use the regular expression “^185” to search through employee id.
There are a lot of tricks you can use with regular expressions. Here I use “^” before “185” to indicate that I want to find strings that start with 185. If we want to find strings that ended with “185” we would use the syntax “185$”.
If you want to search for a regular expression that is a string of numbers on a variable that is in a numerical format (such as double, int), you will have to first convert the variable to a string before you can use a regular expression.
Let me know what you think! I'm excited to start contributing.
Hi @hkronenb thanks for the suggestion, and we're glad to have you excited to contribute too!
I agree with you that the current example is not very elegant or R-like (
I am a little hesitant to dive into them though, in this lesson, because they could easily become a whole module into themselves. But they are certainly a powerful tool that would be useful to introduce, even if very cursorily. Let me think a little bit more about how/where to fit in this content and I'll get back to you.
[In which Noam clears out like a month of GitHub notifications]
I would hesitate to "partially" introduce regular expressions at all; regex and string handling would easily be an additional module or two. I think it might make sense to introduce them as a concept, though not here. The tidyr lesson might be a good place, because things like
Given that, I would see if one could re-write this portion of the lesson without string handling at all, so as to reduce the length and cognitive load of the lesson. None of the concepts being conveyed here require string handling. One could re-write these pipelines to simply select 4 countries by name, or use a different logical filter to reduce to some other small set of countries (say, the highest or lowest GDP countries).