New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular Expressions #268

Open
hkronenb opened this Issue May 7, 2017 · 4 comments

Comments

Projects
None yet
5 participants
@hkronenb
Contributor

hkronenb commented May 7, 2017

I recently attended an instructor training and for my instructor checkout I thought of a way to potentially improve a small part of the 'R for Reproducible Scientific Analysis Training: Dataframe Manipulation with dplyr'(http://swcarpentry.github.io/r-novice-gapminder/13-dplyr/)

I found the following a little unintuitive and a bit inefficient.


Get the start letter of each country
starts.with <- substr(gapminder$country, start = 1, stop = 1)
Filter countries that start with "A" or "Z"
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]**


I think a better way to do this would be in one line using grepl. Below are three examples of how grepl could be used to do this task.


Three options for using regular expressions
az.countries <- subset(gapminder, grepl("^[A|Z]”, country))
az.countries <- subset(gapminder, grepl("^[AZ]”, country))
az.countries <- subset(gapminder, grepl("^A”, country)|grepl("^Z”, country))


Though regular expressions can be a little tough to grasp at first, they are really useful and a great thing to learn early on. I understand that this is a new topic, however, I don't think that it is more difficult to understand than substring (so it would not take much longer than a few extra minutes to teach) and in the long run knowing regular expressions could be more useful.

I wrote the following brief intro on regular expressions:

A regular expression is sequence of characters that define a search pattern. In programming we use regular expressions to check if a certain pattern occurs in a set of strings. For example if you have a list of students and you wanted to search for students who had the last name Smith you could use regular expression “Smith”. In R you specifically use the syntax:

grepl(“REGULAR EXPRESSION”, VARIABLE”)
or in this example
grepl(“Smith”, Name)

Because a regular expression is a sequence of characters, it can only be used on a variable that is a string. However, you can use regular expressions to search for a pattern of numbers if the variable you are searching through is a string. For example you could have an employee id that is a string of numbers and letters. You want to subset on all employee ids that start with the pattern of numbers “185” because “185” represents the employees who do work in a particular field. You can use the regular expression “^185” to search through employee id.

There are a lot of tricks you can use with regular expressions. Here I use “^” before “185” to indicate that I want to find strings that start with 185. If we want to find strings that ended with “185” we would use the syntax “185$”.

If you want to search for a regular expression that is a string of numbers on a variable that is in a numerical format (such as double, int), you will have to first convert the variable to a string before you can use a regular expression.


Let me know what you think! I'm excited to start contributing.

@naupaka

This comment has been minimized.

Member

naupaka commented May 8, 2017

Hi @hkronenb thanks for the suggestion, and we're glad to have you excited to contribute too!

I agree with you that the current example is not very elegant or R-like (substr() is probably a little too into the weeds for this particular spot in this lesson, IMO), and that regular expressions would be the proper and in many cases easier and more elegant way to do this.

I am a little hesitant to dive into them though, in this lesson, because they could easily become a whole module into themselves. But they are certainly a powerful tool that would be useful to introduce, even if very cursorily. Let me think a little bit more about how/where to fit in this content and I'll get back to you.

thoughts @tomwright01 @noamross?

@noamross

This comment has been minimized.

noamross commented Jun 5, 2017

[In which Noam clears out like a month of GitHub notifications]

I would hesitate to "partially" introduce regular expressions at all; regex and string handling would easily be an additional module or two. I think it might make sense to introduce them as a concept, though not here. The tidyr lesson might be a good place, because things like sep arguments can use but do not require regexes. So you could show one regex example and make an advanced option for an exercise something that would require students to look up some different regex syntax.

Given that, I would see if one could re-write this portion of the lesson without string handling at all, so as to reduce the length and cognitive load of the lesson. None of the concepts being conveyed here require string handling. One could re-write these pipelines to simply select 4 countries by name, or use a different logical filter to reduce to some other small set of countries (say, the highest or lowest GDP countries).

@naupaka

This comment has been minimized.

Member

naupaka commented Jun 5, 2017

@hkronenb I am inclined to agree with @noamross. What if instead of swapping it out for regexps, you just pulled the substr() bit out of this lesson as your checkout PR?

@jcoliver

This comment has been minimized.

Collaborator

jcoliver commented Jan 18, 2018

Also, this plotting example (and use of substr) is echoed from episode 8. So revisions might also occur there, or be sure to separate the dpylr material from that example.

fmichonneau pushed a commit that referenced this issue Jun 19, 2018

Merge pull request #268 from maxim-belkin/fix-returns
util.py: make functions return NotImplemented
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment