Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGEX expression to pull POTUS names from filenames #2

Closed
tymonaghan opened this issue Oct 23, 2018 · 5 comments

Comments

Projects
None yet
2 participants
@tymonaghan
Copy link
Owner

commented Oct 23, 2018

need a REGEX function that will match and extract president names from filenames.

Assume you have two filenames:
"1888-Cleveland-12-3.txt"
and
"1889-BenHarrison-12-3.txt"

How do I write a regular expression to match "Cleveland" and "BenHarrison" while ignoring "txt?"

Right now I'm just using

([[:alpha:]]{4,})

which just captures text strings longer than 4 chars long. So, it ignores the "txt" with only 3 chars, but it would be more useful to learn to use REGEX correctly.

@RJP43

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

Do the filenames always have the hyphens between the year and the name and between the name and the "12"? Also, you just want the name extracted right?

@tymonaghan

This comment has been minimized.

Copy link
Owner Author

commented Oct 23, 2018

@RJP43

Do the filenames always have the hyphens between the year and the name and between the name and the "12"?

Yes, the name I want extracted is always enclosed in -hyphens- (but really I can just change the filenames - just figure it would be good to learn to understand REGEX at least at a very basic level

Also, you just want the name extracted right?

Yup! So if you run it over the two filenames above, would just want "Cleveland" and "BenHarrison" to match.

@RJP43

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

@tymonaghan, actually the hyphens are great "handlebars" to use in your pattern matching. I really like the regex feature in oXygen because of the drop-down menu that gives explanations for each regular expression. So assuming you are using oXygen if you use the following regex you should get the results you are looking for:
(?<=-)[A-z]+(?=-)

Here is a tutorial for writing Regex in oXygen that I have used with past students - http://dh.newtfire.org/explainRegex.html

@RJP43

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

The [A-z]+ is called a capturing group. At first I wanted to suggest this expression (?<=-).+?(?=-) because the .+? means grab any text, but because the "12" is also situated between hyphens the expression given in my previous comment specifies the contents must only contain letters whether capitalized or not from A-Z. Fun thing about regex is there are probably at least 5 other ways to write this and still get what you are looking for, so let me know if this doesn't exactly get you what you want.

@tymonaghan

This comment has been minimized.

Copy link
Owner Author

commented Oct 23, 2018

Thank you both! @RJP43 so I see the + sign after the [A-z] capturing group telling it to capture A-z one or more times, as many times as it can. I can see how there would be other ways to do this too. I don't know what happened to that other guy but his suggestion was helpful too if I were doing analysis on the file names themselves - but for now I have updated my code in tm.R and it's now automatically extracting the president's name from the filename when I run the GetSentiment() function. Thanks all.

@tymonaghan tymonaghan closed this Oct 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.