-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSVParser: check column headers when parsing config CSV files #1459
Conversation
vig42
commented
Mar 2, 2021
As we are adding a restriction on column names, we'll also need good error reporting e.g., RepoSense should inform the user which column is missing, which column name is incorrect, etc. |
Right now if a mandatory column is missing, RepoSense would throw an InvalidCsvException with the name of the missing column. For optional columns, if the column is missing, there is no exception thrown (because it is optional). This allows us to add in new columns without breaking compatibility. However, this also means that if an optional column header has a typo in it, the user would not get any indication. If a column header does not match any of the specified mandatory or optional columns, we could do some form of typo detection using string distance and warn the user if it is a close match. Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well. |
Yes, we should do this.
Probably no need to cater for typos (at least not urgent) but should be able to deal with case-differences and whitespace differences (at least leading/trailing spaces). |
I have added this. The log output looks like this:
Right now we're taking care of case differences by using the |
@vig42 Is this ready for review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is quite good, will proceed to review later
for (String parsedHeader : mandatoryHeaders()) { | ||
if (possible.equalsIgnoreCase(parsedHeader)) { | ||
headerMap.put(parsedHeader, i); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the 'break' been added here? Is it not supposed to be checking for all the mandatory headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The break is there because I am assuming that the parsed column header would match with at most 1 of the mandatory headers, i.e. that we would not have a case where there are 2 mandatory headers which are identical.
So as soon as there is a match with 1 of the mandatory headers, there is no point in checking the remaining ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, it seems that in a certain scenario, duplicated column names cannot be found out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, it seems that in a certain scenario, duplicated column names cannot be found out?
I have added a warning if there are duplicate headers detected in the CSV.
@Tejas2805 @dcshzj @fzdy1914 FYI, I found out that the existing code would already throw a runtime exception if there were any duplicate headers passed. The exception is thrown by line 75 here: RepoSense/src/main/java/reposense/parser/CsvParser.java Lines 69 to 76 in 8c05294
This issue is also present in the master branch. If you modify the repo-config.csv and edit 1 of the headers to be a duplicate of another header (e.g. change "Ignore Glob List" to "File formats"), then you would see this:
By default, this only picks up exact duplicates, i.e. it is case sensitive. I have added This means that RepoSense won't allow duplicate headers in the config files, so there is no need to worry about whether we take the leftmost or rightmost duplicate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Guys, shouldn't this PR have been merged before the shallow cloning one? Anyway, let's merge this soon. |
I don't think that two conflicts though. |
That one introduced a new column into the csv file that broke all the existing dashboards. This PR is supposed to ignore optional columns. In fact, this PR was created to prevent the former from happening. Anyway, it's not a big problem. I'll wait till this is merged to use the |
I am sorry that I was not informed of this issue. Anyway, I have merged it right now. |
User Guide states that "RepoSense ignores the first row (i.e., column headings) of CSV config files". However, as of #1459, this is no longer true. Let's change the User Guide to reflect the new method of CSV parsing.