Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSVParser: check column headers when parsing config CSV files #1459

Merged
merged 12 commits into from
Mar 28, 2021

Conversation

vig42
Copy link
Contributor

@vig42 vig42 commented Mar 2, 2021

Currently, RepoSense parses the config CSV files purely based on
column order. This means that every column needs to be present in the
correct order (even if the column is optional) in order for the CSV
file to be parsed. This additionally means that every time we add a
new column to a config CSV, we would break compatibility with existing
config CSV files for all users.

Let’s change the CSVParser to verify and parse the config CSV files
based on the column headers. This would allow us to parse config CSVs
with columns in a different order or with some (optional) columns
missing, and would allow us to add new columns without breaking
compatibility.

@vig42 vig42 mentioned this pull request Mar 2, 2021
@damithc
Copy link
Collaborator

damithc commented Mar 2, 2021

As we are adding a restriction on column names, we'll also need good error reporting e.g., RepoSense should inform the user which column is missing, which column name is incorrect, etc.

@vig42
Copy link
Contributor Author

vig42 commented Mar 2, 2021

As we are adding a restriction on column names, we'll also need good error reporting e.g., RepoSense should inform the user which column is missing, which column name is incorrect, etc.

Right now if a mandatory column is missing, RepoSense would throw an InvalidCsvException with the name of the missing column. For optional columns, if the column is missing, there is no exception thrown (because it is optional). This allows us to add in new columns without breaking compatibility.

However, this also means that if an optional column header has a typo in it, the user would not get any indication. If a column header does not match any of the specified mandatory or optional columns, we could do some form of typo detection using string distance and warn the user if it is a close match.

Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well.

@damithc
Copy link
Collaborator

damithc commented Mar 2, 2021

Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well.

Yes, we should do this.

However, this also means that if an optional column header has a typo in it, the user would not get any indication. If a column header does not match any of the specified mandatory or optional columns, we could do some form of typo detection using string distance and warn the user if it is a close match.

Probably no need to cater for typos (at least not urgent) but should be able to deal with case-differences and whitespace differences (at least leading/trailing spaces).

@vig42
Copy link
Contributor Author

vig42 commented Mar 2, 2021

Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well.

Yes, we should do this.

I have added this. The log output looks like this:
12:01:31 - Parsed header of CSV file, group-config.csv, and recognized columns: Globs, Repository's Location, Group Name

Probably no need to cater for typos (at least not urgent) but should be able to deal with case-differences and whitespace differences (at least leading/trailing spaces).

Right now we're taking care of case differences by using the equalsIgnoreCase method to compare, and taking care of leading/trailing spaces by trimming the header before comparing.

@Tejas2805
Copy link
Contributor

@vig42 Is this ready for review?

@vig42
Copy link
Contributor Author

vig42 commented Mar 7, 2021

@vig42 Is this ready for review?

Yes, should be ready.

I have added some checks to account for leading/trailing spaces and upper/lower case, and I have added an info log to show the user which columns have been parsed from the CSV, so I think Prof @damithc's concerns should be taken care of.

Copy link
Member

@fzdy1914 fzdy1914 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is quite good, will proceed to review later

@fzdy1914 fzdy1914 requested review from a team and removed request for a team March 15, 2021 08:49
for (String parsedHeader : mandatoryHeaders()) {
if (possible.equalsIgnoreCase(parsedHeader)) {
headerMap.put(parsedHeader, i);
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the 'break' been added here? Is it not supposed to be checking for all the mandatory headers?

Copy link
Contributor Author

@vig42 vig42 Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The break is there because I am assuming that the parsed column header would match with at most 1 of the mandatory headers, i.e. that we would not have a case where there are 2 mandatory headers which are identical.

So as soon as there is a match with 1 of the mandatory headers, there is no point in checking the remaining ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, it seems that in a certain scenario, duplicated column names cannot be found out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, it seems that in a certain scenario, duplicated column names cannot be found out?

I have added a warning if there are duplicate headers detected in the CSV.

src/main/java/reposense/parser/CsvParser.java Outdated Show resolved Hide resolved
src/main/java/reposense/parser/CsvParser.java Show resolved Hide resolved
src/main/java/reposense/parser/CsvParser.java Outdated Show resolved Hide resolved
@vig42
Copy link
Contributor Author

vig42 commented Mar 24, 2021

@Tejas2805 @dcshzj @fzdy1914 FYI, I found out that the existing code would already throw a runtime exception if there were any duplicate headers passed. The exception is thrown by line 75 here:

public List<T> parse() throws IOException, InvalidCsvException {
List<T> results = new ArrayList<>();
Iterable<CSVRecord> records;
try (BufferedReader csvReader = new BufferedReader(new FileReader(csvFilePath.toFile()))) {
String[] header = getHeader(csvReader);
records = CSVFormat.DEFAULT.withIgnoreEmptyLines(false).withHeader(header).parse(csvReader);

This issue is also present in the master branch. If you modify the repo-config.csv and edit 1 of the headers to be a duplicate of another header (e.g. change "Ignore Glob List" to "File formats"), then you would see this:

Exception in thread "main" java.lang.IllegalArgumentException: The header contains a duplicate entry: 'File formats' in [Repository's Location, Branch, File formats, File formats, Ignore standalone config, Ignore Commit List, Ignore Authors List]
        at org.apache.commons.csv.CSVFormat.validate(CSVFormat.java:1466)
        at org.apache.commons.csv.CSVFormat.<init>(CSVFormat.java:700)
        at org.apache.commons.csv.CSVFormat.withHeader(CSVFormat.java:1746)
        at reposense.parser.CsvParser.parse(CsvParser.java:75)
        at reposense.RepoSense.getRepoConfigurations(RepoSense.java:109)
        at reposense.RepoSense.main(RepoSense.java:64)

By default, this only picks up exact duplicates, i.e. it is case sensitive. I have added .withTrim() and .withIgnoreHeaderCase() so that it will catch duplicates even if they are not in the same case and if they have leading/trailing spaces. I have also added a try/catch block to catch these runtime exceptions and instead throw an InvalidCsvException with the respective filename.

This means that RepoSense won't allow duplicate headers in the config files, so there is no need to worry about whether we take the leftmost or rightmost duplicate.

@vig42 vig42 requested a review from dcshzj March 24, 2021 07:54
Copy link
Member

@dcshzj dcshzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@fzdy1914 fzdy1914 requested a review from Tejas2805 March 25, 2021 05:31
Copy link
Contributor

@Tejas2805 Tejas2805 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@damithc
Copy link
Collaborator

damithc commented Mar 28, 2021

Guys, shouldn't this PR have been merged before the shallow cloning one? Anyway, let's merge this soon.

@fzdy1914
Copy link
Member

Guys, shouldn't this PR have been merged before the shallow cloning one? Anyway, let's merge this soon.

Guys, shouldn't this PR have been merged before the shallow cloning one? Anyway, let's merge this soon.

I don't think that two conflicts though.

@fzdy1914 fzdy1914 changed the title Check column headers when parsing config CSV files CSVParser: check column headers when parsing config CSV files Mar 28, 2021
@damithc
Copy link
Collaborator

damithc commented Mar 28, 2021

I don't think that two conflicts though.

That one introduced a new column into the csv file that broke all the existing dashboards. This PR is supposed to ignore optional columns. In fact, this PR was created to prevent the former from happening. Anyway, it's not a big problem. I'll wait till this is merged to use the master again. Just don't do another release until this is merged. @HCY123902

@fzdy1914 fzdy1914 merged commit 35d41b2 into reposense:master Mar 28, 2021
@fzdy1914
Copy link
Member

I don't think that two conflicts though.

That one introduced a new column into the csv file that broke all the existing dashboards. This PR is supposed to ignore optional columns. In fact, this PR was created to prevent the former from happening. Anyway, it's not a big problem. I'll wait till this is merged to use the master again. Just don't do another release until this is merged. @HCY123902

I am sorry that I was not informed of this issue. Anyway, I have merged it right now.

fzdy1914 pushed a commit that referenced this pull request Apr 23, 2021
User Guide states that "RepoSense ignores the first row (i.e., 
column headings) of CSV config files". However, as of #1459, this is 
no longer true.

Let's change the User Guide to reflect the new method of CSV 
parsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants