CSVParser: check column headers when parsing config CSV files #1459

vig42 · 2021-03-02T12:41:18Z

Currently, RepoSense parses the config CSV files purely based on
column order. This means that every column needs to be present in the
correct order (even if the column is optional) in order for the CSV
file to be parsed. This additionally means that every time we add a
new column to a config CSV, we would break compatibility with existing
config CSV files for all users.

Let’s change the CSVParser to verify and parse the config CSV files
based on the column headers. This would allow us to parse config CSVs
with columns in a different order or with some (optional) columns
missing, and would allow us to add new columns without breaking
compatibility.

damithc · 2021-03-02T12:48:09Z

As we are adding a restriction on column names, we'll also need good error reporting e.g., RepoSense should inform the user which column is missing, which column name is incorrect, etc.

vig42 · 2021-03-02T13:05:56Z

As we are adding a restriction on column names, we'll also need good error reporting e.g., RepoSense should inform the user which column is missing, which column name is incorrect, etc.

Right now if a mandatory column is missing, RepoSense would throw an InvalidCsvException with the name of the missing column. For optional columns, if the column is missing, there is no exception thrown (because it is optional). This allows us to add in new columns without breaking compatibility.

However, this also means that if an optional column header has a typo in it, the user would not get any indication. If a column header does not match any of the specified mandatory or optional columns, we could do some form of typo detection using string distance and warn the user if it is a close match.

Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well.

damithc · 2021-03-02T14:35:56Z

Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well.

Yes, we should do this.

However, this also means that if an optional column header has a typo in it, the user would not get any indication. If a column header does not match any of the specified mandatory or optional columns, we could do some form of typo detection using string distance and warn the user if it is a close match.

Probably no need to cater for typos (at least not urgent) but should be able to deal with case-differences and whitespace differences (at least leading/trailing spaces).

vig42 · 2021-03-02T16:14:36Z

Otherwise, we could provide a list of all the columns that were successfully parsed. This would allow the user to manually check if 1 of the columns was not recognized, and would probably help with debugging as well.

Yes, we should do this.

I have added this. The log output looks like this:
12:01:31 - Parsed header of CSV file, group-config.csv, and recognized columns: Globs, Repository's Location, Group Name

Probably no need to cater for typos (at least not urgent) but should be able to deal with case-differences and whitespace differences (at least leading/trailing spaces).

Right now we're taking care of case differences by using the equalsIgnoreCase method to compare, and taking care of leading/trailing spaces by trimming the header before comparing.

Tejas2805 · 2021-03-07T08:47:41Z

@vig42 Is this ready for review?

vig42 · 2021-03-07T13:37:38Z

@vig42 Is this ready for review?

Yes, should be ready.

I have added some checks to account for leading/trailing spaces and upper/lower case, and I have added an info log to show the user which columns have been parsed from the CSV, so I think Prof @damithc's concerns should be taken care of.

fzdy1914

The idea is quite good, will proceed to review later

Tejas2805 · 2021-03-15T15:38:26Z

src/main/java/reposense/parser/CsvParser.java

+            for (String parsedHeader : mandatoryHeaders()) {
+                if (possible.equalsIgnoreCase(parsedHeader)) {
+                    headerMap.put(parsedHeader, i);
+                    break;


Why is the 'break' been added here? Is it not supposed to be checking for all the mandatory headers?

The break is there because I am assuming that the parsed column header would match with at most 1 of the mandatory headers, i.e. that we would not have a case where there are 2 mandatory headers which are identical.

So as soon as there is a match with 1 of the mandatory headers, there is no point in checking the remaining ones.

Then, it seems that in a certain scenario, duplicated column names cannot be found out?

Then, it seems that in a certain scenario, duplicated column names cannot be found out?

I have added a warning if there are duplicate headers detected in the CSV.

src/main/java/reposense/parser/CsvParser.java

vig42 · 2021-03-24T07:41:41Z

@Tejas2805 @dcshzj @fzdy1914 FYI, I found out that the existing code would already throw a runtime exception if there were any duplicate headers passed. The exception is thrown by line 75 here:

RepoSense/src/main/java/reposense/parser/CsvParser.java

Lines 69 to 76 in 8c05294

    
           public List<T> parse() throws IOException, InvalidCsvException { 
        
               List<T> results = new ArrayList<>(); 
        
               Iterable<CSVRecord> records; 
        
               try (BufferedReader csvReader = new BufferedReader(new FileReader(csvFilePath.toFile()))) { 
        
                   String[] header = getHeader(csvReader); 
        
                   records = CSVFormat.DEFAULT.withIgnoreEmptyLines(false).withHeader(header).parse(csvReader);

This issue is also present in the master branch. If you modify the repo-config.csv and edit 1 of the headers to be a duplicate of another header (e.g. change "Ignore Glob List" to "File formats"), then you would see this:

Exception in thread "main" java.lang.IllegalArgumentException: The header contains a duplicate entry: 'File formats' in [Repository's Location, Branch, File formats, File formats, Ignore standalone config, Ignore Commit List, Ignore Authors List]
        at org.apache.commons.csv.CSVFormat.validate(CSVFormat.java:1466)
        at org.apache.commons.csv.CSVFormat.<init>(CSVFormat.java:700)
        at org.apache.commons.csv.CSVFormat.withHeader(CSVFormat.java:1746)
        at reposense.parser.CsvParser.parse(CsvParser.java:75)
        at reposense.RepoSense.getRepoConfigurations(RepoSense.java:109)
        at reposense.RepoSense.main(RepoSense.java:64)

By default, this only picks up exact duplicates, i.e. it is case sensitive. I have added .withTrim() and .withIgnoreHeaderCase() so that it will catch duplicates even if they are not in the same case and if they have leading/trailing spaces. I have also added a try/catch block to catch these runtime exceptions and instead throw an InvalidCsvException with the respective filename.

This means that RepoSense won't allow duplicate headers in the config files, so there is no need to worry about whether we take the leftmost or rightmost duplicate.

dcshzj

LGTM!

Tejas2805

LGTM!

damithc · 2021-03-28T08:52:48Z

Guys, shouldn't this PR have been merged before the shallow cloning one? Anyway, let's merge this soon.

fzdy1914 · 2021-03-28T14:53:35Z

Guys, shouldn't this PR have been merged before the shallow cloning one? Anyway, let's merge this soon.

I don't think that two conflicts though.

damithc · 2021-03-28T14:58:56Z

I don't think that two conflicts though.

That one introduced a new column into the csv file that broke all the existing dashboards. This PR is supposed to ignore optional columns. In fact, this PR was created to prevent the former from happening. Anyway, it's not a big problem. I'll wait till this is merged to use the master again. Just don't do another release until this is merged. @HCY123902

fzdy1914 · 2021-03-28T15:12:33Z

I don't think that two conflicts though.

That one introduced a new column into the csv file that broke all the existing dashboards. This PR is supposed to ignore optional columns. In fact, this PR was created to prevent the former from happening. Anyway, it's not a big problem. I'll wait till this is merged to use the master again. Just don't do another release until this is merged. @HCY123902

I am sorry that I was not informed of this issue. Anyway, I have merged it right now.

User Guide states that "RepoSense ignores the first row (i.e., column headings) of CSV config files". However, as of #1459, this is no longer true. Let's change the User Guide to reflect the new method of CSV parsing.

vig42 added 4 commits March 2, 2021 16:28

Implement header parsing

b386be6

Add tests

245b107

Fix config files

aa86748

Fix checkstyle

79036af

vig42 mentioned this pull request Mar 2, 2021

Implement shallow cloning #1449

Merged

Add log of parsed columns

17a1fca

Merge branch 'master' into csv-parse-headers

1c74d56

fzdy1914 reviewed Mar 15, 2021

View reviewed changes

fzdy1914 requested review from a team and removed request for a team March 15, 2021 08:49

Tejas2805 reviewed Mar 15, 2021

View reviewed changes

Add warning for duplicate headers in csv

92010d0

fzdy1914 added the s.ToReview label Mar 22, 2021

dcshzj requested changes Mar 24, 2021

View reviewed changes

src/main/java/reposense/parser/CsvParser.java Outdated Show resolved Hide resolved

src/main/java/reposense/parser/CsvParser.java Show resolved Hide resolved

src/main/java/reposense/parser/CsvParser.java Outdated Show resolved Hide resolved

vig42 added 2 commits March 24, 2021 14:30

Make headerSize a local variable

3c90b3a

Add test for duplicate headers

dbedf9f

vig42 requested a review from dcshzj March 24, 2021 07:54

dcshzj approved these changes Mar 25, 2021

View reviewed changes

fzdy1914 self-requested a review March 25, 2021 05:31

fzdy1914 added s.FinalReview and removed s.ToReview labels Mar 25, 2021

fzdy1914 requested a review from Tejas2805 March 25, 2021 05:31

Tejas2805 approved these changes Mar 25, 2021

View reviewed changes

vig42 added 3 commits March 28, 2021 15:33

Merge branch 'master' into csv-parse-headers

7218f00

Fix csv column headers

4ef4695

Fix csv

4cab6e8

fzdy1914 changed the title ~~Check column headers when parsing config CSV files~~ CSVParser: check column headers when parsing config CSV files Mar 28, 2021

fzdy1914 merged commit 35d41b2 into reposense:master Mar 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSVParser: check column headers when parsing config CSV files #1459

CSVParser: check column headers when parsing config CSV files #1459

vig42 commented Mar 2, 2021

damithc commented Mar 2, 2021

vig42 commented Mar 2, 2021 •

edited

damithc commented Mar 2, 2021

vig42 commented Mar 2, 2021

Tejas2805 commented Mar 7, 2021

vig42 commented Mar 7, 2021

fzdy1914 left a comment

Tejas2805 Mar 15, 2021

vig42 Mar 15, 2021 •

edited

fzdy1914 Mar 16, 2021

vig42 Mar 18, 2021

vig42 commented Mar 24, 2021 •

edited

dcshzj left a comment

Tejas2805 left a comment

damithc commented Mar 28, 2021

fzdy1914 commented Mar 28, 2021

damithc commented Mar 28, 2021

fzdy1914 commented Mar 28, 2021

CSVParser: check column headers when parsing config CSV files #1459

CSVParser: check column headers when parsing config CSV files #1459

Conversation

vig42 commented Mar 2, 2021

damithc commented Mar 2, 2021

vig42 commented Mar 2, 2021 • edited

damithc commented Mar 2, 2021

vig42 commented Mar 2, 2021

Tejas2805 commented Mar 7, 2021

vig42 commented Mar 7, 2021

fzdy1914 left a comment

Choose a reason for hiding this comment

Tejas2805 Mar 15, 2021

Choose a reason for hiding this comment

vig42 Mar 15, 2021 • edited

Choose a reason for hiding this comment

fzdy1914 Mar 16, 2021

Choose a reason for hiding this comment

vig42 Mar 18, 2021

Choose a reason for hiding this comment

vig42 commented Mar 24, 2021 • edited

dcshzj left a comment

Choose a reason for hiding this comment

Tejas2805 left a comment

Choose a reason for hiding this comment

damithc commented Mar 28, 2021

fzdy1914 commented Mar 28, 2021

damithc commented Mar 28, 2021

fzdy1914 commented Mar 28, 2021

vig42 commented Mar 2, 2021 •

edited

vig42 Mar 15, 2021 •

edited

vig42 commented Mar 24, 2021 •

edited