-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problems with data? #10
Comments
I got the same problem. Waiting 4 the reply anxiously. |
The file is a TSV file so the rows are separated by a tab. I have to check the users table. |
I have the same problem with point number 3, multiple duplicated users in users.csv |
I have to check the script there, will get back to u soon. For now, just use them as separate users. |
Any answer for question 2 above posted by aiolli? "the number of values in the users.csv rows varies. sometimes 12 values are present in a row and sometimes only 11 (counting lists as a single value, clearly). how should be interpreted this? which field value is actually missing (perhaps the last one?)" It is either the last value or the region value missing. In both cases it contradicts what the documentation says. Please clarify this issue. Thanks. |
The users are true replicates. Sorry about that. We believe that is not a big issue, just remove the replicates. Daniel |
@danct again, if you split by \t, all lines in the user file have 12 values. Hope that helps |
Thanks Daniel. Indeed there is a tab at the end of each line with 11 values. This means it is always the value for the last field (edu_fieldofstudies) that is missing for about half of the lines. According to the provided documentation, there should have been a 0 instead of no value. "edu_fieldofstudies comma-separated fields of studies that the user studied. 0 means "unknown" [...]" Anyhow, it's an easy fix. Thanks again. |
Alrighty :) |
hi all,
we found some problems with training data:
files are not in a real csv format. are they? we are not aware of csv formats that support fields which are lists
the number of values in the users.csv rows varies. sometimes 12 values are present in a row and sometimes only 11 (counting lists as a single value, clearly). how should be interpreted this? which field value is actually missing (perhaps the last one?)
there are many replicated users in the file users.csv. in particular, there are 1367057 unique users out of 1500000!
Thanks for your answer!
The text was updated successfully, but these errors were encountered: