Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with data? #10

Closed
aiolli opened this issue Mar 7, 2016 · 9 comments
Closed

problems with data? #10

aiolli opened this issue Mar 7, 2016 · 9 comments

Comments

@aiolli
Copy link

aiolli commented Mar 7, 2016

hi all,

we found some problems with training data:

  1. files are not in a real csv format. are they? we are not aware of csv formats that support fields which are lists

  2. the number of values in the users.csv rows varies. sometimes 12 values are present in a row and sometimes only 11 (counting lists as a single value, clearly). how should be interpreted this? which field value is actually missing (perhaps the last one?)

  3. there are many replicated users in the file users.csv. in particular, there are 1367057 unique users out of 1500000!

Thanks for your answer!

@mogami95
Copy link

mogami95 commented Mar 8, 2016

I got the same problem. Waiting 4 the reply anxiously.

@dkohlsdorf
Copy link
Contributor

The file is a TSV file so the rows are separated by a tab. I have to check the users table.
Hope that helps,
Daniel

@creat89
Copy link

creat89 commented Mar 10, 2016

I have the same problem with point number 3, multiple duplicated users in users.csv

@dkohlsdorf
Copy link
Contributor

I have to check the script there, will get back to u soon. For now, just use them as separate users.

@danct
Copy link

danct commented Mar 16, 2016

Any answer for question 2 above posted by aiolli?

"the number of values in the users.csv rows varies. sometimes 12 values are present in a row and sometimes only 11 (counting lists as a single value, clearly). how should be interpreted this? which field value is actually missing (perhaps the last one?)"

It is either the last value or the region value missing. In both cases it contradicts what the documentation says.

Please clarify this issue.

Thanks.

@dkohlsdorf
Copy link
Contributor

The users are true replicates. Sorry about that. We believe that is not a big issue, just remove the replicates.

Daniel

@dkohlsdorf
Copy link
Contributor

@danct again, if you split by \t, all lines in the user file have 12 values.

Hope that helps
Daniel

@danct
Copy link

danct commented Mar 18, 2016

Thanks Daniel. Indeed there is a tab at the end of each line with 11 values. This means it is always the value for the last field (edu_fieldofstudies) that is missing for about half of the lines.

According to the provided documentation, there should have been a 0 instead of no value.

"edu_fieldofstudies comma-separated fields of studies that the user studied. 0 means "unknown" [...]"

Anyhow, it's an easy fix.

Thanks again.

@dkohlsdorf
Copy link
Contributor

Alrighty :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants