Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation on the ai_service_2_confidence column in keywords.tsv000 (range seems weird) #39

Closed
jeanmidevacc opened this issue Jul 28, 2021 · 2 comments · Fixed by #40
Closed
Assignees
Labels
bug Something isn't working
Projects

Comments

@jeanmidevacc
Copy link

jeanmidevacc commented Jul 28, 2021

Describe the bug
Hello ,

I was looking on the data from the lite dataset this morning and I noticed something weird in the column 'ai_service_2_confidence' from the keywords.tsv000 file.

when I applied some stats on the columns about ai_service the column 'ai_service_2_confidence' seems to have extreme value that are exceeding 100 that is for me the expected max (if I take the ai_service_1_confidence as reference for exemple)

image

To Reproduce

There is the code to build the stats

import pandas as pd
dfp_keywords_raw = pd.read_csv('keywords.tsv000', sep='\t', header=0)
dfp_keywords_raw[['ai_service_1_confidence', 'ai_service_2_confidence']].describe()

Steps to reproduce the behavior:
Having a python environment (3.6.13) with pandas 1.1.5 installed

Expected behavior
I am expecting to have a value in the column 'ai_service_2_confidence' in keywords.tsv000 file between 0 and 100 or if it's not the case having a more precise description of the value for the 'ai_service_2_confidence' in the description (like the range)

Additional context
I have a list of the keywords that seems to be impacted by these extreme values
unsplash_extreme_value.zip

Hope that it will help on your investigation 🕵️‍♀️ (and I hope that is not just me that is missing something)

PS: your dataset is great by the way (really hope to have access to the full version soon)👍

@jeanmidevacc jeanmidevacc added the bug Something isn't working label Jul 28, 2021
@TimmyCarbone
Copy link
Member

TimmyCarbone commented Jul 29, 2021

@jeanmidevacc I've looked into it and it looks like you can divide the values that are > 100 by 100.
For example, if you see confidence = 9657.65, the actual confidence in a range 0-100 is 96.5765.

This is obviously an issue in the dataset and I'm adding this fix to the next release that's coming up this week.

Thank you for catching it and for describing the issue the way you did!

@TimmyCarbone TimmyCarbone added this to To do in 1.2.0 Jul 29, 2021
@TimmyCarbone TimmyCarbone moved this from To do to Done in 1.2.0 Jul 30, 2021
@jeanmidevacc
Copy link
Author

Great thanks for the update (and to have handle quickly the issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
1.2.0
Done
Development

Successfully merging a pull request may close this issue.

2 participants