New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a wrong data entrry at boston_house_prices.csv, Issue #10801 #10795

Merged
merged 1 commit into from Mar 13, 2018

Conversation

Projects
None yet
4 participants
@tarcusx
Contributor

tarcusx commented Mar 12, 2018

Reference Issues/PRs

Fixes #10801.

What does this implement/fix? Explain your changes.

There is a wrong data entry at boston_house_prices.csv compared with the original data: http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Any other comments?

To create a original data csv file for checking:

wget http://lib.stat.cmu.edu/datasets/boston
tail -n+23 boston | awk '{if(NR%2){ORS=" "}else{ORS="\n"};print;}' | awk 'BEGIN { OFS="," } {$1=$1; print}' > boston.csv

To compare the sklearn load_boston data with the original data:

In [1]: import numpy as np
In [2]: orig_data = np.loadtxt("./boston.csv", delimiter=",")
In [3]: from sklearn.datasets import load_boston
In [4]: boston = load_boston()
In [5]: sk_data = np.column_stack((boston.data, boston.target))
In [6]: check = (sk_data == orig_data)
In [7]: np.where(check==False)
Out[7]: (array([445]), array([0]))
In [8]: orig_data[445,0]
Out[8]: 10.671799999999999
In [9]: sk_data[445,0]
Out[9]: 0.67179999999999995

@tarcusx tarcusx changed the title from fixed a wrong data entrry at boston_house_prices.csv to Fixed a wrong data entrry at boston_house_prices.csv Mar 12, 2018

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 12, 2018

Member

Thanks. Please mention this with .. versionchanged:: 0.20 in the loader docstring.

Please also add a Bug Fix entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Member

jnothman commented Mar 12, 2018

Thanks. Please mention this with .. versionchanged:: 0.20 in the loader docstring.

Please also add a Bug Fix entry to the change log at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Mar 12, 2018

Member

Just for completeness, we indeed seem to have a data point wrong. I checked with the data from UCI:

import pandas as pd
from sklearn.datasets import load_boston

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', sep='\s+', header=None)
X, y = load_boston(return_X_y=True)

print(df.iloc[445])
print(X[445])

Output:

0      10.6718
1       0.0000
2      18.1000
3       0.0000
4       0.7400
5       6.4590
6      94.8000
7       1.9879
8      24.0000
9     666.0000
10     20.2000
11     43.0600
12     23.9800
13     11.8000
Name: 445, dtype: float64
array([  0.6718,   0.    ,  18.1   ,   0.    ,   0.74  ,   6.459 ,
        94.8   ,   1.9879,  24.    , 666.    ,  20.2   ,  43.06  ,
        23.98  ])
Member

lesteve commented Mar 12, 2018

Just for completeness, we indeed seem to have a data point wrong. I checked with the data from UCI:

import pandas as pd
from sklearn.datasets import load_boston

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data', sep='\s+', header=None)
X, y = load_boston(return_X_y=True)

print(df.iloc[445])
print(X[445])

Output:

0      10.6718
1       0.0000
2      18.1000
3       0.0000
4       0.7400
5       6.4590
6      94.8000
7       1.9879
8      24.0000
9     666.0000
10     20.2000
11     43.0600
12     23.9800
13     11.8000
Name: 445, dtype: float64
array([  0.6718,   0.    ,  18.1   ,   0.    ,   0.74  ,   6.459 ,
        94.8   ,   1.9879,  24.    , 666.    ,  20.2   ,  43.06  ,
        23.98  ])

@tarcusx tarcusx changed the title from Fixed a wrong data entrry at boston_house_prices.csv to Fixed issue#10801, a wrong data entrry at boston_house_prices.csv Mar 12, 2018

@tarcusx tarcusx changed the title from Fixed issue#10801, a wrong data entrry at boston_house_prices.csv to Fixed a wrong data entrry at boston_house_prices.csv, Issue #10801 Mar 12, 2018

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Mar 12, 2018

Member

@tarcusx just a FYI there was no need to open an associated issue in this case, since your PR explains the problem quite well already.

When you have a PR that fixes an issue, the best practice is to use "Fix #issueNumber" in your PR description, this way the associated issue gets closed automatically when the PR is merged. For more details, look at this. I have edited your PR description for you, but try to remember for next time.

Member

lesteve commented Mar 12, 2018

@tarcusx just a FYI there was no need to open an associated issue in this case, since your PR explains the problem quite well already.

When you have a PR that fixes an issue, the best practice is to use "Fix #issueNumber" in your PR description, this way the associated issue gets closed automatically when the PR is merged. For more details, look at this. I have edited your PR description for you, but try to remember for next time.

@tarcusx

This comment has been minimized.

Show comment
Hide comment
@tarcusx

tarcusx Mar 12, 2018

Contributor

@lesteve Thank you.
So this time I can use the Issue number for the doc string and change log comments.
When there is no issue related with the PR, which number should I use for :issue: in the docs?
The PR number?

Contributor

tarcusx commented Mar 12, 2018

@lesteve Thank you.
So this time I can use the Issue number for the doc string and change log comments.
When there is no issue related with the PR, which number should I use for :issue: in the docs?
The PR number?

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Mar 12, 2018

Member

When there is no issue related with the PR, which number should I use for :issue: in the docs?
The PR number?

The PR number is completely fine when there is no issue.

Member

lesteve commented Mar 12, 2018

When there is no issue related with the PR, which number should I use for :issue: in the docs?
The PR number?

The PR number is completely fine when there is no issue.

@tarcusx

This comment has been minimized.

Show comment
Hide comment
@tarcusx

tarcusx Mar 12, 2018

Contributor

@jnothman Updated.

Contributor

tarcusx commented Mar 12, 2018

@jnothman Updated.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 12, 2018

Member

I prefer PR numbers in the change log, actually

Member

jnothman commented Mar 12, 2018

I prefer PR numbers in the change log, actually

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 12, 2018

Member

I would also prefer if you added, rather than amended, commits

Member

jnothman commented Mar 12, 2018

I would also prefer if you added, rather than amended, commits

@tarcusx

This comment has been minimized.

Show comment
Hide comment
@tarcusx

tarcusx Mar 12, 2018

Contributor

I prefer PR numbers in the change log, actually
I would also prefer if you added, rather than amended, commits

I see. Next time, I'll do that.
Thanks.

Contributor

tarcusx commented Mar 12, 2018

I prefer PR numbers in the change log, actually
I would also prefer if you added, rather than amended, commits

I see. Next time, I'll do that.
Thanks.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Mar 12, 2018

Member
Member

jnothman commented Mar 12, 2018

@qinhanmin2014

LGTM. Thanks @tarcusx :)

@qinhanmin2014 qinhanmin2014 merged commit 8585275 into scikit-learn:master Mar 13, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch Coverage not affected when comparing 47ce5e1...cb68d6b
Details
codecov/project 95.02% (+0.01%) compared to 47ce5e1
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Mar 13, 2018

Member

Thanks @tarcusx, just curious, how did you notice that a data point was wrong?

Member

lesteve commented Mar 13, 2018

Thanks @tarcusx, just curious, how did you notice that a data point was wrong?

@tarcusx

This comment has been minimized.

Show comment
Hide comment
@tarcusx

tarcusx Mar 13, 2018

Contributor

@lesteve
I did a machine learning tutorial which downloads the boston_house_prices.csv and loads it.
I used load_boston() because I'm lazy...
And noticed the result had a slight difference.
Even I'm lazy, I was very curious about the reason so investigated it ;)

Contributor

tarcusx commented Mar 13, 2018

@lesteve
I did a machine learning tutorial which downloads the boston_house_prices.csv and loads it.
I used load_boston() because I'm lazy...
And noticed the result had a slight difference.
Even I'm lazy, I was very curious about the reason so investigated it ;)

@tarcusx tarcusx deleted the tarcusx:boston_house_prices_fix branch Mar 13, 2018

qinhanmin2014 referenced this pull request May 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment