New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Take over PR #7647 - Add a "filename" attribute to datasets that have a CSV file #9101

Merged
merged 23 commits into from Dec 4, 2017
Commits
Jump to file or symbol
Failed to load files and symbols.
+15 −2
Diff settings

Always

Just for now

Viewing a subset of changes. View all

add examples of numpy.loadtxt usage

  • Loading branch information...
alex-33 authored and maskani-moh committed Oct 18, 2016
commit 6532a9499e710aa8ca93be0113c8a08fe0dd5a11
@@ -144,8 +144,8 @@ learn::
.. topic:: Loading from the data files

This comment has been minimized.

@jnothman

jnothman Sep 26, 2017

Member

I don't get why this belongs in the tutorial, unless it's framed as "You can also load your own data. For example, load_boston(...) just pulls in data using numpy.loadtxt::". This currently appears to be too much detail on the internals of scikit-learn.

This comment has been minimized.

@maskani-moh

maskani-moh Sep 26, 2017

Contributor

@jnothman, I agree with you, no need to mention the filename attribute in the tutorial. Too much detail for a tutorial.
Should I remove this section then?

All standard datasets which you can import with ``load_`` have underlying source files that
you can read manually (consider numpy.loadtxt and pandas for analysis).
The data and target can be stored in one file (e.g. iris, boston, breast_cancer) or
you can read manually (consider :func:`numpy.loadtxt` and `pandas <http://pandas.pydata.org/>`_
for analysis). The data and target can be stored in one file (e.g. iris, boston, breast_cancer) or
in several (e.g. diabetes, linnerud).
>>> from sklearn.datasets import load_boston
@@ -160,6 +160,19 @@ learn::
>>> print(diabetes.target_filename) # doctest: +SKIP
(some-path)/sklearn/datasets/data/diabetes_target.csv.gz
Example of reading data file with numpy. Boston dataset contains
2 header lines, that is why we are going to skip them:
>>> import numpy as np
>>> boston_data = np.loadtxt(boston.filename, delimiter=",", skiprows=2)
>>> boston.data.shape # sklearn dataset
(506, 13)
>>> boston_data.shape # also contains target columns
(506, 14)
See also:
:func:`pandas.read_csv`
Learning and predicting
------------------------
ProTip! Use n and p to navigate between commits in a pull request.