New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data.frame to data frame #145

Open
wants to merge 2 commits into
base: gh-pages
from

Conversation

Projects
None yet
7 participants
@jrnold
Contributor

jrnold commented Jun 13, 2016

  • Replace instances of data.frame with "data frame" which is how the R documentation refers to them; see ?data.frame.
  • Note that data frames are lists in which each vector has the same length
  • Add tip that csv files can be created/edited within RStudio as a new text file.

jrnold added some commits Jun 13, 2016

Replace data.frame with data frame
Replace instances of data.frame that were not code to data.frame, which is how the R documentation refers to them; see .
@aammd

This comment has been minimized.

Contributor

aammd commented Jun 13, 2016

I had originally commented on this over in #144 , but this is a better place for this whole discussion! my original comment below (supporting the current way of doing things, i.e. writing data.frame)

I think the reasoning behind this is to match the output of class(). It may also help to get learners to focus on a single new term ("data.frame") for thinking about their data. (though perhaps others want to chime in here?)

@jrnold

This comment has been minimized.

Contributor

jrnold commented Jun 13, 2016

And my comment from #144

I'll see what others say about the data.frame v. "data frame", but it seems if data.frame is used to formally reference the object class, then it should be data.frame. Though my preference would be to follow the R conventions.

@tomwright01

This comment has been minimized.

Contributor

tomwright01 commented Jun 14, 2016

There are 69 lines with "data.frame" and 16 with "data frame". I guess we should be consistent.
I'm +1 for "data frame".

The 69 lines will include places where data.frame() is used as a function. Guess we shouldn't replace those.

@gvdr

This comment has been minimized.

Contributor

gvdr commented Jun 14, 2016

On my side, as a user I'm +1 for "data.frame". Almost everywhere when we use the term, we are referring to objects of the specific class data.frame. See for example the use in dplyr, where it is particularly sensitive as there is a data_frame() function: https://cran.r-project.org/web/packages/dplyr/vignettes/data_frames.html

@tomwright01

This comment has been minimized.

Contributor

tomwright01 commented Jun 14, 2016

The example actually uses data.frame() to refer to the function. When referring to the object type it uses 'data frame'.

@gvdr

This comment has been minimized.

Contributor

gvdr commented Jun 14, 2016

Gosh, you are right. The dot was just in my brain. Much cognitive bias. Pardon me. Then, I'll second you.

@aammd

This comment has been minimized.

Contributor

aammd commented Jun 14, 2016

It looks like data frame might be winning, both on here and in my informal polls (real life and on Twitter)

@gavinsimpson

This comment has been minimized.

gavinsimpson commented Jun 14, 2016

The concept of a data frame extends beyond R's "data.frame" class. See the "tbl_df" class in R package tibble and the "DataFrame" class in Python library Pandas. Julia also has a "DataFrames" package. It will be important going forward to talk about data frames in the abstract sense that cuts across implementations and even languages.

Hence I'm for "data frame" when discussing the concept, or talking about the spreadsheet-like tabular data objects. When referencing specific class names (which one rarely needs to do) only then should we resort to data.frame, or DataFrame.

@aammd

This comment has been minimized.

Contributor

aammd commented Jun 14, 2016

hi @gavinsimpson , thanks for your input! Out of curiosity, is there a language-agnostic description of what is meant by a "data frame" (in the broad sense) that we could link to?

@gvdr

This comment has been minimized.

Contributor

gvdr commented Jun 14, 2016

Hadley's paper on Tidy Data is quite language-agnostic, at least for section 2. Defining tidy data http://vita.had.co.nz/papers/tidy-data.pdf

@jrnold

This comment has been minimized.

Contributor

jrnold commented Jun 14, 2016

Feather the Wes McKinney and Hadley Wickham uses the term 'data frames" https://blog.rstudio.org/2016/03/29/feather/ and is explicitly language agnostic. I can't find an explicit definition of data frame though.

@jrnold

This comment has been minimized.

Contributor

jrnold commented Jun 14, 2016

Sorry, here's a definition

data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Every column can have missing values.

@aammd

This comment has been minimized.

Contributor

aammd commented Jun 14, 2016

@jrnold I like that! What if we add that definition to the lesson material as part of this PR?

@tomwright01

This comment has been minimized.

Contributor

tomwright01 commented Jun 15, 2016

Please add the definition to the glossary in reference.MD

On Tue, Jun 14, 2016, 7:18 PM Andrew MacDonald notifications@github.com
wrote:

@jrnold https://github.com/jrnold I like that! What if we add that
definition to the lesson material as part of this PR?


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#145 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AHf2K-ppBn2YFQUW8D5aMonYRZeV2NScks5qLzaogaJpZM4I0wQx
.

@gavinsimpson

This comment has been minimized.

gavinsimpson commented Jun 15, 2016

That definition is not broad enough for the usage that Hadley and co are now espousing. Effectively, because data frames are lists, anything can be a component (as long as each component has the correct length), and so they are using nested data frames. From an EDA viewpoint, the examples Hadley gave in his recent Edinburgh talk of nested data frames were quite persuasive.

That perhaps goes more towards the R implementation and perhaps tbl_dfs and `the tidy data paradigm in particular.

@jrnold

This comment has been minimized.

Contributor

jrnold commented Jun 15, 2016

Would replacing "numeric, boolean, and date-and-time, categorical (factors), or string" with "heterogeneous data types" handle that generalization?

@gavinsimpson

This comment has been minimized.

gavinsimpson commented Jun 15, 2016

@jrnold yes, I guess so - even without the nested data frame distraction, R has more data types than the description from the RStudio blog (e.g. raw, complex, if we're being extra picky), which may not be the same in other implementations.

@jrnold

This comment has been minimized.

Contributor

jrnold commented Jun 16, 2016

This is the response I got from twitter https://twitter.com/thosjleeper/status/743178779212787713

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

The source is stackoverflow tag info for dataframe.

@gvwilson gvwilson self-assigned this Jul 24, 2016

@gvwilson gvwilson removed their assignment Apr 26, 2017

rgaiacs pushed a commit to rgaiacs/swc-r-novice-gapminder that referenced this pull request May 6, 2017

Merge pull request swcarpentry#145 from ChristinaLK/cheat-sheet
adding beginning of outline for trainees to use during sessions

rgaiacs added a commit to rgaiacs/swc-r-novice-gapminder that referenced this pull request May 6, 2017

@naupaka naupaka added the help-wanted label Jan 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment