New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What would everyone like to see in the 2nd edition? #37

Closed
wesm opened this Issue Apr 21, 2016 · 42 comments

Comments

Projects
None yet
@wesm
Owner

wesm commented Apr 21, 2016

I've started working on the revised 2nd Edition of Python for Data Analysis. The agenda / table of contents is not set in stone, though!

Any comments on the existing content or requests for new content would be welcome here. I can't make any promises, but since I know how useful the book has been for many people the last 3.5 years, I would like to make sure the 2nd edition is just as useful (if not more so!) in the following 3.5 years (which will put us all the way to 2020, if you can believe it).

Thank you all in advance for the support.

@vchollati

This comment has been minimized.

vchollati commented Apr 21, 2016

Adding content about dask, distributed, ibis would be awesome. :P

@wesm

This comment has been minimized.

Owner

wesm commented Apr 21, 2016

@vchollati with 4 years of history in the mirror, I'm pretty hesitant to write about projects that are still under active development -- we did a major edit of the 1st edition to fix pandas API breakage, so my rule of thumb will be having code examples that I feel confident will still work 2 years from now.

Since the 1st edition got translated into at least 5 other languages (one or two more may be in the works), this stability is extra important as fixes in the primary English edition may take a lot longer to percolate to the translations.

@little7dragon

This comment has been minimized.

little7dragon commented Apr 21, 2016

Adding practical analysis theory and ibis is better.

@atacca

This comment has been minimized.

atacca commented Apr 21, 2016

A request! How about a bit on creating a simple file-handler for web forms, and outputting into live graphing?

@m-sostero

This comment has been minimized.

m-sostero commented Apr 21, 2016

Minor point, but I assume the section on IPython will be complemented with a section on Jupyter.

@kokes

This comment has been minimized.

kokes commented Apr 21, 2016

Speaking as someone who only skimmed the first edition (eagerly awaiting the second edition to purchase in multiple copies as reference material at work).

Some high performance tips, both within pandas (at/iat vs loc/iloc, working with .values, generally how to squeeze the most out of pure pandas) and using other tools (dask, blaze, numba, cython, pypy, ...). I know these tips are scattered around and that there's a whole chunk in the advanced numpy chapter, but a whole chapter dedicated to speedy processing might be worthwhile.

Regarding the comments above - the whole feather/ibis/arrow matter, while still under active development, would probably deserve a mention (not code, just to mention it), so that readers know what to anticipate. And if reading years after the book was published, they can look up relevant code.

Thanks for your work on all this.

@briandk

This comment has been minimized.

briandk commented Apr 21, 2016

I teach a data science/computational modeling course for undergraduates. We assume no prior programming experience, and I'm thinking of using Python/pandas. So, this may not be the book for us and my requests may not work 😇

  1. pandas has some built in plot methods. Will the book talk about/teach those at all, or will it be up to the reader to read the pandas docs on visualization?
  2. Will there be anything on interactive visualizations for the web (like the Bokeh) package? I could totally imagine, though, that that's either outside the scope of the book and/or a package that's not mature enough.

Lastly, thank you so much for making pandas!

@chris1610

This comment has been minimized.

chris1610 commented Apr 22, 2016

  • Definitely some more info about the newer functions like: assign,pipe and how to use them effectively.
  • Indexes: .ix, .iloc, .loc etc. I still get tripped up from time to time here.
  • query and when it should be used or not used
  • Maybe a "cookbook" section that walks through some basic problems and how to solve them.

Anyway, great stuff and I really look forward to the new book!

@josepmv

This comment has been minimized.

josepmv commented Apr 22, 2016

I bought the first edition and I really enjoyed it: I read it in a linear way to learn Pandas and now I'm using it as a reference book. I like because it's very well organized.
In the next edition, I would add:

  • All the new stuff related to Indexes: indexes has been the most painful thing for me.
  • Best practices: in Pandas, you can do things in several ways and some are better than others.
  • How Pandas can work with other tools (dask, numba, bcolz, ...): not deep, just a brief introduction.
  • A full example using most things that Pandas has, to see how they work together.
@leondutoit

This comment has been minimized.

leondutoit commented Apr 22, 2016

Some topics topics that come to mind:

  • best practices and examples of working with databases (e.g. creating, updating tables with the DataFrame as the interface)
  • some use cases for programming with pandas (e.g. how to use lower-level sections of the API to build custom functionality; this could include a more in-depth discussion of the library's design)
  • discussion of compression techniques and circumstances in which they are appropriate
@fdelaunay

This comment has been minimized.

fdelaunay commented Apr 22, 2016

Hi Wes,
I'd love to read a chapter about Reproducible Research. How to make the analysis reproducible, how to use IPython notebooks efficiently, how to make a good team work.
Thanks for asking ^^

@cjburgoyne

This comment has been minimized.

cjburgoyne commented Apr 23, 2016

I've been unpacking the recent Tom Augspurger posts and finding those very insightful. Would love to see strategies for unit testing scripts, especially with pipe/assign chains.

@TomAugspurger

This comment has been minimized.

TomAugspurger commented Apr 23, 2016

Exited that the 2nd Edition is happening :)

The layout of the first edition works well. Starting with the high-level overview of "Why Python" is great (and after these 4 years, you have even more evidence for why Python is a good choice). I really enjoyed the introductory examples to show the capabilities, before delving into sections of the APIs.

I hope you keep the section on NumPy, at least the parts on ufuncs and broadcasting.
I wonder if xarray is worth a section as well.

Plotting and visualization is a bit tricky. I think the pandas .plot API is pretty much settled aside from GroupBy.plot, which is discussed here. Seaborn deserves a mention, Bokeh as well probably. Perhaps even javascript based tools like d3, inside the notebook.

I'm curious to see what you do (if anything) with the Timeseries and Financial Applications sections. Pandas is great at timeseries, and the additions of TimeGroupers and the groupby-like resample, rolling, and expanding APIs only add to that. I'm guessing part of the reason they got prominence in the first edition was because your background. I'm looking forward to what your experiences in big-data land have taught you about interacting with it from python / pandas.

The biggest omission these days is probably a section on interfacing with scikit-learn. They've done some great work over the last 4 years. Unfortunately, the dust hasn't settled on exactly what they do with DataFrames (here and linked issues in that thread), so I don't know what can be set in stone at this point. At the very least you can cover a bit about going convert from pandas' extension types (mostly just Categoricals in this context) to NumPy arrays with get_dummies.

References to Tidy Data are always popular :) so that might be worth mentioning. I've been meaning to make pd.melt more MultiIndex friendly, but haven't gotten around to it yet.

The hardest concept I see when teaching pandas is effective use of Indexes. It's a hard concept to explain well. I don't have much to offer here, other than a hope that you attempt a better explanation than I can (not to say that the first edition didn't: it does emphasize the role of Indexes in slicing and reindexing/alignment).

Sorry about the wall of text, I hope some of it is useful :)

@NewMountain

This comment has been minimized.

NewMountain commented Apr 24, 2016

Hi Wes and thank you.

I would like to second the suggestion about Tom Augspurger's work. I read this post NB Viewer and it was like discovering Pandas all over again. What I got out of the first Python for Data Analysis book was "here are a million different ways to solve some problems (which you may or may not have)". What I really wanted (and still want) is a single source laying out simple, user-friendly general strategies, idioms and best practices. I've read several pandas books and Tom's notebooks and blog posts are the first thing I have seen that feels like it's answering that need.

Perhaps a closing chapter for efficient, ergonomic data work?

@andportnoy

This comment has been minimized.

andportnoy commented Apr 24, 2016

I second @kokes on this:

Some high performance tips, both within pandas (at/iat vs loc/iloc, working with .values, generally how to squeeze the most out of pure pandas) and using other tools (dask, blaze, numba, cython, pypy, ...). I know these tips are scattered around and that there's a whole chunk in the advanced numpy chapter, but a whole chapter dedicated to speedy processing might be worthwhile.

Also, please write about effective use of pivoting, stacking, unstacking, index-setting and other indexing-related operations.

@Tagar

This comment has been minimized.

Tagar commented Apr 27, 2016

Examples of data wrangling / data munging / feature engineering using PySpark + pandas + etc.
(I saw your reply to @vchollati on ibis, but I think Spark is more stable than it used to be).

@naoyak

This comment has been minimized.

naoyak commented Apr 29, 2016

Sorry if this has been addressed, but Python 3?

@justmarkham

This comment has been minimized.

justmarkham commented May 11, 2016

The appendix of the 1st edition ("Python Language Essentials") is excellent, and I would propose keeping it in the 2nd edition. In fact, I tell my students who own the book to read the appendix first :)

@KrOstir

This comment has been minimized.

KrOstir commented May 19, 2016

This one is rather small, but probably many people are using Anaconda. I would therefore suggest to add Continuum Analytics Anaconda in addition to Enthought Canopy.

@DavidWright123

This comment has been minimized.

DavidWright123 commented May 29, 2016

Anaconda Distribution information would be helpful, as I prefer it to Enthrought Canopy.

As I come from a statistics background, and I'm a newcomer to Pandas and Python (and to CS in general), it would be great to have an extra section on how Pandas works with SQLite or MySQL, and how people integrate Pandas into their workflow with databases and ETL processes.

While this may be (a bit) outside the scope of your book, I'm sure a lot of newcomers to Pandas would love some basic information or recommendations on how Pandas fits into a data analyst's typical workflow. If you can't fit it into the book, please point us in the right direction with an informative link or two.

-P.S. I'm currently on chapter 4 of your first edition, and I love it so far. So, thank you!
And, please forgive me if you already include 'SQL-to-Pandas' workflow guidance later on in your 1st edition.

@DavidWright123

This comment has been minimized.

DavidWright123 commented May 29, 2016

Wes, do you have an rough estimate of when the 2nd edition with be available for purchase?

@wesm

This comment has been minimized.

Owner

wesm commented Jun 1, 2016

@DavidWright123 I'll try to provide status updates over the rest of the year, but I believe it's going to be in the 1st quarter of 2017.

And it will definitely be Python 3.5-based (since Python 2.x will retire within the working lifetime of the book: http://pythonclock.org/) =)

@Madfile

This comment has been minimized.

Madfile commented Aug 11, 2016

Programming with python3 !

@louridas

This comment has been minimized.

louridas commented Oct 15, 2016

The 2012 Federal Election Commission Database will probably need updating.

Apart from the fact that there are now the 2016 data, I have come across something strange while going through the existing example.

If I simply sum up the contributions to the two candidates at the end:

fec_mrbo.groupby('cand_nm').sum()

it appears that Mitt Romney raised more money than Barack Obama (679.994.900 vs. 558.359.100). These figures run contrary to expectations and what has been reported to the media.

Trying to investigate, I found that a lot of Mitt Romney transactions are transfers from one Mitt Romney committee to another, so they are not net contributions.

Indeed, running:

transfers = fec[fec['memo_text'].str.contains("TRANSFER").fillna(False)]

mr_transfers = transfers[transfers['cand_nm'] == 'Romney, Mitt']
bo_transfers = transfers[transfers['cand_nm'] == 'Obama, Barack']
print("Romney -> Romney:", mr_transfers['contb_receipt_amt'].count(), mr_transfers['contb_receipt_amt'].sum())
print("Obama -> Obama:", bo_transfers['contb_receipt_amt'].count(), bo_transfers['contb_receipt_amt'].sum())

gets me:

Romney -> Romney: 644022 295380725.37
Obama -> Obama: 0 0

So if I do some cleaning:

fec = fec[~fec['memo_text'].str.contains("TRANSFER").fillna(False)]
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
fec_mrbo.groupby('cand_nm').sum()

I get:

Obama, Barack 558359100
Romney, Mitt 384614200

which looks closer to what has been reported.

NB: I am not an American so I don't really know the campaign reporting rules. Just pointing out something that looks strange to me.

@wesm

This comment has been minimized.

Owner

wesm commented Oct 16, 2016

Thanks. I'll see if it's straightforward to update to the 2016 disclosure dataset

@vineethbabu4424

This comment has been minimized.

vineethbabu4424 commented Oct 18, 2016

Hi Wes,

Thanks for taking the effort to writing a second edition.

Do you think it's possible for the code and the output be be provided from Jupyter notebook ?

Vineeth

@dacoex

This comment has been minimized.

dacoex commented Nov 2, 2016

please expand the timeseries section:

  • tidy messy timeseries data & index sanity
  • best pratice
  • merging
  • good timeseries plots and aggregation
@guidorice

This comment has been minimized.

guidorice commented Nov 9, 2016

While I am very excited by the book content, working ipython/jupyter can be very frustrating especially with respect to visualizations. Please make the notebooks and examples work with anaconda.

btw, does anyone have any quickstart suggestions, or notes on how to use the notebooks with conda? The visualizations wont display.

I realize I may have to punt and try Enthought Python.

edit:
nvm, got it working with anaconda, I think I was confused because I thought this was supposed to display a graph. But it does display it later with plot()

plt.figure(figsize=(10, 4))
Out[10]:
<matplotlib.figure.Figure at 0x10d58add0>
<matplotlib.figure.Figure at 0x10d58add0>

Here was my conda setup which worked:

conda create --name pfda python=2.7 numpy pandas scipy matplotlib chaco jupyter`
source activate pfda
jupyter notebook ch02.ipynb 
@briandk

This comment has been minimized.

briandk commented Nov 10, 2016

@guidorice: a few things.

  1. As far as I know, anaconda comes standard with Jupyter Notebook.
  2. You should probably include code like this in your preamble, which will set up your notebook to display resolution-independent graphics in line. Feel free to omit the third section of import statements if you don't use those modules.
  3. In my experience, if you're using plt directly, you should conclude your plotting commands with plt.show() to display the plot. I believe pandas plot methods do so for you.
@wesm

This comment has been minimized.

Owner

wesm commented Nov 10, 2016

Try running %matplotlib inline in a notebook cell

@pglezen

This comment has been minimized.

pglezen commented Dec 4, 2016

The hardest concept I see when teaching pandas is effective use of Indexes. It's a hard concept to explain well.

I also believe it is an underappreciated concept. It seems like an esoteric topic but it's truly fundamental to getting anything non-trivial done. I like how the first edition treats index objects as first class citizens of the pandas community by having its own section at the same level of Series and DataFrame. I hope this won't diminish with all the new topics being considered.

@ssantic

This comment has been minimized.

ssantic commented Feb 6, 2017

Hey @wesm what I'd really, really recommend is that the code and plots in the book (at least the electronic formats) be in color, as opposed to just black and white. As much as I love the first edition (and the early release of the second so far), I've always found the fact that everything is just black and white really tiring to read - especially as opposed to most O'Reilly books.

@TheGhostHuCodes

This comment has been minimized.

TheGhostHuCodes commented Feb 7, 2017

Hi @wesm, I'm reading along with the 2nd edition on SafariBooksOnline and have found some deprecation warnings, small wording issues, and typos. Should I open issues here or leave notes on the O'Reilly errata page?

@wesm

This comment has been minimized.

Owner

wesm commented Feb 9, 2017

The errata page is fine. Thanks!

@ghost

This comment has been minimized.

ghost commented Feb 9, 2017

@krother

This comment has been minimized.

krother commented Mar 11, 2017

Hi Wes,
thanks much for your great and useful book! I've done the German translation for O'Reilly and am using it all the time. The data examples are great, please keep it that way!

Yesterday I noticed that the Jupyter notebook on Ch10 contains a few deprecated function calls. Do you suggest me to create a PR, or are they being overhauled anyway?

@wesm

This comment has been minimized.

Owner

wesm commented Mar 12, 2017

@krother yes, I'll be generating updated notebooks with refreshed code examples using up-to-date API calls. Vielen Dank für die Übersetzung!

@valmunos

This comment has been minimized.

valmunos commented Apr 4, 2017

Hi Wes, one thing I always think is helpful when learning from a book is problem sets. I realize your page count is probably going to be pretty high, but I do think it would make it a better overall resource.

@DavidWright123

This comment has been minimized.

DavidWright123 commented Apr 4, 2017

@ssantic

This comment has been minimized.

ssantic commented Apr 24, 2017

@wesm The 2nd Edition is looking great so far. :) Do you have any idea when the new Early Release chapters (so, Chapter 10 onwards) will be available? Thanks!

@wesm

This comment has been minimized.

Owner

wesm commented May 2, 2017

There will hopefully be at least 3-4 more early release chapters coming out this month, with the rest of the first manuscript draft appearing not long thereafter.

@wesm

This comment has been minimized.

Owner

wesm commented Sep 9, 2017

Thanks all for the input! I hope you enjoy the 2nd edition when it ships in a few weeks.

@wesm wesm closed this Sep 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment