Parse the book errata for easier sorting

In [321]:
import bisect
from collections import OrderedDict
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [5]:
r = requests.get('https://www.oreilly.com/catalog/errata.csp?isbn=0636920142874')
soup = BeautifulSoup(r.text)

In [9]:
soup.title.text

"Errata | O'Reilly Media Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow"

In [21]:
t = soup.find('table')

In [25]:
t.attrs.get('class', '')

['mainLayoutTable']

In [395]:
errata_table = None
for t in soup.find_all('table', recursive=True):
    if 'class' not in t.attrs:
        errata_table = t
        break

In [114]:
data = []
for tr in errata_table.find_all('tr'):
    cols = [c for c in tr.find_all('td')]
    if len(cols) == 0:
        continue
    col_texts = [c.text.strip().replace('\r', '') for c in cols]
    if len(col_texts) == 6:
        data.append(col_texts)
    else:
        print('found != 6 cols')
        print(col_texts)

In [375]:
df = pd.DataFrame(data, columns=['Version', 'Location', 'Description', 'Submitted By', 'Date Submitted', 'Date Corrected'])

In [376]:
df.dtypes

Version           object
Location          object
Description       object
Submitted By      object
Date Submitted    object
Date Corrected    object
dtype: object

In [377]:
df['Version'] = df['Version'].astype('category')

In [378]:
df['Version'].cat.categories

Index(['Mobi', 'Other Digital Version', 'PDF', 'Printed', 'Printed, PDF',
       'Printed, Safari Books Online', 'Safari Books Online', 'ePub',
       'ePub, Mobi, Safari Books Online'],
      dtype='object')

In [379]:
df['Date Submitted'] = pd.to_datetime(df['Date Submitted'])
df['Date Corrected'] = pd.to_datetime(df['Date Corrected'])

In [404]:
# extract chapter number, if one exists. If not, set to -1
# the ?: at the start makes it a non-capturing group
# use float type to handle NaNs
df['Chapter'] = df['Location'].str.extract('(?:Ch|ch|chapter|Chapter)\.? ?([0-9]+)').astype('float')
descr_ch = df['Description'].str.extract('(?<!Mar)(?:Ch|ch|chapter|Chapter)\.? ?([0-9]{1,3})', expand=False).astype('float')  # the (?<!Mar) is to avoid matching "March"
df['Chapter'] = df['Chapter'].fillna(descr_ch)
df['Chapter'][df['Chapter'].isna()] = -1
df['Chapter'] = df['Chapter'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self

In [412]:
# extract page number, if one exists. If not, set to -1
# the ?: at the start makes it a non-capturing group
# use float type to handle NaNs
df['Page'] = df['Location'].str.extract('(?:p|page|Page)\.? ?([0-9]+)').astype('float')

# sometime get page number at start...make 3 digits to avoid cases where it's a chapter number instead
# FIXME use lookahead for "Chapter"? Not working though
df['Page'][df['Page'].isna()] = df[df['Page'].isna()]['Location'].str.extract('^([0-9]+)(?!.*(Ch))')[0].astype('float')

# Remove rows that refer to roman indices, to make life easier
roman_page = df['Location'].str.extract('(?:p|page|Page)\.? ?([xiv]+)(?![a-z])', expand=False)
roman_mask = ~(roman_page.isna())
df.drop(df[roman_mask].index, axis=0, inplace=True)

df['Page'].iloc[df['Page'].isna()] = -1
df['Page'] = df['Page'].astype('int')
# df['Page'][df['Page'] > 820] = -1  # book only has 820 pages

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying 

In [413]:
chapter_page_map = OrderedDict({
    1: 1,
    2: 35,
    3: 85,
    4: 111,
    5: 153,
    6: 175,
    7: 189,
    8: 213,
    9: 235,
    10: 279,
    11: 331,
    12: 375,
    13: 413,
    14: 445,
    15: 497,
    16: 525,
    17: 567,
    18: 609,
    19: 667
})

In [414]:
chapters = list(chapter_page_map.keys())
pages = list(chapter_page_map.values())

In [415]:
# find the chapter numbers for rows with page numbers but no chapter numbers
def lookup_chapter(row):
    ind = bisect.bisect_right(pages, row.Page) - 1
    return chapters[ind]

missing_ch_mask = (df['Chapter'] < 0) & (df['Page'] > 0)
df['Chapter'][missing_ch_mask] = df[missing_ch_mask].apply(lookup_chapter, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(~key, value, inplace=True)


In [416]:
# Reorder columns, drop other info
df = df[['Version', 'Chapter', 'Page', 'Location', 'Description']]

In [417]:
df.query('Chapter < 0 & Page < 0')

Unnamed: 0,Version,Chapter,Page,Location,Description
4,Safari Books Online,-1,-1,?\nSection: Computing Gradients Using Autodiff,Super minor typo: just replace\n\nyou must cal...
5,Safari Books Online,-1,-1,"""Changes in the Second Edition,"" Numbered List...",'covolutional' should be 'convolutional' (miss...
15,Safari Books Online,-1,-1,"??\nRight under ""Training and Evaluating the M...",When I fit the model (including on Google Cola...


In [418]:
from IPython.display import display, HTML

In [420]:
df_sorted = df.sort_values(by=['Chapter', 'Page'])

In [448]:
df_styler = df_sorted.style.set_properties(**{
    'text-align': 'left', # left align text
    'white-space': 'pre-wrap', # make newlines work!
    'font-family': 'Arial',
    'font-size': '11pt',
})
df_styler.set_table_attributes('border="1"');
df_styler.hide_index();

In [449]:
# save to HTML
with open("my_errata.html", "w") as f:
    f.write("""<!DOCTYPE html>
<html>
<body>
""")
    f.write(df_styler.render(index=False))
    f.write("""
</body>
</html>""")

In [421]:
# from https://stackoverflow.com/questions/49661018/displaying-embedded-newlines-in-a-text-column-of-a-pandas-dataframe
display(df_sorted.style.set_properties(**{
    'text-align': 'left', # left align text
    'white-space': 'pre-wrap', # make newlines work!
}))

Unnamed: 0,Version,Chapter,Page,Location,Description
4,Safari Books Online,-1,-1,? Section: Computing Gradients Using Autodiff,Super minor typo: just replace you must call the tape’s jabobian() method with you must call the tape’s jacobian() method
5,Safari Books Online,-1,-1,"""Changes in the Second Edition,"" Numbered List Point 1","'covolutional' should be 'convolutional' (missing an 'n'). (I couldn't find page numbers in the Safari Books Online iPad app.) Note from the Author or Editor:Good catch, thanks. Fixed."
15,Safari Books Online,-1,-1,"?? Right under ""Training and Evaluating the Model""","When I fit the model (including on Google Colab), it shows progress out of 1719 rather than out of 55000 (as shown in the book), even though X_train has 55000 rows. What's going on? Note from the Author or Editor:Thanks for your question! Keras changed the way it displays progress during training since I wrote the book (after a bit of investigation, it looks like it happened in TensorFlow 2.2). Keras used to display the number of samples processed so far during the epoch (something like 38816/55000), but it now shows the number of *batches* processed so far. So if the batch size is 32 (which is the default) then there are math.ceil(55000/32)=1719 batches per epoch, so you would see 1213/1719 (instead of 38816/55000). I'll update the book to show the new format. Thanks a lot! Cheers, Aurelien"
26,Safari Books Online,1,1,1 First line.,"First sentence reads... ""When most people hear 'Machine Learning,' they picture a robot: a dependable butler or a deadly Terminator, depending on who you ask."" It's not ""...who you ask,"" it's ""... whom you ask."" Should use proper English, at least in the very first sentence of the book. You would not say ""You ask he,"" you'd say ""You ask him."" Note from the Author or Editor:Thanks for your feedback. As you might know, I am French, so please forgive my English mistakes. The he/him rule is very helpful. It's interesting that no one pointed out this error to me before, even though it's in the very first sentence! :) I think it goes to show that people are getting used to this mistake, to the point that many people on the Web seem to argue that ""whom"" now sounds too formal. Perhaps in a few decades it will no longer be considered a mistake. That said, of course, I've fixed the book now, thanks again!"
29,PDF,1,14,Page 14 First paragraph - First line,"an additional ""ag"" next to ""is"" : ""Reinforcement Learning isag a very"" -> ""Reinforcement Learning is a very"" Note from the Author or Editor:Good catch, thanks. I fixed this typo, it should be fine now in the electronic versions, and it will be correct in the 2nd release of the book (printed in October)."
30,Printed,1,14,Page 14 2nd line,"Reinforcement Learning isag a very different beast. Note from the Author or Editor:Good catch, thanks!"
31,PDF,1,30,"Page 30 Bullet pt listing in ""Underfitting the Training Data"" section","The list of methods to counter underfitting is in plain text, while the analogous list with regards to overfitting in the previous section was highlighted in a warning/caution frame; might want to adjust. Note from the Author or Editor:Thanks, good point. I'll change the underfitting section to use a warning frame."
38,Printed,1,143,Page 143 Eq 4-13,"(3rd release) In Eq 4-13, bottom line of p143 and Eq 4-19, x^T \theta^{(k)} is used But for matching the order of theta and x in other places, I suggest (\theta^{(k)})^T x or \theta^T x Thanks Note from the Author or Editor:Thanks for your suggestion, I fixed the 3 instances you pointed out. FYI, I hesitated between ""x^T theta"" and ""theta^T x"" because the first linear equation in chapter 1 is written y = theta0 x0 + theta1 x1 + ..., which naturally translates to y = theta^T x. It would be weird to write y = x0 theta0 + x1 theta1 + ... However, when dealing with matrices, one typically writes y = X W: here, X has to appear first (and there's no transpose), because each row of X already corresponds to a transposed feature vector. I remember being confused the first time I saw this, so I wanted to quickly transition from theta-first to X-first. However, I was not careful enough, so I ended up having a confusing mixture of both! Oops... I think you're right that consistently using theta-first before we really tackle matrices is probably better."
32,PDF,2,47,Page 47 End of virtualenv box,"This is an error of omission. If we are going to be using jupyter in a virtual environment. Then we must also setup jupyter to use the libraries associated with said environment. The requires the following two steps $ python3 -m pip install -U ipykernel $ python3 -m ipykernel install --user --name=my_env After that, when starting jupyter you can select ""my_env"" and start working in that environment. Note from the Author or Editor:Thanks Mohammed, great catch! Since the ipykernel package is installed automatically along with jupyter, the first command is not required, but the second is important (at least if you plan to have more than one virtualenv, which is the whole point). I updated the book like this: --------------------------------------------  $ python3 -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn  Collecting jupyter  Downloading https://[...]/jupyter-1.0.0-py2.py3-none-any.whl  Collecting matplotlib  [...]  If you created a virtualenv, you need to register it to Jupyter and give it a name:  $ python3 -m ipykernel install --user --name=python3  Now you can fire up Jupyter by typing the following command:  $ jupyter notebook  [...] Serving notebooks from local directory: [...]/ml  [...] The Jupyter Notebook is running at:  [...] http://localhost:8888/?token=60995e108e44ac8d8865a[...]  [...] or http://127.0.0.1:8889/?token=60995e108e44ac8d8865a[...]  [...] Use Control-C to stop this server and shut down all kernels [...] -------------------------------------------- Notice that I removed this section: --------------------------------------------  To check your installation, try to import every module like this:  $ python3 -c ""import jupyter, matplotlib, numpy, pandas, scipy, sklearn""  There should be no output and no error. -------------------------------------------- This is because I didn't want the layout of the book to be affected too much, and this paragraph is not necessary since users will notice if there are errors in the previous steps. Again, thanks a lot for your great feedback!"
33,Printed,2,67,Page 67 Second paragraph,"""After one- hot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for one 1 per row."" The resulting matrix has thousands of ROWS, but only 5 columns. The code output directly after this text gives an example. Note from the Author or Editor:Thanks for your feedback. I see how this paragraph can be confusing. Please let me clarify. The paragraph starts with: """""" Notice that the output is a SciPy _sparse matrix_, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After one-hot encoding, we get a matrix with thousands of columns, and the matrix is full of 0s except for a single 1 per row. [...] """""" My goal here was to explain that one-hot encoding categorical attributes with thousands of categories will result in a matrix with thousands of columns, in which case it's useful to have a sparse matrix, and that's the reason why the `OneHotEncoder` produces a sparse matrix. The sentence ""After one-hot encoding, we get..."" is in the context of the previous sentence ""This is very useful when you have categorical attributes with thousands of categories."" But I see how it's possible to interpret the sentence ""This is very useful..."" as a side comment, independent from the following sentence. In this case, ""After one-hot encoding..."" would seem to refer to the actual output of the previous code example. I've rephrased the paragraph to make it clearer: """""" Notice that the output is a SciPy _sparse matrix_, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories, since in this case one-hot encoding will produce a matrix with thousands of columns, and this matrix would be full of 0s, except for a single 1 per row. Using tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the nonzero elements. You can use it mostly like a normal 2D array, but if you really want to convert it to a (dense) NumPy array, just call the `toarray()` method: """""" Thanks again for your feedback! Cheers, Aurelien"
