-
Notifications
You must be signed in to change notification settings - Fork 108
fix: re-order JSON and CSV in the lesson about saving data (Python course) #1658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Making this change because in Python it doesn't matter and in JavaScript it's easier to start with JSON, which is built-in, and only then move to CSV, which requires an additional library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Module Import Timing and Export Instructions Mismatch
The lesson contains two main instructional inconsistencies:
- The
csv
module is prematurely imported in the JSON section's code example, appearing before the CSV format is introduced, which can confuse students. - The instructions for adding data exports are contradictory: the JSON section tells users to "replace" the
print(data)
line, while the CSV section later says to "add one more data export", creating ambiguity about whether exports should coexist or replace each other.
sources/academy/webscraping/scraping_basics_python/08_saving_data.md#L86-L186
apify-docs/sources/academy/webscraping/scraping_basics_python/08_saving_data.md
Lines 86 to 186 in a8e57c7
```py | |
import httpx | |
from bs4 import BeautifulSoup | |
from decimal import Decimal | |
import csv | |
# highlight-next-line | |
import json | |
``` | |
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `print(data)` with the following: | |
```py | |
with open("products.json", "w") as file: | |
json.dump(data, file) | |
``` | |
That's it! If we run the program now, it should also create a `products.json` file in the current working directory: | |
```text | |
$ python main.py | |
Traceback (most recent call last): | |
... | |
raise TypeError(f'Object of type {o.__class__.__name__} ' | |
TypeError: Object of type Decimal is not JSON serializable | |
``` | |
Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly: | |
```py | |
def serialize(obj): | |
if isinstance(obj, Decimal): | |
return str(obj) | |
raise TypeError("Object not JSON serializable") | |
with open("products.json", "w") as file: | |
json.dump(data, file, default=serialize) | |
``` | |
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products: | |
<!-- eslint-skip --> | |
```json title=products.json | |
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...] | |
``` | |
If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash: | |
```json | |
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"} | |
``` | |
:::tip Pretty JSON | |
While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output. | |
Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`. | |
::: | |
## Saving data as CSV | |
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets. | |
In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage: | |
```py | |
>>> import csv | |
>>> with open("data.csv", "w") as file: | |
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"]) | |
... writer.writeheader() | |
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"}) | |
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"}) | |
... | |
``` | |
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents: | |
```csv title=data.csv | |
name,age,hobbies | |
Alice,24,"kickbox, Python" | |
Bob,42,"reading, TypeScript" | |
``` | |
In the CSV format, if a value contains commas, we should enclose it in quotes. When we open the file in a text editor of our choice, we can see that the writer automatically handled this. | |
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have. | |
 | |
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports: | |
```py | |
import httpx | |
from bs4 import BeautifulSoup | |
from decimal import Decimal | |
# highlight-next-line | |
import csv | |
``` | |
Next, let's add one more data export to end of the source code of our scraper: |
Comment bugbot run
to trigger another review on this PR
Was this report helpful? Give feedback by reacting with 👍 or 👎
Preview for this PR was built for commit |
When working on #1584 I realized it'd be better if the lesson started with JSON and continued with CSV, not the other way.
In Python it doesn't matter and in JavaScript it's easier to start with JSON, which is built-in, and only then move to CSV, which requires an additional library. So for the sake of having both lessons aligned, I want to change the order in the Python lesson, too.
So most of the diff is just the two sections reversed, and the two exercises reversed. I made only a few additional changes to the wording.