Skip to content

fix: re-order JSON and CSV in the lesson about saving data (Python course) #1658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

honzajavorek
Copy link
Collaborator

When working on #1584 I realized it'd be better if the lesson started with JSON and continued with CSV, not the other way.

In Python it doesn't matter and in JavaScript it's easier to start with JSON, which is built-in, and only then move to CSV, which requires an additional library. So for the sake of having both lessons aligned, I want to change the order in the Python lesson, too.

So most of the diff is just the two sections reversed, and the two exercises reversed. I made only a few additional changes to the wording.

Making this change because in Python it doesn't matter
and in JavaScript it's easier to start with JSON, which
is built-in, and only then move to CSV, which requires
an additional library.
@honzajavorek honzajavorek requested a review from TC-MO June 27, 2025 14:13
@honzajavorek honzajavorek added the t-academy Issues related to Web Scraping and Apify academies. label Jun 27, 2025
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Module Import Timing and Export Instructions Mismatch

The lesson contains two main instructional inconsistencies:

  1. The csv module is prematurely imported in the JSON section's code example, appearing before the CSV format is introduced, which can confuse students.
  2. The instructions for adding data exports are contradictory: the JSON section tells users to "replace" the print(data) line, while the CSV section later says to "add one more data export", creating ambiguity about whether exports should coexist or replace each other.

sources/academy/webscraping/scraping_basics_python/08_saving_data.md#L86-L186

```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
import csv
# highlight-next-line
import json
```
Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `print(data)` with the following:
```py
with open("products.json", "w") as file:
json.dump(data, file)
```
That's it! If we run the program now, it should also create a `products.json` file in the current working directory:
```text
$ python main.py
Traceback (most recent call last):
...
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Decimal is not JSON serializable
```
Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly:
```py
def serialize(obj):
if isinstance(obj, Decimal):
return str(obj)
raise TypeError("Object not JSON serializable")
with open("products.json", "w") as file:
json.dump(data, file, default=serialize)
```
If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products:
<!-- eslint-skip -->
```json title=products.json
[{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...]
```
If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash:
```json
{"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"}
```
:::tip Pretty JSON
While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output.
Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`.
:::
## Saving data as CSV
The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets.
In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage:
```py
>>> import csv
>>> with open("data.csv", "w") as file:
... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"])
... writer.writeheader()
... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"})
... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"})
...
```
We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents:
```csv title=data.csv
name,age,hobbies
Alice,24,"kickbox, Python"
Bob,42,"reading, TypeScript"
```
In the CSV format, if a value contains commas, we should enclose it in quotes. When we open the file in a text editor of our choice, we can see that the writer automatically handled this.
When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have.
![CSV example preview](images/csv-example.png)
Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports:
```py
import httpx
from bs4 import BeautifulSoup
from decimal import Decimal
# highlight-next-line
import csv
```
Next, let's add one more data export to end of the source code of our scraper:

Fix in Cursor


Comment bugbot run to trigger another review on this PR
Was this report helpful? Give feedback by reacting with 👍 or 👎

@apify-service-account
Copy link

Preview for this PR was built for commit a8e57c7 and is ready at https://pr-1658.preview.docs.apify.com!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-academy Issues related to Web Scraping and Apify academies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants