# Working with CSV and JSON files

In the previous tutorial we saw how to use functions
from modules, as well as how to read and write files
text. In this tutorial, we will take advantage of these new
skills by taking an interest in **two types of very text files
frequently used to store and distribute data: files
CSV and JSON** files. We will learn how to manipulate these two
file types using Python modules dedicated to their processing
respectively: **the `csv` module and the `json` module**.

## Manipulating CSV files

### CSV files

CSV stands for ***comma-separated values***, or in plain English “values”.
separated by commas”. CSV files are intended to reproduce the
**structure of data from spreadsheets** such as Microsoft Excel or
LibreOffice Calc, reduced to strict textual data (more than
formatting, more column types, etc.).

We will take as an example the CSV file which contains the list
departments in 2021, from the Official Geographic Code (COG).
Let's look at the first lines of this file using a command
`shell` to get a good idea of ​​the structure of such a file.

In [None]:
!head -n 5 departement2021.csv

To use the analogy of a file from a spreadsheet, each
A line in the file represents a row in the spreadsheet, and the cells in a
lines are separated by commas. The first line can contain a
`header`, i.e. the name of the columns, but it is not
still the case.

The main advantages of CSV files are:

- their **simplicity**: they contain raw textual data,
so very light and can be easily edited via
any text editor or programming language

- their **universality**: they are very widely used as a
standard data exchange format

### The `csv` module

Since the data contained in a CSV is text data, we can
wondering why we need a particular module for the
manipulate, and why the tools we saw in the tutorial
previous would not be sufficient. The main reason is that the
CSV files still have some subtleties and standards, often
invisible to the user, but very important in practice. By
example: if we want to separate the different data according to the
commas, what if the text data itself
contain commas?

This is why we use the **csv module** to interact
with this type of files, in order to capitalize on the fact that others
asked themselves all these questions, and therefore not having to reinvent
the wheel with each CSV file import.

Note that in practice, we tend to manipulate this type of
data in the form of DataFrames (as in `R`), in order to take advantage
of their **tabular structure**. We will study in a future tutorial
the ***package* `Pandas`** which allows you to do precisely this in
Python. However, it is always useful to know how to manipulate the
data from a CSV as text data, and therefore to know the
`csv` module.

### Reading

In [None]:
import csv

The syntax for reading and manipulating CSV files in Python
is very close to that for simple text files. The only
difference is that we have to create a `reader` object from the object
file to be able to iterate over the lines.

In [None]:
rows = []

with open("departement2021.csv") as file_in:
    csv_reader = csv.reader(file_in)
    for row in csv_reader:
        rows.append(row)

rows[:4]

We find the same syntax as for simple text files:
Once the `reader` is created, we can iterate over the lines and perform
operations with them; for example, storing them in a list
as above.

When we have a CSV file with column names like in our
case, it is interesting to use them to manipulate the named data,
rather than by position using a simple list. We use for
this a `DictReader` instead of the `reader`. Now, when iterating over
the `DictReader` object created, each line is a dictionary, so the key
is the name of the column and the value is the cell data.

To illustrate its interest, let's display the names of the departments so the
department number is between 20 and 29.

In [None]:
with open("departement2021.csv") as file_in:
    dict_reader = csv.DictReader(file_in)
    for row in dict_reader:
        if row["DEP"].startswith("2"):
            print(row["LIBELLE"])

The code is much more readable: we easily understand what
data is manipulated and how.

### Writing

The syntax for writing is again quite close to that for
text files. The difference is that we are dealing with 2D data
(line x column), we can no longer pass only one string of
character when writing, you must **pass a list of elements**.

In [None]:
header = ["nom", "classe", "age"]
row1 = ["Maurice", "5èmeB", 12]
row2 = ["Manuela", "6èmeA", 11]

with open("test.csv", "w") as file_out:
    csv_writer = csv.writer(file_out)
    csv_writer.writerow(header)
    csv_writer.writerow(row1)
    csv_writer.writerow(row2)

Let's check that our raw CSV file looks like what we
were waiting.

In [None]:
# Shell command to display the contents of a file
!cat test.csv

### The *header*

As in a spreadsheet-type document, the first line of a file
CSV usually contains **variable names** (columns).
call this line the ***header***. This line is not mandatory
in theory, but it is still very practical to understand
quickly the nature of the data that is in a CSV file.
So it is a good practice to include a *header* when generating a
CSV file.

We saw in the previous example that the writing of the *header* is
was like any other data line. It is
when reading things get complicated, since it is necessary
retrieve the *header* separately from other data if the CSV file is
contains one. Let's use the CSV generated in the previous step to illustrate
that.

In [None]:
data = []
with open("test.csv", "r") as file_in:
    csv_reader = csv.reader(file_in)
    header = next(csv_reader)
    for row in csv_reader:
        data.append(row)

In [None]:
print(header)

In [None]:
print(data)

To retrieve the *header*, we use the `next` function. This is a
*built-in* function that will call the object's `__next__` method
`reader`, which allows you to iterate one step over the `reader`. The first call
to the `next` function therefore returns the first line of the document. If a
*header* is present in the file (which must be ensured),
The returned element is the *header*. We then classically retrieve the
rest of the data through a loop on the `reader` object, which we store
in a list of lists (one list per line).

### Importance of the delimiter

The **delimiter** is the character that is used to delimit
successive values ​​of a row in a CSV file.

The CSV standard uses — as its name suggests — the comma as
delimiter, but this is modifiable, and **it is not uncommon to fall
on CSV files that have another delimiter**. In this case, you must
go look directly in the raw text what the delimiter is
used. For example, we often find a delimitation by `tabs`
(the character is `\t`), i.e. a given number of spaces, in which case the
file may have the extension `.tsv` for *tab-separated value*. It
then you must specify the delimiter with the `delimiter` parameter
when creating the `reader`.

In practice, as with encoding a text file, **there are few
valid reason to change the delimiter**. Even if commas
appear in values ​​in the file — for example, in an address
— these values ​​are then surrounded by quotation marks, which allows
the separation of values ​​to be done correctly in the large
majority of cases.

## Manipulating JSON files

### JSON files

JSON (*JavaScript Object Notation*) is a very common file format.
popular for writing and exchanging data in the form of a
unique, human-readable string —
at least in theory.

As the name suggests, JSON is related to the *JavaScript* language in
the extent that it constitutes a derivative of the notation of objects in this
language. The format is however now independent of any language
programming, but is very frequently used in different
languages.

The JSON format is particularly important for statisticians and
data scientists because it constitutes the **quasi-standard response format
[API](https://fr.wikipedia.org/wiki/Interface_de_programmation)**.
Dialogue with APIs goes beyond the scope of this course.
introductory. However, APIs tend to become more widespread as a mode
standard communication for data exchange, it is important to
master the basics of the JSON format in order to manipulate the responses of
API when we need to interact with them.

The JSON storing objects as **key-value pairs** and where
values ​​can be ***arrays*** — a fairly broad concept in
computer science which includes in particular the lists that we know - it
closely resembles Python dictionaries. It thus constitutes a
quite a natural file format to ***serialize*** these,
that is to say, moving from a data structure in memory (here, a
dictionary) to a sequence of bytes that can be universally read
by any computer. Let's look at the JSON representation as an example
from a Python dictionary.

In [None]:
cv = {
    "marc": {"poste": "manager", "experience": 7, "hobbies": ["couture", "frisbee"]},
    "miranda": {"poste": "ingénieure", "experience": 5, "hobbies": ["trekking"]}
}

print(cv)

In [None]:
import json

print(json.dumps(cv))

We can see that the JSON representation is quite close to that of the
Python dictionary, with **some peculiarities**. In this case by
For example, special characters like accents are automatically
encoded in *Unicode*.

### The `json` module

The `json` module handles importing JSON files and exporting objects
Python in JSON format. It is particularly responsible for managing the constraints of
conversion to JSON mentioned previously, such as that of accents.

In particular, **JSON can store most types of objects
*built-in* features of Python** that we have seen so far (*strings*,
numeric values, Booleans, lists, dictionaries, `NoneType`) and more
others, but it cannot represent Python objects created
manually via classes for example.

### Writing

Let's start this time with writing. As we saw in
In the previous example, the `dumps` function (for *dump string*) converts
a **serializable** Python value into its JSON representation in the form
of character string.

In [None]:
x = "test"
json.dumps(x)

In [None]:
x = [1, 2, 3]
json.dumps(x)

Writing a JSON file from Python is simply a matter of writing
this representation in a text file, to which we will give
the `.json` extension to clearly indicate that it is a text file
particular. As this operation is very frequent, there is a
very similar function, `dump`, which performs both the conversion and
writing.

In [None]:
with open("cv.json", "w") as file_out:
    json.dump(cv, file_out)

In [None]:
!cat cv.json

In a single operation, we serialized a Python dictionary (the object
`cv`) in a JSON file.

### Reading

The `json` module provides the `load` and `loads` functions, which perform
respectively the opposite operations of the `dump` and `dumps` functions:

- the `load` function allows you to import JSON content present in a
text file and convert it to a dictionary

- the `loads` function allows you to convert JSON content present in
a string into a dictionary

Let's take the CV that we serialized previously in JSON format
to illustrate reading from a file.

In [None]:
with open("cv.json", "r") as file_in:
    data = json.load(file_in)
    
data

We will illustrate reading JSON content from a string
of characters from a realistic example: that of querying a
API. For the example, we will query the National Address Base
(BAN), which allows any national address to be geolocated.

API querying in Python is very simple thanks to the
library `requests`. Let's look for example at how we can retrieve
in just two lines of code geographic information about
all the routes that contain the name “comedy” in France.

In [None]:
import requests

In [None]:
response = requests.get("https://api-adresse.data.gouv.fr/search/?q=comedie&type=street")
r_text = response.text
print(r_text[:150])

The API returns a response to us, from which we extract the text content.
As with the vast majority of APIs, this content is JSON.
can then import it into a Python dictionary via the function
`loads` (for *load string*) to be able to manipulate the data it
contains.

In [None]:
r_dict = json.loads(r_text)

In [None]:
r_dict.keys()

In [None]:
type(r_dict["features"])

The results that interest us are contained in the value of the
dictionary associated with the key `features`, which is a list of
dictionaries, one per result.

In [None]:
r_dict["features"][0]

In [None]:
r_dict["features"][1]

## Exercises

### Comprehension questions

- 1/ What is a CSV file?

- 2/ What are the advantages of the CSV format?

- 3/ Why do we use the `csv` module to read and write
CSV files?

- 4/ Are the data in a CSV file necessarily separated by
commas?

- 5/ What is the *header* of a CSV file? Does it exist?
necessarily?

- 6/ Why is the JSON format widely used in data manipulation?
data ?

- 7/ What Python object does content in JSON format look like?

- 8/ What types of Python objects can be converted to JSON?

- 9/ What is the serialization of a Python object?

- 10/ What is the main common point between CSV files and
JSON files?

- 11/ Does a file with the extension .json necessarily contain
JSON?

<details>

<summary>

Show solution

</summary>

- 1/ A CSV is a text file that represents the raw information
of a spreadsheet-type document. Each line of the file represents a
spreadsheet row, and cells in a row are separated by
commas. The first line may contain a `header`,
that is, the name of the columns, but this is not always the case.

- 2/ Simplicity of reading and editing, universality.

- 3/ Even if the CSV format is very simple, it presents certain
characteristics (delimiter, end-of-line character, etc.) including
must be taken into account when reading or editing CSV. The csv module
provides functions that take these particularities into account.

- 4/ No, in theory we can separate the data by any
character or sequence of characters. In practice, you must follow the
convention in most cases, which is to use a comma.

- 5/ This is the first line of the CSV file, which normally contains
variable names, but this is not always the case.

- 6/ This is the majority response format of APIs, which are very
used for data dissemination and exchange.

- 7/ To dictionaries.

- 8/ All so-called serializable objects, which includes most of the
basic objects that we have seen, but not the created objects
manually via classes.

- 9/ Serializing a Python object (serializable) consists of
convert the data contained in this object into a sequence
of bytes, that is, in a message that can be understood by
any computer.

- 10/ These are text files.

- 11/ No, JSON files like CSV files are files
text. The extension is a convention which allows in the large
majority of cases to know what the file contains, but it does not
can't guarantee it.

</details>

### Sort keys when writing JSON

The next cell contains a dictionary. The goal of the exercise is
to write this data into a JSON file, sorting the keys of the
dictionary in alphabetical order.

Hint: The `dump` function of the `json` module contains a parameter
to sort the keys. Read the [documentation of the
function](https://docs.python.org/fr/3/library/json.html#json.dump) for
determine it.

In [None]:
data = {"id": 1, "nom": "Isidore", "age": 29}

In [None]:
# Test your answer in this cell

<details>

<summary>

Show solution

</summary>

``` python
import json

data = {"id": 1, "nom": "Isidore", "age": 29}

with open("data_sorted.json", "w") as file_out:
    json.dump(data, file_out, sort_keys=True)
```

</details>

### Convert a non-serializable object to JSON

We have seen that the objects that we create manually via classes
are generally not serializable. The following cell shows one
example with our `Lemon` object used in the OOP tutorial.
Trying to convert the object directly to JSON returns an error.

You need to modify the following code in order to be able to serialize the object.
To do this, you must:

- convert the `mon_citron` instance using the *built-in* method
`__dict__` that all Python objects have

- convert the resulting dictionary to JSON as a string
characters

In [None]:
import json

class Citron:

    def __init__(self, couleur, qte_jus):
        self.saveur = "acide"
        self.couleur = couleur
        self.jus = qte_jus
        
mon_citron = Citron(couleur="jaune", qte_jus=45)
json.dumps(mon_citron)

In [None]:
# Test your answer in this cell

<details>

<summary>

Show solution

</summary>

``` python
import json

class Citron:

    def __init__(self, couleur, qte_jus):
        self.saveur = "acide"
        self.couleur = couleur
        self.jus = qte_jus
        
mon_citron = Citron(couleur="jaune", qte_jus=45)
mon_citron_dict = mon_citron.__dict__

json.dumps(mon_citron_dict)
```

</details>

### Change the delimiter of a CSV file

Your current directory contains the file `nat2020.csv`. This is the
first name file distributed by INSEE: it contains data on
first names given to children born in France between 1900 and 2020.

Problem: Unlike the CSV standard, the delimiter used is not
not the comma. So you must:

- find the separator used (via the Jupyter text editor, via
a shell command, testing with the `csv` module in Python..)
to read the file correctly

- generate a new CSV file `nat2020_corr.csv` containing the
same data, but this time with the comma as separator.

In [None]:
# Test your answer in this cell

<details>

<summary>

Show solution

</summary>

``` python
# Trouvons à l'aide d'une commande shell le séparateur utilisé
!head -n 3 nat2020.csv

with open('nat2020.csv', 'r') as file_in:
    # Lecture du fichier CSV existant
    reader = csv.reader(file_in, delimiter=';')
    with open('nat2020_corr.csv', 'w') as file_out:
        # Ecriture dans le nouveau fichier CSV
        writer = csv.writer(file_out)  # Par défaut, le délimiteur est la virgule
        for row in reader:
            writer.writerow(row)
            
# Vérification à l'aide d'une commande shell
!head -n 3 nat2020_corr.csv
```

</details>

### Extract and save data from an API

The exercise consists of making a request to the Address Base API
National, and save the results in a CSV file. Here are the
steps to implement:

- perform a street name query with a keyword as in the
tutorial (if you want to make a more complex query, you
can watch the [documentation of
the API](https://adresse.data.gouv.fr/api-doc/adresse)) and store the
results in a dictionary

- create a CSV file `resultats_ban.csv` in which we will store
the following information: ‘name’, ‘city’, ‘commune_code’,
‘longitude’, ‘latitude’

- using a `writer` object and looping through the results
returned by the API, write each line to the CSV

For example, for the path query containing the word “comedy”, here is
the CSV to obtain:

name, city, municipality_code, longitude, latitude
Old Comedy Street, Lille, 59350, 3.063832, 50.635192
Comedy Square, Montpellier, 34172, 3.879638, 43.608525
Comedy Street, Cherbourg-en-Cotentin, 50129, -1.629732, 49.641574
Allee de la Comedie,Villeneuve-d’Ascq,59009,3.162808,50.64628
Old Comedy Street, Poitiers, 86194, 0.342649, 46.580457

In [None]:
# Test your answer in this cell

<details>

<summary>

Show solution

</summary>

``` python
response = requests.get("https://api-adresse.data.gouv.fr/search/?q=comedie&type=street")
r_text = response.text
r_dict = json.loads(r_text)

with open('resultats_ban.csv', 'w') as file_out:
    header = ['nom', 'ville', 'code_commune', 'longitude', 'latitude']
    csv_writer = csv.writer(file_out)
    csv_writer.writerow(header)
    for result in r_dict['features']:
        nom = result['properties']['name']
        commune = result['properties']['city']
        code_commune = result['properties']['citycode']
        long, lat = result['geometry']['coordinates']
        row = [nom, commune, code_commune, long, lat]
        csv_writer.writerow(row)
```

</details>

### Cut the base of departments by regions

The objective of this exercise is to cut the CSV file of
departments that we used in the tutorial into several small ones
CSV, one per region. This type of operation can be useful for example
when working with a very large file, which does not pass
not in memory; cut it into several files that we process
independently, where possible, helps reduce the
volumetry.

Here is the list of operations to be carried out:

- create a `dep` folder in the current directory using the module
`pathlib` (see previous tutorial)

- with a `reader` object from the `csv` module, loop through the
lines from the CSV file of the departments. Be careful not to include
the *header*, using the `next` function to skip to the first
line. For each subsequent line:

- retrieve the region code (variable `REG`)

- generate the path of the CSV file `dep/{REG}.csv` where {REG} is at
replace with the region code of the line

- open this CSV file in `append` mode to write the line to the
end of file

In [None]:
# Test your answer in this cell

<details>

<summary>

Show solution

</summary>

``` python
from pathlib import Path

path_dep = Path("dep/")
path_dep.mkdir(exist_ok=True)

with open('departement2021.csv', 'r') as file_in:
    csv_reader = csv.reader(file_in)
    next(csv_reader)  # Passe le header
    for row in csv_reader:
        reg = row[1]
        filename = reg + '.csv'
        path_reg_file = path_dep / filename  # Chemin du fichier csv region
        with open(path_reg_file, 'a') as file_reg_in:
                writer = csv.writer(file_reg_in)
                writer.writerow(row)
```

</details>

### Add missing *headers*

In the previous exercise, we split the CSV file from
French departments into several CSV files, one per region. But
we have not included in the different files the *header*,
i.e. the first line that contains the column names. So we will
manually add it to each of the CSV files created during
the previous exercise.

Here is the list of operations to be carried out:

- read the complete departments file and retrieve the `header`
in a list with the `next` function

- save the paths of the different CSV files in a list
contained in the `dep` folder with the `glob` method of `pathlib`
(see previous tutorial)

- for each path:

- open the existing CSV file, and retrieve the data
as a list of lists (one list per line)

- open the CSV file for writing to reset it, write
the header first, then write the data we have to the
previously saved in a list box

In [None]:
# Test your answer in this cell

<details>

<summary>

Show solution

</summary>

``` python
from pathlib import Path

with open('departement2021.csv', 'r') as file_in:
    csv_reader = csv.reader(file_in)
    header = next(csv_reader)

dep_files_paths = list(Path("dep/").glob('*.csv'))

for path in dep_files_paths:
    # Lecture du fichier existant, dont on stocke les lignes dans une liste
    with open(path, 'r') as file_dep_in:
        reader = csv.reader(file_dep_in)
        dep_rows = []
        for row in reader:
            dep_rows.append(row)
    # On réécrit le fichier de sortie, en rajoutant au préalable le header
    with open(path, 'w') as file_dep_out:
        writer = csv.writer(file_dep_out)
        writer.writerow(header)
        for row in dep_rows:
            writer.writerow(row)
```

</details>