# Mini-project 1: WARMUP - A dataset with CSV and JSON

In this mini-project, we will generate a fake dataset to warmup with dictionaries, functional programming, and the following libraries: `csv`, `json`, `itertools`, `numpy` and `matplotlib`.

## 1. Generate composed names
### 1.1. Define the generation function

Write a function that returns a list of permutations to create fake composed names separated by a dash, e.g. `Alice-Maria`. 

Be careful:
* The function has 1 input parameter: a list of first names ; and returns the list of permutations with a "-" in-between
* The output list must also include the opposite, e.g. `Alice-Maria` and also `Maria-Alice`
* The output list must not contain repetitions, e.g. `Bob-Bob` (this is a permutation, not a product)

Although Python has tools to do this in some modules, it is a good exercise to start from an exmpty list and fill it progressively with functions we know.

In [None]:
# My code here [...]

In [4]:
import sys
sys.version

'3.8.10 (default, Sep 28 2021, 16:10:42) \n[GCC 9.3.0]'

In [3]:
def generate_fn(names: list[str]) -> list[str]:
    output = []
    for name in names:
        for name2 in names:
            output.append(name + "-" + name2)
    return output

TypeError: 'type' object is not subscriptable

### 4.1.2. Function call

Here is a list of 11 first names
```
names = ["Bob", "Alice", "Maria", "Albert", "Paul", "Alex", "Luc", "Robert", "Dylan", "Léa", "Richard"]
```
The function call with this list must return exactly 110 composed names (i.e. the exact number of permutations w/o repetition of 11 elements) stored in a variables named `composed_names`.


In [None]:
# My code here [...]

# 4.2. Generate characters as first + last names
## 4.2.1. Define the generation function

Write a function that returns a list of combinations of composed names and last names separated by a space, e.g. `Paul-Robert Loiseau`.

Be careful:
* The function has 2 input parameters: a list of composed first names and a list of last names ; and returns a list of combinations
* It means that for each last name, we will insert in the resulting list as many characters as there are names in the list of first names
* Your list must be in this order: **first name and then last name**, thus it must not contain `Tournesol Paul-Alex` for instance.


In [None]:
# My code here [...]

### 4.2.2. Function call

Here is a list of 11 last names (from the stories of Tintin):
```
surnames = ["Dupont", "Dupond", "Haddock", "Tournesol", "Castafiore", "Lampion", "Lopez", "Loiseau", "Müller", "Sanzot"]
```

The function calls of the function from `3.1.2.` and this one must finally return exactly 1100 characters, stored in a variable named `characters`.

In [None]:
# My code here [...]

# 4.3. Import data from a CSV file

The CSV format (Comma-Separated Values) is a very simple file format to exchange databases (big data, machine learning, and so on) very similar to basic "Excel" sheets.

Thus we can open CSV files with a spreadsheet (Excel, Libreoffice, Google Docs, ...) or even with a simple text editor (Pycharm or Notepad...). The content looks this way:

```
First name, Last name, Email, Age, City
Robert, Lepingre, bobby@example.com, 41, Paris
Mark, Fothergill, jeanne@example.com, 32, London
Santiago, Permado, pierre@example.com, 23, Madrid
```
In this example CSV content, we can read 5 columns separated by commas: first name, last name, email, age and city ; as well as 3 records (3 lines): Robert Mark and Santiago.

In order to generate such a file in Python, we can simply use `print()` for each field, then print a comma, then another field, and so one. But in order to read a CSV file, this would be much more complicated to do it manually. Fortunately, Python has a module named `csv` to read and write CSV files!

Use the documentation of the `csv` module for the next questions:

## 4.3.1. Load the file

Open the file `exams.csv`, load its content, and:

1. Transform-it in order to get marks by discipline, for instance `math_marks = [15.5, 13, 10.5, 12, ...]`
2. Import the numerical module `numpy` and use its functions `mean()` et `std()` to display the mean and the standard deviation of marks by discipline

In [None]:
# My code loading the CSV and displaying the means and std per discipline [...]

## 4.3.2. Optional: Plot the density of marks

A density plot shows, for each of the 40 possible notes in the horizontal axis (from 0 to 20 with a 0.5 step), the number of appear le nombre d'appearances of this mark, on the vertical axis. This is a way to check how data are distributed.

In [None]:
# Plot the density here [...]

_Note_: here we propose to reinvtent the wheel, for the pleasure to manipulate lists and primitive rypes! However if you do datascience with Python, use package `pandas` to do this kind of job.

In particular, [`density()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.density.html) plots a density curve in just a function call.


## 4.3.3. Create an adapter data structure
Transform the data you read in a data structure made of nested dictionaries and/or lists.

The structure must represtent the name of students as well as their marks to the 3 exams

For instance:
```
{
  "Alice-Maria Lampion" : {"math": 15, "french": 10, "philosophy": 11.5},
  "Paul-Alex Loiseau" : {"math": 8.5, "french": 17, "philosophy": 15},
  ...
}
```


In [None]:
# My code creating the data structure here [...]

## 4.3.4. Save you data structure in JSON

Import module `json` and use `json.dump()` to save your database un a file: `dataset.json`

Protip: add the paramter `indent=4` in order to make your JSON file readable by a human with a simple text editor. Open the file without Python to observer.

In [None]:
# My code saving the data structure in JSON here []

## 4.3.5. Read and check

We are now going to check that we can load properly the JSON file with `json.load()`.

We will first crash voluntarily this Jupyter Notebook in order to start from scratch. Your code will remain in your browser but all variables will be lost.

In [None]:
# We volontarily end the interpreter here to make sure all previous variables are cleared up.
import os
os._exit(0)

Now re-loard the JSON file into a Pytohn variable and consult the math mark of Paul-Robert Müller :

In [1]:
# My code loading the saved JSON dataset here [...]

## 4.3.6. Optional: adapt to malformed data

You crated a JSON file but some file already existed in the current directory. However this second file has many problems.

**Your goal**: Write a Python function that:
* opens all files from the current directory
* check if this is actually a file (not a directory)
* check if this is a JSON file (e.g. ends with `.json`)
* load the data and adapt to malformed data
* compute the mean in mathematics of all students in that directory

**Protip**: Consult the documentation of module `os` and have a particular look to functions `isdir, isfile, listdir`. Use function `str.split` to extract the extension and try/catch exceptions with `try..except`.

In [None]:
# My optional code loading malformed JSON files here [...]

# Resources

* itertools: https://docs.python.org/3/library/itertools.html
* Functional programming: https://docs.python.org/3/howto/functional.html
* csv: https://docs.python.org/3/library/csv.html
* json: https://docs.python.org/fr/3/library/json.html
* pandas: https://pandas.pydata.org/pandas-docs/stable/

