# Mini-project 1: Create a dataset with CSV and JSON

In this mini-project, we will generate a fake dataset to warmup with dictionaries, functional programming, and the following libraries: `csv`, `json`, `numpy` and `matplotlib`.

## 1. Generate composed names
### 1.1. Define the generation function

Write a custom function that returns a list of permutations to create fake composed names separated by a dash, e.g. `Alice-Maria`. 

Be careful:
* The function has 1 input parameter: a list of first names ; and returns the list of permutations with a "-" in-between
* The output list must also include the opposite, e.g. `Alice-Maria` and also `Maria-Alice`
* The output list must not contain repetitions, e.g. `Bob-Bob` (this is a permutation, not a product)

Although Python has tools to do this in some modules, it is a good exercise to start from an exmpty list and fill it progressively with functions we know.

In [None]:
# My code here [...]

### 1.2. Function call and comparison to other functions

Here is a list of 11 first names
```
names = ["Bob", "Alice", "Maria", "Albert", "Paul", "Alex", "Luc", "Robert", "Dylan", "Léa", "Richard"]
```
Now call your function: the function call with this list must return exactly 110 composed names (i.e. the exact number of permutations w/o repetition of 11 elements) stored in a variables named `composed_names`.


In [None]:
# My code here [...]

# 2. Generate characters as first + last names
## 2.1. Define the generation function

Write a custom function that returns a list of combinations of composed names and last names separated by a space, e.g. `Paul-Robert Loiseau`.

Be careful:
* The function has 2 input parameters: a list of composed first names and a list of last names ; and returns a list of combinations
* It means that for each last name, we will insert in the resulting list as many characters as there are names in the list of first names
* Your list must be in this order: **first name and then last name**, thus it must not contain `Tournesol Paul-Alex` for instance.


In [None]:
# My code here [...]

### 2.2. Function call

Here is a list of 11 last names (from the stories of Tintin):
```
surnames = ["Dupont", "Dupond", "Haddock", "Tournesol", "Castafiore", "Lampion", "Lopez", "Loiseau", "Müller", "Sanzot"]
```

Now call your functions: the function calls of the function from `1.2.` and this one must finally return exactly 1100 characters, stored in a variable named `characters`.

In [None]:
# My code here [...]

# 3. Import data from a CSV file

We will associate to these characters exam marks generated by another program in a CSV file.
Use the documentation of the [`csv`](https://docs.python.org/3/library/csv.html) module for the next questions:

## 3.1. Load the file

Use **Right Clic + Save As** to download the file [`exams.csv`](https://raw.githubusercontent.com/ymollard/python-advanced-slides/main/exercises/data/exams.csv). With Python, open it, load its content, and transform-it in order to get marks by discipline, for instance `math_marks = [15, 13...]`

In [None]:
# My code loading the CSV and displaying the means and std per discipline [...]

Install the numerical module `numpy` with pip in your venv (in the PyCharm system terminal).

Use functions `numpy.mean()` et `numy.std()` to get the mean and the standard deviation of marks by discipline

## 3.2. Plot the density of marks (Optional)

A density plot shows, for each of the 40 possible notes in the horizontal axis (from 0 to 20 with a 0.5 step), the number of occurences of this mark, on the vertical axis. 

Install the the plot module `matplotlib` with pip in your venv (in the PyCharm system terminal), create a list of occurences for each possible mark, and plot the possibles marks in x-axis as well as their occurences in y-axis using `pypot.plot()` and `pyplot.show()`.

**Note:** Although it is good to practice operations on lists and plots, datascientists can rely on existing libraries to do this work for them. In particular, [pandas.DataFrame.plot.density](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.density.html) plots the density.

In [None]:
# Plot the density here [...]

## 3. Create an adapter data structure
Transform the data you read in a data structure made of nested dictionaries and/or lists.

The structure must represtent the name of students as well as their marks to the 3 exams

For instance:
```
{
  "Alice-Maria Lampion" : {"math": 15, "french": 10, "philosophy": 11.5},
  "Paul-Alex Loiseau" : {"math": 8.5, "french": 17, "philosophy": 15},
  ...
}
```
**Be careful**: there are not as many marks as characters in your dataset. Crop the data of the longer list to size of the smallest list.

In [None]:
# My code creating the data structure here [...]

## 3.1. Save you data structure in JSON

Import module `json` and use `json.dump()` to save your database un a file: `dataset.json`

Protip: add the paramter `indent=4` in order to make your JSON file readable by a human with a simple text editor. Open the file without Python to observer.

In [None]:
# My code saving the data structure in JSON here []

## 3.2. Read and check

We are now going to check that we can load properly the JSON file with `json.load()`.

We will first crash voluntarily this Jupyter Notebook in order to start from scratch. Your code will remain in your browser but all variables will be lost.

In [None]:
# We volontarily end the interpreter here to make sure all previous variables are cleared up.
import os
os._exit(0)

Now re-load the JSON file into a Python variable and consult the math mark of Paul-Robert Müller :

In [None]:
# My code loading the saved JSON dataset here [...]

## 3.3. Optional: adapt to malformed data

You crated a JSON file but some file already existed in the current directory. However this second file has many problems.

**Your goal**: Write a Python function that:
* opens all files from the current directory
* check if this is actually a file (not a directory)
* check if this is a JSON file (e.g. ends with `.json`)
* load the data and adapt to malformed data
* compute the mean in mathematics of all students in that directory

Use the module [`pathlib`](https://docs.python.org/fr/3/library/pathlib.html) to handle file paths, existence and type. Take note that `os.path` is now deprecated.

In [None]:
# My optional code loading malformed JSON files here [...]

# Resources

* itertools: https://docs.python.org/3/library/itertools.html
* Functional programming: https://docs.python.org/3/howto/functional.html
* csv: https://docs.python.org/3/library/csv.html
* json: https://docs.python.org/fr/3/library/json.html
* pandas: https://pandas.pydata.org/pandas-docs/stable/



# Opening

Python has a module called [`itertools`](https://docs.python.org/3/library/itertools.html) made for efficient looping. In particular, Python generators and iterators are more advanced but also more efficient ways to loop over data. They save memory space, in some situations also CPU time, and may also generate infinite lists of data.

Although we have implemented our own generation functions for educative purposes, in general it is always advisable to use functions coming from builtin modules, such as `itertools`, instead of equivalent custom functions. Resources of builtin modules are full of micro-optimizations.