# Data Preparation

In [2]:
# "magic commands" to enable autoreload of your imported packages
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Our goal is to load all 9 `.csv` files into 9 `pandas.DataFrame`s in a single dict named `data` where:
- each **key** is the **cleaned name** of the csv file
- each **value** is the **DataFrame** created from the csv

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    ...
    }
```

### 1. Create the variable `csv_path`, which stores the path to your `"csv" folder` as a string

Let's use Python's `pathlib` to handle our paths easily. Pathlib defines a very handy `Path` class.

We can instantiate a `Path` by giving it a string defining our path.

In [3]:
from pathlib import Path
csv_path = Path("~/.lewagon/olist/data/csv").expanduser()
csv_path

PosixPath('/Users/simonhingant/.lewagon/olist/data/csv')

_`PosixPath`_? That means it's a path in Posix format. POSIX (Portable Operating System Interface) is a standardized set of APIs and conventions for Unix-like operating systems to ensure compatibility and interoperability. Both Linux (e.g. Ubuntu on WSL) and macOS are POSIX-compliant.

We use the `expanduser()` to expand the `~` into the actual absolute path of our home folder.

We can now use the `iterdir()` method to list the files in our folder:

In [4]:
file_paths = list(csv_path.iterdir())
file_paths

[PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_sellers_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/product_category_name_translation.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_orders_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_order_items_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_customers_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_geolocation_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_order_payments_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_order_reviews_dataset.csv'),
 PosixPath('/Users/simonhingant/.lewagon/olist/data/csv/olist_products_dataset.csv')]

In [5]:
# Test your code below. We try to load the first csv in the directory
import pandas as pd
pd.read_csv(file_paths[0]).head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### 2. Create the list `file_names` containing all csv file names in the csv directory

- It should look like this `file_names = ['olist_sellers_dataset.csv', ....]`
- You can start from `file_paths`
- A `Path` has a `name`  property

In [8]:
file_names=[path.name for path in file_paths]
file_names

['olist_sellers_dataset.csv',
 'product_category_name_translation.csv',
 'olist_orders_dataset.csv',
 'olist_order_items_dataset.csv',
 'olist_customers_dataset.csv',
 'olist_geolocation_dataset.csv',
 'olist_order_payments_dataset.csv',
 'olist_order_reviews_dataset.csv',
 'olist_products_dataset.csv']

### 3.  Create the list of dict key `key_names` 
Starting from file_names and:
- Removing its suffix ".csv" when it exists
- Removing its suffix "_dataset.csv" when it exists
- Removing its prefix "olist_" when it exists

<details>
    <summary>- Hint - </summary>

- `.replace()`
    
- `str` ings are iterables you can slice with [ ]
</details>

In [41]:
key_names = [name.replace('olist_', '').replace('_dataset.csv', '').replace('.csv', '') for name in file_names]
key_names

['sellers',
 'product_category_name_translation',
 'orders',
 'order_items',
 'customers',
 'geolocation',
 'order_payments',
 'order_reviews',
 'products']

### 4. Construct the dictionary `data`

```python
data = { 
    'sellers': DataFrame1,
    'orders': DataFrame2,
    'order_items': DataFrame3,
    ...
    }
```
Where `DataFrame1`, `DataFrame2`, ... should be actual `pandas.DataFrame`s.

<details>
    <summary>‚ñ∏ Hint</summary>

The `zip()` method is very useful to iterate over two lists
```python
for (x, y) in zip(['a','b','c'], [1,2,3]):
    print(x,y)

# returns ('a', 1), ('b', 2), ('c', 3)
    
```
</details>

In [45]:
from pathlib import Path
csv_path   = Path("~/.lewagon/olist/data/csv").expanduser()
file_paths = list(csv_path.iterdir())          # recr√©e la LISTE propre

import pandas as pd
data = {}
for key, path in zip(key_names, file_paths):   # ne PAS r√©utiliser "file_paths" ici
    data[key] = pd.read_csv(path)

list(data.keys())

['sellers',
 'product_category_name_translation',
 'orders',
 'order_items',
 'customers',
 'geolocation',
 'order_payments',
 'order_reviews',
 'products']

### 5. Implement the method `get_data()` in `olist/data.py`

Time to move our logic from the notebook into our `.py` files. This will allow us to easily load the data in the new notebooks we'll create througout this module. 

Go and open the `olist/data.py` file in this module's folder (it's two levels up from this notebook), and start moving the code you have written in this notebook to the `get_data()` method. Skip the lines we wrote to test our code (no need to display lists and DataFrames in the `.py` file).

The function should return the dictionary with DataFrames when calling it. Try it out with one of the csvs:

In [50]:
from olist.data import Olist
Olist().get_data()['sellers'].head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


### Test your code

In [51]:
from nbresult import ChallengeResult
from olist.data import Olist
data = Olist().get_data()
result = ChallengeResult('get_data',
    keys_len=len(data),
    keys=sorted(list(data.keys())),
    columns=sorted(list(data['sellers'].columns)),
    vars_used=Olist.get_data.__code__.co_names
    )
result.write()
print(result.check())


platform darwin -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /Users/simonhingant/.pyenv/versions/3.12.9/envs/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /Users/simonhingant/code/simsam56/03-Decision-Science/01-Statistical-Inference/data-data-preparation/tests
plugins: anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 3 items

test_get_data.py::TestGetData::test_columns [32mPASSED[0m[32m                       [ 33%][0m
test_get_data.py::TestGetData::test_keys [32mPASSED[0m[32m                          [ 66%][0m
test_get_data.py::TestGetData::test_len [32mPASSED[0m[32m                           [100%][0m



üíØ You can commit your code:

[1;32mgit[39m add tests/get_data.pickle

[32mgit[39m commit -m [33m'Completed get_data step'[39m

[32mgit[39m push origin master



‚ùìThis piece of code needs to work from anywhere on your machine, not only in this notebook.
- Open a new terminal
- Go to your home folder `cd`
- Launch an `ipython` session
- Test the two lines of code above üëÜ

üèÅ Congratulations !

üíæ Don't forget to commit & push: 
* this `data_preparation.ipynb` notebook and the `tests/get_data.pickle` test results in the challenge folder
* as well as `data.py` in the `03-Decision-Science/olist` folder