In [None]:
import numpy as np 
import pandas as pd
from pathlib import Path

We set some default values for our project and check the file

In [None]:
datadir = Path("data/raw/")
outputdir = Path("data/processed/")
filename = datadir / "les1.csv"
filename.resolve(), filename.exists()

In [None]:
df = pd.read_csv(filename)
df.head()

Let's check some of the statistics

In [None]:
df.describe()

You can see x2 has some nans, from the count.
Let's select all columns with nans

In [None]:
select = list(df.isna().sum() > 0)
select

Check if it works

In [None]:
df.columns[select]

Now drop the nans

In [None]:
df = df.dropna(subset=df.columns[select], axis="rows")
df

We dropped 18 rows.
Let's check the types:

In [None]:
df.dtypes

Let's clean out the name.
We use a regular expression to select the first word up to the first space.
Use https://regex101.com to create your own regular expressions.
Or use [you.com to try with natural language](https://you.com/search?q=regular+expression+that+selects+the+first+word+up+to+a+space&fromSearchBar=true)

In [None]:
import re

regex = re.compile("^[\w]+")
out = re.search(regex, "Python Regius")
out.group()

Let's put that into a function

In [None]:
def extract(regex, msg):
    out = re.search(regex, msg)
    return out.group()

And apply it

In [None]:
df["name"] = df["name"].apply(lambda x: extract(regex=regex, msg=x))

In [None]:
df.head()

Looks good.

Now we save the file with a timestamp.

In [None]:
from datetime import datetime
tag = datetime.now().strftime("%Y%m%d-%H%M") + ".csv"
output = outputdir / tag
df.to_csv(output, index=False)

A lot of data scientists will stop here.

However, while the job is done, leaving things like this is very tricky.
Notebooks are for prototyping, not for creating a solid solution.

Now, have a look at the src folder.
Start at main.py, and also look at the other files.

Now go to the terminal, cd to the `les1` directory.
From there, you do:

`python src/main.py --file=les1.csv`

Note how a logging.log file appears, and check that.

If your terminal is complaining it cant find dependencies (eg click), this means your terminal does not know how to locate them.
Depending on your setup, you can fix this by doing:
`pdm run python src/main.py --file=les1.csv`.
This tells your terminal: find the current virtual environment and use that to run the python command.

Other options are to manually activate the virtual environment with `eval $(pdm venv activate)`, but this might not 
work as expected in some setups (eg the docker devcontainer).

# Excercise

Inside the data/raw folder in the root directory DME22 you will find a `palmerpenguins.parq` file.
This is a parquet file. If you always use csv, [read this](https://bawaji94.medium.com/feather-vs-parquet-vs-csv-vs-jay-55206a9a09b0) so you know why thats not a good idea.

## setup
- mkdir a new folder outside of the DME22 folder, named `cleanup`
- initialize a new virtual environment (see docs/dependencies)
- `add` or `install` the libraries you need (eg with poetry, pip or pdm)
- mkdir a `data/raw` folder and copy `palmerpenguins.parq`. Bonuspoints if you use the `cp` command :)  
- mkdir a `data/processed`, `src` and `notebook` folder.
- create a `src/main.py` file

## prototype
Make a notebook where you:
- load the file with pandas
- check for nans
- figure out how you clean the out.
- double check how many rows you throw away. Find a solution if this doesnt look good.
- clean up the column with the names of the penguins. They are too long, so shorten them with a regex.
- save the cleaned file with a timestamp

## implement
after you prototyped this, create a `__init__.py` file inside the `src` folder.
Streamline the cleanup process as a command line executable process.
Use [click](https://click.palletsprojects.com/en/8.1.x/) to create easy arguments.

Try to add typehints.
Format your code with [black](https://github.com/psf/black) by running `black src` from the command line, where src is the folder you want to format. (make sure you cd-d to the folder so that you see `src` when you ls)

If your terminal does not find black, it means you either need to install black (eg `pip install black` or `pdm add black`) to your venv. If you are sure you did that (eg by checking the `/.venv/lib/python3.11/site-packages/` folder for black), then you need to activate your venv; again, this could mean you either need to add `pdm run` or do `eval $(pdm venv activate)`. What is especially tricky is that vscode will sometimes try to be helpfull and it will activate a venv for you, but it activates another venv than the one you are working in. This can be very confusing. If you are not sure, check the terminal output of the command `which python`. If it does not point to the venv you are working in, you might first need to undo vscode "helping" you by running `deactivate` in the terminal. If this confuses you, please practice with activating and deactivating venvs until you get familiar with it.

Add logging with loguru.

Don't hardcode any settings. Use pydantic with a settings.py file.



