## Dependencies

---

In [88]:
import os
import pandas as pd

## Transformation

---

### Functions:

In [89]:
def replace_datapoint(value:str, replacement:any, datalist: list[any]) -> None:
    """
    Replace a datapoint if we know the value of datapoint to replace. Useful for kilometers list bug.
    :param value: value of datapoint to remove.
    :param replacement: value of datapoint to insert.
    :param datalist: list where the operation will be applied.
    """
    try:
        to_replace = datalist.index(value)
        datalist[to_replace] = replacement
    except ValueError as ve:
        print("That values doesn't exists")


In [90]:
def fix_tuples(csv:str, index:int) -> pd.DataFrame:
    """
    Fix displaced tuples in csv file.
    :param csv: name of file to fix (without extension).
    :param index: index of registry in description column to delete.
    :return: dataframe with tuples fixed.
    """
    dataframe = pd.read_csv(f"persistence/{csv}.csv")

    models = list(dataframe["model"])
    year = list(dataframe["year"])
    kilometer =  list(dataframe["kilometers"])
    engine = list(dataframe["engine"])
    price = list(dataframe["price"])
    descriptions = list(dataframe["description"])

    lists = [models, year, kilometer, engine, price]
    lists = [data_list.pop(0) for data_list in lists]
    descriptions.pop(index)

    return pd.DataFrame({
        "model": models,
        "year": year,
        "kilometers": kilometer,
        "engine": engine,
        "price": price,
        "description": descriptions
    })


In [91]:
def group_ds(func) -> list[pd.DataFrame]:
   """
   Decorator to transform the list of files names to list of dataframes.
   :param func: function to decorate.
   :return: list of dataframes.
   """
   def wrapper(*args):
       files = func(*args)
       dsets = []
       for file in files:
           frame = pd.read_csv(f"persistence/{file}")
           dsets.append(frame)
       return dsets
   return wrapper

In [92]:
@group_ds
def list_ds(month:str) -> list["str"]:
    """
    List files in persistence directory by month.
    :param month: dataset prefix.
    :return: list of datasets with month prefix.
    """
    path = os.getcwd()
    files = os.listdir(f"{path}/persistence")
    return [file for file in files if file.startswith(month)]


In [93]:
def concatenate_frames(dataframes: list[pd.DataFrame]) -> pd.DataFrame:
    """
    Concatenate a list of dataframe in a single one.
    :param dataframes: list of dataframe to concat in a single one.
    :return: dataframe with dataframes concatenated.
    """
    concatenation = pd.concat(dataframes)
    concatenation = (
        concatenation
        .drop_duplicates()
        .reset_index()
        .drop("index", axis=1)
    )

    return concatenation


### Messy datapoints:

 Since some ads don't have their fields complete, the scraper skip to the next field extracting for example the `engine` value inside `kilometers`, due kilometers field was empty and was skipped.

In [94]:
frame = pd.read_csv("persistence/April-08.csv")
frame.tail(5)

Unnamed: 0,model,year,kilometers,engine,price,description
40,Kia Rio,2016,82500,Gasolina,9500,Precio negociable: si Color : Negro DESCRIPCIÓ...
41,Hyundai Elantra,2018,41200,Gasolina,11000,Color : NEGRO YA TIENE PLACAS WhatsApp o Teleg...
42,BMW 320,2009,49000,Gasolina,8200,Precio negociable: si Color : Gris 4 puertas. ...
43,Nissan Sentra,2013,14000,Gasolina,6200,Color : Negro DESCRIPCIÓN CALIFICACIONES SEGUR...
44,Hyundai Accent,2014,66000,Gasolina,7500,Precio negociable: si Color : CELESTE METALICO...


### Displaced tuples:

Since some ads structure and the way that data was extracted happens a displacement for one ad that wasn't extracted as it should be, so I've to remove some unusable datapoins to keep the coherence between `model` and `description`, due the unwanted behavior affect the `description` values.

In [95]:
# frame = fix_tuples("April-08", 33)

In [96]:
# frame.tail()

In [97]:
# frame.to_csv("persistence/March-21.csv", index=False)

### Concatenation:

In [98]:
february = concatenate_frames(list_ds("February"))
march = concatenate_frames(list_ds("March"))
april = concatenate_frames(list_ds("April"))

In [99]:
months = [february, march, april]
concatenate_frames(months)

Unnamed: 0,model,year,kilometers,engine,price,description
0,Nissan Sentra,2018,60323,Gasolina,9850,Garantía: Como es visto no hay garantía\nFinan...
1,Kia Forte,2017,53000,Gasolina,9000,Garantía: Como es visto no hay garantía\nFinan...
2,Honda City,2013,209470,Gasolina,7900,Financiamiento: no\nPrecio negociable: si\nCol...
3,Kia Forte,2016,116672,Gasolina,6975,Garantía: Como es visto no hay garantía\nFinan...
4,Honda Civic,2017,66000,Gasolina,13500,Garantía: Garantía de fabrica restante.\nFinan...
...,...,...,...,...,...,...
155,Chevrolet Aveo,2017,43000,Gasolina,14300,Color : Vino metálico Mantenimiento al día en ...
156,Volkswagen Golf,1996,200000,Gasolina,2000,Garantía: Como es visto no hay garantía Financ...
157,Nissan Sentra,2017,86000,Gasolina,7900,Precio negociable: si Color : gris DESCRIPCIÓN...
158,Nissan Sentra,2002,0,Gasolina,4500,DESCRIPCIÓN CALIFICACIONES SEGURIDAD Nissan se...
