# Ejemplos de Data Analytics

## Operaciones básicas sobre archivos de datos usando Python

A continuación se presenta un ejemplo práctico de la manipulación de archivos de datos en texto usando listas. Si bien, resulta mucho más facil realizar esta tarea usando librerías especializadas en Python, el objetivo aquí es ejemplificar el uso de las estructuras de datos disponibles en Python puro.

### Competencias a desarrollar

Al finalizar este tutorial, usted estará en capacidad de:

- Descargar archivos de datos de internet y convertirlos a una lista de campos.

- Visualizar los datos como una tabla.

- Reorganizar las columnas de la tabla.

- Obtener un subconjunto de registros.

- Filtrar los registros.

- Buscar registros.

- Obtener los valores únicos por campo.

- Obtener un subcojunto de las columnas.

- Escribir los resultados como una tabla a un archivo.

### Enunciado

Se realizarán varias actividades con un archivo de datos de eventos de conductores de camiones: `truck_event_text_partition.csv`

### Carga de datos de los eventos de los conductores

In [None]:
#
# Se descarga el archivo directamente del repo al disco duro.
#
url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv"
!wget --quiet {url} -P /tmp/

#
# Visualiza el contenido del directorio
#
!ls -1 /tmp/*

/tmp/dap_multiplexer.141a5334963b.root.log.INFO.20230323-032236.83
/tmp/dap_multiplexer.INFO
/tmp/debugger_gqfb0slli
/tmp/kernel_manager_proxy.141a5334963b.root.log.ERROR.20230323-042504.34
/tmp/kernel_manager_proxy.141a5334963b.root.log.INFO.20230323-032232.34
/tmp/kernel_manager_proxy.ERROR
/tmp/kernel_manager_proxy.INFO
/tmp/truck_event_text_partition.csv

/tmp/initgoogle_syslog_dir.0:
unknown

/tmp/pyright-290-Hr2d2xjbDCyF:

/tmp/pyright-290-kcBaHBZyO1wk:

/tmp/python-languageserver-cancellation:
7183bd7ce2226ed0e106e6db0f599f70ab8d7232b2


In [None]:
with open("/tmp/truck_event_text_partition.csv", "r") as file:
    truck_events = file.readlines()

#
# Cantidad de filas del archivo incluyendo la cabecera.
#
print(len(truck_events))

#
# Tipo de objeto resultante
#
display(type(truck_events))

17076


list

In [None]:
#
# Visualización del contenido
#
truck_events[0:2]

['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate\n',
 '14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22\n']

In [None]:
#
# Limpieza de "\n" (saltos de línea)
#
truck_events = [line.replace("\n", "") for line in truck_events]

In [None]:
#
# Conversión de los strings (filas) a listas
#
truck_events = [line.split(",") for line in truck_events]

In [None]:
#
# Visualización de nombres de las columnas
#
truck_events[0]

['driverId',
 'truckId',
 'eventTime',
 'eventType',
 'longitude',
 'latitude',
 'eventKey',
 'CorrelationId',
 'driverName',
 'routeId',
 'routeName',
 'eventDate']

### Visualización de datos

In [None]:
#
# La visualización de datos como una lista de listas resulta inapropiada
#
truck_events[:2]

[['driverId',
  'truckId',
  'eventTime',
  'eventType',
  'longitude',
  'latitude',
  'eventKey',
  'CorrelationId',
  'driverName',
  'routeId',
  'routeName',
  'eventDate'],
 ['14',
  '25',
  '59:21.4',
  'Normal',
  '-94.58',
  '37.03',
  '14|25|9223370572464814373',
  '3.66E+18',
  'Adis Cesir',
  '160405074',
  'Joplin to Kansas City Route 2',
  '2016-05-27-22']]

Se hará una función `pprint(data)` con dos funciones dentro para mostrar los datos con un formato apropiado de tabla.

1. `get_format_string(data)` obtendrá el tamaño de ancho apropiado para cada columna, ya que extraerá el máximo ancho de cada elemento de la columna.
2. `print_data(format_string, data)` imprime los datos formateados como una tabla. 

In [None]:
def pprint(data: list):

    """
    Imprime la lista de listas (data) en un formato de tabla alineada a la derecha
    """
    
    def get_format_string(data: list) -> list:
        
        """
        Obtendrá el tamaño de ancho apropiado para cada columna, ya que 
        extraerá el máximo ancho de cada elemento de la columna.
        """
        
        format_string = []
        for i_col in range(len(data[0])):
            lengths = [len(str(row[i_col])) for row in data]
            max_length = max(lengths)
            #
            # Por ejemplo: "{:>10s}"
            #
            format_string.append("{:>" + str(max_length) + "s}")
        return format_string
    
    def print_data(format_string: list, data: list):
        
        """
        Imprime la lista de listas formateada como una tabla.
        """
        
        format_string = get_format_string(data)
        
        for index, row in enumerate(data):
            
            # Se itera a través de las filas de los datos y agrega
            # un número de fila (comenzando en 0) al principio de 
            # cada fila, excepto la primera fila.
            text = "    " if index == 0 else "{:2d}  ".format(index - 1)

            # Con zip se empareja cada ancho máximo de columna con
            # cada elemento de la fila.
            for fmt, value in zip(format_string, row):
                # Se agrega el dato, con su correspondiente formato
                # que define el ancho, al texto a imprimir de la fila.
                text += fmt.format(str(value)) + " "
            
            # Si el texto de la fila resulta más largo que 106 caracteres
            # la función acorta la cadena de texto a 100 caracteres y 
            # agrega "[...]" para indicar que la cadena se ha acortado.
            if len(text) >= 106:
                text = text[:100] + " [...]"
            
            # Se imprime el renglón
            print(text)
    
    format_string = get_format_string(data)
    print_data(format_string, data)

In [None]:
pprint(truck_events[:5])

    driverId truckId eventTime eventType longitude latitude                   eventKey CorrelationId [...]
 0        14      25   59:21.4    Normal    -94.58    37.03  14|25|9223370572464814373      3.66E+18 [...]
 1        18      16   59:21.7    Normal    -89.66    39.78  18|16|9223370572464814089      3.66E+18 [...]
 2        27     105   59:21.7    Normal    -90.21    38.65 27|105|9223370572464814070      3.66E+18 [...]
 3        11      74   59:21.7    Normal     -90.2    38.65  11|74|9223370572464814123      3.66E+18 [...]


### Reorganización de las columnas

Se pone la columna 8 `driverName` como la columna 1 y se mueve el resto a la derecha.

Esto se hace creando cada fila de la nueva lista con un list comprehension que mueve cada elemento 8 de la lista de la fila a la posición 1.

In [None]:
truck_events = [
    [row[0], row[8]] + [row[index] for index in range(1, 12) if index != 8]
    for row in truck_events
]
pprint(truck_events[:10])

    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        14      Adis Cesir      25   59:21.4    Normal    -94.58    37.03  14|25|92233705724648143 [...]
 1        18       Grant Liu      16   59:21.7    Normal    -89.66    39.78  18|16|92233705724648140 [...]
 2        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 3        11  Jamie Engesser      74   59:21.7    Normal     -90.2    38.65  11|74|92233705724648141 [...]
 4        22   Nadeem Asghar      87   59:21.7    Normal    -90.04    35.19  22|87|92233705724648141 [...]
 5        22   Nadeem Asghar      87   59:22.3    Normal    -90.37    35.21  22|87|92233705724648134 [...]
 6        23       Adam Diaz      68   59:22.4    Normal    -89.91    40.86  23|68|92233705724648134 [...]
 7        11  Jamie Engesser      74   59:22.5    Normal    -89.74     39.1  11|74|92233705724648133 [...]
 8        20    Chris Harris      41 

### Obtención de un subconjunto de registros

Los nueve primeros registros.

In [None]:
truck_events_subset = truck_events[0:10]
pprint(truck_events_subset)

    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        14      Adis Cesir      25   59:21.4    Normal    -94.58    37.03  14|25|92233705724648143 [...]
 1        18       Grant Liu      16   59:21.7    Normal    -89.66    39.78  18|16|92233705724648140 [...]
 2        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 3        11  Jamie Engesser      74   59:21.7    Normal     -90.2    38.65  11|74|92233705724648141 [...]
 4        22   Nadeem Asghar      87   59:21.7    Normal    -90.04    35.19  22|87|92233705724648141 [...]
 5        22   Nadeem Asghar      87   59:22.3    Normal    -90.37    35.21  22|87|92233705724648134 [...]
 6        23       Adam Diaz      68   59:22.4    Normal    -89.91    40.86  23|68|92233705724648134 [...]
 7        11  Jamie Engesser      74   59:22.5    Normal    -89.74     39.1  11|74|92233705724648133 [...]
 8        20    Chris Harris      41 

### Filtrado de registros

Obtener los registros con el `driverId` igual a `14`.

In [None]:
# truck_events[0] es la primera fila, del encabezado

truck_events_driverId_14 = [truck_events[0]] + [
    row for row in truck_events if row[0] == "14"
]
pprint(truck_events_driverId_14[:10])

    driverId driverName truckId eventTime eventType longitude latitude                  eventKey Cor [...]
 0        14 Adis Cesir      25   59:21.4    Normal    -94.58    37.03 14|25|9223370572464814373     [...]
 1        14 Adis Cesir      25   59:23.3    Normal    -94.31    37.31 14|25|9223370572464812526     [...]
 2        14 Adis Cesir      25   59:24.2    Normal     -94.3    37.66 14|25|9223370572464811655     [...]
 3        14 Adis Cesir      25   59:34.0    Normal     -94.3    37.66 14|25|9223370572464801796     [...]
 4        14 Adis Cesir      25   59:35.8    Normal    -94.46    37.16 14|25|9223370572464800006     [...]
 5        14 Adis Cesir      25   59:53.3    Normal    -94.58    37.03 14|25|9223370572464782555     [...]
 6        14 Adis Cesir      25   59:54.0    Normal    -94.46    37.16 14|25|9223370572464781805     [...]
 7        14 Adis Cesir      25   59:57.5    Normal    -94.35    38.33 14|25|9223370572464778335     [...]
 8        14 Adis Cesir      25   59:

### Búsqueda de registros usando expresiones regulares

In [None]:
import re

In [None]:
name_finished_in_z = [truck_events[0]] + [
    row for row in truck_events if re.search("z$", row[1])
]
pprint(name_finished_in_z[:10])

    driverId driverName truckId eventTime eventType longitude latitude                  eventKey Cor [...]
 0        23  Adam Diaz      68   59:22.4    Normal    -89.91    40.86 23|68|9223370572464813450     [...]
 1        23  Adam Diaz      68   59:26.6    Normal    -91.32    41.71 23|68|9223370572464809216     [...]
 2        23  Adam Diaz      68   59:27.4    Normal    -91.47    41.74 23|68|9223370572464808375     [...]
 3        23  Adam Diaz      68   59:28.4    Normal    -91.63    41.72 23|68|9223370572464807444     [...]
 4        23  Adam Diaz      68   59:29.9    Normal    -91.78    42.23 23|68|9223370572464805905     [...]
 5        23  Adam Diaz      68   59:30.8    Normal    -91.63    41.72 23|68|9223370572464804995     [...]
 6        23  Adam Diaz      68   59:37.0    Normal    -89.91    40.86 23|68|9223370572464798855     [...]
 7        23  Adam Diaz      68   59:39.6    Normal    -89.91    40.86 23|68|9223370572464796242     [...]
 8        23  Adam Diaz      68   59:

In [None]:
name_begins_with_Ma = [truck_events[0]] + [
    row for row in truck_events if re.search("^Ma", row[1])
]
pprint(name_begins_with_Ma[:10])

    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 1        27 Mark Lochbihler     105   59:22.6    Normal    -90.41    38.75 27|105|92233705724648132 [...]
 2        27 Mark Lochbihler     105   59:25.9    Normal    -90.93    38.82 27|105|92233705724648098 [...]
 3        27 Mark Lochbihler     105   59:27.7    Normal    -91.19    38.83 27|105|92233705724648081 [...]
 4        27 Mark Lochbihler     105   59:29.3    Normal    -91.56    38.93 27|105|92233705724648065 [...]
 5        27 Mark Lochbihler     105   59:35.6    Normal    -92.85    38.93 27|105|92233705724648001 [...]
 6        27 Mark Lochbihler     105   59:50.9    Normal     -93.2    38.98 27|105|92233705724647849 [...]
 7        27 Mark Lochbihler     105   59:51.8    Normal    -93.01    38.97 27|105|92233705724647839 [...]
 8        27 Mark Lochbihler     105 

In [None]:
name_contains_ch = [truck_events[0]] + [
    row for row in truck_events if re.search("ch", row[1])
]
pprint(name_contains_ch[:10])

    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 1        27 Mark Lochbihler     105   59:22.6    Normal    -90.41    38.75 27|105|92233705724648132 [...]
 2        16      Tom McCuch      12   59:23.4    Normal    -90.29    40.96  16|12|92233705724648123 [...]
 3        26    Michael Aube      57   59:25.2    Normal    -90.86    38.46  26|57|92233705724648106 [...]
 4        16      Tom McCuch      12   59:25.3    Normal     -90.7    41.62  16|12|92233705724648105 [...]
 5        27 Mark Lochbihler     105   59:25.9    Normal    -90.93    38.82 27|105|92233705724648098 [...]
 6        26    Michael Aube      57   59:27.0    Normal    -91.18    38.22  26|57|92233705724648087 [...]
 7        27 Mark Lochbihler     105   59:27.7    Normal    -91.19    38.83 27|105|92233705724648081 [...]
 8        16      Tom McCuch      12 

In [None]:
name_not_contains_i = [truck_events[0]] + [
    row for row in truck_events if not re.search("i", row[1])
]
pprint(name_not_contains_i[:10])

    driverId     driverName truckId eventTime eventType longitude latitude                   eventKe [...]
 0        22  Nadeem Asghar      87   59:21.7    Normal    -90.04    35.19  22|87|922337057246481410 [...]
 1        22  Nadeem Asghar      87   59:22.3    Normal    -90.37    35.21  22|87|922337057246481348 [...]
 2        32 Ryan Templeton      42   59:22.5    Normal    -90.37    35.21  32|42|922337057246481329 [...]
 3        16     Tom McCuch      12   59:23.4    Normal    -90.29    40.96  16|12|922337057246481239 [...]
 4        22  Nadeem Asghar      87   59:24.2    Normal    -90.94    35.03  22|87|922337057246481165 [...]
 5        32 Ryan Templeton      42   59:24.2    Normal    -90.94    35.03  32|42|922337057246481159 [...]
 6        22  Nadeem Asghar      87   59:25.0    Normal    -91.14    34.96  22|87|922337057246481080 [...]
 7        16     Tom McCuch      12   59:25.3    Normal     -90.7    41.62  16|12|922337057246481053 [...]
 8        21   Jeff Markham     109  

In [None]:
# Que no comiencen con la M o la N
name_not_begins_with_MN = [truck_events[0]] + [
    row for row in truck_events[1:] if not re.search("^[MN]", row[1])
]
pprint(name_not_begins_with_MN[:10])

    driverId     driverName truckId eventTime eventType longitude latitude                  eventKey [...]
 0        14     Adis Cesir      25   59:21.4    Normal    -94.58    37.03 14|25|9223370572464814373 [...]
 1        18      Grant Liu      16   59:21.7    Normal    -89.66    39.78 18|16|9223370572464814089 [...]
 2        11 Jamie Engesser      74   59:21.7    Normal     -90.2    38.65 11|74|9223370572464814123 [...]
 3        23      Adam Diaz      68   59:22.4    Normal    -89.91    40.86 23|68|9223370572464813450 [...]
 4        11 Jamie Engesser      74   59:22.5    Normal    -89.74     39.1 11|74|9223370572464813355 [...]
 5        20   Chris Harris      41   59:22.5    Normal    -93.36    41.69 20|41|9223370572464813344 [...]
 6        32 Ryan Templeton      42   59:22.5    Normal    -90.37    35.21 32|42|9223370572464813296 [...]
 7        17    Eric Mizell      15   59:23.2    Normal    -90.55    38.81 17|15|9223370572464812585 [...]
 8        14     Adis Cesir      25  

### Borrado de registros duplicados

Se vuelven a unir los registros (filas) de la tabla separando los datos por `","`.

Luego se agrupan en un `set` para que obligatoriamente los registros no se repitan.

Finalmente los vuelven a colocar como en la lista inicial.

In [None]:
truck_events_as_strings = [",".join(row) for row in truck_events[1:]]
unique_truck_events = list(set(truck_events_as_strings))
unique_truck_events = [row.split(",") for row in unique_truck_events]
unique_truck_events = [truck_events[0]] + unique_truck_events
pprint(unique_truck_events[:10])

    driverId           driverName truckId eventTime eventType longitude latitude                  ev [...]
 0        29           Teddy Choi      68   37:09.7    Normal    -95.99    36.17 29|68|9223370572419 [...]
 1        18            Grant Liu      49   11:05.5    Normal    -94.42    39.27 18|49|9223370571956 [...]
 2        29           Teddy Choi      66   59:58.2    Normal     -95.3    35.64 29|66|9223370572464 [...]
 3        31        Rommel Garcia      86   36:11.6    Normal    -94.38    38.99 31|86|9223370572419 [...]
 4        25 Jean-Philippe Player      21   09:35.1    Normal    -89.56     37.2 25|21|9223370571956 [...]
 5        29           Teddy Choi      16   09:02.2    Normal    -95.56    35.97 29|16|9223370571956 [...]
 6        18            Grant Liu      11   36:30.2    Normal    -93.93    39.76 18|11|9223370572419 [...]
 7        26         Michael Aube      42   59:40.6    Normal    -90.53    38.51 26|42|9223370572126 [...]
 8        29           Teddy Choi    

### Valores únicos por campo

Nótese que al usar `set comprehension` (`{}`) obligatoriamente los elementos no se van a repetir.

In [None]:
driverNames = sorted({row[1] for row in truck_events[1:]})
driverNames

['Adam Diaz',
 'Adis Cesir',
 'Ajay Singh',
 'Chris Harris',
 'Dan Rice',
 'Don Hilborn',
 'Eric Mizell',
 'George Vetticaden',
 'Grant Liu',
 'Jamie Engesser',
 'Jean-Philippe Player',
 'Jeff Markham',
 'Joe Niemiec',
 'Mark Lochbihler',
 'Michael Aube',
 'Nadeem Asghar',
 'Olivier Renault',
 'Paul Codding',
 'Rohit Bakshi',
 'Rommel Garcia',
 'Ryan Templeton',
 'Teddy Choi',
 'Tom McCuch']

### Ordenamiento de registros con base en un campo

Ordenar elementos por la columna 1 de `driverName`.

In [None]:
from operator import itemgetter

sorted_truck_events = [truck_events[0]] + sorted(
    [row for row in truck_events[1:]], key=itemgetter(1)
)
pprint(sorted_truck_events[:10])

    driverId driverName truckId eventTime eventType longitude latitude                  eventKey Cor [...]
 0        23  Adam Diaz      68   59:22.4    Normal    -89.91    40.86 23|68|9223370572464813450     [...]
 1        23  Adam Diaz      68   59:26.6    Normal    -91.32    41.71 23|68|9223370572464809216     [...]
 2        23  Adam Diaz      68   59:27.4    Normal    -91.47    41.74 23|68|9223370572464808375     [...]
 3        23  Adam Diaz      68   59:28.4    Normal    -91.63    41.72 23|68|9223370572464807444     [...]
 4        23  Adam Diaz      68   59:29.9    Normal    -91.78    42.23 23|68|9223370572464805905     [...]
 5        23  Adam Diaz      68   59:30.8    Normal    -91.63    41.72 23|68|9223370572464804995     [...]
 6        23  Adam Diaz      68   59:37.0    Normal    -89.91    40.86 23|68|9223370572464798855     [...]
 7        23  Adam Diaz      68   59:39.6    Normal    -89.91    40.86 23|68|9223370572464796242     [...]
 8        23  Adam Diaz      68   59:

### Obtención de un subconjunto de columnas

Seleccione las columnas "driverId", "eventTime", "eventType" de la variable truck_events_subset

In [None]:
# Se obtienen los índices de las columnas que se quieren
column_indexes = [
    i_col
    for i_col, colname in enumerate(truck_events_subset[0])
    if colname in ["driverId", "eventTime", "eventType"]
]

# Se obtienen las filas solo con los elementos de los índices 
# obtenidos previamente
specific_columns = [
    [col for i_col, col in enumerate(row) if i_col in column_indexes]
    for row in truck_events_subset
]

pprint(specific_columns)

    driverId eventTime eventType 
 0        14   59:21.4    Normal 
 1        18   59:21.7    Normal 
 2        27   59:21.7    Normal 
 3        11   59:21.7    Normal 
 4        22   59:21.7    Normal 
 5        22   59:22.3    Normal 
 6        23   59:22.4    Normal 
 7        11   59:22.5    Normal 
 8        20   59:22.5    Normal 


### Escritura al disco como una tabla

In [None]:
specific_columns = [",".join(row) for row in specific_columns]
specific_columns = "\n".join(specific_columns)
specific_columns

'driverId,eventTime,eventType\n14,59:21.4,Normal\n18,59:21.7,Normal\n27,59:21.7,Normal\n11,59:21.7,Normal\n22,59:21.7,Normal\n22,59:22.3,Normal\n23,59:22.4,Normal\n11,59:22.5,Normal\n20,59:22.5,Normal'

In [None]:
with open("/tmp/specific_columns.csv", "w") as file:
    print(specific_columns, file=file)

!cat /tmp/specific_columns.csv

driverId,eventTime,eventType
14,59:21.4,Normal
18,59:21.7,Normal
27,59:21.7,Normal
11,59:21.7,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
11,59:22.5,Normal
20,59:22.5,Normal


In [None]:
#
# Visualiza el contenido del directorio
#
!ls -1 /tmp/*

/tmp/dap_multiplexer.141a5334963b.root.log.INFO.20230323-032236.83
/tmp/dap_multiplexer.INFO
/tmp/debugger_gqfb0slli
/tmp/kernel_manager_proxy.141a5334963b.root.log.ERROR.20230323-042504.34
/tmp/kernel_manager_proxy.141a5334963b.root.log.INFO.20230323-032232.34
/tmp/kernel_manager_proxy.ERROR
/tmp/kernel_manager_proxy.INFO
/tmp/specific_columns.csv
/tmp/truck_event_text_partition.csv

/tmp/initgoogle_syslog_dir.0:
unknown

/tmp/pyright-26879-IXfyyoKzvDDQ:

/tmp/pyright-26879-p4eTY5t85Fyj:

/tmp/pyright-290-Hr2d2xjbDCyF:

/tmp/pyright-290-kcBaHBZyO1wk:

/tmp/python-languageserver-cancellation:
7183bd7ce2226ed0e106e6db0f599f70ab8d7232b2


## Procesamiento básico de datos usando Python

En este tutorial se explica como realizar el procedimiento básico de datos usando Python.

### Competencias a desarrollar

Al finalizar este tutorial, usted estará en capacidad de:

- Descargar archivos de datos desde internet y cargarlos como lista.

- Usar la función groupby de la librería itertools.

- Unir dos conjuntos de datos usando un campo clave.

- Ordenar un conjunto de datos.

- Buscar los registros que contienen el valor máximo o mínimo de un campo.

- Escribir al disco duro los resultados.

### Descarga de datos

Se realizarán las actividades usando los archivos de datos `drivers.csv` y `timesheet.csv`

In [None]:
url_drivers = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv"
!wget --quiet {url_drivers} -P /tmp/

url_timesheet = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv"
!wget --quiet {url_timesheet} -P /tmp/

!ls -1 /tmp/*

/tmp/dap_multiplexer.2a37856b40e0.root.log.INFO.20230323-172403.87
/tmp/dap_multiplexer.INFO
/tmp/debugger_1z6u2ig8iy
/tmp/drivers.csv
/tmp/kernel_manager_proxy.2a37856b40e0.root.log.INFO.20230323-172358.34
/tmp/kernel_manager_proxy.INFO
/tmp/timesheet.csv

/tmp/initgoogle_syslog_dir.0:
unknown

/tmp/python-languageserver-cancellation:
043d45c60ef0916e45aa33229b1b6f8c36d693afef


### Impresión

In [None]:
def pprint(data: list):

    """
    Imprime la lista de listas (data) en un formato de tabla alineada a la derecha
    """
    
    def get_format_string(data: list) -> list:
        
        """
        Obtendrá el tamaño de ancho apropiado para cada columna, ya que 
        extraerá el máximo ancho de cada elemento de la columna.
        """
        
        format_string = []
        for i_col in range(len(data[0])):
            lengths = [len(str(row[i_col])) for row in data]
            max_length = max(lengths)
            #
            # Por ejemplo: "{:>10s}"
            #
            format_string.append("{:>" + str(max_length) + "s}")
        return format_string
    
    def print_data(format_string: list, data: list):
        
        """
        Imprime la lista de listas formateada como una tabla.
        """
        
        format_string = get_format_string(data)
        
        for index, row in enumerate(data):
            
            # Se itera a través de las filas de los datos y agrega
            # un número de fila (comenzando en 0) al principio de 
            # cada fila, excepto la primera fila.
            text = "    " if index == 0 else "{:2d}  ".format(index - 1)

            # Con zip se empareja cada ancho máximo de columna con
            # cada elemento de la fila.
            for fmt, value in zip(format_string, row):
                # Se agrega el dato, con su correspondiente formato
                # que define el ancho, al texto a imprimir de la fila.
                text += fmt.format(str(value)) + " "
            
            # Si el texto de la fila resulta más largo que 106 caracteres
            # la función acorta la cadena de texto a 100 caracteres y 
            # agrega "[...]" para indicar que la cadena se ha acortado.
            if len(text) >= 106:
                text = text[:100] + " [...]"
            
            # Se imprime el renglón
            print(text)
    
    format_string = get_format_string(data)
    print_data(format_string, data)

### Creación de la tabla `drivers`

In [None]:
with open("/tmp/drivers.csv", "r") as file:
    drivers = file.readlines()

drivers = [row.replace("\n", "") for row in drivers]
drivers = [row.split(",") for row in drivers]

# Solo se quieren mostrar las dos primeras columnas
drivers = [row[:2] for row in drivers]

pprint(drivers[0:10])

    driverId              name 
 0        10 George Vetticaden 
 1        11    Jamie Engesser 
 2        12       Paul Coddin 
 3        13       Joe Niemiec 
 4        14        Adis Cesir 
 5        15      Rohit Bakshi 
 6        16        Tom McCuch 
 7        17       Eric Mizell 
 8        18         Grant Liu 


### Creación de la tabla `timesheet`

In [None]:
with open("/tmp/timesheet.csv", "r") as file:
    timesheet = file.readlines()

timesheet = [row.replace("\n", "") for row in timesheet]
timesheet = [row.split(",") for row in timesheet]
pprint(timesheet[:10])

    driverId week hours-logged miles-logged 
 0        10    1           70         3300 
 1        10    2           70         3300 
 2        10    3           60         2800 
 3        10    4           70         3100 
 4        10    5           70         3200 
 5        10    6           70         3300 
 6        10    7           70         3000 
 7        10    8           70         3300 
 8        10    9           70         3200 


Se dará formato de número entero a los datos de la tabla.

In [None]:
#
# Typecast de datos
#
timesheet = [
    [int(field) if i_row > 0 else field for field in row]
    for i_row, row in enumerate(timesheet)
]
pprint(timesheet[:10])

    driverId week hours-logged miles-logged 
 0        10    1           70         3300 
 1        10    2           70         3300 
 2        10    3           60         2800 
 3        10    4           70         3100 
 4        10    5           70         3200 
 5        10    6           70         3300 
 6        10    7           70         3000 
 7        10    8           70         3300 
 8        10    9           70         3200 


### Cantidad de horas y millas de cada conductor por año

El siguiente código muestra un resumen del contenido de la lista de listas `timesheet` agrupando los elementos por su primer valor `driverId` y mostrando los primeros 5 elementos de cada grupo, de los primeros 4 grupos.

In [None]:
import itertools
from operator import itemgetter

for i_key, (key, group) in enumerate(
    itertools.groupby(
        timesheet[1:],
        itemgetter(0),
    )
):
    print(key)

    for i_grp, grp in enumerate(group):
        print("   ", grp)
        if i_grp > 3:
            print("    ...")
            break

    if i_key > 2:
        print("...")
        break

10
    [10, 1, 70, 3300]
    [10, 2, 70, 3300]
    [10, 3, 60, 2800]
    [10, 4, 70, 3100]
    [10, 5, 70, 3200]
    ...
11
    [11, 1, 50, 3000]
    [11, 2, 83, 4000]
    [11, 3, 80, 4000]
    [11, 4, 85, 4000]
    [11, 5, 90, 4100]
    ...
12
    [12, 1, 49, 2783]
    [12, 2, 50, 2505]
    [12, 3, 51, 2577]
    [12, 4, 54, 2743]
    [12, 5, 47, 2791]
    ...
13
    [13, 1, 49, 2643]
    [13, 2, 56, 2553]
    [13, 3, 60, 2539]
    [13, 4, 55, 2553]
    [13, 5, 45, 2762]
    ...
...


In [None]:
# Se crea un diccionario con clave de driverId y valor de la lista de sus registros
timesheet_grouped_by_driverId = {
    driverId: list(group)
    for driverId, group in itertools.groupby(
        timesheet[1:],
        key=itemgetter(0),
    )
}

Se creará una tabla que sume los tiempos de los registros de cada conductor.

In [None]:
sum_timesheet = [
    [
        driverId,
        sum([row[2] for row in timesheet_grouped_by_driverId[driverId]]),
        sum([row[3] for row in timesheet_grouped_by_driverId[driverId]]),
    ]
    for driverId in timesheet_grouped_by_driverId.keys()
]

sum_timesheet = [["driverId", "hours-logged", "miles-logged"]] + sum_timesheet

pprint(sum_timesheet)

    driverId hours-logged miles-logged 
 0        10         3232       147150 
 1        11         3642       179300 
 2        12         2639       135962 
 3        13         2727       134126 
 4        14         2781       136624 
 5        15         2734       138750 
 6        16         2746       137205 
 7        17         2701       135992 
 8        18         2654       137834 
 9        19         2738       137968 
10        20         2644       134564 
11        21         2751       138719 
12        22         2733       137550 
13        23         2750       137980 
14        24         2647       134461 
15        25         2723       139180 
16        26         2730       137530 
17        27         2771       137922 
18        28         2723       137469 
19        29         2760       138255 
20        30         2773       137473 
21        31         2704       137057 
22        32         2736       137422 
23        33         2759       139285 


### Unión de las tablas

Se ejecuta un `join` por el `driverId`.

In [None]:
summary = [
    row_drivers + row_timesheet[1:]
    for row_drivers in drivers[1:]
    for row_timesheet in sum_timesheet[1:]
    if row_drivers[0] == str(row_timesheet[0])
]

summary = [["driverId", "name", "hours-logged", "miles-logged"]] + summary
pprint(summary)

    driverId                name hours-logged miles-logged 
 0        10   George Vetticaden         3232       147150 
 1        11      Jamie Engesser         3642       179300 
 2        12         Paul Coddin         2639       135962 
 3        13         Joe Niemiec         2727       134126 
 4        14          Adis Cesir         2781       136624 
 5        15        Rohit Bakshi         2734       138750 
 6        16          Tom McCuch         2746       137205 
 7        17         Eric Mizell         2701       135992 
 8        18           Grant Liu         2654       137834 
 9        19          Ajay Singh         2738       137968 
10        20        Chris Harris         2644       134564 
11        21        Jeff Markham         2751       138719 
12        22       Nadeem Asghar         2733       137550 
13        23           Adam Diaz         2750       137980 
14        24         Don Hilborn         2647       134461 
15        25 Jean-Philippe Playe        

### Ordenamiento de la tabla

Se ordenarán los registros por el valor `hours-logged` de menor a mayor.

In [None]:
from operator import itemgetter

sorted_summary = [summary[0]] + sorted([row for row in summary[1:]], key=itemgetter(2))
pprint(sorted_summary)

    driverId                name hours-logged miles-logged 
 0        12         Paul Coddin         2639       135962 
 1        20        Chris Harris         2644       134564 
 2        24         Don Hilborn         2647       134461 
 3        18           Grant Liu         2654       137834 
 4        37           Wes Floyd         2694       137223 
 5        42     Randy Gelhausen         2697       136673 
 6        40    Nicolas Maillard         2700       136931 
 7        17         Eric Mizell         2701       135992 
 8        31       Rommel Garcia         2704       137057 
 9        25 Jean-Philippe Playe         2723       139180 
10        28     Olivier Renault         2723       137469 
11        41       Greg Phillips         2723       138407 
12        13         Joe Niemiec         2727       134126 
13        35         Emil Siemes         2728       138727 
14        26        Michael Aube         2730       137530 
15        22       Nadeem Asghar        

### Búsqueda del máximo o el mínimo

Se busca el que tenga máximo valor de `hours-logged`.

In [None]:
[
    row
    for row in sorted_summary[1:]
    if row[2] == max(aux_row[2] for aux_row in sorted_summary[1:])
]

[['11', 'Jamie Engesser', 3642, 179300]]

### Almacenamiento de los resultados

In [None]:
summary = [[str(field) for field in row] for row in summary]
summary = [",".join(row) for row in summary]
summary = "\n".join(summary)

with open("/tmp/summary.csv", "w") as file:
    print(summary, file=file)

!cat /tmp/summary.csv

driverId,name,hours-logged,miles-logged
10,George Vetticaden,3232,147150
11,Jamie Engesser,3642,179300
12,Paul Coddin,2639,135962
13,Joe Niemiec,2727,134126
14,Adis Cesir,2781,136624
15,Rohit Bakshi,2734,138750
16,Tom McCuch,2746,137205
17,Eric Mizell,2701,135992
18,Grant Liu,2654,137834
19,Ajay Singh,2738,137968
20,Chris Harris,2644,134564
21,Jeff Markham,2751,138719
22,Nadeem Asghar,2733,137550
23,Adam Diaz,2750,137980
24,Don Hilborn,2647,134461
25,Jean-Philippe Playe,2723,139180
26,Michael Aube,2730,137530
27,Mark Lochbihler,2771,137922
28,Olivier Renault,2723,137469
29,Teddy Choi,2760,138255
30,Dan Rice,2773,137473
31,Rommel Garcia,2704,137057
32,Ryan Templeton,2736,137422
33,Sridhara Sabbella,2759,139285
34,Frank Romano,2811,137728
35,Emil Siemes,2728,138727
36,Andrew Grande,2795,138025
37,Wes Floyd,2694,137223
38,Scott Shaw,2760,137464
39,David Kaiser,2745,138788
40,Nicolas Maillard,2700,136931
41,Greg Phillips,2723,138407
42,Randy Gelhausen,2697,136673
43,Dave Patton,2750,136993