# Modelling

## Introduction

In this notebook, we tackle the complex problem of optimizing class schedules for a university setting. Efficient scheduling is crucial for enhancing learning experiences, improving resource utilization, and balancing faculty workloads. 

However, creating optimal schedules that accommodate the availability of classrooms, student and faculty preferences, and curricular requirements poses significant challenges.

## Setup

The following code snippet lists all files under the input directory. This step ensures that we are aware of all available datasets, allowing us to import necessary files for further analysis. In our case, we are particularly interested in:

- `calendario_grupos_merged.csv`: Contains data related to group schedules.
- `ubicaciones_cleaned.csv`: Contains cleaned data on locations or room assignments.

Once these files are loaded, we will inspect their contents to understand the data structure, check for missing values, and identify key fields necessary for the timeline optimization modeling.

In [37]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

input_path = '/kaggle/input/uab-the-hack-2024/'
df_cg = pd.read_csv(input_path+'calendario_grupos_merged.csv')
df_u = pd.read_csv(input_path+'ubicaciones_cleaned.csv')

df_cg.dtypes

/kaggle/input/uab-the-hack-2024/grupos_cleaned.csv
/kaggle/input/uab-the-hack-2024/caracteristicas.csv
/kaggle/input/uab-the-hack-2024/ubicaciones_cleaned.csv
/kaggle/input/uab-the-hack-2024/calendario_grupos_merged.csv
/kaggle/input/uab-the-hack-2024/franjas_cuarto_hora.csv
/kaggle/input/uab-the-hack-2024/calendario_grupos_clean.csv
/kaggle/input/uab-the-hack-2024/franjas_media_hora.csv
/kaggle/input/uab-the-hack-2024/grupos.csv
/kaggle/input/uab-the-hack-2024/merged_caracteristicas_recursos.csv
/kaggle/input/uab-the-hack-2024/not_merged_caracteristicas_recursos.csv
/kaggle/input/uab-the-hack-2024/recursos_caracteristicas.csv
/kaggle/input/uab-the-hack-2024/calendario_grupos.csv
/kaggle/input/uab-the-hack-2024/ubicaciones.csv


## Data preparation

Sort the df_cg DataFrame by 'ID_FECHA_GRUPO' (date), 'IND_ALUMNOS_GRUPO_REAL' (number of students), and 'ID_HORA_INICIO' (start time), prioritizing earliest dates, larger groups, and earlier start times.

In [39]:
df_cg.sort_values(by=['ID_FECHA_GRUPO', 'IND_ALUMNOS_GRUPO_REAL', 'ID_HORA_INICIO'], ascending=[True, False, True], inplace=True)
df_cg

Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,ID_CURSO_ACADEMICO,ID_ASIGNATURA,ID_TIPO_DOCENCIA,ID_COD_GRUPO,ID_PERIODO_DOCENTE,IND_ALUMNOS_GRUPO_PREV,IND_ALUMNOS_GRUPO_REAL,IND_HORAS_PREVISTAS
9110,2024-0-115-102689-1-31,2024-12-09,1130,1330,2024,102689,1,31,1,108,147,26.0
9123,2024-0-115-103519-1-320,2024-12-09,1130,1330,2024,103519,1,320,1,55,74,26.0
9114,2024-0-115-104339-1-81,2024-12-09,1130,1330,2024,104339,1,81,1,61,69,39.0
9111,2024-0-115-104544-1-61,2024-12-09,1500,1700,2024,104544,1,61,1,50,48,26.0
9119,2024-0-115-102787-54-441,2024-12-09,1030,1130,2024,102787,54,441,1,43,38,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4,2024-0-115-102708-54-311,2025-06-27,930,1030,2024,102708,54,311,1,45,50,6.0
3,2024-0-115-102764-54-472,2025-06-27,1700,1900,2024,102764,54,472,1,37,44,50.0
2,2024-0-115-104554-54-1,2025-07-03,1700,1900,2024,104554,54,1,1,23,21,12.0
0,2024-0-115-102708-54-311,2025-07-04,930,1030,2024,102708,54,311,1,45,50,6.0


Filter the df_cg DataFrame to include only records within the specified date range: March 10, 2025, to March 16, 2025, just one week. This subset allows focusing on data from a specific week for further analysis or optimization.

In [40]:
start_date = '2025-03-10'
end_date = '2025-03-16'
df_cg = df_cg[(df_cg['ID_FECHA_GRUPO'] >= start_date) & (df_cg['ID_FECHA_GRUPO'] <= end_date)]
df_cg

Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,ID_CURSO_ACADEMICO,ID_ASIGNATURA,ID_TIPO_DOCENCIA,ID_COD_GRUPO,ID_PERIODO_DOCENTE,IND_ALUMNOS_GRUPO_PREV,IND_ALUMNOS_GRUPO_REAL,IND_HORAS_PREVISTAS
5773,2024-0-115-102714-1-31,2025-03-10,1130,1330,2024,102714,1,31,1,158,166,60.0
5708,2024-0-115-102690-1-31,2025-03-10,830,1030,2024,102690,1,31,1,96,114,30.0
5797,2024-0-115-106050-1-21,2025-03-10,1030,1130,2024,106050,1,21,0,108,101,45.0
5794,2024-0-115-102757-1-450,2025-03-10,1130,1330,2024,102757,1,450,1,94,100,26.0
5761,2024-0-115-106048-1-21,2025-03-10,1330,1430,2024,106048,1,21,0,66,88,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...
5302,2024-0-115-104368-54-81,2025-03-14,1330,1530,2024,104368,54,81,1,0,15,49.5
5303,2024-0-115-104548-1-61,2025-03-14,1500,1700,2024,104548,1,61,1,14,13,26.0
5324,2024-0-115-104548-54-611,2025-03-14,1700,1800,2024,104548,54,611,1,14,13,12.0
5381,2024-0-115-106057-54-213,2025-03-14,830,1030,2024,106057,54,213,1,21,5,60.0


Sort the df_u DataFrame by the 'CAPACIDAD' (capacity) column in descending order, displaying rooms with the highest capacity first, which may be useful for optimizing room assignments.

In [41]:
df_u.sort_values(by=['CAPACIDAD'], ascending=[False])

Unnamed: 0,ID_UBICACIO,DS_UBICACIO,ID_EDIFICI,CAPACIDAD
460,B1/0042,aula 15,B,202
619,B2/-141,aula magna,B,196
517,B5/-118,aula p15,B,176
512,B1/-1046,aula 5,B,154
457,B2/015,aula 10,B,146
...,...,...,...,...
195,C5/258,laboratori (ieec),C,12
374,C2/426,laboratori de receca,C,10
284,C2/125,laboratori de receca,C,10
274,C5/339.1,laboratori,C,10


## Modelling

In this section, we initialize an empty list column `y` in the `df_cg` DataFrame. The goal is to populate this column with potential room IDs that can accommodate each group based on their student count.

1. **Initialize the 'y' Column**: We start by creating an empty list in the `y` column for each row in `df_cg`.

2. **Iterate Over Groups**: For each group in `df_cg`, we retrieve the number of students (`IND_ALUMNOS_GRUPO_REAL`) and initialize an empty list, `l`, to store suitable room IDs.

3. **Room Capacity Check**: We then iterate over each room in `df_u`. If the room's capacity (`CAPACIDAD`) is equal to or greater than the group size, we add the room's ID (`ID_UBICACIO`) to the list `l`.

4. **Assign Available Rooms**: Finally, we assign the list `l` (containing potential rooms) to the `y` column for the corresponding group in `df_cg`.

This approach enables us to track the rooms that each group could use based on capacity, setting up a foundation for further scheduling optimization.

In [43]:
# Initialize an empty list column 'y' in df_cg
df_cg['y'] = [[] for _ in range(len(df_cg))]

# Iterate over rows in df_cg
for idx_cg, cg in df_cg.iterrows():
    n_al = cg['IND_ALUMNOS_GRUPO_REAL']
    l = []
    
    # Check each row in df_u to see if it can accommodate the group
    for _, u in df_u.iterrows():
        if n_al <= u['CAPACIDAD']:
            l.append(u['ID_UBICACIO'])
    
    # Assign the list l to the 'y' column in df_cg for the current row
    df_cg.at[idx_cg, 'y'] = l

df_cg

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cg['y'] = [[] for _ in range(len(df_cg))]


Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,ID_CURSO_ACADEMICO,ID_ASIGNATURA,ID_TIPO_DOCENCIA,ID_COD_GRUPO,ID_PERIODO_DOCENTE,IND_ALUMNOS_GRUPO_PREV,IND_ALUMNOS_GRUPO_REAL,IND_HORAS_PREVISTAS,y
5773,2024-0-115-102714-1-31,2025-03-10,1130,1330,2024,102714,1,31,1,158,166,60.0,"[B1/0042, B5/-118, B2/-141]"
5708,2024-0-115-102690-1-31,2025-03-10,830,1030,2024,102690,1,31,1,96,114,30.0,"[C1/017, C3/022, B2/015, B1/0042, B3B/0010, B1..."
5797,2024-0-115-106050-1-21,2025-03-10,1030,1130,2024,106050,1,21,0,108,101,45.0,"[Q1/1011, C1/017, C3/022, B2/017, B5B/022, B2/..."
5794,2024-0-115-102757-1-450,2025-03-10,1130,1330,2024,102757,1,450,1,94,100,26.0,"[Q1/1011, C3/018, C1/017, C3/022, B2/017, B5B/..."
5761,2024-0-115-106048-1-21,2025-03-10,1330,1430,2024,106048,1,21,0,66,88,30.0,"[Q1/1011, C3B/010, C3/018, C3/020, C3/032, C3/..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5302,2024-0-115-104368-54-81,2025-03-14,1330,1530,2024,104368,54,81,1,0,15,49.5,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."
5303,2024-0-115-104548-1-61,2025-03-14,1500,1700,2024,104548,1,61,1,14,13,26.0,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."
5324,2024-0-115-104548-54-611,2025-03-14,1700,1800,2024,104548,54,611,1,14,13,12.0,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."
5381,2024-0-115-106057-54-213,2025-03-14,830,1030,2024,106057,54,213,1,21,5,60.0,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."


In this step, we create a new DataFrame, `new_cg_df`, by removing columns from `df_cg` that are not essential for our analysis or optimization process. The columns dropped are:

- `ID_CURSO_ACADEMICO`: Academic year identifier.
- `ID_ASIGNATURA`: Subject identifier.
- `ID_TIPO_DOCENCIA`: Type of teaching identifier.
- `ID_COD_GRUPO`: Group code identifier.
- `ID_PERIODO_DOCENTE`: Teaching period identifier.
- `IND_ALUMNOS_GRUPO_PREV`: Projected number of students in the group.
- `IND_ALUMNOS_GRUPO_REAL`: Actual number of students in the group.
- `IND_HORAS_PREVISTAS`: Expected hours for the group.

By dropping these columns, we streamline the dataset, focusing only on the attributes relevant to room assignments and scheduling optimization.

In [44]:
new_cg_df = df_cg.drop(columns=['ID_CURSO_ACADEMICO', 'ID_ASIGNATURA', 'ID_TIPO_DOCENCIA', 'ID_COD_GRUPO', 'ID_PERIODO_DOCENTE', 'IND_ALUMNOS_GRUPO_PREV', 'IND_ALUMNOS_GRUPO_REAL', 'IND_HORAS_PREVISTAS'])
new_cg_df

Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,y
5773,2024-0-115-102714-1-31,2025-03-10,1130,1330,"[B1/0042, B5/-118, B2/-141]"
5708,2024-0-115-102690-1-31,2025-03-10,830,1030,"[C1/017, C3/022, B2/015, B1/0042, B3B/0010, B1..."
5797,2024-0-115-106050-1-21,2025-03-10,1030,1130,"[Q1/1011, C1/017, C3/022, B2/017, B5B/022, B2/..."
5794,2024-0-115-102757-1-450,2025-03-10,1130,1330,"[Q1/1011, C3/018, C1/017, C3/022, B2/017, B5B/..."
5761,2024-0-115-106048-1-21,2025-03-10,1330,1430,"[Q1/1011, C3B/010, C3/018, C3/020, C3/032, C3/..."
...,...,...,...,...,...
5302,2024-0-115-104368-54-81,2025-03-14,1330,1530,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."
5303,2024-0-115-104548-1-61,2025-03-14,1500,1700,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."
5324,2024-0-115-104548-54-611,2025-03-14,1700,1800,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."
5381,2024-0-115-106057-54-213,2025-03-14,830,1030,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ..."


This section aims to assign available classrooms to each group based on their schedule, avoiding time conflicts. We accomplish this through the following steps:

1. **Convert Start and End Times**: We format the `ID_HORA_INICIO` and `ID_HORA_FIN` columns as time objects, ensuring they are compatible with scheduling comparisons.

2. **Create an Occupancy Calendar**: A dictionary, `ocupacion_aulas`, is created to store occupied time slots. Each key is a tuple `(fecha, aula)` representing a classroom on a specific date, and the value is a list of time intervals during which the room is occupied.

3. **Initialize Classroom Assignment Column**: An empty column, `AULA_ASIGNADA`, is added to `df_cg` to store the final assigned classroom for each group.

4. **Iterate Over Groups**: For each group:
   - Extract the group's date, start and end times, and available classrooms (`y`).
   - Check each available classroom for potential assignment:
     - **Conflict Check**: If a classroom already has reservations on the group’s date, we check for overlapping times to avoid conflicts.
     - **Assign Room if Available**: If no conflict is found, assign the classroom to the group, add the time interval to the occupancy calendar, and stop searching.

5. **Save Assigned Classroom**: After finding a suitable classroom, we store it in the `AULA_ASIGNADA` column in `df_cg`.

This process ensures that each group is assigned a classroom without overlapping with other groups' schedules.

In [45]:
df_cg['ID_HORA_INICIO'] = pd.to_datetime(df_cg['ID_HORA_INICIO'].astype(str).str.zfill(4), format='%H%M').dt.time
df_cg['ID_HORA_FIN'] = pd.to_datetime(df_cg['ID_HORA_FIN'].astype(str).str.zfill(4), format='%H%M').dt.time

# Crear un calendario de ocupación de aulas
# Será un diccionario con la clave siendo una tupla (fecha, aula) y el valor una lista de intervalos horarios ocupados
ocupacion_aulas = {}

# Inicializar una columna para asignar el aula final
df_cg['AULA_ASIGNADA'] = None

# Iterar sobre cada grupo para asignar un aula
for idx_grupo, grupo in df_cg.iterrows():
    fecha = grupo['ID_FECHA_GRUPO']
    hora_inicio = grupo['ID_HORA_INICIO']
    hora_fin = grupo['ID_HORA_FIN']
    aulas_disponibles = grupo['y']
    aula_asignada = None

    # Iterar sobre las aulas disponibles para el grupo
    for aula in aulas_disponibles:
        clave = (fecha, aula)
        intervalo_grupo = (hora_inicio, hora_fin)
        conflicto = False

        # Comprobar si el aula ya tiene reservas en esa fecha
        if clave in ocupacion_aulas:
            # Comprobar solapamientos con horarios ya reservados
            for intervalo_ocupado in ocupacion_aulas[clave]:
                inicio_ocupado, fin_ocupado = intervalo_ocupado
                # Verificar si hay solapamiento
                if not (hora_fin <= inicio_ocupado or hora_inicio >= fin_ocupado):
                    conflicto = True
                    break
        else:
            # Si el aula no tiene reservas en esa fecha, inicializar la lista
            ocupacion_aulas[clave] = []

        if not conflicto:
            # Asignar el aula al grupo
            aula_asignada = aula
            # Añadir el intervalo de tiempo a la ocupación del aula
            ocupacion_aulas[clave].append(intervalo_grupo)
            break  # Dejar de buscar más aulas

    # Guardar el aula asignada en el DataFrame
    df_cg.at[idx_grupo, 'AULA_ASIGNADA'] = aula_asignada

df_cg.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cg['ID_HORA_INICIO'] = pd.to_datetime(df_cg['ID_HORA_INICIO'].astype(str).str.zfill(4), format='%H%M').dt.time
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cg['ID_HORA_FIN'] = pd.to_datetime(df_cg['ID_HORA_FIN'].astype(str).str.zfill(4), format='%H%M').dt.time
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view

Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,ID_CURSO_ACADEMICO,ID_ASIGNATURA,ID_TIPO_DOCENCIA,ID_COD_GRUPO,ID_PERIODO_DOCENTE,IND_ALUMNOS_GRUPO_PREV,IND_ALUMNOS_GRUPO_REAL,IND_HORAS_PREVISTAS,y,AULA_ASIGNADA
5773,2024-0-115-102714-1-31,2025-03-10,11:30:00,13:30:00,2024,102714,1,31,1,158,166,60.0,"[B1/0042, B5/-118, B2/-141]",B1/0042
5708,2024-0-115-102690-1-31,2025-03-10,08:30:00,10:30:00,2024,102690,1,31,1,96,114,30.0,"[C1/017, C3/022, B2/015, B1/0042, B3B/0010, B1...",C1/017
5797,2024-0-115-106050-1-21,2025-03-10,10:30:00,11:30:00,2024,106050,1,21,0,108,101,45.0,"[Q1/1011, C1/017, C3/022, B2/017, B5B/022, B2/...",Q1/1011
5794,2024-0-115-102757-1-450,2025-03-10,11:30:00,13:30:00,2024,102757,1,450,1,94,100,26.0,"[Q1/1011, C3/018, C1/017, C3/022, B2/017, B5B/...",Q1/1011
5761,2024-0-115-106048-1-21,2025-03-10,13:30:00,14:30:00,2024,106048,1,21,0,66,88,30.0,"[Q1/1011, C3B/010, C3/018, C3/020, C3/032, C3/...",Q1/1011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5302,2024-0-115-104368-54-81,2025-03-14,13:30:00,15:30:00,2024,104368,54,81,1,0,15,49.5,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ...",QC/2090
5303,2024-0-115-104548-1-61,2025-03-14,15:00:00,17:00:00,2024,104548,1,61,1,14,13,26.0,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ...",QC/0107
5324,2024-0-115-104548-54-611,2025-03-14,17:00:00,18:00:00,2024,104548,54,611,1,14,13,12.0,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ...",QC/0107
5381,2024-0-115-106057-54-213,2025-03-14,08:30:00,10:30:00,2024,106057,54,213,1,21,5,60.0,"[Q1/1003, Q3/0013, Q4/1013, Q1/0007, Q6/2008, ...",QC/3047


Finally, we save the file.

In [46]:
csv_file_path = '/kaggle/working/grupos_asignados.csv'
df_cg.to_csv(csv_file_path, index=False)

## Merge dataframes

In this final step, we remove unnecessary columns from `df_cg` to focus only on relevant information. The columns dropped include:

- `y`: List of potential classrooms, now redundant as each group has been assigned a classroom.
- `ID_CURSO_ACADEMICO`, `ID_ASIGNATURA`, `ID_TIPO_DOCENCIA`, `ID_COD_GRUPO`, `ID_PERIODO_DOCENTE`: Identifiers related to academic course details.
- `IND_ALUMNOS_GRUPO_PREV`, `IND_ALUMNOS_GRUPO_REAL`, `IND_HORAS_PREVISTAS`: Information on expected student numbers and scheduled hours, which are no longer needed.

This cleaned DataFrame displays only the essential columns for the first 10 rows, showing the final assignments and simplified data.

In [47]:
df_cg.drop(columns=['y','ID_CURSO_ACADEMICO', 'ID_ASIGNATURA', 'ID_TIPO_DOCENCIA', 'ID_COD_GRUPO', 'ID_PERIODO_DOCENTE', 'IND_ALUMNOS_GRUPO_PREV', 'IND_ALUMNOS_GRUPO_REAL', 'IND_HORAS_PREVISTAS']).head(10)


Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,AULA_ASIGNADA
5773,2024-0-115-102714-1-31,2025-03-10,11:30:00,13:30:00,B1/0042
5708,2024-0-115-102690-1-31,2025-03-10,08:30:00,10:30:00,C1/017
5797,2024-0-115-106050-1-21,2025-03-10,10:30:00,11:30:00,Q1/1011
5794,2024-0-115-102757-1-450,2025-03-10,11:30:00,13:30:00,Q1/1011
5761,2024-0-115-106048-1-21,2025-03-10,13:30:00,14:30:00,Q1/1011
5762,2024-0-115-103804-1-415,2025-03-10,09:30:00,10:30:00,Q1/1011
5771,2024-0-115-103815-1-11,2025-03-10,17:00:00,18:00:00,Q3/1011
5702,2024-0-115-102707-1-33,2025-03-10,09:30:00,11:30:00,Q3/1011
5779,2024-0-115-106047-1-21,2025-03-10,11:30:00,13:30:00,Q3/1011
5783,2024-0-115-102714-54-311,2025-03-10,13:30:00,14:30:00,Q3/1011


For analysis purposes, we don't need ```y``` anymore.

In [48]:
df_combined = df_cg.drop(columns=['y'])

In this step, we combine the `df_combined` DataFrame (which contains group data with assigned classrooms) and `df_u` (containing details on classrooms) to enrich our dataset with additional classroom attributes.

1. **Merge DataFrames**: We perform a left join between `df_combined` and `df_u`, matching `AULA_ASIGNADA` in `df_combined` with `ID_UBICACIO` in `df_u`. This joins each assigned classroom's details (e.g., capacity, location) to the group data.

2. **Drop Redundant Column**: After the merge, we drop the `ID_UBICACIO` column from the result, as it is redundant with `AULA_ASIGNADA`.

3. **Display Final Data**: The first few rows of `merged_df` are displayed to show the combined dataset with relevant classroom details.

This merged DataFrame, `merged_df`, provides a comprehensive view of each group with their assigned classrooms and relevant classroom characteristics.

In [49]:
merged_df = df_combined.merge(df_u, left_on='AULA_ASIGNADA', right_on='ID_UBICACIO', how='left').drop(columns=['ID_UBICACIO'])
merged_df.head()

Unnamed: 0,ID_GRUPO,ID_FECHA_GRUPO,ID_HORA_INICIO,ID_HORA_FIN,ID_CURSO_ACADEMICO,ID_ASIGNATURA,ID_TIPO_DOCENCIA,ID_COD_GRUPO,ID_PERIODO_DOCENTE,IND_ALUMNOS_GRUPO_PREV,IND_ALUMNOS_GRUPO_REAL,IND_HORAS_PREVISTAS,AULA_ASIGNADA,DS_UBICACIO,ID_EDIFICI,CAPACIDAD
0,2024-0-115-102714-1-31,2025-03-10,11:30:00,13:30:00,2024,102714,1,31,1,158,166,60.0,B1/0042,aula 15,B,202
1,2024-0-115-102690-1-31,2025-03-10,08:30:00,10:30:00,2024,102690,1,31,1,96,114,30.0,C1/017,aula,C,121
2,2024-0-115-106050-1-21,2025-03-10,10:30:00,11:30:00,2024,106050,1,21,0,108,101,45.0,Q1/1011,aula q1/1011 (tres portes),Q,103
3,2024-0-115-102757-1-450,2025-03-10,11:30:00,13:30:00,2024,102757,1,450,1,94,100,26.0,Q1/1011,aula q1/1011 (tres portes),Q,103
4,2024-0-115-106048-1-21,2025-03-10,13:30:00,14:30:00,2024,106048,1,21,0,66,88,30.0,Q1/1011,aula q1/1011 (tres portes),Q,103


We saved a separate file with the merged dataframe for further analysis.

In [50]:
csv_file_path = '/kaggle/working/grupos_ubicacio_merged.csv'
merged_df.to_csv(csv_file_path, index=False)

## Validate constraints

This section checks for time overlaps in classroom assignments to ensure no scheduling conflicts between groups assigned to the same room on the same day.

1. **Sort the DataFrame**: We sort `df_combined` by `AULA_ASIGNADA`, `ID_FECHA_GRUPO`, and `ID_HORA_INICIO`, grouping records by classroom and date, then arranging by start time within each group.

2. **Create DateTime Columns**: We generate `START_DATETIME` and `END_DATETIME` columns by combining the date (`ID_FECHA_GRUPO`) and start/end times (`ID_HORA_INICIO`, `ID_HORA_FIN`) for each class session. This allows for precise time comparison.

3. **Check for Overlaps**: 
   - We iterate over each group of records with the same classroom and date.
   - For each group, we compare the end time of the previous session with the start time of the next.
   - If the next session’s start time (`START_DATETIME`) is earlier than the previous session’s end time (`END_DATETIME`), we record the conflict as a tuple of group IDs.

4. **Display Results**: 
   - If overlaps are detected, we print the conflicting group pairs to alert to scheduling conflicts.
   - If no conflicts are found, a message indicates that the schedule is conflict-free.

This conflict detection step ensures the integrity of the room assignment schedule by identifying any overlapping time slots.

In [51]:
# Ordenar el DataFrame por AULA_ASIGNADA, ID_FECHA_GRUPO, y ID_HORA_INICIO
df_combined = df_combined.sort_values(by=['AULA_ASIGNADA', 'ID_FECHA_GRUPO', 'ID_HORA_INICIO'])

# Convertir las horas a datetime para comparar franjas de tiempo en cada aula y día
df_combined['START_DATETIME'] = pd.to_datetime(df_combined['ID_FECHA_GRUPO'].astype(str) + ' ' + df_combined['ID_HORA_INICIO'].astype(str))
df_combined['END_DATETIME'] = pd.to_datetime(df_combined['ID_FECHA_GRUPO'].astype(str) + ' ' + df_combined['ID_HORA_FIN'].astype(str))

# Función para verificar solapamientos
overlaps = []
for aula, group in df_combined.groupby(['AULA_ASIGNADA', 'ID_FECHA_GRUPO']):
    for i in range(1, len(group)):
        prev_row = group.iloc[i - 1]
        curr_row = group.iloc[i]
        # Verificar si el inicio de la siguiente es antes del fin de la anterior
        if curr_row['START_DATETIME'] < prev_row['END_DATETIME']:
            overlaps.append((prev_row['ID_GRUPO'], curr_row['ID_GRUPO']))

# Mostrar resultados
if overlaps:
    print("Conflictos encontrados en los siguientes pares de grupos:")
    for conflict in overlaps:
        print(f"Grupo {conflict[0]} y Grupo {conflict[1]}")
else:
    print("No se encontraron conflictos.")

No se encontraron conflictos.


As you can see, there are no conflicts :)