## Proyecto 3 - Analítica Computacional para la Toma de Decisiones

#### Juan Pablo Ríos Hernández 201821819
#### Samuel Felipe Ríos Parra 201821820
#### Joep Cornelis Nicolaas van der Kamp 202416832

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

## Tarea 1. Preguntas de negocio:

#### ¿Cuál va a ser el puntaje que va a obtener un estudiante dada su información académica y sociodemográfica?

#### ¿Cómo es la distribución histórica de los puntajes basados en factores sociodemográficos y académicos?

## Tarea 2. Limpieza y alistamiento de datos:

In [None]:
df = pd.read_csv('Datos_Santander.csv')
df

In [None]:
df.dtypes

#### Missing Values

In [None]:
df.isna().sum()

In [None]:
df.shape

In [None]:
value_counts = df['COLE_BILINGUE'].value_counts()
value_counts

In [None]:
df_cleaned = df.dropna()
df_cleaned.shape

In [None]:
df['COLE_BILINGUE'].fillna('N', inplace=True)
df_cleaned2 = df.dropna()
df_cleaned2.shape


The only column with a big amount of missing values is the COLE_BILINGUE, that shows if the school is bilingual or not. We assume that the schools with missing values are not bilingual and replace these with N meaning no. We remove the rest of the NaN values as the other variables have small amounts of NaN values and the dataset is very large, this leaves us with a dataset of 97517 rows.

In [None]:
object_columns = df_cleaned2.select_dtypes(include=['object']).columns
for col in object_columns:
    unique_values = df_cleaned2[col].unique()
    print(f"Unique values of {col}: {unique_values}")

#### Outliers

In [None]:
# Seperate the numerical and categorical columns and then calculate the IQR for the numerical columns to discover outliers
numerical_columns = df_cleaned2.select_dtypes(include=['number'])
categorical_columns = df_cleaned2.select_dtypes(exclude=['number'])

# Calculate IQR for each numerical column
Q1 = numerical_columns.quantile(0.25)
Q3 = numerical_columns.quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
outliers = ((numerical_columns < (Q1 - 1.5 * IQR)) | (numerical_columns > (Q3 + 1.5 * IQR)))

In [None]:
outliers_count = outliers.sum()

# Print the number of outliers for each numerical column
print("Number of outliers for each numerical column:")
print(outliers_count)

The columns with codigos should not be seen as outliers, therefore the columns with outliers are the puntaje columns. These outliers will be removed as it is such a big dataset. 

In [None]:
# Specify the columns with outliers
columns_with_outliers = ['PUNT_INGLES', 'PUNT_MATEMATICAS', 'PUNT_SOCIALES_CIUDADANAS', 'PUNT_C_NATURALES', 'PUNT_LECTURA_CRITICA', 'PUNT_GLOBAL']

# Create box plots for each column
plt.figure(figsize=(10, 6))
df_cleaned2[columns_with_outliers].boxplot()
plt.title('Box plot of columns with outliers')
plt.xlabel('Columns')
plt.ylabel('Values')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Remove outliers from specified columns
df_cleaned_no_outliers = df_cleaned2[~outliers[columns_with_outliers].any(axis=1)]

# Print the shape of the dataframe after removing outliers
print("Shape of dataframe after removing outliers:", df_cleaned_no_outliers.shape)

#### Handling categorical data

Machine learning modules cannot understand non-numeric data. Therefore we handle the categorical data by one-hot encoding the categories that do not have an ordinal relationship and label encoding the categories that do have an ordinal relationship.

One-hot variables:
- ESTU_TIPODOCUMENTO
- COLE_AREA_UBICACION
- COLE_BILINGUE
- COLE_CALENDARIO
- COLE_CARACTER
- COLE_GENERO
- COLE_JORNADA
- COLE_SEDE_PRINCIPAL
- ESTU_DEPTO_PRESENTACION
- ESTU_DEPTO_RESIDE
- ESTU_ESTADOINVESTIGACION
- ESTU_GENERO
- ESTU_NACIONALIDAD
- ESTU_PAIS_RESIDE
- FAMI_TIENEAUTOMOVIL
- FAMI_TIENECOMPUTADOR
- FAMI_TIENEINTERNET
- FAMI_TIENELAVADORA
- FAMI_PERSONASHOGAR ~~
- FAMI_CUARTOSHOGAR ~~
- FAMI_EDUCACIONMADRE ~~
- FAMI_EDUCACIONPADRE  ~~

Label encoding variables:
- ESTU_FECHANACIMIENTO
- FAMI_ESTRATOVIVIENDA
- DESEMP_INGLES

Others:
- COLE_MCPIO_UBICACION
- ESTU_MCPIO_PRESENTACION
- ESTU_MCPIO_RESIDE

Remove:
- ESTU_CONSECUTIVO
- COLE_DEPTO_UBICACION
- ESTU_ESTUDIANTE
- ESTU_PRIVADO_LIBERTAD

We remove these three variables as they all only have 1 value and will not contribute to the model. ESTU_CONSECUTIVO is removed as it is a code.

In [None]:
df_cleaned_no_outliers.drop(['ESTU_CONSECUTIVO', 'COLE_DEPTO_UBICACION', 'ESTU_ESTUDIANTE', 'ESTU_PRIVADO_LIBERTAD'], axis=1, inplace=True)

One-hot encoding

In [None]:
# # List of categorical variables to one-hot encode
# categorical_columns = ['ESTU_TIPODOCUMENTO', 'COLE_AREA_UBICACION', 'COLE_BILINGUE', 'COLE_CALENDARIO',
#                        'COLE_CARACTER', 'COLE_GENERO', 'COLE_JORNADA', 'COLE_SEDE_PRINCIPAL',
#                        'ESTU_DEPTO_PRESENTACION', 'ESTU_DEPTO_RESIDE', 'ESTU_ESTADOINVESTIGACION',
#                        'ESTU_GENERO', 'ESTU_NACIONALIDAD', 'ESTU_PAIS_RESIDE', 'FAMI_TIENEAUTOMOVIL',
#                        'FAMI_TIENECOMPUTADOR', 'FAMI_TIENEINTERNET', 'FAMI_TIENELAVADORA']

# # One-hot encode the categorical variables
# df_cleaned_no_outliers_encoded = pd.get_dummies(df_cleaned_no_outliers, columns=categorical_columns, drop_first=True)

In [None]:
# df_cleaned_no_outliers_encoded.shape

In [None]:
# df_cleaned_no_outliers_encoded.dtypes

In [None]:
df_cleaned_no_outliers

In [None]:
# List of variables
variables = ['ESTU_FECHANACIMIENTO', 'FAMI_CUARTOSHOGAR', 'FAMI_EDUCACIONMADRE',
             'FAMI_EDUCACIONPADRE', 'FAMI_ESTRATOVIVIENDA', 'FAMI_PERSONASHOGAR', 'DESEMP_INGLES']

# Extract unique values for each variable
unique_values = {}
for variable in variables:
    unique_values[variable] = df_cleaned_no_outliers[variable].unique()

# Print unique values for each variable
for variable, values in unique_values.items():
    print(f"Unique values for {variable}:")
    print(values)
    print()

Label encoding

The variables with ordinal ranks are label encoded

In [None]:
from datetime import datetime

# Function to convert date string to numerical value
def convert_to_numerical(date_string):
    date_object = datetime.strptime(date_string, '%d/%m/%Y')
    day = date_object.day
    month = date_object.month
    year = date_object.year
    return day, month, year

# Apply the conversion function to the entire column
df_cleaned_no_outliers[['Day', 'Month', 'Year']] = df_cleaned_no_outliers['ESTU_FECHANACIMIENTO'].apply(lambda x: pd.Series(convert_to_numerical(x)))
df_cleaned_no_outliers.drop(columns=['ESTU_FECHANACIMIENTO'], inplace=True)

In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Define mapping for label encoding
estrato_mapping = {
    'Sin Estrato': 0,
    'Estrato 1': 1,
    'Estrato 2': 2,
    'Estrato 3': 3,
    'Estrato 4': 4,
    'Estrato 5': 5,
    'Estrato 6': 6
}

# Apply label encoding using the defined mapping
df_cleaned_no_outliers['FAMI_ESTRATOVIVIENDA_ENCODED'] = df_cleaned_no_outliers['FAMI_ESTRATOVIVIENDA'].map(estrato_mapping)
df_cleaned_no_outliers.drop(columns=['FAMI_ESTRATOVIVIENDA'], inplace=True)


In [None]:
# Define the custom mapping dictionary
mapping_dict = {'A-': 1, 'A1': 2, 'A2': 3, 'B1': 4, 'B+': 5}

# Replace the values using the mapping dictionary
df_cleaned_no_outliers['DESEMP_INGLES'] = df_cleaned_no_outliers['DESEMP_INGLES'].map(mapping_dict)

In [None]:
df_cleaned_no_outliers.head()

#### Scaling

In [None]:
df_visualizations = df_cleaned_no_outliers.copy()

In [None]:
df_visualizations.head()

In [None]:
df_cleaned_no_outliers.dtypes

In [None]:
from sklearn.preprocessing import StandardScaler

numerical_columns = df_cleaned_no_outliers.select_dtypes(include=['float64', 'int64']).columns

scaler = StandardScaler()

# Fit and transform the numerical data
df_cleaned_no_outliers[numerical_columns] = scaler.fit_transform(df_cleaned_no_outliers[numerical_columns])


In [None]:
df_cleaned_no_outliers.head()

## Tarea 3. Exploración de datos:

In [None]:
df_visualizations.head()

In [None]:
df_visualizations.columns

### Desempeño de los estudiantes en cada año

In [None]:
df_visualizations['PERIODO'].unique()

In [None]:
notas_2019 = df_visualizations[(df_visualizations['PERIODO'] == 20194) | (df['PERIODO'] == 20191)]
notas_2018 = df_visualizations[(df_visualizations['PERIODO'] == 20181)]
notas_2017 = df_visualizations[(df_visualizations['PERIODO'] == 20172) | (df['PERIODO'] == 20171)]
notas_2016 = df_visualizations[(df_visualizations['PERIODO'] == 20162) | (df['PERIODO'] == 20161)]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_2016['PUNT_GLOBAL'], name = '2016', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_2017['PUNT_GLOBAL'], name = '2017', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_2018['PUNT_GLOBAL'], name = '2018', marker_color = 'salmon'))
fig.add_trace(go.Box(y = notas_2019['PUNT_GLOBAL'], name = '2019', marker_color = 'teal'))
fig.update_layout(title_text="Desempeño global por año")

fig.show()

### Desempeño de los estudiantes por colegio bilingüe

In [None]:
df_visualizations['COLE_BILINGUE'].unique()

In [None]:
notas_no = df_visualizations[(df_visualizations['COLE_BILINGUE'] == 'N')]
notas_si = df_visualizations[(df_visualizations['COLE_BILINGUE'] == 'S')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_no['PUNT_GLOBAL'], name = 'No Bilingüe', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_si['PUNT_GLOBAL'], name = 'Bilingüe', marker_color = 'limegreen'))
fig.update_layout(title_text="Desempeño global por tipo de colegio")

fig.show()

### Desempeño por tipo de calendario académico

In [None]:
df_visualizations['COLE_CALENDARIO'].unique()

In [None]:
notas_a = df_visualizations[(df_visualizations['COLE_CALENDARIO'] == 'A')]
notas_b = df_visualizations[(df_visualizations['COLE_CALENDARIO'] == 'B')]
notas_otro = df_visualizations[(df_visualizations['COLE_CALENDARIO'] == 'OTRO')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_a['PUNT_GLOBAL'], name = 'Calendario A', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_b['PUNT_GLOBAL'], name = 'Calendario B', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_otro['PUNT_GLOBAL'], name = 'Otro', marker_color = 'salmon'))
fig.update_layout(title_text="Desempeño global por calendario académico")

fig.show()

### Desempeño por carácter del colegio

In [None]:
df_visualizations['COLE_CARACTER'].unique()

In [None]:
notas_ambas = df_visualizations[(df_visualizations['COLE_CARACTER'] == 'TÉCNICO/ACADÉMICO')]
notas_academico = df_visualizations[(df_visualizations['COLE_CARACTER'] == 'ACADÉMICO')]
notas_tecnico = df_visualizations[(df_visualizations['COLE_CARACTER'] == 'TÉCNICO')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_ambas['PUNT_GLOBAL'], name = 'Técnico y Académico', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_academico['PUNT_GLOBAL'], name = 'Académico', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_tecnico['PUNT_GLOBAL'], name = 'Técnico', marker_color = 'salmon'))
fig.update_layout(title_text="Desempeño global por tipo de colegio")

fig.show()

### Desempeño por si el colegio es mixto o solo un género

In [None]:
df_visualizations['COLE_GENERO'].unique()

In [None]:
notas_mixto = df_visualizations[(df_visualizations['COLE_GENERO'] == 'MIXTO')]
notas_femenino = df_visualizations[(df_visualizations['COLE_GENERO'] == 'FEMENINO')]
notas_masculino = df_visualizations[(df_visualizations['COLE_GENERO'] == 'MASCULINO')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_mixto['PUNT_GLOBAL'], name = 'Mixto', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_femenino['PUNT_GLOBAL'], name = 'Femenino', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_masculino['PUNT_GLOBAL'], name = 'Masculino', marker_color = 'salmon'))
fig.update_layout(title_text="Desempeño global por colegio mixto o diferenciado")

fig.show()

### Desempeño por tipo de jornada

In [None]:
df_visualizations['COLE_JORNADA'].unique()

In [None]:
notas_manana = df_visualizations[(df_visualizations['COLE_JORNADA'] == 'MAÑANA')]
notas_tarde = df_visualizations[(df_visualizations['COLE_JORNADA'] == 'TARDE')]
notas_noche = df_visualizations[(df_visualizations['COLE_JORNADA'] == 'NOCHE')]
notas_sabatina = df_visualizations[(df_visualizations['COLE_JORNADA'] == 'SABATINA')]
notas_unica = df_visualizations[(df_visualizations['COLE_JORNADA'] == 'UNICA')]
notas_completa = df_visualizations[(df_visualizations['COLE_JORNADA'] == 'COMPLETA')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_manana['PUNT_GLOBAL'], name = 'Mañana', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_tarde['PUNT_GLOBAL'], name = 'Tarde', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_noche['PUNT_GLOBAL'], name = 'Noche', marker_color = 'salmon'))
fig.add_trace(go.Box(y = notas_sabatina['PUNT_GLOBAL'], name = 'Sabatina', marker_color = 'teal'))
fig.add_trace(go.Box(y = notas_unica['PUNT_GLOBAL'], name = 'Única', marker_color = 'darkslategrey'))
fig.add_trace(go.Box(y = notas_completa['PUNT_GLOBAL'], name = 'Completa', marker_color = 'goldenrod'))
fig.update_layout(title_text="Desempeño global por tipo de jornada")

fig.show()

### Desempeño por naturaleza del colegio

In [None]:
df_visualizations['COLE_NATURALEZA'].unique()

In [None]:
notas_oficial = df_visualizations[(df_visualizations['COLE_NATURALEZA'] == 'OFICIAL')]
notas_nooficial = df_visualizations[(df_visualizations['COLE_NATURALEZA'] == 'NO OFICIAL')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_oficial['PUNT_GLOBAL'], name = 'Oficial', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_nooficial['PUNT_GLOBAL'], name = 'No Oficial', marker_color = 'limegreen'))
fig.update_layout(title_text="Desempeño global por naturaleza del colegio")

fig.show()

### Desempeño por género del estudiante

In [None]:
df_visualizations['ESTU_GENERO'].unique()

In [None]:
notas_mujeeres = df_visualizations[(df_visualizations['ESTU_GENERO'] == 'F')]
notas_hombres = df_visualizations[(df_visualizations['ESTU_GENERO'] == 'M')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_mujeeres['PUNT_GLOBAL'], name = 'Mujeres', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_hombres['PUNT_GLOBAL'], name = 'Hombres', marker_color = 'limegreen'))
fig.update_layout(title_text="Desempeño global por género")

fig.show()

### Desempeño por estrato

In [None]:
df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'].unique()

In [None]:
notas_0 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 0)]
notas_1 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 1)]
notas_2 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 2)]
notas_3 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 3)]
notas_4 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 4)]
notas_5 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 5)]
notas_6 = df_visualizations[(df_visualizations['FAMI_ESTRATOVIVIENDA_ENCODED'] == 6)]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_0['PUNT_GLOBAL'], name = 'Estrato 0', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_1['PUNT_GLOBAL'], name = 'Estrato 1', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_2['PUNT_GLOBAL'], name = 'Estrato 2', marker_color = 'salmon'))
fig.add_trace(go.Box(y = notas_3['PUNT_GLOBAL'], name = 'Estrato 3', marker_color = 'teal'))
fig.add_trace(go.Box(y = notas_4['PUNT_GLOBAL'], name = 'Estrato 4', marker_color = 'darkslategrey'))
fig.add_trace(go.Box(y = notas_5['PUNT_GLOBAL'], name = 'Estrato 5', marker_color = 'goldenrod'))
fig.add_trace(go.Box(y = notas_6['PUNT_GLOBAL'], name = 'Estrato 6', marker_color = 'darkviolet'))
fig.update_layout(title_text="Desempeño global por estrato")

fig.show()

### Desempeño por nivel de inglés

In [None]:
df['DESEMP_INGLES'].unique()

In [None]:
notas_a2 = df[(df['DESEMP_INGLES'] == 'A2')]
notas_a_ = df[(df['DESEMP_INGLES'] == 'A-')]
notas_a1 = df[(df['DESEMP_INGLES'] == 'A1')]
notas_b1 = df[(df['DESEMP_INGLES'] == 'B1')]
notas_b_ = df[(df['DESEMP_INGLES'] == 'B+')]

In [None]:
fig = go.Figure()
fig.add_trace(go.Box(y = notas_a2['PUNT_GLOBAL'], name = 'A2', marker_color='fuchsia'))
fig.add_trace(go.Box(y = notas_a_['PUNT_GLOBAL'], name = 'A-', marker_color = 'limegreen'))
fig.add_trace(go.Box(y = notas_a1['PUNT_GLOBAL'], name = 'A1', marker_color = 'salmon'))
fig.add_trace(go.Box(y = notas_b1['PUNT_GLOBAL'], name = 'B1', marker_color = 'teal'))
fig.add_trace(go.Box(y = notas_b_['PUNT_GLOBAL'], name = 'B+', marker_color = 'darkslategrey'))

fig.update_layout(title_text="Desempeño global por nivel de inglés")

fig.show()

In [None]:
df_visualizations['Year']

In [None]:
# Paso 1: Encuentra el valor mínimo de la columna 'Year'
df2 = df_visualizations.copy()

min_year = df2['Year'].min()

# Paso 2: Encuentra el índice de la fila que contiene este valor mínimo
index_min_year = df2[df2['Year'] == min_year].index

# Paso 3: Elimina esa fila usando el índice encontrado
df2 = df2.drop(index_min_year)

In [None]:
import plotly.express as px

fig = px.scatter(x=df2['Year'], y=df2['PUNT_GLOBAL'])
fig.update_layout(
    title="Desempeño por año de nacimiento",
    xaxis_title="Año de nacimiento",
    yaxis_title="Puntaje Global"
)
fig.update_traces(marker=dict(color='salmon'))


fig.show()

### Histograma Bilingüe y No Bilingüe

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=notas_no['PUNT_GLOBAL'], histnorm='probability density', marker_color='fuchsia', name = 'No Bilingüe'))
fig.add_trace(go.Histogram(x=notas_si['PUNT_GLOBAL'], histnorm='probability density', marker_color='limegreen', name = 'Bilinüe'))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(title_text="Desempeño por tipo de colegio")

fig.show()

### Histograma Mujeres y Hombres

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=notas_mujeeres['PUNT_GLOBAL'], histnorm='probability density', marker_color='fuchsia', name = 'Mujeres'))
fig.add_trace(go.Histogram(x=notas_hombres['PUNT_GLOBAL'], histnorm='probability density', marker_color='limegreen', name = 'Hombres'))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(title_text="Desempeño por tipo de colegio")

fig.show()

### Histograma Calendario

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=notas_a['PUNT_GLOBAL'], histnorm='probability density', marker_color='fuchsia', name = 'Calendario A'))
fig.add_trace(go.Histogram(x=notas_b['PUNT_GLOBAL'], histnorm='probability density', marker_color='limegreen', name = 'Calendario B'))
fig.add_trace(go.Histogram(x=notas_otro['PUNT_GLOBAL'], histnorm='probability density', marker_color='yellow', name = 'Otro'))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(title_text="Desempeño por calendario del colegio")

fig.show()

### Histograma Calendario

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=notas_academico['PUNT_GLOBAL'], histnorm='probability density', marker_color='fuchsia', name = 'Académico'))
fig.add_trace(go.Histogram(x=notas_tecnico['PUNT_GLOBAL'], histnorm='probability density', marker_color='limegreen', name = 'Técnico'))
fig.add_trace(go.Histogram(x=notas_ambas['PUNT_GLOBAL'], histnorm='probability density', marker_color='yellow', name = 'Ambas'))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.update_layout(title_text="Desempeño por tipo de colegio")

fig.show()

### Histograma Colegio Mixto o Diferenciado

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=notas_mixto['PUNT_GLOBAL'], histnorm='probability density', marker_color='fuchsia', name = 'Mixto'))
fig.add_trace(go.Histogram(x=notas_femenino['PUNT_GLOBAL'], histnorm='probability density', marker_color='limegreen', name = 'Femenino'))
fig.add_trace(go.Histogram(x=notas_masculino['PUNT_GLOBAL'], histnorm='probability density', marker_color='yellow', name = 'Masculino'))

# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.7)
fig.update_layout(title_text="Desempeño por colegio mixto o diferenciado")

fig.show()

### Violin plot Bilingüe y No Bilingüe - OMITIR, NO FUNCIONAN

In [None]:
fig = go.Figure()

fig.add_trace(go.Violin(
                        y=df_visualizations['PUNT_GLOBAL'][ df_visualizations['COLE_BILINGUE'] == 'S' ],
                        legendgroup='Bilingüe', scalegroup='Yes', name='Bilingüe',
                        side='negative',
                        line_color='blue')
             )
fig.add_trace(go.Violin(x=df_visualizations['COLE_BILINGUE'][ df_visualizations['COLE_BILINGUE'] == 'N' ],
                        y=df_visualizations['PUNT_GLOBAL'][ df_visualizations['COLE_BILINGUE'] == 'N' ],
                        legendgroup='No Bilingüe', scalegroup='Yes', name='No Bilingüe',
                        side='positive',
                        line_color='orange')
             )
fig.update_traces(meanline_visible=True)
fig.update_layout(violingap=0, violinmode='overlay')
fig.show()

In [None]:
fig = go.Figure()

# Añadir el diagrama de violín para los colegios bilingües
fig.add_trace(go.Violin(y=notas_si['PUNT_GLOBAL'],
                        name='Bilingüe',
                        box_visible=True,
                        meanline_visible=True))

# Añadir el diagrama de violín para los colegios no bilingües
fig.add_trace(go.Violin(y=notas_no['PUNT_GLOBAL'],
                        name='No Bilingüe',
                        box_visible=True,
                        meanline_visible=True))

# Actualizar el diseño del gráfico
fig.update_layout(title='Comparación del Puntaje Global entre Colegios Bilingües y No Bilingües',
                  yaxis_title='Puntaje Global',
                  violinmode='group')

fig.show()

In [None]:
fig = go.Figure()

# Añadir el diagrama de violín para los colegios bilingües en el lado izquierdo
fig.add_trace(go.Violin(y=notas_si['PUNT_GLOBAL'],
                        name='Bilingüe',
                        side='negative',
                        box_visible=True,
                        meanline_visible=True,
                        line_color='blue'))

# Añadir el diagrama de violín para los colegios no bilingües en el lado derecho
fig.add_trace(go.Violin(y=notas_no['PUNT_GLOBAL'],
                        name='No Bilingüe',
                        side='positive',
                        box_visible=True,
                        meanline_visible=True,
                        line_color='red'))

# Actualizar el diseño del gráfico
fig.update_layout(title='Comparación del Puntaje Global entre Colegios Bilingües y No Bilingües',
                  yaxis_title='Puntaje Global',
                  violinmode='overlay')

fig.show()


In [None]:
fig = go.Figure()

# Añadir el diagrama de violín para ambos grupos
fig.add_trace(go.Violin(y=notas_si['PUNT_GLOBAL'],
                        x=['Bilingüe'] * len(notas_si),
                        name='Bilingüe',
                        box_visible=True,
                        meanline_visible=True,
                        line_color='blue'))

fig.add_trace(go.Violin(y=notas_no['PUNT_GLOBAL'],
                        x=['No Bilingüe'] * len(notas_no),
                        name='No Bilingüe',
                        box_visible=True,
                        meanline_visible=True,
                        line_color='red'))

# Actualizar el diseño del gráfico
fig.update_layout(title='Comparación del Puntaje Global entre Colegios Bilingües y No Bilingües',
                  yaxis_title='Puntaje Global',
                  violinmode='group')

fig.show()

In [None]:
provisional = df_visualizations[['COLE_BILINGUE', 'PUNT_GLOBAL']].copy()

In [None]:
provisional.head()