# Predicción de Aprobar/Reprobar un Estudiante con PySpark

En este notebook vamos a entrenar un modelo de **regresión logística** usando PySpark para predecir si un estudiante aprobará o no basado en algunas características. 

Usaremos un dataset ficticio de estudiantes, y luego veremos cómo hacer predicciones para un nuevo estudiante.

## Cargar y preparar los datos

Primero, cargamos los datos de un archivo CSV (que debe estar en el mismo directorio o especificar el path correcto) y los preparamos para el modelo.


In [9]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MathStudentAnalysis") \
    .getOrCreate()

25/09/18 22:34:03 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|   Mjob|    Fjob|reason|guardian|traveltime|studytime|failures|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|famrel|freetime|goout|Dalc|Walc|health|absences| G1| G2| G3|
+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|    GP|  F| 18|      U|    GT3|      A|   4|   4|at_home| teacher|course|  mother|         2|        2|       0|      yes|    no|  no|        no|    yes|   yes|      no|      no|     4|       3|    4|   1|   1|     3|       6|  5|  6|  6|
|    GP|  F| 17|      U|    GT3|      T|

In [15]:
# Suponiendo que tienes un archivo CSV
df = spark.read.csv("/user/app/source/student-mat.csv", header=True, inferSchema=True)
df.show(5)

+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|school|sex|age|address|famsize|Pstatus|Medu|Fedu|   Mjob|    Fjob|reason|guardian|traveltime|studytime|failures|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|famrel|freetime|goout|Dalc|Walc|health|absences| G1| G2| G3|
+------+---+---+-------+-------+-------+----+----+-------+--------+------+--------+----------+---------+--------+---------+------+----+----------+-------+------+--------+--------+------+--------+-----+----+----+------+--------+---+---+---+
|    GP|  F| 18|      U|    GT3|      A|   4|   4|at_home| teacher|course|  mother|         2|        2|       0|      yes|    no|  no|        no|    yes|   yes|      no|      no|     4|       3|    4|   1|   1|     3|       6|  5|  6|  6|
|    GP|  F| 17|      U|    GT3|      T|

## Inspección y limpieza de los datos

Antes de entrenar el modelo, revisamos los datos para asegurarnos de que no haya valores nulos ni problemas con el formato.


In [16]:
df.printSchema()  # Ver tipo de datos
df.describe().show()  # Estadísticas de columnas numéricas

# Filtrar filas con valores nulos o realizar limpieza
df = df.dropna()
df.show(5)


root
 |-- school: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- famsize: string (nullable = true)
 |-- Pstatus: string (nullable = true)
 |-- Medu: integer (nullable = true)
 |-- Fedu: integer (nullable = true)
 |-- Mjob: string (nullable = true)
 |-- Fjob: string (nullable = true)
 |-- reason: string (nullable = true)
 |-- guardian: string (nullable = true)
 |-- traveltime: integer (nullable = true)
 |-- studytime: integer (nullable = true)
 |-- failures: integer (nullable = true)
 |-- schoolsup: string (nullable = true)
 |-- famsup: string (nullable = true)
 |-- paid: string (nullable = true)
 |-- activities: string (nullable = true)
 |-- nursery: string (nullable = true)
 |-- higher: string (nullable = true)
 |-- internet: string (nullable = true)
 |-- romantic: string (nullable = true)
 |-- famrel: integer (nullable = true)
 |-- freetime: integer (nullable = true)
 |-- goout: integer (null

                                                                                

+-------+------+----+------------------+-------+-------+-------+------------------+------------------+-------+-------+----------+--------+------------------+------------------+------------------+---------+------+----+----------+-------+------+--------+--------+------------------+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+
|summary|school| sex|               age|address|famsize|Pstatus|              Medu|              Fedu|   Mjob|   Fjob|    reason|guardian|        traveltime|         studytime|          failures|schoolsup|famsup|paid|activities|nursery|higher|internet|romantic|            famrel|          freetime|             goout|              Dalc|              Walc|            health|         absences|                G1|                G2|                G3|
+-------+------+----+------------------+-------+-------+-------+------------------+---------------

## Preparación de los datos para el modelo

Vamos a preparar las columnas que usaremos para entrenar el modelo y transformar las características a un formato adecuado.

In [12]:
!pip install numpy

Collecting numpy
  Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.5/19.5 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: numpy
Successfully installed numpy-2.0.2
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [18]:
from pyspark.sql.functions import col, when

# Crear una columna 'label' donde 1 = aprobó (G3 >= 10), 0 = no aprobó (G3 < 10)
df = df.withColumn("label", when(col("G3") >= 10, 1).otherwise(0))

# Ver las primeras filas con la nueva columna 'label'
df.select("G3", "label").show(5)

+---+-----+
| G3|label|
+---+-----+
|  6|    0|
|  6|    0|
| 10|    1|
| 15|    1|
| 10|    1|
+---+-----+
only showing top 5 rows


In [20]:
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col

# Seleccionar las características predictoras
features = ["hours_studied", "age", "grade"]
assembler = VectorAssembler(inputCols=features, outputCol="features")

# Transformar los datos
data = assembler.transform(df)

# Mostrar algunas filas para ver cómo queda
data.select("features", "label").show(5)


IllegalArgumentException: [FIELD_NOT_FOUND] No such struct field `hours_studied` in `school`, `sex`, `age`, `address`, `famsize`, `Pstatus`, `Medu`, `Fedu`, `Mjob`, `Fjob`, `reason`, `guardian`, `traveltime`, `studytime`, `failures`, `schoolsup`, `famsup`, `paid`, `activities`, `nursery`, `higher`, `internet`, `romantic`, `famrel`, `freetime`, `goout`, `Dalc`, `Walc`, `health`, `absences`, `G1`, `G2`, `G3`, `label`. SQLSTATE: 42704