# Ejecircio de Análisis correlacional

Se busca verificar la relación existente entre el tráfico y clima através del análisis en PySpark de los datos obtenidos a través del proceso ETL

In [7]:
# Crea sesion de Spark
from spark.spark_session import get_spark_session

spark = get_spark_session()


In [8]:
# Carga datos de archivos parquet
data_path = "hdfs://localhost:9000/hdfs/traffic_weather/processed/year=2022/"

df = spark.read.parquet(data_path)
df.show(5)

df.count()


+-----------+--------+----------+-------------+----------------+---------+-----------+-----+-------------------+--------+-----------------+-----+---+
|id_dim_date|state_cd|station_id|sum_veh_count|functional_class| latitude|  longitude|state|           AVG_temp|AVG_rain|        AVG_hmdty|month|day|
+-----------+--------+----------+-------------+----------------+---------+-----------+-----+-------------------+--------+-----------------+-----+---+
|        241|      53|    P6AAAA|       110025|              2U| 47.43086|   -122.221|   WA|-1.6666666666666667|     0.0|73.16666666666667|    1|  3|
|        241|      12|    979933|        88816|              2U| 26.17933|  -80.30672|   FL|-1.6666666666666667|     0.0|73.16666666666667|    1|  3|
|        241|      53|    P05AAA|          870|              3R|46.438974|-117.951628|   WA|-1.6666666666666667|     0.0|73.16666666666667|    1|  3|
|        241|      53|    P09AAA|        10053|              1R|46.047791|-119.224519|   WA|-1.66666

22

In [9]:
from pyspark.ml.feature import VectorAssembler

# Seleccion de variables de analisis
variables = ["sum_veh_count", "AVG_temp", "AVG_rain", "AVG_hmdty"]

# Creacion del vector
assembler = VectorAssembler(inputCols=variables, outputCol="features")
traffic_vector = assembler.transform(df.select(variables))

In [15]:
# Calculo de correlacion de Pearson
from pyspark.ml.stat import Correlation

corr_matrix = Correlation.corr(traffic_vector, "features", method="pearson").head()
corr_matrix

Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.2072, 0.2877, 0.1525, 0.2072, 1.0, 0.2587, 0.7123, 0.2877, 0.2587, 1.0, 0.5345, 0.1525, 0.7123, 0.5345, 1.0], False))

In [17]:
# Extraemos valores de matriz
corr_values = corr_matrix[0].toArray()
corr_values

array([[1.        , 0.20717595, 0.28768309, 0.1524582 ],
       [0.20717595, 1.        , 0.25874589, 0.7122916 ],
       [0.28768309, 0.25874589, 1.        , 0.53446787],
       [0.1524582 , 0.7122916 , 0.53446787, 1.        ]])

Con los valores obtenidos se identifica que la correlacion entre la cantidad vehiculos por día tiene una baja correlacion con los distintos facortores climáticos medidos (temperatura, lluvia y humedad)