# Ejecircio de Análisis correlacional

Se busca verificar la relación existente entre el tráfico y clima através del análisis en PySpark de los datos obtenidos a través del proceso ETL

In [4]:
# Crea sesion de Spark
from spark.spark_session import get_spark_session

spark = get_spark_session()


In [5]:
# Carga datos de archivos parquet
data_path = "hdfs://localhost:9000/hdfs/traffic_weather/processed/year=2022/"

df = spark.read.parquet(data_path)
df.show(5)
df.count()


+-----------+--------+----------+-------------+----------------+---------+-----------+-----+------------------+--------+------------------+------------+-----+---+
|id_dim_date|state_cd|station_id|sum_veh_count|functional_class| latitude|  longitude|state|          AVG_temp|AVG_rain|      AVG_daylight|AVG_snowfall|month|day|
+-----------+--------+----------+-------------+----------------+---------+-----------+-----+------------------+--------+------------------+------------+-----+---+
|        124|      12|    979933|        98718|              2U| 26.17933|  -80.30672|   FL|15.616666666666667|     9.0|55783.293333333335|         0.0|    6| 13|
|        124|      53|    P05AAA|         1815|              3R|46.438974|-117.951628|   WA|15.616666666666667|     9.0|55783.293333333335|         0.0|    6| 13|
|        124|      53|    P6AAAA|       134172|              2U| 47.43086|   -122.221|   WA|15.616666666666667|     9.0|55783.293333333335|         0.0|    6| 13|
|        124|      30|

                                                                                

1000

In [7]:
from pyspark.ml.feature import VectorAssembler

# Seleccion de variables de analisis
variables = ["sum_veh_count", "AVG_temp", "AVG_rain", "AVG_daylight", "AVG_snowfall"]

# Creacion del vector
assembler = VectorAssembler(inputCols=variables, outputCol="features")
traffic_vector = assembler.transform(df.select(variables))

In [8]:
# Calculo de correlacion de Pearson
from pyspark.ml.stat import Correlation

corr_matrix = Correlation.corr(traffic_vector, "features", method="pearson").head()
corr_matrix

                                                                                

Row(pearson(features)=DenseMatrix(5, 5, [1.0, -0.0046, -0.0076, -0.0324, 0.0088, -0.0046, 1.0, 0.0371, ..., 0.1328, 1.0, -0.2742, 0.0088, -0.335, 0.0197, -0.2742, 1.0], False))

In [9]:
# Extraemos valores de matriz
corr_values = corr_matrix[0].toArray()
corr_values

array([[ 1.        , -0.00459987, -0.00761476, -0.0323906 ,  0.00883185],
       [-0.00459987,  1.        ,  0.03708315,  0.77337719, -0.33502021],
       [-0.00761476,  0.03708315,  1.        ,  0.13276161,  0.0197094 ],
       [-0.0323906 ,  0.77337719,  0.13276161,  1.        , -0.27415055],
       [ 0.00883185, -0.33502021,  0.0197094 , -0.27415055,  1.        ]])

Con los valores obtenidos se identifica que la correlacion entre la cantidad vehiculos por día tiene una baja correlacion con los distintos facortores climáticos medidos (temperatura, lluvia, nieve, luz del sol)

Se observa que la afluencia de vehiculos presenta un relación negativa para factores como temperatura, lluvia y luz solar, infiriendo que disminuye cuando estas aumentan