# III. Gestión de datos avanzada

## Correlaciones. Entender las relaciones entre las variables

Sean $X$ e $Y$ variables aleatorias i.i.d., y sea $\rho_{X,Y}$ la correlación entre $X$ e $Y$.

Se considera **correlación baja** para $\thinspace  \rho_{X,Y} \in [-0.3,0.3]$

Se considera **correlación fuerte** para $\thinspace \rho_{X,Y} \in [-1,-0.6] \cup [0.6,1] $

En este ejemplo se analizan dos variables: ***Hora de Salida (DepDelay)*** y ***Hora de Arribar (ArrDelay)***. Evidentemente, la hora de salida afecta la hora de llegada. Observe la variable de correlación ($Matriz_correlación[0,1]$).

In [35]:
import pandas as pd
import numpy as np

n = 10**6
df = pd.read_csv(r"C:\COVID_pruebas\base_datos_2008.csv", nrows=n )

In [36]:
matriz_correlaciones = np.corrcoef( df["ArrDelay"], df["DepDelay"] )
matriz_correlaciones
#Esto me da un error pues recordemos que hay vacíos en el df original.
# Hay que limpiar esos datos faltantes

array([[nan, nan],
       [nan, nan]])

In [37]:
df.dropna( inplace=True, subset=["ArrDelay","DepDelay"] )
#inplace mantiene los datos de entrada en la misma variable.
# No hay necesidad de guardar los resultados en otro df, ya se "Actualizaron"
# Corremos nuevamente la matriz de correlaciones

In [38]:
matriz_correlaciones = np.corrcoef( df["ArrDelay"], df["DepDelay"] )
matriz_correlaciones
#Esto me da un error pues recordemos que hay vacíos en el df original.
# Hay que limpiar esos datos faltantes

array([[1.        , 0.93490112],
       [0.93490112, 1.        ]])

**Nota:** Es posible agregar más variables aleatorias al análisis. Tome ***Hora de Salida ("Deptime")***.

In [39]:
matriz_correlaciones_2 = np.corrcoef([ df["ArrDelay"], df["DepDelay"], df["DepTime"] ])
matriz_correlaciones_2
#Se observa que la nueva correlación es más débil, en comparación con la anterior

array([[1.        , 0.93490112, 0.17978801],
       [0.93490112, 1.        , 0.21229661],
       [0.17978801, 0.21229661, 1.        ]])

In [40]:
df.drop(inplace=True, columns=["Year","Cancelled","Diverted"])

#Se eliminan variables constantes o dummies. 

In [41]:
df.corr()
#Matriz de correlaciones

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,FlightNum,ActualElapsedTime,CRSElapsedTime,...,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
Month,1.0,-0.01505,0.03093,0.005771,0.001497,-0.007122,-0.0098,0.07981,-0.074488,-0.076087,...,0.031751,0.032285,-0.070562,-0.039301,-0.028302,-0.007918,0.02372,-0.018685,-0.000285,0.035021
DayofMonth,-0.01505,1.0,-0.051893,-0.004149,-0.002338,0.00197,-0.000563,-0.005089,0.009748,0.005772,...,-0.016971,-0.025893,0.004336,-0.013975,0.02001,-0.026833,0.00286,0.046824,-0.018356,-0.010162
DayOfWeek,0.03093,-0.051893,1.0,0.006505,0.010266,0.005845,0.007485,-0.003372,0.005525,0.012497,...,-0.020013,-0.009137,0.013937,0.007376,-0.023813,0.011215,-0.004178,-0.051235,0.008752,-0.002767
DepTime,0.005771,-0.004149,0.006505,1.0,0.967093,0.718832,0.801405,-0.001497,-0.033971,-0.025449,...,0.179788,0.212297,-0.021899,-0.028305,-0.007641,0.013935,0.004591,-0.031814,-0.008924,0.219831
CRSDepTime,0.001497,-0.002338,0.010266,0.967093,1.0,0.701446,0.80063,-0.013377,-0.031518,-0.020645,...,0.102853,0.132519,-0.01422,-0.03237,-0.018797,-0.046834,-0.022649,-0.074218,-0.009042,0.187314
ArrTime,-0.007122,0.00197,0.005845,0.718832,0.701446,1.0,0.85724,-0.011925,0.027351,0.031429,...,0.073489,0.087577,0.026093,0.009363,0.010162,-0.044245,-0.018817,0.01626,-0.004983,0.014447
CRSArrTime,-0.0098,-0.000563,0.007485,0.801405,0.80063,0.85724,1.0,-0.025392,0.036679,0.047353,...,0.098478,0.126896,0.04424,-0.002497,0.000617,-0.047146,-0.018615,-0.048304,-0.008037,0.161782
FlightNum,0.07981,-0.005089,-0.003372,-0.001497,-0.013377,-0.011925,-0.025392,1.0,-0.289356,-0.30225,...,0.06846,0.053206,-0.333416,0.02367,0.073521,0.094233,0.06619,0.01067,-0.006524,-0.021083
ActualElapsedTime,-0.074488,0.009748,0.005525,-0.033971,-0.031518,0.027351,0.036679,-0.289356,1.0,0.977603,...,0.062966,0.020822,0.956349,0.158391,0.25636,-0.044983,-0.01037,0.188094,-0.001425,-0.091541
CRSElapsedTime,-0.076087,0.005772,0.012497,-0.025449,-0.020645,0.031429,0.047353,-0.30225,0.977603,1.0,...,-0.022951,0.009131,0.978077,0.099275,0.123392,-0.010326,-0.024111,0.046358,0.001212,-0.04693


In [42]:
df.drop(inplace=True, columns=["Month"])
matriz = round( df.corr(),2 )
matriz.style.background_gradient()

Unnamed: 0,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,FlightNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Distance,TaxiIn,TaxiOut,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
DayofMonth,1.0,-0.05,-0.0,-0.0,0.0,-0.0,-0.01,0.01,0.01,0.01,-0.02,-0.03,0.0,-0.01,0.02,-0.03,0.0,0.05,-0.02,-0.01
DayOfWeek,-0.05,1.0,0.01,0.01,0.01,0.01,-0.0,0.01,0.01,0.01,-0.02,-0.01,0.01,0.01,-0.02,0.01,-0.0,-0.05,0.01,-0.0
DepTime,-0.0,0.01,1.0,0.97,0.72,0.8,-0.0,-0.03,-0.03,-0.03,0.18,0.21,-0.02,-0.03,-0.01,0.01,0.0,-0.03,-0.01,0.22
CRSDepTime,-0.0,0.01,0.97,1.0,0.7,0.8,-0.01,-0.03,-0.02,-0.03,0.1,0.13,-0.01,-0.03,-0.02,-0.05,-0.02,-0.07,-0.01,0.19
ArrTime,0.0,0.01,0.72,0.7,1.0,0.86,-0.01,0.03,0.03,0.03,0.07,0.09,0.03,0.01,0.01,-0.04,-0.02,0.02,-0.0,0.01
CRSArrTime,-0.0,0.01,0.8,0.8,0.86,1.0,-0.03,0.04,0.05,0.04,0.1,0.13,0.04,-0.0,0.0,-0.05,-0.02,-0.05,-0.01,0.16
FlightNum,-0.01,-0.0,-0.0,-0.01,-0.01,-0.03,1.0,-0.29,-0.3,-0.31,0.07,0.05,-0.33,0.02,0.07,0.09,0.07,0.01,-0.01,-0.02
ActualElapsedTime,0.01,0.01,-0.03,-0.03,0.03,0.04,-0.29,1.0,0.98,0.98,0.06,0.02,0.96,0.16,0.26,-0.04,-0.01,0.19,-0.0,-0.09
CRSElapsedTime,0.01,0.01,-0.03,-0.02,0.03,0.05,-0.3,0.98,1.0,0.99,-0.02,0.01,0.98,0.1,0.12,-0.01,-0.02,0.05,0.0,-0.05
AirTime,0.01,0.01,-0.03,-0.03,0.03,0.04,-0.31,0.98,0.99,1.0,0.0,0.0,0.98,0.08,0.09,-0.03,-0.03,0.08,0.0,-0.06


## Test de la Chi-Cuadrado

La prueba de ji cuadrado se usa para comprobar hipótesis sobre si ciertos datos son como se esperaba. La idea clave tras la prueba es comparar los valores observados en los datos con los valores esperados que tendríamos si la hipótesis nula es cierta. Esto se aplica a **variables cualitativas**

$$\chi^{2} = \sum_{i=1}^{k} \dfrac{(observado_{i} - esperado_{i})^{2}}{esperado_i} $$


Una **debilidad** de este test es que no podemos cuantificar cada una de las relaciones entre categorías; sin embargo permite **afirmaciones globales** del tipo **fumar está relacionado significativamente con el cáncer de pulmón**.

In [50]:
import pandas as pd
import numpy as np

n = 10**6
df = pd.read_csv(r"C:\COVID_pruebas\base_datos_2008.csv")

In [56]:
print("Las dimensiones son: ",df.shape)
df.head(2)

Las dimensiones son:  (7009728, 29)


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,335,...,4.0,8.0,0,,0,,,,,
1,2008,1,3,4,754.0,735,1002.0,1000,WN,3231,...,5.0,10.0,0,,0,,,,,


**Observación:** En comparación con otros scripts, se importó la base completa, misma que tiene 7,009,728 registros.

In [86]:
np.random.seed(0)
#Esto fija una semilla

df_1 = df[ df["Origin"].isin( ["HOU","ATL","IND"] ) ]
#Se seleccionan los vuelos que tienen como orígen a HOU,ATL,IND

df_1 = df_1.sample( frac=1 )
#Se reordenan de manera "aleatoria" los datos que se tienen 

df_muestra = df_1[0:10000]
#Se toma una muestra de 10,000 registros


In [87]:
df_muestra["BigDelay"] = df_muestra["ArrDelay"] > 30
#Si "ArrDelay">30, colocará un "True". En caso contrario, colocará un "False"

observados = pd.crosstab( index=df_muestra["BigDelay"], columns=df_muestra["Origin"], margins=True )
#Esto genera una tabla de contingencia.
# Se estudian dos columnas: "BigDelay" y "Origin"
# Margins indica que se guarden los márgenes en la tabla (las sumas de los totales)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_muestra["BigDelay"] = df_muestra["ArrDelay"] > 30


In [89]:
observados

Origin,ATL,HOU,IND,All
BigDelay,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,6927,883,765,8575
True,1197,129,99,1425
All,8124,1012,864,10000


Esta **Tabla de contingencia** se interpreta de la siguiente manera: 

La cantidad de vuelos que salieron de ***Atl*** y se retrasaron ***menos*** de 30 min., fue de 6,927.

La cantidad de vuelos que salieron de ***Atl*** y se retrasaron ***más*** de 30 min., fue de 6,927

***El total*** de vuelos que salieron de ***Atl*** con retraso, fue de 8124

In [104]:
from scipy.stats import chi2_contingency
#Este paquete se encarga de realizar el test

test = chi2_contingency(observados)
#Genera el objeto a partir de la tabla de observados
test

(8.939538453043031,
 0.17700704816414425,
 6,
 array([[ 6966.33,   867.79,   740.88,  8575.  ],
        [ 1157.67,   144.21,   123.12,  1425.  ],
        [ 8124.  ,  1012.  ,   864.  , 10000.  ]]))

Los **parámetros** que retorna la **prueba de chi_cuadrado** son:

* Estadístico de prueba
* p-valor
* Grados de libertad
* Tabla de las frecuencias esperadas, basadas en las sumas marginales de la tabla. (valores teóricos esperados)

In [105]:
esperados = pd.DataFrame(test[3])
#Me arroja la matriz de las frecuencias esperadas. (4to parámetro de retorno de test)

esperados

Unnamed: 0,0,1,2,3
0,6966.33,867.79,740.88,8575.0
1,1157.67,144.21,123.12,1425.0
2,8124.0,1012.0,864.0,10000.0


In [114]:
esperados_rel = round( esperados.apply (lambda r: (r/len(df_muestra)) *100, axis=1) , 2 )

observados_rel =  round( observados.apply (lambda r: (r/len(df_muestra)) *100, axis=1) , 2 )

#Lambda es una función sin nombre. 

In [115]:
observados_rel

Origin,ATL,HOU,IND,All
BigDelay,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,69.27,8.83,7.65,85.75
True,11.97,1.29,0.99,14.25
All,81.24,10.12,8.64,100.0


In [116]:
esperados_rel

Unnamed: 0,0,1,2,3
0,69.66,8.68,7.41,85.75
1,11.58,1.44,1.23,14.25
2,81.24,10.12,8.64,100.0


A pesar de que no se copian los nombres de las columnas, **la estructura se mantiene**. Así, es más fácil visualizar la diferencia entre tablas. 

**Comentarios:** 
* Los márgenes se mantienen. **Esto siempre debe de ocurrir**.
* Se compara entrada con entrada para determinar si la diferencia entre éstas es significativa

In [117]:
test[1]

0.17700704816414425

### Resumen de Test de Hipótesis
* **p-valor < 0.05:** hay diferencias significativas: Hay relación entre variables
* **p-valor > 0.05:** no hay diferencias significativas: No hay relación entre variables

**Conclusiones:** Dado que $p-valor>>0.05$, no es posible afirmar que hay una relación entre variables