Introduction:

Aim of this kernel is to perform a chi-square test to check whether two variables are independent or not.

Data:

The data comes in from the aircraft wildlife strikes data set and it contains records of wildlife strikes against aircrafts from 1990 to 2015.

We would be looking at indepedence testing for these two scenarios:

1. visibility and flight impact
2. if there is a different impact caused by Seagulls and Mourning doves.

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [4]:
aircraft = pd.read_csv("E:\\Bagas's File\\Data Science Bagas\\Datasets\\database.csv")

  aircraft = pd.read_csv("E:\\Bagas's File\\Data Science Bagas\\Datasets\\database.csv")


In [8]:
aircraft.head()

Unnamed: 0,Record ID,Incident Year,Incident Month,Incident Day,Operator ID,Operator,Aircraft,Aircraft Type,Aircraft Make,Aircraft Model,...,Fuselage Strike,Fuselage Damage,Landing Gear Strike,Landing Gear Damage,Tail Strike,Tail Damage,Lights Strike,Lights Damage,Other Strike,Other Damage
0,127128,1990,1,1,DAL,DELTA AIR LINES,B-757-200,A,148.0,26.0,...,0,0,0,0,1,1,0,0,0,0
1,129779,1990,1,1,HAL,HAWAIIAN AIR,DC-9,A,583.0,90.0,...,0,0,0,0,0,0,0,0,1,0
2,129780,1990,1,2,UNK,UNKNOWN,UNKNOWN,,,,...,0,0,0,0,0,0,0,0,0,0
3,2258,1990,1,3,MIL,MILITARY,A-10A,A,345.0,,...,0,0,0,0,0,0,0,0,0,0
4,2257,1990,1,3,MIL,MILITARY,F-16,A,561.0,,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# One-way chisquare test to see if states are evenly distributed
stats.chisquare(aircraft['State'].value_counts())

Power_divergenceResult(statistic=240989.20211926798, pvalue=0.0)

In [18]:
# One-way chisquare test to see if species name are evenly distributed
stats.chisquare(aircraft['Species Name'].value_counts())

Power_divergenceResult(statistic=11608552.48272652, pvalue=0.0)

In [6]:
# Combine the engine shutdown and engine shut down entries in flight impact column
aircraft.loc[aircraft["Flight Impact"] == "ENGINE SHUT DOWN", "Flight Impact"] = "ENGINE SHUTDOWN"

In [9]:
# Two-way Chisquare test for the relationship between visibility and flight impact
contingencyTable = pd.crosstab(aircraft['Visibility'],aircraft['Flight Impact'])
print(contingencyTable)
stat, p, dof, expected = stats.chi2_contingency(contingencyTable)
stat, p, dof, expected

Flight Impact  ABORTED TAKEOFF  ENGINE SHUTDOWN   NONE  OTHER  \
Visibility                                                      
DAWN                       124               29   2872     42   
DAY                       1730              254  49291   1238   
DUSK                       127               39   3787    110   
NIGHT                      239               98  25796    549   
UNKNOWN                      0                0      1      0   

Flight Impact  PRECAUTIONARY LANDING  
Visibility                            
DAWN                             190  
DAY                             3852  
DUSK                             309  
NIGHT                           1460  
UNKNOWN                            0  


(601.1722402084056,
 1.2901998940571034e-117,
 16,
 array([[7.84759651e+01, 1.48468042e+01, 2.88971834e+03, 6.85427461e+01,
         2.05416141e+02],
        [1.35808958e+03, 2.56935867e+02, 5.00088960e+04, 1.18618725e+03,
         3.55489125e+03],
        [1.05341394e+02, 1.99294529e+01, 3.87898330e+03, 9.20076408e+01,
         2.75738216e+02],
        [6.78068963e+02, 1.28283317e+02, 2.49685151e+04, 5.92241315e+02,
         1.77489132e+03],
        [2.40945548e-02, 4.55842930e-03, 8.87233142e-01, 2.10447486e-02,
         6.30691253e-02]]))

Let's setup the null hypothesis vs alternate hypothesis testing via below code to understand if the hypothesis holds or not.

In [11]:
# interpret test-statistic
prob = 0.95
critical = stats.chi2.ppf(prob, dof)
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

Dependent (reject H0)


As we have used the chi2_contigency method, let's understand the output:

1. 601.172 = The test statistic.
2. 1.29019 = p-value
3. 16 = degrees of freedom
4. array = expected frequencies

In [12]:
# Two-way chisquare test to see whether mourning doves and gulls cause different impact
subset = aircraft.loc[aircraft['Species Name'].isin(['MOURNING DOVE','GULL'])]

bird_impact = pd.crosstab(subset["Species Name"], subset["Flight Impact"])
stat, p, dof, expected = stats.chi2_contingency(bird_impact)
stat, p, dof, expected

(81.84385799476242,
 7.083914416658326e-17,
 4,
 array([[ 261.6898332 ,   38.12271644, 4117.89952343,  111.13741064,
          352.15051628],
        [ 143.3101668 ,   20.87728356, 2255.10047657,   60.86258936,
          192.84948372]]))

In [13]:
# interpret test-statistic
prob = 0.95
critical = stats.chi2.ppf(prob, dof)
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

Dependent (reject H0)


1. 81.843 = The test statistic.
2. 7.08391 = p-value
3. 4 = degrees of freedom
4. array = expected frequencies

As we can see from the above results, the variables are dependent.

In [14]:
# Two-way Chisquare test for the relationship between Aircraft Make and flight impact
contingencyTable = pd.crosstab(aircraft['Aircraft Make'],aircraft['Flight Impact'])
print(contingencyTable)
stat, p, dof, expected = stats.chi2_contingency(contingencyTable)
stat, p, dof, expected

Flight Impact  ABORTED TAKEOFF  ENGINE SHUTDOWN   NONE  OTHER  \
Aircraft Make                                                   
04A                         84               28  10781    214   
100                          0                0      1      0   
107                          0                0      1      0   
123                        192               17   2648     94   
128                          1                2    233     31   
...                        ...              ...    ...    ...   
972                          0                0      1      0   
975                          0                0      4      0   
998                          3                5     88     22   
HEL                          0                1      5      1   
Q                            0                0      1      0   

Flight Impact  PRECAUTIONARY LANDING  
Aircraft Make                         
04A                              221  
100                                1 

(13308.518457786493,
 0.0,
 352,
 array([[2.62033527e+02, 5.18740698e+01, 1.00621801e+04, 2.72107286e+02,
         6.79805054e+02],
        [4.62629814e-02, 9.15855753e-03, 1.77651484e+00, 4.80415406e-02,
         1.20022079e-01],
        [2.31314907e-02, 4.57927876e-03, 8.88257421e-01, 2.40207703e-02,
         6.00110393e-02],
        [7.57787636e+01, 1.50017172e+01, 2.90993131e+03, 7.86920435e+01,
         1.96596165e+02],
        [1.09411951e+01, 2.16599886e+00, 4.20145760e+02, 1.13618244e+01,
         2.83852216e+01],
        [8.32733666e-01, 1.64854035e-01, 3.19772672e+01, 8.64747731e-01,
         2.16039742e+00],
        [2.31314907e-02, 4.57927876e-03, 8.88257421e-01, 2.40207703e-02,
         6.00110393e-02],
        [2.31314907e-02, 4.57927876e-03, 8.88257421e-01, 2.40207703e-02,
         6.00110393e-02],
        [7.83047224e+02, 1.55017745e+02, 3.00692902e+04, 8.13151116e+02,
         2.03149370e+03],
        [2.31314907e-02, 4.57927876e-03, 8.88257421e-01, 2.40207703e-02,
   

In [15]:
# interpret test-statistic
prob = 0.95
critical = stats.chi2.ppf(prob, dof)
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

Dependent (reject H0)


In [16]:
# Two-way Chisquare test for the relationship between Operator and flight impact
contingencyTable = pd.crosstab(aircraft['Operator'],aircraft['Flight Impact'])
print(contingencyTable)
stat, p, dof, expected = stats.chi2_contingency(contingencyTable)
stat, p, dof, expected

Flight Impact                ABORTED TAKEOFF  ENGINE SHUTDOWN  NONE  OTHER  \
Operator                                                                     
1US AIRWAYS                               50               20  3230     64   
ABELAG AVIATION                            0                0     1      0   
ABSA AEROLINHAS BRASILEIRAS                0                0     2      0   
ABX AIR                                    5                2  1257      9   
ACM AVIATION                               0                0     1      0   
...                                      ...              ...   ...    ...   
WORLDWIDE JET CHARTER                      0                0     3      0   
XL AIRWAYS UK                              0                0     2      0   
XOJET                                      1                0    33      0   
XTRA AIRWAYS                               0                1    13      1   
ZANTOP INTL AIRLINES                       0                0   

(17018.064167675017,
 0.0,
 1996,
 array([[8.25736088e+01, 1.62549238e+01, 3.09236761e+03, 8.50662645e+01,
         2.15737596e+02],
        [2.36465088e-02, 4.65490373e-03, 8.85557734e-01, 2.43603278e-02,
         6.17805258e-02],
        [4.72930176e-02, 9.30980747e-03, 1.77111547e+00, 4.87206555e-02,
         1.23561052e-01],
        ...,
        [8.74920826e-01, 1.72231438e-01, 3.27656362e+01, 9.01332127e-01,
         2.28587946e+00],
        [3.78344141e-01, 7.44784598e-02, 1.41689237e+01, 3.89765244e-01,
         9.88488413e-01],
        [7.09395265e-02, 1.39647112e-02, 2.65667320e+00, 7.30809833e-02,
         1.85341577e-01]]))

In [17]:
# interpret test-statistic
prob = 0.95
critical = stats.chi2.ppf(prob, dof)
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

Dependent (reject H0)
