Aim: Test to confirm if there is a difference between number of travelers in AA and AS airlines in 2021

$Ho$: Number of Travelers is the same for AA and AS airlines in 2021<br />
$Ha$: Number of Travelers is not the same for AA and AS airlines in 2021 

In [38]:
#importing libraries
import pandas as pd
import numpy as np

In [39]:
#importing data
total_trav = pd.read_csv("data/df_all.csv")
#obtain shape
total_trav.shape

(996, 4)

In [40]:
total_trav.head(3)

Unnamed: 0,Travelers,Day,Month,Year
0,2882915.0,28,Nov,2019
1,2648268.0,27,Nov,2019
2,1968137.0,26,Nov,2019


In [41]:
#changing Month values to numbers
total_trav = total_trav.replace({'Month': {"Jan": 1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,'Aug':8,'Sep':9,
                                             'Oct':10,'Nov':11,'Dec':12}})
#converting Day, Month, and Year columns to string and combining the three columns 
total_trav['date'] = total_trav.Day.astype(str) + '-' + total_trav.Month.astype(str) + '-' + total_trav.Year.astype(str)
total_trav.tail()


Unnamed: 0,Travelers,Day,Month,Year,date
991,766594.0,5,1,2021,5-1-2021
992,1080346.0,4,1,2021,4-1-2021
993,1327289.0,3,1,2021,3-1-2021
994,1192881.0,2,1,2021,2-1-2021
995,805990.0,1,1,2021,1-1-2021


In [42]:
#converting date to a date variable
total_trav['date'] = pd.to_datetime(total_trav['date'])
#to count number of events create a new column and assign 1 to each row
# post_long['event_num'] = 1
total_trav.date.dtype

dtype('<M8[ns]')

In [43]:
#list of airports
#for consistency let's set a seed
np.random.seed(123)
airlines = ['AS','G4','AA','XP','MX','DL','2D','F9','HA','B6','Test']
total_trav['airlines'] = np.random.choice(list(airlines), len(total_trav))

In [44]:
total_trav.head()

Unnamed: 0,Travelers,Day,Month,Year,date,airlines
0,2882915.0,28,11,2019,2019-11-28,AA
1,2648268.0,27,11,2019,2019-11-27,AA
2,1968137.0,26,11,2019,2019-11-26,2D
3,1591158.0,25,11,2019,2019-11-25,G4
4,2624250.0,24,11,2019,2019-11-24,XP


In [45]:
#subsetting data for AA airline travlers 2021
filter1 = (total_trav['Year']==2021) & (total_trav['airlines']=='AA')
AA_travelers = total_trav[filter1]

In [46]:
#subsetting data for AS airline travlers 2021
filter2 = (total_trav['Year']==2021) & (total_trav['airlines']=='AS')
AS_travelers = total_trav[filter2]

In [47]:
#convert AA data to array
AA_trav_array = np.array(AA_travelers['Travelers'])
AA_trav_array

array([1382230., 2152721., 1527465., 1446353., 1448369., 1820152.,
       1942337., 1826310., 1820355., 1925641., 1934918., 1979981.,
       2093066., 2196411., 1984658., 1900170., 1863697., 1707805.,
       1429657., 1626962., 1278113., 1543136., 1107534., 1096348.,
        714725.,  735009.,  628989.,  560190.,  805990.])

In [48]:
#convert AS data to array
AS_trav_array = np.array(AS_travelers['Travelers'])
AS_trav_array

array([2207949., 2213716., 2001439., 1940302., 2070878., 1455913.,
       1439804., 1465197., 1629475., 1900658., 1685462., 2045301.,
       2022858., 2168264., 2141429., 1889911., 2066964., 1560561.,
       1815931., 1618169., 1315493., 1703267., 1463672., 1468218.,
       1561495., 1549181., 1195306., 1535156., 1360290., 1413141.,
        825745., 1049692.,  914823.,  773422.,  690438.,  772471.])

Since the data is not related, this test is considered an independent sample test. So we will be using the unpaired 
ttest for this analysis

In [49]:
#loading the required library
from scipy.stats import ttest_ind

For independent sample ttest, we first check to see if the variance for the two groups are equal or not. Note that if you divide the standard deviation of one group by the standard deviation of the other group and you obtain a value that is greater than 2, variance for the two groups is considered to be unequal

In [50]:
#defaulting equal variance status to true
Equal_var_status = True

if AS_trav_array.std() > AA_trav_array.std():
    if (AS_trav_array.std()/AA_trav_array.std()) > 2:
        Equal_var_status = False
else:
    if (AA_trav_array.std()/AS_trav_array.std()) > 2:
        Equal_var_status = False
print(Equal_var_status)

True


From the result above, we see that variance is the same for the two groups. Next, we perform ttest. Since we are testing to see if there is a difference, this is a two-tailed test

In [51]:
results = ttest_ind(AS_trav_array,AA_trav_array,equal_var=Equal_var_status,alternative='two-sided')

In [52]:
results.pvalue

0.6794060124908805

The pvalue for the test is 0.679, which is greater than 0.05, so we fail to reject the null hypothesis and conclude that the number of travelers for the two airlines in 2021 is not significantly different.