---

## Recommended Workflow
1. **Start with a Chi-Square Test** to assess overall proportionality between emergencies and production.
2. **Follow up with Z-tests** or **odds ratios** for individual plane models to identify specific deviations.
3. **Visualize results** to effectively communicate findings.


# Comparing Relative Frequencies: In-Flight Emergencies vs. Production

To compare the relative frequencies of in-flight emergencies and production for each plane model, here are some statistically sound approaches:



---

## Recommended Workflow
1. **Start with a Chi-Square Test** to assess overall proportionality between emergencies and production.
2. **Follow up with Z-tests** or **odds ratios** for individual plane models to identify specific deviations.
3. **Visualize results** to effectively communicate findings.


In [206]:
import requests
import pandas as pd 
import json
import plotly.graph_objects as go
import numpy as np
from scipy.stats import chi2
from scipy.stats import chi2_contingency

In [207]:
production_dict = {
    'ATR 72': 1233,
    'Airbus A220': 367,
    'Airbus A300': 561,
    'Airbus A319': 1484,
    'Airbus A320': 4752,
    'Airbus A320 NEO': 3607,
    'Airbus A321': 1784,
    'Airbus A330': 1615,
    'Airbus A340': 380,
    'Airbus A350': 623,
    'Airbus A380': 251,
    'Airbus A318': 80,
    'BAe Avro RJ': 387,
    'Boeing 717': 156,
    'Boeing 727': 1832,
    'Boeing 737 Classic': 1988,
    'Boeing 737 NG': 6960,
    'Boeing 737 Original': 1150,
    'Boeing 747': 1574,
    'Boeing 757': 1050,
    'Boeing 767': 1319,
    'Boeing 777': 1738,
    'Boeing 787': 1150,
    'Bombardier DHC-8': 1258,
    'Canadair CRJ': 1950,
    'Canadair CRJ 100': 226,
    'Canadair CRJ 700': 330,
    'Canadair CRJ 900': 500,
    'Embraer ERJ135': 120,
    'Embraer ERJ140': 74,
    'Embraer ERJ145': 1100,
    'Embraer ERJ170': 190,
    'Embraer ERJ190': 560,
    'Fokker 100': 283,
    'McDonnell Douglas MD-11': 200,
    'McDonnell Douglas MD-88': 150,
    'McDonnell Douglas MD-90': 116,
    'Sukhoi Superjet 100': 200
}

In [208]:
df_squawk = pd.read_csv('../data/processed/squawk7700_processed_final_v2.csv')

# Data prep frequency of IFE
df_ife_production = df_squawk.groupby('productionLine').flight_id.count().sort_values(ascending=False).reset_index()
df_ife_production.columns = ['productionLine','ife_count']
df_ife_production.head()

# Data prep frequency of production
df_prod = pd.DataFrame({'productionLine' : production_dict.keys(), 'production_amt' : list(production_dict.values())})
df_prod.sort_values(by='productionLine', inplace= True)
df_prod.reset_index(drop=True, inplace=True)

# Merge both data preps and sort by productionLine
df_ife_production = pd.merge(df_ife_production, 
                            df_prod, 
                            on='productionLine', 
                            how='left')

df_ife_production.sort_values(by='productionLine', inplace= True)
df_ife_production.reset_index(drop=True, inplace=True)
df_ife_production.dropna(inplace=True)

# Relative frequency columns
df_ife_production['ife_count_rel'] = df_ife_production['ife_count'].apply(lambda x : x/df_ife_production['ife_count'].sum())
df_ife_production['production_amt_rel'] = df_ife_production['production_amt'].apply(lambda x : x/df_ife_production['production_amt'].sum())

# Sanity checks
print(len(df_ife_production))
print(df_ife_production['ife_count_rel'].sum())
print(df_ife_production['production_amt_rel'].sum())
df_ife_production.head()

37
0.9999999999999999
1.0


Unnamed: 0,productionLine,ife_count,production_amt,ife_count_rel,production_amt_rel
0,ATR 72,2,1233.0,0.003155,0.02853
1,Airbus A220,3,367.0,0.004732,0.008492
2,Airbus A300,10,561.0,0.015773,0.012981
3,Airbus A319,44,1484.0,0.069401,0.034338
4,Airbus A320,99,4752.0,0.156151,0.109954


---

## 1. Chi-Square Test for Independence
This test checks whether the observed distribution of in-flight emergencies differs from what you'd expect based on production frequencies.

- **Null Hypothesis (\(H_0\))**: The relative frequency of in-flight emergencies is proportional to the relative frequency of production.

### Steps:
1. Compute the **expected number of emergencies** for each plane model by multiplying the total number of emergencies by the relative production frequencies.
2. Use the formula for the chi-square statistic:
   \[
   \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
   \]
   where:
   - \(O_i\): Observed number of emergencies for a given model.
   - \(E_i\): Expected number of emergencies for the same model.
3. Compare the calculated statistic with the chi-square distribution to get a p-value.

*This approach is ideal if you have raw counts (not just relative frequencies).*

In [209]:
total_ifes = df_ife_production['ife_count'].sum()
df_ife_production['expected_ife_count'] = df_ife_production.apply(lambda row : row['production_amt_rel']*total_ifes, axis=1)
df_ife_production['chi_stat_contributor'] = df_ife_production.apply(lambda row : ((row['ife_count']-row['expected_ife_count'])**2)/row['expected_ife_count'], axis=1)


chi_stat = df_ife_production['chi_stat_contributor'].sum()
dof = len(df_ife_production) - 1  # Degrees of freedom (number of categories - 1)

p_value = 1 - chi2.cdf(chi_stat, dof)

print(chi2.cdf(chi_stat, dof))
print(f"Chi-square statistic: {chi_stat}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p_value}")

# Sanity Checks
print(chi_stat)
df_ife_production.head()

1.0
Chi-square statistic: 634.8298813823012
Degrees of freedom: 36
P-value: 0.0
634.8298813823012


Unnamed: 0,productionLine,ife_count,production_amt,ife_count_rel,production_amt_rel,expected_ife_count,chi_stat_contributor
0,ATR 72,2,1233.0,0.003155,0.02853,18.08788,14.309023
1,Airbus A220,3,367.0,0.004732,0.008492,5.383822,1.055497
2,Airbus A300,10,561.0,0.015773,0.012981,8.229765,0.38078
3,Airbus A319,44,1484.0,0.069401,0.034338,21.770003,22.69971
4,Airbus A320,99,4752.0,0.156151,0.109954,69.710954,12.305788


In [210]:
# Create a contingency table
contingency_table = df_ife_production[['production_amt', 'ife_count']].to_numpy()
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# Output results
print(f"Chi-square Statistic: {chi2_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)

# Add expected frequencies back to the DataFrame
df_ife_production['ExpectedProduction'], df_ife_production['ExpectedEmergencies'] = expected.T

Chi-square Statistic: 583.0285862555544
P-value: 5.903339628159654e-100
Degrees of Freedom: 36
Expected Frequencies:
[[1.21714471e+03 1.78552860e+01]
 [3.64650643e+02 5.34935693e+00]
 [5.62744641e+02 8.25535893e+00]
 [1.50590860e+03 2.20913983e+01]
 [4.78086559e+03 7.01344066e+01]
 [3.55583654e+03 5.21634589e+01]
 [1.79270140e+03 2.62985953e+01]
 [1.62318813e+03 2.38118672e+01]
 [3.77462693e+02 5.53730731e+00]
 [6.15963924e+02 9.03607589e+00]
 [2.52298823e+02 3.70117669e+00]
 [3.83375946e+02 5.62405363e+00]
 [1.56701222e+02 2.29877771e+00]
 [1.80649900e+03 2.65010034e+01]
 [1.96418576e+03 2.88142388e+01]
 [7.01410440e+03 1.02895603e+02]
 [1.13435916e+03 1.66408374e+01]
 [1.56799777e+03 2.30022348e+01]
 [1.05945795e+03 1.55420505e+01]
 [1.33540979e+03 1.95902125e+01]
 [1.74835200e+03 2.56479978e+01]
 [1.15801218e+03 1.69878227e+01]
 [1.24769652e+03 1.83034753e+01]
 [1.92279299e+03 2.82070145e+01]
 [2.30616893e+02 3.38310681e+00]
 [3.26214494e+02 4.78550579e+00]
 [5.07554273e+02 7.445726

---

## 2. Z-Test or Proportional Comparison
If you're comparing individual proportions for a specific plane model, a **z-test for proportions** determines whether the difference between the proportion of emergencies and the proportion of production is statistically significant.

### Formula:
\[
Z = \frac{p_1 - p_2}{\sqrt{p(1-p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}
\]
Where:
- \(p_1\): Proportion of emergencies for a specific model.
- \(p_2\): Proportion of production for the same model.
- \(p\): Pooled proportion, calculated as \(\frac{x_1 + x_2}{n_1 + n_2}\).
- \(n_1, n_2\): Total counts in each dataset.

# Z-Test for Proportions: Explanation

In the context of the **z-test for proportions**, the terms \(x_1\) and \(x_2\) represent the **counts of successes** (or events of interest) in the two groups being compared.

## Definitions:
- **\(x_1\)**: The count of in-flight emergencies for the plane model of interest.
- **\(x_2\)**: The count of planes produced for the same plane model.

These counts are used to calculate the proportions (\(p_1\) and \(p_2\)) for the z-test:

### Proportions:
\[
p_1 = \frac{x_1}{n_1}, \quad p_2 = \frac{x_2}{n_2}
\]
Where:
- \(n_1\): Total number of in-flight emergencies across all plane models.
- \(n_2\): Total number of planes produced across all plane models.

### Pooled Proportion:
Under the null hypothesis, we assume that the emergencies are distributed proportionally to production. The pooled proportion (\(p\)) is calculated as:
\[
p = \frac{x_1 + x_2}{n_1 + n_2}
\]

---

## Example:
Suppose:
- Total in-flight emergencies (\(n_1\)) = 1,000.
- Total planes produced (\(n_2\)) = 10,000.
- Plane model A had:
  - \(x_1 = 100\) emergencies.
  - \(x_2 = 500\) produced.

### Step-by-Step:
1. Calculate the proportions:
   - \(p_1 = \frac{100}{1000} = 0.1\) (10% emergencies for model A).
   - \(p_2 = \frac{500}{10000} = 0.05\) (5% production for model A).

2. Use the z-test formula to test whether the difference between these proportions is statistically significant.


In [213]:
df_ife_production.sort_values(by='productionLine', inplace=True)
df_ife_production['ife_count'] = df_ife_production['ife_count'].astype(float)
df_ife_production['production_amt'] = df_ife_production['production_amt'].astype(float)

n1 = df_ife_production['ife_count'].sum()
n2  = df_ife_production['production_amt'].sum()
n1n2_divisor = (1/n1) + (1/n2)

df_ife_production['pooled_prop'] = df_ife_production.apply(lambda row : (row['ife_count'] + row['production_amt'])/(n1+n2), axis=1)
df_ife_production['z_val'] = df_ife_production.apply(lambda row : 
                                                     (row['ife_count_rel'] - row['production_amt_rel']) / ((row['pooled_prop']*(1-row['pooled_prop']))*(n1n2_divisor))**0.5, axis=1)

df_ife_production.head()

Unnamed: 0,productionLine,ife_count,production_amt,ife_count_rel,production_amt_rel,expected_ife_count,chi_stat_contributor,ExpectedProduction,ExpectedEmergencies,pooled_prop,z_val
0,ATR 72,2.0,1233.0,0.003155,0.02853,18.08788,14.309023,1217.144714,17.855286,0.028163,-3.834037
1,Airbus A220,3.0,367.0,0.004732,0.008492,5.383822,1.055497,364.650643,5.349357,0.008437,-1.027544
2,Airbus A300,10.0,561.0,0.015773,0.012981,8.229765,0.38078,562.744641,8.255359,0.013021,0.615668
3,Airbus A319,44.0,1484.0,0.069401,0.034338,21.770003,22.69971,1505.908602,22.091398,0.034844,4.779327
4,Airbus A320,99.0,4752.0,0.156151,0.109954,69.710954,12.305788,4780.865593,70.134407,0.110622,3.681579


In [249]:
df_ife_production_underrep = df_ife_production[df_ife_production['z_val']<-1.96].sort_values(by='z_val').head()
df_ife_production_underrep.sort_values(by='ExpectedEmergencies',ascending=False, inplace=True)
# Plot using Plotly
fig = go.Figure()

# Add bars for actual emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_underrep['productionLine'],
    y=df_ife_production_underrep['ife_count'],
    name='Observed Emergencies',
    marker_color='green'
))

# Add bars for expected emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_underrep['productionLine'],
    y=df_ife_production_underrep['ExpectedEmergencies'],
    name='Expected Emergencies',
    marker_color='orange'
))

# Customize layout
fig.update_layout(
    title='Top 5 Under-represented Outliers (Z-Test): Expected and Actual Emergency Frequencies vs. Aircraft Model',
    xaxis=dict(title='Aircraft Model'),
    yaxis=dict(title='Frequency'),
    barmode='group',  # Group bars side by side
    legend=dict(title='Legend'),
    width=1400, 
    height=800,
    paper_bgcolor='rgb(254, 246, 224)'
    # template='plotly_white'
)

# Show the plot
fig.show()

In [248]:
df_ife_production_overrep = df_ife_production[df_ife_production['z_val']>1.96].sort_values(by='z_val',ascending=False).head()
df_ife_production_overrep.sort_values(by='ExpectedEmergencies',ascending=False, inplace=True)
# Plot using Plotly
fig = go.Figure()

# Add bars for actual emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_overrep['productionLine'],
    y=df_ife_production_overrep['ife_count'],
    name='Observed Emergencies',
    marker_color='red'
))

# Add bars for expected emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_overrep['productionLine'],
    y=df_ife_production_overrep['ExpectedEmergencies'],
    name='Expected Emergencies',
    marker_color='orange'
))

# Customize layout
fig.update_layout(
    title='Top 5 Over-represented Outliers (Z-Test): Expected and Actual Emergency Frequencies vs. Aircraft Model',
    xaxis=dict(title='Aircraft Model'),
    yaxis=dict(title='Frequency'),
    barmode='group',  # Group bars side by side
    legend=dict(title='Legend'),
    width=1400, 
    height=800,
    paper_bgcolor='rgb(254, 246, 224)'
    # template='plotly_white'
)

# Show the plot
fig.show()

---

## 3. Odds Ratio (Relative Risk)
An **odds ratio** measures how much more likely emergencies are for a particular model compared to its production proportion.

### Formula:
\[
OR = \frac{\text{Emergency proportion for a model}}{\text{Production proportion for the same model}}
\]

This method provides an intuitive measure of **overrepresentation** or **underrepresentation** of emergencies for a given plane model.



In [216]:
df_ife_production.sort_values(by='productionLine', inplace=True)
df_ife_production['odds_ratio'] = df_ife_production.apply(lambda row : row['ife_count_rel']/row['production_amt_rel'], axis=1)

df_ife_production.head()

Unnamed: 0,productionLine,ife_count,production_amt,ife_count_rel,production_amt_rel,expected_ife_count,chi_stat_contributor,ExpectedProduction,ExpectedEmergencies,pooled_prop,z_val,odds_ratio
0,ATR 72,2.0,1233.0,0.003155,0.02853,18.08788,14.309023,1217.144714,17.855286,0.028163,-3.834037,0.110571
1,Airbus A220,3.0,367.0,0.004732,0.008492,5.383822,1.055497,364.650643,5.349357,0.008437,-1.027544,0.557225
2,Airbus A300,10.0,561.0,0.015773,0.012981,8.229765,0.38078,562.744641,8.255359,0.013021,0.615668,1.215101
3,Airbus A319,44.0,1484.0,0.069401,0.034338,21.770003,22.69971,1505.908602,22.091398,0.034844,4.779327,2.02113
4,Airbus A320,99.0,4752.0,0.156151,0.109954,69.710954,12.305788,4780.865593,70.134407,0.110622,3.681579,1.42015


In [None]:
df_ife_production_underrep_odds = df_ife_production.sort_values(by='odds_ratio').head()
df_ife_production_underrep.sort_values(by='ExpectedEmergencies',ascending=False, inplace=True)

# Plot using Plotly
fig = go.Figure()

# Add bars for actual emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_underrep['productionLine'],
    y=df_ife_production_underrep['ife_count'],
    name='Observed Emergencies',
    marker_color='green'
))

# Add bars for expected emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_underrep['productionLine'],
    y=df_ife_production_underrep['ExpectedEmergencies'],
    name='Expected Emergencies',
    marker_color='orange'
))

# Customize layout
fig.update_layout(
    title='Top 5 Under-represented Outliers (Odds Ratio): Normalized Production and Emergency Frequencies vs. Aircraft Model',
    xaxis=dict(title='Aircraft Model'),
    yaxis=dict(title='Frequency'),
    barmode='group',  # Group bars side by side
    legend=dict(title='Legend'),
    width=1400, 
    height=800,
    paper_bgcolor='rgb(254, 246, 224)'
    # template='plotly_white'
)

# Show the plot
fig.show()

In [None]:
df_ife_production_overrep = df_ife_production.sort_values(by='odds_ratio',ascending=False).head()
df_ife_production_overrep.sort_values(by='ExpectedEmergencies',ascending=False, inplace=True)
# Plot using Plotly
fig = go.Figure()

# Add bars for actual emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_overrep['productionLine'],
    y=df_ife_production_overrep['ife_count'],
    name='Observed Emergencies',
    marker_color='red'
))

# Add bars for expected emergencies
fig.add_trace(go.Bar(
    x=df_ife_production_overrep['productionLine'],
    y=df_ife_production_overrep['ExpectedEmergencies'],
    name='Expected Emergencies',
    marker_color='orange'
))

# Customize layout
fig.update_layout(
    title='Top 5 Over-represented Outliers (Odds Ratio): Normalized Production and Emergency Frequencies vs. Aircraft Model',
    xaxis=dict(title='Aircraft Model'),
    yaxis=dict(title='Frequency'),
    barmode='group',  # Group bars side by side
    legend=dict(title='Legend'),
    width=1400, 
    height=800,
    paper_bgcolor='rgb(254, 246, 224)'
    # template='plotly_white'
)

# Show the plot
fig.show()

## 4. Visualization and Deviation Analysis
If a formal hypothesis test isn’t required, a **visual comparison** can help highlight differences:

1. **Plot both relative frequencies** (emergencies and production) on the same bar chart.
2. Highlight deviations using a **residual calculation**:
   \[
   Residual = \text{Observed relative frequency (emergencies)} - \text{Expected relative frequency (production)}
   \]



In [219]:
df_ife_production.sort_values(by='productionLine', inplace=True)
df_ife_production['residual'] = df_ife_production.apply(lambda row : row['ife_count_rel'] - row['production_amt_rel'], axis=1)

residual_under = list(df_ife_production.sort_values(by='residual',ascending=True).head()['productionLine'])
residual_over = list(df_ife_production.sort_values(by='residual',ascending=False).head()['productionLine'])

df_ife_production.head()

Unnamed: 0,productionLine,ife_count,production_amt,ife_count_rel,production_amt_rel,expected_ife_count,chi_stat_contributor,ExpectedProduction,ExpectedEmergencies,pooled_prop,z_val,odds_ratio,residual
0,ATR 72,2.0,1233.0,0.003155,0.02853,18.08788,14.309023,1217.144714,17.855286,0.028163,-3.834037,0.110571,-0.025375
1,Airbus A220,3.0,367.0,0.004732,0.008492,5.383822,1.055497,364.650643,5.349357,0.008437,-1.027544,0.557225,-0.00376
2,Airbus A300,10.0,561.0,0.015773,0.012981,8.229765,0.38078,562.744641,8.255359,0.013021,0.615668,1.215101,0.002792
3,Airbus A319,44.0,1484.0,0.069401,0.034338,21.770003,22.69971,1505.908602,22.091398,0.034844,4.779327,2.02113,0.035063
4,Airbus A320,99.0,4752.0,0.156151,0.109954,69.710954,12.305788,4780.865593,70.134407,0.110622,3.681579,1.42015,0.046197


In [226]:
df_ife_production.sort_values(by='production_amt',ascending=False,inplace=True)

# Plot using Plotly
fig = go.Figure()

# Add bars for relative production frequency
fig.add_trace(go.Bar(
    x=df_ife_production['productionLine'],
    y=df_ife_production['production_amt_rel'],
    name='Relative Production Frequency',
    marker_color='blue'
))

# Add bars for relative emergency frequency
fig.add_trace(go.Bar(
    x=df_ife_production['productionLine'],
    y=df_ife_production['ife_count_rel'],
    name='Relative Emergency Frequency',
    marker_color='red'
))

# Customize layout
fig.update_layout(
    title='Relative Observed Emergency Frequency vs. Relative Production Amount for IFE Dataset',
    xaxis=dict(title='Plane Model'),
    yaxis=dict(title='Relative Frequency'),
    barmode='group',  # Group bars side by side
    legend=dict(title='Legend'),
    template='plotly_white'
)

# Show the plot
fig.show()

In [254]:
def colordef_ife(cat,residual_under) :
    if cat in residual_under :
        return 'green'
    else :
        return 'red'

def colordef_prod(cat,residual_under) :
    if cat in residual_under :
        return 'orange'
    else :
        return 'orange'

In [257]:
# Create a list of colors, highlighting the specific category
df_residual_under = list(df_ife_production[df_ife_production['productionLine'].isin(residual_under)])
df_residual_over = list(df_ife_production[df_ife_production['productionLine'].isin(residual_over)])
residual_all = list(residual_under+residual_over)
df_ife_production.sort_values(by='production_amt',ascending=False,inplace=True)

color_ife = [colordef_ife(cat,residual_under) if cat in residual_all else 'lightgray' for cat in list(df_ife_production['productionLine'])]
color_prod = [colordef_prod(cat,residual_under) if cat in residual_all else 'gray' for cat in list(df_ife_production['productionLine'])]

# Plot using Plotly
fig = go.Figure()

# Add bars for relative emergency frequency
fig.add_trace(go.Bar(
    x=df_ife_production['productionLine'],
    y=df_ife_production['ife_count'],
    name='Observed Emergency Frequency',
    marker_color=color_ife
))

# Add bars for relative production frequency
fig.add_trace(go.Bar(
    x=df_ife_production['productionLine'],
    y=df_ife_production['ExpectedEmergencies'],
    name='Expected Emergency Frequency',
    marker_color=color_prod
))

# Customize layout
fig.update_layout(
    title='Highlighted Outliers (Z-Test): Expected and Observed Emergency Frequencies vs. Aircraft Model',
    xaxis=dict(title='Aircraft Model'),
    yaxis=dict(title='Frequency'),
    barmode='group',  # Group bars side by side
    legend=dict(title='Legend'),
    width=1400, 
    height=800,
    paper_bgcolor='rgb(254, 246, 224)'
    # template='plotly_white'
)

fig.update_xaxes(tickangle=60)

# Show the plot
fig.show()

### Save Dataframe

In [223]:
df_ife_production.sort_values(by='productionLine', inplace=True)
df_ife_production.to_csv('../data/processed/squawk7700_categorical_statistical_comparison.csv')