https://www.kaggle.com/neel90/airline-2019/tasks?taskId=1271

**Task Details**

1. Combine different csv files into a single dataframe
1. Clean the city_name columns, which also contain the abreviated state names.
1. Check which of the columns are redundant information (i.e. they can easily be computed from the other columns)
1. Find out the airports and the flight operators which correspond to maximum delay in general.

**Submission**

Submit your notebook containing all the tasks mentioned below. If you are interested in figuring out some more features from the data, you are free to do so. However please separate each of them with appropriate headings for ease of evaluation.

--------------------------------------------------------------------------------------------------

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

---

## 1. Combine different csv files into a single dataframe

In [None]:
dfs = []

for dirname, _, filenames in os.walk('/kaggle/input/airline-2019'):
    for filename in filenames:
        path = os.path.join(dirname, filename)
        dfs.append(pd.read_csv(path))

df = pd.concat(dfs, ignore_index=True)
print('Shape of data: ', df.shape)

In [None]:
df.info(null_counts=True)

### Showing columns with non-object data types

In [None]:
df.select_dtypes(exclude=['O'])

### Showing columns with object data types

In [None]:
df.select_dtypes(include=['O'])

--------------------------------------------------------------------------------------------------
## 2. Clean the city_name columns, which also contain the abreviated state names

In [None]:
re = r'\s*,\s*[a-zA-Z]*\s*$'

In [None]:
# Clean origin cities
df['ORIGIN_CITY_NAME'] = df['ORIGIN_CITY_NAME'].str.replace(re, '')

# Check
(df['ORIGIN_CITY_NAME'].str.contains(',') == False).sum() == df['ORIGIN_CITY_NAME'].index.size

In [None]:
df['ORIGIN_CITY_NAME']

In [None]:
# Clean destination cities
df['DEST_CITY_NAME'] = df['DEST_CITY_NAME'].str.replace(re, '')

# Check
(df['DEST_CITY_NAME'].str.contains(',') == False).sum() == df['DEST_CITY_NAME'].index.size

In [None]:
df['DEST_CITY_NAME']

--------------------------------------------------------------------------------------------------
## 3. Check which of the columns are redundant information (i.e. they can easily be computed from the other columns)

In [None]:
df['Unnamed: 25'].value_counts(dropna=False)

In [None]:
(df['Unnamed: 25'].isna()).sum() == df['Unnamed: 25'].index.size

### So, the column Unnamed: 25 can be dropped because it consists of NaN values only

In [None]:
df.drop('Unnamed: 25', axis=1, inplace=True)

In [None]:
df['CANCELLED'].value_counts(dropna=False)

In [None]:
df['CANCELLATION_CODE'].value_counts(dropna=False)

In [None]:
# Split CANCELLATION_CODE into two groups by CANCELLED
df_cancelled_0 = df[df['CANCELLED'] == 0]['CANCELLATION_CODE']
df_cancelled_1 = df[df['CANCELLED'] == 1]['CANCELLATION_CODE']

In [None]:
df_cancelled_0.value_counts(dropna=False)

In [None]:
df_cancelled_1.value_counts(dropna=False)

### So, the column CANCELLED can be dropped because when it has NaN value the column CANCELLATION_CODE will also have NaN value

In [None]:
df.drop('CANCELLED', axis=1, inplace=True)

--------------------------------------------------------------------------------------------------

## 4. Find out the airports and the flight operators which correspond to maximum delay in general

In [None]:
df['TOTAL_DELAY'] = df['CARRIER_DELAY'] + df['WEATHER_DELAY'] + df['NAS_DELAY'] + df['SECURITY_DELAY'] + df['LATE_AIRCRAFT_DELAY']

df_delays = df[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']]
ax = df_delays.sum().plot.pie(title='Delays', figsize=(8, 8))
ax.set_xlabel('')
ax.set_ylabel('');

In [None]:
df_airlines = df.groupby('OP_CARRIER_AIRLINE_ID')['TOTAL_DELAY'].aggregate(np.sum).reset_index().sort_values('TOTAL_DELAY', ascending=False)

# Find out the flight operators which correspond to maximum delay in general
plt.figure(figsize=(15, 8))
ax = sns.barplot(x='OP_CARRIER_AIRLINE_ID', y='TOTAL_DELAY', data=df_airlines)
ax.set_xlabel('Airline', fontsize=16)
ax.set_ylabel('Total delay', fontsize=16)
plt.show();

### The flight operator with ID *20304* has the maximum delay in general

In [None]:
df_airports = df.groupby('ORIGIN')['TOTAL_DELAY'].aggregate(np.sum).reset_index().sort_values('TOTAL_DELAY', ascending=False)

# Find out the airports which correspond to maximum delay in general
plt.figure(figsize=(15, 70))
ax = sns.barplot(y='ORIGIN', x='TOTAL_DELAY', data=df_airports)
ax.set_ylabel('Airport', fontsize=16)
ax.set_xlabel('Total delay', fontsize=16)
plt.show();

### Check that ORD with the maximum delay

In [None]:
df[df['ORIGIN'] == 'ORD'][['ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_NM']].head()

#### Chicago airport (ORD) has the maximum delay in general

### **5. Possible additional changes**

In [None]:
# Rename columns
# df.rename(columns={'ORIGIN_STATE_NM': 'ORIGIN_STATE_NAME'}, inplace=True)
# df.rename(columns={'DEST_STATE_NM': 'DEST_STATE_NAME'}, inplace=True)