#THY Travel Datathon Preselection Case Study

by Duygu Can, Meriç Pakkan, Neslihan Oflaz, Ahad Khaleghi Ardabili, Akın Erdem


Flying passengers can check-in through the web site, applications, kiosks and counters. In this case study content, the train data including the number of seven-month passenger check-in operations is provided. We want you to estimate the number of channels (column Operation_Count) in the csv file attached.

## Dataset Expolaration

Upload the data by reading provided .csv files from Google Drive (change path if needed). There are 808696 samples in the training and 121921 instances in the test sets. Each has 23 features.

In [None]:
import pandas as pd
train_df = pd.read_csv("../input/datathon/assessment/Assessment Data/Assessment Train Data.csv")
result_df = pd.read_csv("../input/datathon/assessment/Assessment Data/Assessment Result File.csv")
print(train_df.shape)
print(result_df.shape)
train_df.head()

Check what data types we have:

In [None]:
train_df.dtypes

In the *Departure_YMD_LMT* and the *Operation_YMD_LMT* colums, date of departure and date of check is stored so it is convenient to convert them to date time objects.

In [None]:
train_df['Departure_YMD_LMT'] = pd.to_datetime(train_df['Departure_YMD_LMT'], format='%Y%m%d')
train_df['Operation_YMD_LMT'] = pd.to_datetime(train_df['Operation_YMD_LMT'], format='%Y%m%d')
result_df['Departure_YMD_LMT'] = pd.to_datetime(result_df['Departure_YMD_LMT'], format='%Y%m%d')
result_df['Operation_YMD_LMT'] = pd.to_datetime(result_df['Operation_YMD_LMT'], format='%Y%m%d')

In [None]:
train_df.dtypes

In [None]:
train_df.describe()

Convert *object* datatype to category when needed.

In [None]:
for col_name in train_df.columns:
    if train_df[col_name].dtype.name == 'object':
        train_df[col_name] = train_df[col_name].astype('category')
        result_df[col_name] = result_df[col_name].astype('category')

In [None]:
train_df.dtypes

See the unique levels of the categorical columns. *Departure Airport* has only one value called "KDT". So, it is not informative and should be dropped.

In [None]:
for col_name in train_df.columns:
    if train_df[col_name].dtype.name == 'category':
        print(col_name, ":", train_df[col_name].unique())

Generate *Operation_Channel_Group* as defined in the pdf file.

In [None]:
dict = {"JW": 'Online',
        "TW": 'Online',
        "TS": 'Mobile',
        "JM": 'Mobile',
        "TY":"Counter",
        "QC":"Counter",
        "SC":"Kiosks",
        "IR":"Other",
        "?":"Other",
        "IA":"Other",
        "BD":"Other",
        "CC":"Other",
        "QR":"Other",
        "QP":"Other",
        "QA":"Other"
        }
train_df['Operation_Channel_Group'] = train_df['Operation_Channel'].map(dict)
train_df['Operation_Channel_Group'].unique()

Do the same for the test set.

In [None]:
result_df['Operation_Channel_Group'] = result_df['Operation_Channel'].map(dict)
result_df['Operation_Channel_Group'].unique()

Change data type to category

In [None]:
train_df["Operation_Channel_Group"] = train_df["Operation_Channel_Group"].astype('category')
result_df["Operation_Channel_Group"] = result_df["Operation_Channel_Group"].astype('category')

### Missing Value Handling

Column percentage of null values in the training and test sets are printed below. At first glance,there are only null values in the *Operation Initials* column, however some unknown values are encoded as "?" in the datasets.

In [None]:
(train_df.isnull().mean()*100).round(4)

In [None]:
(result_df.isnull().mean()*100).round(4)

Column based "?" occurance percentages in the training set:

In [None]:
import numpy as np
def unknown_perc(df):
  print("Column Name\t Percentage")
  for col_name in df.columns:
        if df[col_name].dtype.name == 'category' and (df[col_name] == "?").any():
          count = df[col_name].value_counts(dropna=False)['?']
          percentage = (count/len(df)*100).round(3)
          print(col_name,"\t", percentage)
  return
        
unknown_perc(train_df)

Column based "?" occurance percentages in the test set:

In [None]:
unknown_perc(result_df)

#### Generating Flags


Let's transform Operation_Sonic_Code to Operation_Sonic_Code_Flag  in order to see if these variables are null or not. We choose to do this since there are so many different classes for these variables.



---

Depreciated: Terminal_Name to Terminal_Name_Flag


---



In [None]:
train_df['Operation_Sonic_Code_Flag'] = np.where(train_df['Operation_Sonic_Code']=='?', '0', '1')
train_df['Operation_Sonic_Code_Flag'] = train_df['Operation_Sonic_Code_Flag'].astype(int)
#train_df['Terminal_Number_Flag'] = np.where(train_df['Terminal_Number']=='?', '0', '1')
#train_df['Terminal_Number_Flag'] = train_df['Terminal_Number_Flag'].astype(int)
result_df['Operation_Sonic_Code_Flag'] = np.where(result_df['Operation_Sonic_Code']=='?', '0', '1')
result_df['Operation_Sonic_Code_Flag'] = result_df['Operation_Sonic_Code_Flag'].astype(int)
#result_df['Terminal_Number_Flag'] = np.where(result_df['Terminal_Number']=='?', '0', '1')
#result_df['Terminal_Number_Flag'] = result_df['Terminal_Number_Flag'].astype(int)

Convert "?" to NA when needed. Apart from *Inbound_Departure_Airport* and *Outbound_Arrival_Airport*, "?" is printed for unknown values (missing). For those specific columns "?" means that there is no inbound or outbound flight (direct flight), so they are encoded as a seperate class called *Unknown*.

In [None]:
import numpy as np
#train_df['Terminal_Number'] = train_df['Terminal_Number'].replace('?', np.nan)
#train_df['Operation_Channel'] = train_df['Operation_Channel'].replace('?', np.nan)
train_df['Passenger_Title'] = train_df['Passenger_Title'].replace('?', np.nan)
train_df['Passenger_Gender'] = train_df['Passenger_Gender'].replace('?', np.nan)
train_df['Inbound_Departure_Airport'] = train_df['Inbound_Departure_Airport'].replace('?', "Unknown")
train_df['Outbound_Arrival_Airport'] = train_df['Outbound_Arrival_Airport'].replace('?', "Unknown")
train_df['Cabin_Class'] = train_df['Cabin_Class'].replace('?', np.nan)
train_df["Operation_Initials"] = train_df["Operation_Initials"].replace("?",np.nan)
train_df["Operation_Sonic_Code"] = train_df["Operation_Sonic_Code"].replace("?",np.nan)

#result_df['Terminal_Number'] = result_df['Terminal_Number'].replace('?', np.nan)
#result_df['Operation_Channel'] = result_df['Operation_Channel'].replace('?', np.nan)
result_df['Passenger_Title'] = result_df['Passenger_Title'].replace('?', np.nan)
result_df['Passenger_Gender'] = result_df['Passenger_Gender'].replace('?', np.nan)
result_df['Inbound_Departure_Airport'] = result_df['Inbound_Departure_Airport'].replace('?', "Unknown")
result_df['Outbound_Arrival_Airport'] = result_df['Outbound_Arrival_Airport'].replace('?', "Unknown")
result_df['Cabin_Class'] = result_df['Cabin_Class'].replace('?', np.nan)
result_df["Operation_Initials"] = train_df["Operation_Initials"].replace("?",np.nan)
result_df["Operation_Sonic_Code"] = result_df["Operation_Sonic_Code"].replace("?",np.nan)

Now, the new missing value percentages for the training set becomes:

In [None]:
(train_df.isnull().mean()*100).round(4)

and for the test set the result is:

In [None]:
(result_df.isnull().mean()*100).round(4)

### Dropping Uninformative Features

Notice that *Operation Sonic Code*  has a missing value ratio of 79% and for *Terminal Number* column this ratio is even higher (>90%). With a ratio this high, we cannot impute missing values correctly. These columns should be dropped, along with the *Departure_Airport* column. 



---

Later we decided to keep *Terminal Number* since it shows high correlation whether the passenger flies or not (SWC_FLY). We thought that even this column composed of values mostly unknown, it can be stil informative.


---




In [None]:
#train_df2 = train_df.copy()
#result_df2 = result_df.copy()
train_df = train_df.drop(columns = [ "Departure_Airport", "Operation_Sonic_Code"]) #"Terminal_Number", 
result_df = result_df.drop(columns = ["Departure_Airport", "Operation_Sonic_Code"]) #"Terminal_Number", 

#### Imputing Passenger Gender

First, we used *Passenger Title* to impute missing values in the gender column. We replaced unknown genders whose titles are *MISTER*, *MISS* and *MISSES* with male and females, respectively. Later, we decided *Operation Channel Group* based imputation for the rest. So we grouped the dataframe by this column and found most frequent observations for the *Passenger Gender* column.

In [None]:
# Replace missing values whose titles are MISTER with M
train_df.loc[(train_df.Passenger_Gender.isna() ) & (train_df.Passenger_Title=='MISTER'),"Passenger_Gender"] = "M"
result_df.loc[(result_df.Passenger_Gender.isna() ) & (result_df.Passenger_Title=='MISTER'),"Passenger_Gender"] = "M"

# Replace missing values whose titles are MISS or MISSES with F
train_df.loc[(train_df.Passenger_Gender.isna() ) & ((train_df.Passenger_Title=='MISS') | (train_df.Passenger_Title=='MISSES')) ,"Passenger_Gender"] = "F"
result_df.loc[(result_df.Passenger_Gender.isna() ) & ((result_df.Passenger_Title=='MISS') | (result_df.Passenger_Title=='MISSES')) ,"Passenger_Gender"] = "F"

sum(train_df["Passenger_Gender"].isnull())

There are 981 missing values left in the gender column. We checked channel group based gender distribution and found that the most frequent gender is male for every channel

In [None]:
train_df.groupby("Operation_Channel_Group")['Passenger_Gender'].apply(lambda x: x.value_counts().index[0])#.reset_index()

In [None]:
train_df.groupby("Operation_Channel_Group")['Passenger_Gender'].apply(lambda x: x.value_counts())

In [None]:
result_df.groupby("Operation_Channel_Group")['Passenger_Gender'].apply(lambda x: x.value_counts())

Since most common gender for all channel groups is male, we imputed missing gender values with "M".

In [None]:
train_df['Passenger_Gender'] = train_df['Passenger_Gender'].replace(np.nan, "M")
train_df['Passenger_Gender'].unique()

In [None]:
result_df['Passenger_Gender'] = train_df['Passenger_Gender'].replace(np.nan, "M")
result_df['Passenger_Gender'].unique()

#### Imputing Passenger Title

Again, Operation Channel Group based imputation is employed. "MISTER" is most common tittle so we employed all the missin values with that.

In [None]:
train_df.groupby("Operation_Channel_Group")['Passenger_Title'].apply(lambda x: x.value_counts())

In [None]:
result_df.groupby("Operation_Channel_Group")['Passenger_Title'].apply(lambda x: x.value_counts())

In [None]:
train_df['Passenger_Title'] = train_df['Passenger_Title'].replace(np.nan, "MISTER")
train_df['Passenger_Title'].unique()

In [None]:
result_df['Passenger_Title'] = result_df['Passenger_Title'].replace(np.nan, "MISTER")
result_df['Passenger_Title'].unique()

#### Imputing Cabin Class
Most common class is economy class for all channel groups so we imputed missing values with it.

In [None]:
train_df.groupby("Operation_Channel_Group")['Cabin_Class'].apply(lambda x: x.value_counts())

In [None]:
train_df["Cabin_Class"].unique()

In [None]:
result_df.groupby("Operation_Channel_Group")['Cabin_Class'].apply(lambda x: x.value_counts())

In [None]:
train_df['Cabin_Class'] = train_df['Cabin_Class'].replace(np.nan, "Y")
train_df['Cabin_Class'].unique()

In [None]:
result_df['Cabin_Class'] = result_df['Cabin_Class'].replace(np.nan, "Y")
result_df['Cabin_Class'].unique()

#### Imputing Operation Initials

The most frequent *Operation Initials* observed in different channel groups for the training and the test set are given below.

In [None]:
train_df.groupby("Operation_Channel_Group")['Operation_Initials'].apply(lambda x: x.value_counts().index[0])

In [None]:
result_df.groupby("Operation_Channel_Group")['Operation_Initials'].apply(lambda x: x.value_counts().index[0])

The most frequent initials observed in each channel group is "KS" for test set so we replaced NA values with it.

In [None]:
result_df['Operation_Initials'] = result_df['Operation_Initials'].replace(np.nan, "KS")
result_df['Operation_Initials'].unique()

However, we need to impute missing values differently for different channels for the training set.

In [None]:
train_df.loc[(train_df.Operation_Channel_Group == "Counter") & (train_df.Operation_Initials.isna()),"Operation_Initials"] = "KS"
train_df.loc[(train_df.Operation_Channel_Group == "Kiosks") & (train_df.Operation_Initials.isna()),"Operation_Initials"] = "SC"
train_df.loc[(train_df.Operation_Channel_Group != "Kiosks") & (train_df.Operation_Channel_Group != "Counter") & (train_df.Operation_Initials.isna()),"Operation_Initials"] = "MK"

## Feature Generation

### Early Check-in & Early Check-in Status

We generate another variable to see if people checked in on-time or not. The date is given as days, so how many days there are between the flight and the check-in is added as a variable.

In [None]:
train_df['Early_Check_In'] = (train_df.Departure_YMD_LMT - train_df.Operation_YMD_LMT)
train_df['Early_Check_In'] = (train_df['Early_Check_In']/86400000000000).astype(int)
train_df['Early_Check_In'].unique()

There are some peculiar cases where check-in made 7000 days before in 1999!

In [None]:
train_df[train_df['Early_Check_In']>100].sort_values('Operation_YMD_LMT')

We replaced the check-in day as early, on-time and peculiar.

In [None]:
train_df.loc[train_df.Early_Check_In > 100, 'Early_Check_In_Status'] = 'Peculiar'
train_df.loc[(train_df.Early_Check_In == 0) | (train_df.Early_Check_In == -1), 'Early_Check_In_Status'] = 'On-time'
train_df.loc[(train_df.Early_Check_In == 1) | (train_df.Early_Check_In == 2) | (train_df.Early_Check_In == 3), 'Early_Check_In_Status'] = 'Early'

Do the same for the result set.

In [None]:
result_df['Early_Check_In'] = (result_df.Departure_YMD_LMT - result_df.Operation_YMD_LMT)
result_df['Early_Check_In'] = (result_df['Early_Check_In']/86400000000000).astype(int)
result_df.loc[result_df.Early_Check_In > 100, 'Early_Check_In_Status'] = 'Peculiar'
result_df.loc[(result_df.Early_Check_In == 0) | (result_df.Early_Check_In == -1), 'Early_Check_In_Status'] = 'On-time'
result_df.loc[(result_df.Early_Check_In == 1) | (result_df.Early_Check_In == 2) | (result_df.Early_Check_In == 3), 'Early_Check_In_Status'] = 'Early'

The flights with no inbound and no outbound terminal are direct flights so we generated a new colum to label those flights

In [None]:
train_df['Direct_Flight'] = np.where((train_df.Inbound_Departure_Airport == 'Unknown') & (train_df.Outbound_Arrival_Airport == 'Unknown'), 1, 0)
result_df['Direct_Flight'] = np.where((result_df.Inbound_Departure_Airport == 'Unknown') & (result_df.Outbound_Arrival_Airport == 'Unknown'), 1, 0)

### Check-in Inbound & Check-in Outbound

The number of different values in Operation_Airport is very high. We think that, whether the Operation_Airport is the same airport with the Inbound_Departure_Airport or the Outbound_Arrival_Airport, is an important feature. So Checkin_Outbound and Operation_Outbound variables indicate if the check-in operation is done at either the Inbound_Departure_Airport or the Outbound_Arrival_Airport. 



In [None]:
train_df.loc[(train_df.Operation_Airport == train_df.Inbound_Departure_Airport), 'Checkin_Inbound'] = 1
train_df['Checkin_Inbound'] = train_df['Checkin_Inbound'].replace(np.nan, 0)

result_df.loc[(result_df.Operation_Airport == result_df.Inbound_Departure_Airport), 'Checkin_Inbound'] = 1
result_df['Checkin_Inbound'] = result_df['Checkin_Inbound'].replace(np.nan, 0)



train_df.loc[(train_df.Operation_Airport == train_df.Outbound_Arrival_Airport), 'Checkin_Outbound'] = 1
train_df['Checkin_Outbound'] = train_df['Checkin_Outbound'].replace(np.nan, 0)

result_df.loc[(result_df.Operation_Airport == result_df.Outbound_Arrival_Airport), 'Checkin_Outbound'] = 1
result_df['Checkin_Outbound'] = result_df['Checkin_Outbound'].replace(np.nan, 0)


### Operation Airport Reduced
We now reduce the number of different values in Operation_Airport variable for the most frequent ones since there is a huge drop after *EST* airport.

In [None]:
train_df.groupby('Operation_Airport').count().sort_values('Operation_Initials', ascending=False).head(10)


The same most frequent 4 airports (KDT, IST, SKW, EST) found in the test set.

In [None]:
result_df.groupby('Operation_Airport').count().sort_values('Operation_Initials', ascending=False).head(10)


The occurences of most frequent values are far different for the first 4 values. They are the same airport for both train and test data. So, any airport other than these 4 is replaced with "OTHERS".

In [None]:
train_df['Operation_Airport_Reduced'] = np.where((train_df.Operation_Airport == 'KDT') | (train_df.Operation_Airport == 'IST') | (train_df.Operation_Airport == 'SKW') | (train_df.Operation_Airport == 'EST'), train_df.Operation_Airport, 'OTHERS')
result_df['Operation_Airport_Reduced'] = np.where((result_df.Operation_Airport == 'KDT') | (result_df.Operation_Airport == 'IST') | (result_df.Operation_Airport == 'SKW') | (result_df.Operation_Airport == 'EST'), result_df.Operation_Airport, 'OTHERS')

### Operation Initials Reduced

We followed the same grouping scheme for the Operation_Initials column. There is a huge drop after LK company in terms of count.

In [None]:
train_df.groupby('Operation_Initials').count().sort_values('Operation_Airport', ascending=False).head(10)

In [None]:
result_df.groupby('Operation_Initials').count().sort_values('Operation_Airport', ascending=False).head(10)


The 6 most frequent values differ significantly from the others. The same pattern in the training set is shared with the test set.  We replace the others with "OTHERS" and kept the most frequent ones.

In [None]:
train_df['Operation_Initials_Reduced'] = np.where((train_df.Operation_Initials == 'KS') | (train_df.Operation_Initials == 'MK') | (train_df.Operation_Initials == 'SC') | (train_df.Operation_Initials == 'Q7') | (train_df.Operation_Initials == 'EY') | (train_df.Operation_Initials == 'LK'), train_df.Operation_Initials, 'OTHERS')
result_df['Operation_Initials_Reduced'] = np.where((result_df.Operation_Initials == 'KS') | (result_df.Operation_Initials == 'MK') | (result_df.Operation_Initials == 'SC') | (result_df.Operation_Initials == 'Q7') | (result_df.Operation_Initials == 'EY') | (result_df.Operation_Initials == 'LK'), result_df.Operation_Initials, 'OTHERS')


In [None]:
train_df.groupby('Operation_Initials_Reduced').count().sort_values('Operation_Airport', ascending=False).head(10)

### Inbound_Departure_Airport_Reduced

The most common *Inbound Departure Airports* are unknowns (those with no inbound flight), IST, SKW and EST for the test and the training set both. So we kept those and grouped the others as OTHERS. 

In [None]:
train_df.groupby('Inbound_Departure_Airport').count().sort_values('Operation_Airport', ascending=False).head(10)

In [None]:
result_df.groupby('Inbound_Departure_Airport').count().sort_values('Operation_Airport', ascending=False).head(10)

In [None]:
train_df['Inbound_Departure_Airport_Reduced'] = np.where((train_df.Inbound_Departure_Airport == 'Unknown') | (train_df.Inbound_Departure_Airport == 'IST') | (train_df.Inbound_Departure_Airport == 'SKW') | (train_df.Inbound_Departure_Airport == 'EST'), train_df.Inbound_Departure_Airport, 'OTHERS')
result_df['Inbound_Departure_Airport_Reduced'] = np.where((result_df.Inbound_Departure_Airport == 'Unknown') | (result_df.Inbound_Departure_Airport == 'IST') | (result_df.Inbound_Departure_Airport == 'SKW') | (result_df.Inbound_Departure_Airport == 'EST'), result_df.Inbound_Departure_Airport, 'OTHERS')

### Outbound Arrival Airport Reduced

The most common *Outbound Departure Airports* are unknowns (those with no outbound flight) and KDT for the test and the training set both. So we kept those and grouped the others as OTHERS.

In [None]:
train_df.groupby('Outbound_Arrival_Airport').count().sort_values('Operation_Airport', ascending=False).head(10)

In [None]:
result_df.groupby('Outbound_Arrival_Airport').count().sort_values('Operation_Airport', ascending=False).head(10)

In [None]:
train_df['Outbound_Arrival_Airport_Reduced'] = np.where((train_df.Outbound_Arrival_Airport == 'Unknown') | (train_df.Outbound_Arrival_Airport == 'KDT'), train_df.Outbound_Arrival_Airport, 'OTHERS')
result_df['Outbound_Arrival_Airport_Reduced'] = np.where((result_df.Outbound_Arrival_Airport == 'Unknown') | (result_df.Outbound_Arrival_Airport == 'KDT'), result_df.Outbound_Arrival_Airport, 'OTHERS')

### Weekend

We thought that may be week of day, weekend/weekday information might be usefull. Weekend days are encoded as 1 in the *Weekend* column.

In [None]:
import datetime
train_df['Weekend'] = [x in [5,6] for x in train_df.Departure_YMD_LMT.dt.weekday]
train_df['Weekend'] = train_df['Weekend'].replace(True, int(1))
train_df['Weekend'] = train_df['Weekend'].replace(False, int(0))
result_df['Weekend'] = [x in [5,6] for x in result_df.Departure_YMD_LMT.dt.weekday]
result_df['Weekend'] = result_df['Weekend'].replace(True, int(1))
result_df['Weekend'] = result_df['Weekend'].replace(False, int(0))

In [None]:
train_df.Weekend.head()

### Departure Day

Generate *Departure Day* column:

In [None]:
train_df['Departure Day'] = train_df['Departure_YMD_LMT'].dt.weekday_name

result_df['Departure Day'] = result_df['Departure_YMD_LMT'].dt.weekday_name

train_df['Departure Day'].head()

### Day of Month

In [None]:
train_df.insert(1,'Day_of_Month','foo')
train_df['Day_of_Month'] = train_df['Departure_YMD_LMT'].dt.day

result_df.insert(1,'Day_of_Month','foo')
result_df['Day_of_Month'] = result_df['Departure_YMD_LMT'].dt.day

train_df['Day_of_Month'].head()

### Economy Class

Transform categorical *Cabin Class* column to binary variable.

In [None]:
dict = {"Y": 1,
        "C": 0
        }
train_df['Economy_Class'] = train_df['Cabin_Class'].map(dict)
train_df = train_df.drop("Cabin_Class", axis = 1)
result_df['Economy_Class'] = result_df['Cabin_Class'].map(dict)
result_df = result_df.drop("Cabin_Class", axis = 1)
train_df['Economy_Class'].unique()

#### Lightly Flying Passengers

New feature is generated for lightly flying passengers since this case seems like a predictor for the operation count.

In [None]:
train_df['Fly_Light'] = np.where(train_df['Passenger_Baggage_Count']==0, 1, 0)
result_df['Fly_Light'] = np.where(result_df['Passenger_Baggage_Count']==0, 1, 0)

### Drop Unnecessary Columns

In [None]:
train_df = train_df.drop(columns = ["Departure_YMD_LMT", 
                                    "Operation_YMD_LMT", 
                                    "Operation_Initials", 
                                    "Operation_Airport",
                                    "Inbound_Departure_Airport",
                                    "Outbound_Arrival_Airport",
                                    "Terminal_Name",
                                    "Early_Check_In"], axis =1) 

## Encoding Categorical Features

Convert objects to category

In [None]:
for col_name in train_df.columns:
    if train_df[col_name].dtype.name == 'object':
        train_df[col_name] = train_df[col_name].astype('category')
        result_df[col_name] = result_df[col_name].astype('category')

In [None]:
train_df.dtypes

Encode categorical columns

In [None]:
train_onehot = train_df.copy()
#train_onehot.drop(columns = ["Operation_Initials", "Terminal_Name"], axis =1)
result_onehot = result_df.copy()
for cols in train_df.columns: #leave as traidf!!!
  if train_onehot[cols].dtype.name == 'category':
    print(cols)
    one_hot = pd.get_dummies(train_df[cols], prefix = cols)
    train_onehot = train_onehot.drop(cols,axis = 1)
    train_onehot = train_onehot.join(one_hot)
  

In [None]:
train_onehot.columns

## Correlation Matrix

Upon observing correlation matrix, we first noticed that *Passenger Baggage Count* is highly correlated with *Passenger Baggage Weight*. These columns are multicollinear, so we are going to include our model only one of them. We decided to continue with *Passenger Baggage Count* since we later observed that this feature has a higher importance weight. Lightly flying passengers with no baggages negatively correlates with baggage count and weight related columns, as expected. CIP's with high commercial value tend to belong a loyalty program.

In [None]:
import seaborn as sns
corr = train_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

## Visualization

Curious behavior of the operation count:
Below we see the *Operation Count* distribution. It is positive skewed. A high majority of the passengers does 1 operation, the population is localized under 20 operations. There is an outlier with 129 operations.

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1,figsize=(10, 8))
train_df["Operation_Count"].hist(bins=500, color="blue", ax=ax)


Operation count distributed uniformly over days of the week. Fridays are slightly a little busier than the others (?).

In [None]:
import seaborn as sns
sns.barplot(x='Departure Day',y='Operation_Count',data=train_df)

Uniform distribution in the operation count among weekend flag. So, there is no difference in number of operations during weekdays and the weekends.

In [None]:
sns.barplot(x='Weekend',y='Operation_Count',data=train_df)

Direct flight seems to be an importan predictor of the operation count.

In [None]:
sns.barplot(x='Direct_Flight',y='Operation_Count',data=train_df)

Passenger_Baggage_Count

In [None]:
sns.barplot(x='Passenger_Baggage_Count',y='Operation_Count',data=train_df)

In [None]:
sns.barplot(x='Fly_Light',y='Operation_Count',data=train_df)

In [None]:
sns.distplot(train_df['Passenger_Baggage_Weight'])

## Feature Importance

Since no strong correlation with the target is found within the data, an automatic feature selection method is employed. Light GBM, is a greadient boosted tree based algorithm. Differing from other three based algorithms, it grows the trees vertically, i.e. it chooses the leaf with maximum loss and grows the tree from there. Details of this algorithm can be found in references.



Important Features:
---

+ Day_of_Month
+	SWC_FQTV_Member
+	Passenger_Baggage_Count
+	Direct_Flight
+	Early_Check_In_Status_Early
+	Economy_Class
+	SWC_CIP_Passenger
+	Terminal_Number_?
+	Passenger_Gender_M
+	Operation_Initials_Reduced_MK
+	Operation_Channel_TS

In [None]:
import lightgbm as lgb

#train_onehot = train_onehot.drop(columns = ["Departure_YMD_LMT", "Operation_YMD_LMT", "Operation_Initials", "Operation_Airport"], axis =1)
target = train_onehot["Operation_Count"]
train = train_onehot.drop(["Operation_Count"], axis = 1)
#lightGBM model fit
gbm = lgb.LGBMRegressor()
gbm.fit(train, target)
gbm.booster_.feature_importance()
""
# importance of each attribute
fea_imp_ = pd.DataFrame({'cols':train.columns, 'fea_imp':gbm.feature_importances_})
fea_imp_.loc[fea_imp_.fea_imp > 0].sort_values(by=['fea_imp'], ascending = False)

## Train/Validation/Test Split

In [None]:
from sklearn.model_selection import train_test_split
# define target
y = train_onehot.Operation_Count
# define features
X = train_onehot.drop(columns = ["Operation_Count"])
# stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # stratify=X_train.Operation_Channel_Group, 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.20, random_state=42)
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Validation Features Shape:', X_val.shape)
print('Validation Labels Shape:', y_val.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

## Feature Normalization

Not needed for DT based algorithms

## Model Building

### Baseline Model: Random Forest

We first used all the features avaible.

### MP Model:

We see that y values are dominated by '1's. Our first aim is to classify the y values as '1's and 'others'.  Then we will use linear regression to find a relation among 'others'.


#### Training and Validation of the model

**PART 1: Classification**

The values that are different than '1' are set to zero 



In [None]:
y_train_log = np.where(y_train == 1, 1, 0)
y_val_log = np.where(y_val == 1, 1, 0)
y_test_log = np.where(y_test == 1, 1, 0)

Logistic Regression is employed:

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train_log)

*Validation for Part 1:*

A new column is created to encode the first prediction for validation data. 

In [None]:
X_val['Prediction1'] = logreg.predict(X_val)
y_val_log_pred=X_val['Prediction1']

 **PART 2 : LINEAR REGRESSION**

In this part, we will train the data set corresponding to y values different than 1.







In [None]:
X_train_multi = X_train[y_train_log == 0]
y_train_multi = y_train[y_train_log == 0]

X_val_multi = X_val[y_val_log_pred == 0]
y_val_multi = y_val[y_val_log_pred == 0]

Lasso regression is used

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

reg = LassoCV()
reg.fit(X_train_multi, y_train_multi)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)


#regressor = LinearRegression()  
#regressor.fit(X_train_multi, y_train_multi) #training the algorithm

In [None]:
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

Feature importance:

In [None]:
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")

Columns are chosen according to above Lasso model

In [None]:
cols = ['Fly_Light', 'Operation_Initials_Reduced_SC', 'Direct_Flight', 'Operation_Initials_Reduced_EY', 'Economy_Class', 'Passenger_Gender_M', 
        'Inbound_Departure_Airport_Reduced_Unknown', 'Early_Check_In_Status_Early', 'Operation_Channel_TS', 'SWC_FLY']

In [None]:
X_train_multi = X_train_multi[cols]
X_val_multi = X_val[cols]

Chosen columns are used in linear regression:

In [None]:
regressor = LinearRegression()  
regressor.fit(X_train_multi, y_train_multi) #training the algorithm

X_val['Prediction2'] = regressor.predict(X_val_multi)

In [None]:
X_val['Prediction2'] = X_val['Prediction2'].round()
y_val_multi_pred= X_val['Prediction2']

Data type is changed to integer:

In [None]:
X_val['Prediction2'] = X_val['Prediction2'].astype('int64')

Final prediction combining the previus results are encoded in a third column called Prediction_fin

In [None]:
X_val['Prediction_fin'] = np.where((X_val.Prediction1 == 1), 1, X_val.Prediction2)

In [None]:
X_val['Prediction_fin'] = X_val['Prediction_fin'].astype('int64')

Accuracy check:

In [None]:
from sklearn.metrics import accuracy_score

score = accuracy_score(X_val['Prediction_fin'], y_val)
score


Error etc.

In [None]:
# Calculate the absolute errors
errors = abs(X_val['Prediction_fin'] - y_val)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_val)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

## Test and Evaluation



Applying same staps on the test data

In [None]:
X_test['Prediction1'] = logreg.predict(X_test)
y_test_log_pred=X_test['Prediction1']

In [None]:
X_test_multi = X_test[cols]

In [None]:
X_test['Prediction2'] = regressor.predict(X_test_multi)

In [None]:
X_test['Prediction2'] = X_test['Prediction2'].round()
y_test_multi_pred= X_test['Prediction2']

In [None]:
X_test['Prediction2'] = X_test['Prediction2'].astype('int64')

In [None]:
X_test['Prediction_finito'] = np.where((X_test.Prediction1 == 1), 1, X_test.Prediction2)

In [None]:
X_test['Prediction_finito'] = X_test['Prediction_finito'].astype('int64')

In [None]:
# Calculate the absolute errors
errors = abs(X_test['Prediction_finito'] - y_test)
# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / y_test)
# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

## Prediction

on result_df

In [None]:
result_df = result_df.drop(columns = ["Departure_YMD_LMT", 
                                    "Operation_YMD_LMT", 
                                    "Operation_Initials", 
                                    "Operation_Airport",
                                    "Inbound_Departure_Airport",
                                    "Outbound_Arrival_Airport",
                                    "Terminal_Name",
                                    "Early_Check_In",
                                       "Operation_Count"], axis =1)

In [None]:
result_onehot = result_df.copy()
for cols in result_df.columns: #leave as train_df!!!
  if result_onehot[cols].dtype.name == 'category':
    print(cols)
    one_hot = pd.get_dummies(train_df[cols], prefix = cols)
    result_onehot = result_onehot.drop(cols,axis = 1)
    result_onehot = result_onehot.join(one_hot)


In [None]:
result_onehot['Prediction1'] = logreg.predict(result_onehot)


In [None]:
result_onehot_multi = result_onehot[cols]

In [None]:
result_onehot['Prediction2'] = regressor.predict(result_onehot_multi)

## Submission

## References


1.   [Impute Missing Values](https://jamesrledoux.com/code/imputation)
2.   [Is it better to drop or impute values from data sets when applying ML, or would it be better to label them as 'missing' for categorical variables?](https://www.quora.com/Is-it-better-to-drop-or-impute-values-from-data-sets-when-applying-ML-or-would-it-be-better-to-label-them-as-missing-for-categorical-variables)

