# CS7324 - Lab 3: Extending Logistic Regression
### Jarad Angel, Zach Bohl,  and Luigi Allen

Airline Passenger Satisfation Dataset: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction

# Preparation and Overview

### Task Explanation and Business Case

The main purpose of the dataset was to predict customer satisfaction,  however to fullfil the requires that dataset must contain three or more classes to predict.  We are explring an alternative classification problem within the Airline Passenger Satisfaction dataset which is to predict the travel class of passengers (i.e., Business, Economy, and Economy Plus). Here’s is how we are framing this task:

### Classification Task: Predict Passenger Travel Class
This task involves predicting whether a passenger is flying in Business Class, Economy Class, or Economy Plus based on various features available in the dataset. To predict travel class effectively, several features from the dataset are likely to be important. Here are some key features that can significantly influence the prediction:

1. **Flight Distance:** Longer flights might have a higher proportion of Business and Economy Plus passengers.
2. **In-flight Services:** Ratings for Wi-Fi, entertainment, and seat comfort can indicate the class, as Business class typically offers superior services.
3. **Baggage Handling:** Satisfaction with baggage handling might correlate with higher travel classes.
4. **Customer Satisfaction:** Overall satisfaction scores can be higher for Business and Economy Plus passengers.
5. **Flight Delays:** Passengers in higher classes might experience fewer delays or have different perceptions of delays.
6. **Age:** Different age groups might prefer different travel classes.
7. **Gender:** There might be trends in travel class preferences based on gender.
8. **Type of Traveler:** Business travelers are more likely to be in Business class, while leisure travelers might prefer Economy or Economy Plus.


### Business Use-Case:
The goal of predicting travel class can provide insights into passenger behavior and help airlines such as American Airlines (AA) to optimize their service offerings and revenue management strategies. Airlines could use this information for:

- **Revenue Optimization:** Understanding the factors that influence travel class choices allows airlines to adjust pricing, promotions, and services to encourage passengers to upgrade to higher classes (e.g., from Economy to Economy Plus or Business).

- **Passenger Segmentation:** Airlines can use this model to tailor services to specific passenger segments, such as offering personalized upgrades or add-on services to Economy Plus passengers likely to switch to Business class. Currently, American Airlines does not offer upgrades to their lowest tier(Economy) customers to switch to top tier(Buisness class).

- **Operational Efficiency:** The model can help forecast the demand for each travel class on specific routes, allowing airlines to adjust flight capacity (e.g., increasing Business class seats on popular routes).

### Interested Parties:
The parties at American Airlines that would be interested in the results are as follows:

- **Revenue Management Teams:** They can use the model to identify passengers more likely to purchase upgrades or additional services, optimizing pricing strategies for different classes.

- **Customer Experience Teams:** Understanding class preferences allows these teams to tailor in-flight experiences, loyalty rewards, and offers based on predicted class choice.

- **Airline Executives:** Senior leadership can make data-driven decisions regarding fleet configuration (e.g., more Business class seats on high-demand routes).

- **Marketing Teams:** These teams can leverage insights to create targeted promotions and campaigns for each class of traveler, improving customer acquisition and retention.

### Offline Analysis vs. Deployed Model:
- **Offline Analysis:** This model can be used to identify travel class trends for various passenger demographics, seasonality, or routes. This helps in revenue forecasting, adjusting pricing strategies, and configuring flight classes.

- **Deployed Model:** It can also be deployed in the booking system to predict and offer real-time travel class upgrades or special offers for passengers who are likely to switch from Economy to Economy Plus or Business.

### Performance Requirements:
To be valuable for real-time deployment or offline analysis, the model should ideally achieve at least 80% accuracy or higher. However, since American Airlines make significant revenue from upgrades, improving the precision and recall for Business class predictions may be crucial. Airlines would aim for high recall (e.g., >90%) in identifying potential Business class passengers, as a missed opportunity here could directly impact revenue.

The classifier's performance could also be evaluated using F1-score, particularly for underrepresented classes (e.g., Economy Plus, which might have fewer passengers than Economy).

In [371]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Loading the dataset and displaying feature information for analysis
df = pd.read_csv('Archive/train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

In [372]:
print(df.describe())

          Unnamed: 0             id            Age  Flight Distance  \
count  103904.000000  103904.000000  103904.000000    103904.000000   
mean    51951.500000   64924.210502      39.379706      1189.448375   
std     29994.645522   37463.812252      15.114964       997.147281   
min         0.000000       1.000000       7.000000        31.000000   
25%     25975.750000   32533.750000      27.000000       414.000000   
50%     51951.500000   64856.500000      40.000000       843.000000   
75%     77927.250000   97368.250000      51.000000      1743.000000   
max    103903.000000  129880.000000      85.000000      4983.000000   

       Inflight wifi service  Departure/Arrival time convenient  \
count          103904.000000                      103904.000000   
mean                2.729683                           3.060296   
std                 1.327829                           1.525075   
min                 0.000000                           0.000000   
25%                 2.000

In [373]:
# A look at the entire dataframe to understand all of the available features and sample values
pd.set_option('display.max_columns', None) # Display option to show all columns
df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied


In [374]:
# Let's combine some of the ages into buckets...
# Define the age ranges for each category
bins = [0, 12, 25, 65, np.inf]
labels = ['Child', 'Young Adult', 'Adult', 'Elderly']

# Create a new column with the age categories
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Remove the 'Age' column in-place
# df.drop('Age', axis=1, inplace=True)

df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult


In [375]:
# Let's combine some of the ages into buckets...
# Define the age ranges for each category
bins = [0, 12, 25, 65, np.inf]
labels = ['Child', 'Young Adult', 'Adult', 'Elderly']

# Create a new column with the age categories
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Remove the 'Age' column in-place
# df.drop('Age', axis=1, inplace=True)

df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult


In [376]:
# Remove the 'Ease' column in-place
df.drop('Ease of Online booking', axis=1, inplace=True)

# Remove the row column, not needed
df.drop(['Unnamed: 0', 'id'], axis=1, inplace=True)

# Drop rows with missing values
df.dropna(inplace=True)

df

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group
0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult
1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult
2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult
3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult
4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,Female,disloyal Customer,23,Business travel,Eco,192,2,1,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult
103900,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult
103901,Male,disloyal Customer,30,Business travel,Business,1995,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult
103902,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult


## Transforming Categorical Data


In [377]:
# Updating GENDER from string to binary

# Create a dictionary for mapping Male and Female
gender_map = {'Male': 0, 'Female': 1}

# Replace the original column instead of creating a new one:
df['Gender_Numeric'] = df['Gender'].map(gender_map).fillna(df['Gender'])
df.drop('Gender', axis=1, inplace=True)
df

Unnamed: 0,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Age_Group,Gender_Numeric
0,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied,Young Adult,0
1,disloyal Customer,25,Business travel,Business,235,3,2,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied,Adult,0
2,Loyal Customer,26,Business travel,Business,1142,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied,Adult,1
3,Loyal Customer,25,Business travel,Business,562,2,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied,Adult,1
4,Loyal Customer,61,Business travel,Business,214,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied,Adult,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,disloyal Customer,23,Business travel,Eco,192,2,1,3,2,2,2,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied,Young Adult,1
103900,Loyal Customer,49,Business travel,Business,2347,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,satisfied,Adult,0
103901,disloyal Customer,30,Business travel,Business,1995,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied,Adult,0
103902,disloyal Customer,22,Business travel,Eco,1000,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied,Young Adult,1


In [378]:
#testing...
# 
# from sklearn.preprocessing import OneHotEncoder

# Encode categorical variables
categorical_features = df.select_dtypes(include=['object']).columns
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False).set_output(transform='pandas')
df_encoded = pd.DataFrame(onehot_encoder.fit_transform(df[['Customer Type', 'Type of Travel', 'Class', 'satisfaction']]))

# # Replace the original categorical columns with the encoded ones
# df_encoded.index = df.index
# df = df.drop(categorical_features, axis=1)
# df = pd.concat([df, df_encoded], axis=1)

# Ensure all column names are strings
#df.columns = df.columns.astype(str)
# # Scale numeric variables
# numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
# scaler = StandardScaler()
# df[numeric_features] = scaler.fit_transform(df[numeric_features])

# # Describe the final dataset
# print(df.describe())

# # Breakdown of variables after preprocessing
# numeric_stats = df[numeric_features].describe().T
# categorical_stats = df_encoded.describe().T

# print("Numeric Features Stats:")
# print(numeric_stats)

# print("Categorical Features Stats:")
# print(categorical_stats)

df_encoded



Unnamed: 0,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus,satisfaction_neutral or dissatisfied,satisfaction_satisfied
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
1,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
103899,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
103900,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
103901,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
103902,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [383]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


# Encode categorical variables
categorical_features = df.select_dtypes(include=['object']).columns
onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
df_encoded = pd.DataFrame(onehot_encoder.fit_transform(df[categorical_features]))

# Replace the original categorical columns with the encoded ones
df_encoded.index = df.index
df = df.drop(categorical_features, axis=1)
df = pd.concat([df, df_encoded], axis=1)

# Ensure all column names are strings
df.columns = df.columns.astype(str)

# # Scale numeric variables
# numeric_features = df.select_dtypes(include=['int64', 'float64']).columns
# scaler = StandardScaler()
# df[numeric_features] = scaler.fit_transform(df[numeric_features])

# Describe the final dataset
# print(df.describe())

# Breakdown of variables after preprocessing
# numeric_stats = df[numeric_features].describe().T

# Check if df_encoded is empty before describing
# if not df_encoded.empty:
#     categorical_stats = df_encoded.describe().T
#     print("Categorical Features Stats:")
#     print(categorical_stats)
# else:
#     print("No categorical features to describe.")

# print("Numeric Features Stats:")
# print(numeric_stats)

df

Unnamed: 0,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,Age_Group,Gender_Numeric,0,1,2,3,4,5,6,7,8
0,13,460,3,4,1,5,3,5,5,4,3,4,4,5,5,25,18.0,Young Adult,0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0
1,25,235,3,2,3,1,3,1,1,1,5,3,1,4,1,1,6.0,Adult,0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,26,1142,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,Adult,1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
3,25,562,2,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,Adult,1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,61,214,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,Adult,0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,23,192,2,1,3,2,2,2,2,3,1,4,2,3,2,3,0.0,Young Adult,1,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
103900,49,2347,4,4,4,2,4,5,5,5,5,5,5,5,4,0,0.0,Adult,0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
103901,30,1995,1,1,3,4,1,5,4,3,2,4,5,5,4,7,14.0,Adult,0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
103902,22,1000,1,1,5,1,1,1,1,4,5,1,5,4,1,0,0.0,Young Adult,1,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0


In [380]:
#one hot encoding
# df = pd.get
# Displaying the first 5 rows of the dataset    


In [381]:
#Dementionality reduction

# Modeling

In [48]:
##### Copying the instructor's Python class for Binary Logistic Regression

import numpy as np
class BinaryLogisticRegressionBase:
    # private:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        return 'Base Binary Logistic Regression Object, Not Trainable'
    
    # convenience, private and static:
    @staticmethod
    def _sigmoid(theta):
        return 1/(1+np.exp(-theta)) 
    
    @staticmethod
    def _add_intercept(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
    
    # public:
    def predict_proba(self, X, add_intercept=True):
        # add bias term if requested
        Xb = self._add_intercept(X) if add_intercept else X
        Xb = np.array(Xb)
        Xb = Xb.reshape(1, -1)  # Reshape to (1, n_features)
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction
    
    
        
blr = BinaryLogisticRegressionBase(0.1)
print(blr)

Base Binary Logistic Regression Object, Not Trainable


In [49]:

# inherit from base class
class BinaryLogisticRegression(BinaryLogisticRegressionBase):
    #private:
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'


    @property
    def coef_(self):
        if(hasattr(self,'w_')):
            return self.w_[1:]
        else:
            return None

    @property
    def intercept_(self):
        if(hasattr(self,'w_')):
            return self.w_[0]
        else:
            return None

        
    def _get_gradient(self,X,y):
        # programming \sum_i (yi-g(xi))xi
        gradient = np.zeros(self.w_.shape) # set gradient to zero
        for (xi,yi) in zip(X,y):
            xi = np.array(xi)
            # the actual update inside of sum
            gradi = (yi - self.predict_proba(xi,add_intercept=False))*xi 
            # reshape to be column vector and add to gradient
            gradient += gradi.reshape(self.w_.shape) 
        
        return gradient/float(len(y))
       
    # public:
    def fit(self, X, y):
        Xb = self._add_intercept(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate 


In [50]:
import pandas as pd

dfTest = pd.read_csv('archive/test.csv')
dfTrain = pd.read_csv('archive/train.csv')

# This line will map the satisfaction categories to 1 and 0
dfTest['satisfaction'] = dfTest['satisfaction'].map({'satisfied': 1, 'neutral or dissatisfied': 0})

# You can use this line to check the first few rows of the DataFrame to ensure the changes are correct
dfTest.head()

# This line will map the satisfaction categories to 1 and 0
dfTrain['satisfaction'] = dfTrain['satisfaction'].map({'satisfied': 1, 'neutral or dissatisfied': 0})

# You can use this line to check the first few rows of the DataFrame to ensure the changes are correct
dfTrain.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,0
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,0
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,1
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,0
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,1


In [51]:
# Assuming you have two DataFrames, df1 and df2, with identical categories (columns)
df = pd.concat([dfTest, dfTrain], axis=0, ignore_index=True)


In [52]:
# Train/Test split

from sklearn.model_selection import train_test_split
import plotly

X = df
y = df.satisfaction

X = np.array(X)
y = np.array(y)

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size=0.2)


In [53]:
%%time
# Now we can train the classifier
blr = BinaryLogisticRegression(eta=0.1,iterations=12)
blr.fit(X_train,y_train)
print(blr)

TypeError: can't multiply sequence by non-int of type 'float'

In [54]:
%%time
# can we do better? Maybe more iterations?
params = dict(eta=0.01,
              iterations=500)

blr = BinaryLogisticRegression(**params)
blr.fit(X_train,y_train)
print(blr)
yhat = blr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))

TypeError: can't multiply sequence by non-int of type 'float'

# Deployment

# Exceptional Work 