This is the __experimental notebook__, here we'll demonstrate many of the things we tried but didn't get into the final model.

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
import xgboost 
import numpy as np
import warnings
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings('ignore')
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport

Importing data

In [2]:
data= pd.read_csv('train.csv')


Preprocess: checking missing values, and seeing the datatypes. In order to see if the data has categorical variables

In [None]:
any(data.dtypes == 'object') #no strings- all numeric
data.info()

After seeing that the data is all numerical we're going to check if there are missing values

In [None]:
data.isna().any() #no missing values

Now we're going to head the data to get a sense of what are the values like and get a more visual feel

In [None]:
data.head()

After performing EDA found on the deliverable notebook we started doing Feature engineering. 
The following features were created after investigating in Mayo Clinic's website of what we could calculate with the available data: 
* BMI (Body Mass Index): is a measure of body fat based on height and weight that applies to adult men and women.
* Average eyesight: used to minimize the number of columns in order to simplify the model
* Average hearing: used to minimize the number of columns in order to simplify the model
* Cholesterol ratio non-HDL: for predicting your risk of heart disease, many healthcare professionals now believe that determining your non-HDL cholesterol level may be more useful than calculating your cholesterol ratio.
* Total cholesterol: this is the total amount of cholesterol that’s circulating in your blood. Here’s the formula for calculating it: HDL + LDL + 20% triglycerides = total cholesterol.
* Triglyceride-to-HDL: it can help determine a person's risk of heart disease triglycerides/HDL level.
* Liver Enzyme Ratio: Liver function tests check the levels of certain enzymes and proteins in your blood. Levels that are higher or lower than usual can mean liver problems. 
* Creatinine clearance: A creatinine test is a measure of how well your kidneys are performing their job of filtering waste from your blood.
* Systolic to relaxation ratio: Systolic pressure is affected by a variety of factors. Factors such as anxiety, caffeine consumption, and performing resistance and cardiovascular exercises, cause immediate, temporary increases in systolic pressure.

We also tried thougt that categorizeing/bining would help the model by having less numerical features, adding categorical with One Hot Encoding.
After trying this aproach we found that the models where more precise without cagetorizing, so we went back to purely numerical.
Then we found that many of the newly created features were doing the opposite of improving the model's performance. We tried many combinations and tested the feaures importance but found little of it. That's why we decided to simply the model as seen on the deliverable notebook.

In [None]:
# Calculate BMI
data['BMI'] = data['weight(kg)'] / ((data['height(cm)'] / 100) ** 2)

# Categorize BMI
data['BMI_category'] = data['BMI'].apply(lambda x: 'Underweight' if x < 18.5 else 
                                         ('Normal weight' if x < 25 else 
                                          ('Overweight' if x < 30 else 'Obesity')))

# Calculate average eyesight
data['avg eyesight'] = (data['eyesight(left)'] + data['eyesight(right)']) / 2

# Categorize average eyesight
data['eyesight_category'] = data['avg eyesight'].apply(lambda x: 'Good' if x > 0.8 else 
                                                       ('Moderate' if x > 0.4 else 'Poor'))
# Calculate Cholesterol ratio HDL
data['Cholesterol ratio HDL'] = (data['HDL'] + data['LDL']) / data['HDL']


# Categorize Cholesterol ratio HDL (using arbitrary thresholds)
data['Cholesterol_ratio_category'] = data['Cholesterol ratio HDL'].apply(lambda x: 'Low' if x < 3.5 else 
                                                                         ('Normal' if x <= 5 else 'High'))

# Calculate Total cholesterol
data['Total cholesterol'] = data['HDL'] + data['LDL'] + (data['triglyceride'] / 0.2)

# Categorize Total Cholesterol
data['Total_cholesterol_category'] = data['Total cholesterol'].apply(
    lambda x: 'Desirable' if x < 200 else ('Borderline high' if x <= 239 else 'High'))

# Calculate Triglyceride-to-HDL ratio
data['Triglyceride-to-HDL'] = data['triglyceride'] / data['HDL']

# Categorize Triglyceride-to-HDL ratio
data['Triglyceride_to_HDL_category'] = data['Triglyceride-to-HDL'].apply(
    lambda x: 'Optimal' if x < 2 else ('Moderate' if x <= 4 else 'High'))


data["Liver Enzyme Ratio"] = data["AST"] / data["ALT"]

# Categorize Liver Enzyme Ratio
data['Liver_Enzyme_Ratio_category'] = data['Liver Enzyme Ratio'].apply(
    lambda x: 'Low' if x < 0.8 else ('Normal' if x <= 1.2 else 'High'))

#creatine clearance (kidney function)
data['creatinine clearance']= (140- (data['age']* data['weight(kg)'])/(72* data['serum creatinine']))

# Categorize Creatinine Clearance
data['Creatinine_Clearance_category'] = data['creatinine clearance'].apply(
    lambda x: 'Normal' if x > 90 else 
              ('Mildly Decreased' if x > 60 else 
               ('Moderately to Severely Decreased' if x > 30 else 
                ('Severely Decreased' if x > 15 else 'Kidney Failure'))))
                # Create 'age_group' column directly
data['age_group'] = data['age'].apply(lambda x: 'Young Adult' if x <= 35 else ('Middle-Aged Adult' if x <= 55 else 'Senior Adult'))

# Create 'systolic_to_relaxation_ratio' column as the ratio of systolic to relaxation (diastolic) blood pressure
data['systolic_to_relaxation_ratio'] = data['systolic'] / data['relaxation']

# Categorize the ratio
data['ratio_category'] = data['systolic_to_relaxation_ratio'].apply(
    lambda x: 'Low' if x < 1.0 else ('Normal' if x <= 1.5 else 'High'))

Then we procedeed to test multiple combinations of features categorized and uncategorized with a sample of 10% to make the testing faster with the following models.
* Logistic regression: Used for binary classification problems (yes/no, true/false). It can be extended to multiclass classification via techniques like one-vs-rest (OvR).
* Random Forest: Suitable for both classification and regression tasks. Good for complex datasets with high dimensionality and feature interactions.
* XGBoost: Can be used for classification, regression, ranking, and user-defined prediction tasks. 

Compared to other models such as Random Forest and Logistic Regression, XGBoost demonstrated superior performance in our scenario. Random Forest, builds multiple decision trees too but it combines their predictions to improve accuracy. While Random Forest is robust and less prone to overfitting than XGBoost, it may not capture complex relationships as effectively. 

Logistic Regression, on the other hand, is a linear model that models the probability of a binary outcome using a logistic function. Although Logistic Regression is simple and interpretable, it may struggle to capture non-linear relationships present in the data.

The reason XGBoost performed better in our case is primarily due to its ability to handle complex non-linear relationships and interactions between features. This capability allowed XGBoost to effectively learn from the dataset and make accurate predictions regarding smoking behavior. 

While Random Forest and Logistic Regression are viable alternatives, they may not capture the nuances present in the data as comprehensively as XGBoost. Ultimately, the choice of model depends on the specific characteristics of the dataset and the goals of the analysis.

Splitting data x and y 

__Dealing with outliers__

Our first attempt was to eliminate all of the outliers as seen below, after debating the reasons we found that some categories didn´t make sense to remove outilers, for example: Age, weight, hearing and eyesight to name a few. After trying that we saw a drop on the model's performance, which is why we only eliminated outliers in the columns: ALT, LDL, triglyceride, AST, Gtp.

The first attempt to remove "outliers" in every column is shown below

In [2]:
df_train= pd.read_csv('train.csv')

In [14]:
# Keep a count of the original number of rows
original_row_count = len(df_train)

# Initialize a mask to keep track of rows to keep
mask = pd.Series(True, index=df_train.index)

# Loop through each column to update the mask for outliers
for column in df_train.columns:
    Q1 = df_train[column].quantile(0.25)
    Q3 = df_train[column].quantile(0.75)
    IQR = Q3 - Q1

    # Define the bounds for outliers
    lower_bound = Q1 - 4 * IQR
    upper_bound = Q3 + 4 * IQR

    # Print the limits
    print(f"{column}: Lower Bound = {lower_bound}, Upper Bound = {upper_bound}")

    # Update the mask to false for rows that are outliers in the current column
    mask = mask & (df_train[column] >= lower_bound) & (df_train[column] <= upper_bound)

# Apply the mask to filter out outliers across all columns
df_train_filtered = df_train[mask]

# Calculate the number of rows after filtering
filtered_row_count = len(df_train_filtered)

# Calculate and print the number of rows removed
rows_removed = original_row_count - filtered_row_count
print(f"Rows removed: {rows_removed}")

id: Lower Bound = -278696.25, Upper Bound = 437951.25
age: Lower Bound = -20.0, Upper Bound = 115.0
height(cm): Lower Bound = 120.0, Upper Bound = 210.0
weight(kg): Lower Bound = 0.0, Upper Bound = 135.0
waist(cm): Lower Bound = 29.0, Upper Bound = 137.0
eyesight(left): Lower Bound = -0.7999999999999996, Upper Bound = 2.8
eyesight(right): Lower Bound = -0.7999999999999996, Upper Bound = 2.8
hearing(left): Lower Bound = 1.0, Upper Bound = 1.0
hearing(right): Lower Bound = 1.0, Upper Bound = 1.0
systolic: Lower Bound = 50.0, Upper Bound = 194.0
relaxation: Lower Bound = 22.0, Upper Bound = 130.0
fasting blood sugar: Lower Bound = 38.0, Upper Bound = 155.0
Cholesterol: Lower Bound = 7.0, Upper Bound = 385.0
triglyceride: Lower Bound = -275.0, Upper Bound = 517.0
HDL: Lower Bound = -31.0, Upper Bound = 140.0
LDL: Lower Bound = -57.0, Upper Bound = 285.0
hemoglobin: Lower Bound = 5.800000000000001, Upper Bound = 23.8
Urine protein: Lower Bound = 1.0, Upper Bound = 1.0
serum creatinine: Lowe

In [3]:
data= data.sample(frac=.1)
X_train=data.drop(columns=['id','smoking'])
y_train=data['smoking']

The following cell was used to test __Logistic Regression__ , __Random Forest__, __XGBoost__.

We chose logistic regression because of it's ability to predict. It works fast on any size data, and you can easily understand what's going on. Plus, you can tweak it to avoid making too sure of itself, keeping predictions realistic. Simple, quick, and smart.

Random Forest is great for binary predictions with numerical data. It uses many decision trees to boost accuracy and tackle overfitting, giving you a more reliable model. It's good at uncovering complex patterns logistic regression might miss and ranks variables by their importance, offering deep insights. While it's a bit heavier on computation and less straightforward to interpret than simpler models, its power in handling diverse data sizes and complexity makes it a more robust choice. 

And finlally XGBoost shines in binary prediction with its precision and speed, perfect for numerical data. It builds on gradient boosting by correcting previous mistakes, leading to highly accurate models. It's fast, thanks to optimized algorithms that work well on big data, and it's smart at avoiding overfitting with its regularized learning approach.

In [None]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'XGBOOST': xgboost.XGBClassifier()}

from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, make_scorer
import numpy as np

# Assuming models is a dictionary of model names and their corresponding initialized objects
for model_name, model in models.items():
    print(f"Training and evaluating {model_name} model")
    
    # Create a pipeline for scaling and model training
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])

    # Use roc_auc_score as the scoring metric
    auc_scorer = make_scorer(roc_auc_score, needs_proba=True, multi_class='ovo')

    # Calculate cross-validation AUC-ROC scores
    auc_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring=auc_scorer)

    # Calculate and print average performance across folds
    avg_auc = np.mean(auc_scores)
    print(f'Average {model_name} AUC-ROC = {avg_auc}\n')

We decided to try ___Random Forest__ due to it's scalability and because it is less prone to overfitting than XGBoost

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import xgboost
from scipy import stats

In [5]:
train= pd.read_csv('train.csv')

In [7]:
# Keep a count of the original number of rows
original_row_count2 = len(train)

# Columns you want to check for outliers
columns_to_check = ['ALT', 'LDL', 'triglyceride', 'AST', 'Gtp']

# Initialize a mask to keep track of rows to keep
mask = pd.Series(True, index=train.index)

# Loop through each column to update the mask for outliers
for column in columns_to_check:
    Q1 = train[column].quantile(0.25)
    Q3 = train[column].quantile(0.75)
    IQR = Q3 - Q1

    # Define the bounds for outliers
    lower_bound = Q1 - 4 * IQR
    upper_bound = Q3 + 4 * IQR

    # Print the limits
    print(f"{column}: Lower Bound = {lower_bound}, Upper Bound = {upper_bound}")

    # Update the mask to false for rows that are outliers in the current column
    mask = mask & (train[column] >= lower_bound) & (train[column] <= upper_bound)

# Apply the mask to filter out outliers across all columns
train_filtered = train[mask]

# Calculate the number of rows after filtering
filtered_row_count = len(train_filtered)

# Calculate and print the number of rows removed
rows_removed = original_row_count - filtered_row_count
print(f"Rows removed: {rows_removed}")

ALT: Lower Bound = -48.0, Upper Bound = 96.0
LDL: Lower Bound = -57.0, Upper Bound = 285.0
triglyceride: Lower Bound = -275.0, Upper Bound = 517.0
AST: Lower Bound = -16.0, Upper Bound = 65.0
Gtp: Lower Bound = -86.0, Upper Bound = 148.0
Rows removed: 2807


In [12]:
original_row_count2

159256

In [None]:

train_filtered=train_filtered.drop(columns=['hearing(left)','hearing(right)','Urine protein'])

In [None]:
y= train_filtered['smoking']
X= train_filtered.drop(columns='smoking')

In [None]:
from random import randint

# Define parameter grid
param_dist = {
    "n_estimators": randint(10, 1000),
    "max_features": ['auto', 'sqrt','log2'],
    "max_depth": randint(2,30),
    "min_samples_split": randint(2, 20),
    "min_samples_leaf": randint(1, 20),
    "bootstrap": [True, False]

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from scipy.stats import randint

# Define the hyperparameter grid
param_dist = {
    'n_estimators': randint(10, 1000),  # Using a distribution for 'n_estimators'
    'max_depth': randint(1, 20),
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20),
    'bootstrap': [True, False]
}

# Instantiate Random Forest classifier
rf = RandomForestClassifier()

# Define a custom scorer using roc_auc_score
auc_scorer = make_scorer(roc_auc_score, greater_is_better=True)

# Randomized search with custom scoring
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=25, cv=5, verbose=2, random_state=42, n_jobs=-1, scoring=auc_scorer)

# Fit the random search model
random_search.fit(X, y)

# Best parameters found
print("Best Parameters:", random_search.best_params_)


In [None]:
best = {
    'n_estimators': 334,  # Using a distribution for 'n_estimators'
    'max_depth': 12,
    'min_samples_split': 9,
    'min_samples_leaf':11,
    'bootstrap': True
}
rf = RandomForestClassifier(**best)

In [None]:
rf.fit(X, y)

In [None]:
y_pred = rf.predict(X)
y_pred_proba = rf.predict_proba(X)[:, 1]
auc_score = roc_auc_score(y, y_pred_proba)

print(f"AUC Score: {auc_score}")

After learning __SVM__ in class we decided to try it.

SVM is ideal for binary prediction, excelling with numerical data. It focuses on creating the best boundary between classes, ensuring high accuracy and minimal overfitting. With different kernels, SVM can tackle both linear and non-linear data, offering great flexibility. Although it might require more computational power for large datasets, its precision makes it a top choice for data science students looking to master a reliable and adaptable classification tool.

The following attempt was done as follows

In [2]:
df_train= pd.read_csv('train.csv')

In [4]:
# Keep a count of the original number of rows
original_row_count = len(df_train)
# After analyzing the graphs and experimenting we determied that the following columns have extreme outliers affecting the model performace
columns = ['ALT', 'LDL', 'triglyceride', 'AST', 'Gtp']

# Initialize a mask to keep track of rows to keep
mask = pd.Series(True, index=df_train.index)

# Loop through each column to update the mask for outliers
for column in columns:
    Q1 = df_train[column].quantile(0.25)
    Q3 = df_train[column].quantile(0.75)
    IQR = Q3 - Q1

    # Define the bounds for outliers
    lower_bound = Q1 - 4 * IQR
    upper_bound = Q3 + 4 * IQR

    print(f"{column}: Lower Bound = {lower_bound}, Upper Bound = {upper_bound}")

    # Update the mask to false for rows that are outliers in the current column
    mask = mask & (df_train[column] >= lower_bound) & (df_train[column] <= upper_bound)

# Apply the mask to filter out outliers across all columns
df_train_filtered = df_train[mask]

# Calculate the number of rows after filtering
filtered_row_count = len(df_train_filtered)

# Calculate and print the number of rows removed
rows_removed = original_row_count - filtered_row_count
print(f"Rows removed: {rows_removed}")

ALT: Lower Bound = -48.0, Upper Bound = 96.0
LDL: Lower Bound = -57.0, Upper Bound = 285.0
triglyceride: Lower Bound = -275.0, Upper Bound = 517.0
AST: Lower Bound = -16.0, Upper Bound = 65.0
Gtp: Lower Bound = -86.0, Upper Bound = 148.0
Rows removed: 2807


In [5]:
#Splitting the data into X and y to train and test the model
X = df_train_filtered.drop(columns=['smoking','id'])
y = df_train_filtered['smoking']

In [7]:
print(X.shape)  # For features
print(y.shape)  # For target labels


(156449, 22)
(156449,)


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Define a pipeline combining a standard scaler and the SVM model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

# Define the parameter grid to search
param_grid = {
    'svm__C': [0.1, 1, 10],
    'svm__kernel': ['linear', 'rbf']
}

# Set up the grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=5)

# Fit grid search to find the best parameters
grid_search.fit(X, y)

print("Best parameters found:", grid_search.best_params_)


This was the first attempt with default parameters for SVM

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Initialize your model
model = SVC(kernel='linear', random_state=42)

# Perform cross-validation
scores = cross_val_score(model, X_scaled, y, cv=5) # 5-fold cross-validation

print("Cross-validation scores:", scores)
print("Average score:", scores.mean())


Cross-validation scores: [0.75317993 0.75375519 0.7564078  0.75880473 0.7580939 ]
Average score: 0.7560483099641981


After testing we decided to go with XGBoost because the performance was better in every scenario. Then we proceeded to test which features work better in XGBoost

In [None]:
Outliers, que fue con z-score mandar IQR y pedir que lo haga zscore
SVM 