This is a simple notebook based on [lr is all u need](https://www.kaggle.com/code/heyspaceturtle/logistic-regression-is-all-u-need) 

Special thanks to Devastator for his [clean notebook ](https://www.kaggle.com/code/thedevastator/tps-aug-simple-baseline)
and to Lucas See for his [great EDA](https://www.kaggle.com/code/pinstripezebra/eda-baseline-model). Make sure to give them an upvote! 

please upvote for above author

In [67]:
# Viz
import matplotlib.pyplot as plt
import seaborn as sns

# Data Processing
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Modeling 
from sklearn.ensemble import VotingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import cluster, accuracy_score, roc_auc_score
from sklearn.model_selection import cross_validate, GridSearchCV, cross_val_score, StratifiedKFold

# Other
import warnings
warnings.filterwarnings('ignore')

In [68]:
# Load Data
train = pd.read_csv('../input/tabular-playground-series-aug-2022/train.csv')
test = pd.read_csv('../input/tabular-playground-series-aug-2022/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-aug-2022/sample_submission.csv')

train.head()

In [69]:
# Variable lists for easy manipulation
id_var = ['id']
target= ['failure']
cat_vars = ['product_code','attribute_0','attribute_1']
num_vars = [v for v in test.columns if v not in id_var and v not in cat_vars]
predictors = cat_vars + num_vars

In [70]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(12, 12), facecolor='#EAECEE')
cmap = sns.color_palette("rainbow", as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1., center=0, annot=False,
            square=True, linewidths=.5, cbar_kws={"shrink": 0.75})

ax.set_title('Correlation heatmap', fontsize=24, y= 1.05)
colorbar = ax.collections[0].colorbar

## Pre-Processing 

In [71]:
#Imputing missing values 
multi_imp = IterativeImputer(max_iter = 9, random_state = 42, verbose = 0, skip_complete = True, n_nearest_features = 10, tol = 0.001)
multi_imp.fit(train[num_vars])
train[num_vars] = multi_imp.transform(train[num_vars])
test[num_vars] = multi_imp.transform(test[num_vars])

print('Train NA:', train.isna().sum().sum())
print('Test NA:', train.isna().sum().sum())

In [72]:
train['measurement_0'].hist()

In [73]:
train['loading'].hist()

In [74]:
float_cols = [col for col in train.columns if train[col].dtypes == 'float64']

In [75]:
#both train and test loading value is right skewed
for i in float_cols:
    train[i] = train[i].apply(lambda x:np.log(x+1))
    test[i] = test[i].apply(lambda x:np.log(x+1))

In [76]:
train['loading'].hist()

In [77]:
# Normalize columns
attributes = ['attribute_2', 'attribute_3', 'measurement_4', 'measurement_5', 'measurement_6']
train[attributes] = preprocessing.normalize(train[attributes])
test[attributes] = preprocessing.normalize(test[attributes])

In [78]:
# Drop product code
test = test.drop(['product_code'], axis = 1)
train = train.drop(['product_code'], axis = 1)
cat_vars.remove('product_code')

In [79]:
# One Hot Encoding
for v in cat_vars:
    tempdf = pd.get_dummies(train[v], prefix = v)
    tempdf_test = pd.get_dummies(test[v], prefix = v)
    train = pd.merge(left = train, right = tempdf, left_index = True, right_index = True)
    test = pd.merge(left = test, right = tempdf_test, left_index = True, right_index = True)
train = train.drop(cat_vars, axis = 1)
test = test.drop(cat_vars, axis = 1)

## Model Training

In [80]:
# update predictor list
predictors = [v for v in train.columns if v not in id_var and v not in target]

# Train test split
X_train, X_test, y_train, y_test = train_test_split(train[predictors], train[target], test_size=0.20, random_state=42)
X_train.head()

### Logistic Regression 

In [81]:
# Logistic Regression
lr = LogisticRegression(random_state = 0)
lr.fit(X_train, y_train)
print('Validation AUC:', round(roc_auc_score(y_test, lr.predict_proba(X_test)[:,1]), 4))

### LR Grid Search
This is commented out so the notebook runs faster. params are saved on the next cell

In [82]:
# Logistic Regression Grid Search
#space = {"C":np.logspace(-4, 4, 50),"penalty":["l1", "l2"], "solver":['liblinear', 'lbfgs', 'newton-cg']}
#gsLR = GridSearchCV(LogisticRegression(max_iter = 200), space ,scoring='roc_auc', cv = kfold, verbose = 1)
#gsLR.fit(X_train, y_train.values.squeeze())
#print(gsLR.best_params_)

#LR_best = gsLR.best_estimator_
#print(gsLR.best_params_)
#print('Validation AUC:', round(roc_auc_score(y_test, LR_best.predict_proba(X_test)[:,1]), 4))

In [83]:
# Best Params from Grid Search
lrgs = LogisticRegression(max_iter = 200, C=0.0001, penalty='l2', solver='newton-cg')
lrgs.fit(X_train, y_train)

print('Validation AUC:', round(roc_auc_score(y_test, lrgs.predict_proba(X_test)[:,1]), 4))

In [84]:
# LDA
LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train,y_train.values.squeeze())
print('Validation AUC:', round(roc_auc_score(y_test, LDA.predict_proba(X_test)[:,1]), 4))

Public Leaderboard scores: 
- Logistic Regression GS = 0.58786 
- LDA = 0.58133

LDA drops a lot on the public LB, but tbh everything about this LB is weird. I still need to do some in-depth cross validation to make sure the score is stable and we don't drop fro the LB on private.

## Submission

In [85]:
# Remove id
test = test.drop('id', axis = 1)

# LR GS  0.58786
sub1 = sample_submission.copy()
sub1.failure = lr.predict_proba(test)[:,1]
sub1.to_csv('submission.csv', index = False)