# [**Classifying with Logistic Regression**](https://www.kaggle.com/competitions/tabular-playground-series-aug-2022/data?select=train.csv)

### **Contents**

- **prediction** - 
<!-- - **EDA** - A very brief EDA, showing only the essentials
- **Aggregating Categorical Variables** - A continuation of the EDA, showing that we should be able to forecast the aggregated time series (daily total sales) and then disaggregate the forecasts based on historical proportions without penalising performance
- **Total Sales Forecast** - Forecast the total number of sales across all categorical variables using Linear Regression for 2021.
- **Product Sales Ratio Forecast** - Forecast the ratio of sales between products for 2021
- **Dissagregating Total Sales Forecast** - Disagreggate the Total Sales forecasts, to get the forecast for each categorical variable -->

### **References**

This work and approach was inspired by:
- [Simple Logistic Regression for Good Score (0.5837)](https://www.kaggle.com/code/ryanluoli2/simple-logistic-regression-for-good-score-0-5837)

In [1]:
! kaggle competitions download -c tabular-playground-series-aug-2022

Downloading tabular-playground-series-aug-2022.zip to /Users/tungwu/Documents/GitHub/kaggle/Tabular Playground Series - Aug 2022
  0%|                                               | 0.00/2.27M [00:00<?, ?B/s]
100%|██████████████████████████████████████| 2.27M/2.27M [00:00<00:00, 25.3MB/s]


In [5]:
! unzip -o tabular-playground-series-aug-2022.zip -d data/

Archive:  tabular-playground-series-aug-2022.zip
  inflating: data/sample_submission.csv  
  inflating: data/test.csv           
  inflating: data/train.csv          


In [6]:
import numpy as np
import pandas as pd

import seaborn as sn
import matplotlib.pyplot as plt

# Import Data

In [7]:
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

In [9]:
display(df_train.head())
display(df_test.head())

Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,...,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,material_7,material_8,9,5,7,8,4,...,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,material_7,material_8,9,5,14,3,3,...,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,material_7,material_8,9,5,12,1,5,...,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,material_7,material_8,9,5,13,2,6,...,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,material_7,material_8,9,5,9,2,8,...,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,...,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17
0,26570,F,119.57,material_5,material_6,6,4,6,9,6,...,18.654,10.802,15.909,18.07,13.772,13.659,16.825,13.742,17.71,634.612
1,26571,F,113.51,material_5,material_6,6,4,11,8,0,...,19.368,12.032,13.998,,12.473,17.468,16.708,14.776,14.102,537.037
2,26572,F,112.16,material_5,material_6,6,4,8,12,4,...,17.774,11.743,17.046,18.086,10.907,13.363,15.737,17.065,16.021,658.995
3,26573,F,112.72,material_5,material_6,6,4,8,11,10,...,18.948,11.79,18.165,16.163,10.933,15.501,15.667,12.62,16.111,594.301
4,26574,F,208.0,material_5,material_6,6,4,14,16,8,...,19.141,12.37,14.578,17.849,11.941,16.07,16.183,13.324,17.15,801.044


# Data Preparation

In [19]:
## constants
ID = 'id'
FAILURE = 'failure'
LOADING = 'loading'
PRODUCT_CODE = 'product_code'

In [15]:
cat_features = list(df_train.columns[1:5])
cat_features.append(FAILURE)
cat_features.remove(LOADING)
cat_features

['product_code', 'attribute_0', 'attribute_1', 'failure']

In [13]:
num_features = list(df_train.columns[5:])
num_features.append(LOADING)
num_features.remove(FAILURE)
num_features

['attribute_2',
 'attribute_3',
 'measurement_0',
 'measurement_1',
 'measurement_2',
 'measurement_3',
 'measurement_4',
 'measurement_5',
 'measurement_6',
 'measurement_7',
 'measurement_8',
 'measurement_9',
 'measurement_10',
 'measurement_11',
 'measurement_12',
 'measurement_13',
 'measurement_14',
 'measurement_15',
 'measurement_16',
 'measurement_17',
 'loading']

In [16]:
#combine the train and test data for preparation
df = pd.concat([df_train, df_test], axis=0)

## Dealing with Missing Values

In [18]:
#fill missing values for numerical features with the grouped mean for each product

missing_features = list(df.columns[10:-1])
missing_features.append(LOADING)

for feature in missing_features:
    df[feature] = df[feature].fillna(df.groupby([PRODUCT_CODE])[feature].transform(np.mean))

## Encoding Categorical Variables

In [20]:
#encode nominal categorical variables with dummy variables

for feature in cat_features[0:-1]:
    df = pd.get_dummies(df, columns=[feature])
    df = df.drop([df.columns[-1]], axis=1)

In [21]:

#split the data back into train and test sets

df_train = df.iloc[0:len(df_train)].copy()
df_test = df.iloc[len(df_train):].copy()

X_train = df_train.drop([ID, FAILURE], axis=1).copy()
y_train = df_train[FAILURE].copy()

## Standardize Data

In [22]:
# standardize all the numerical variables for better regression results

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
df_test[num_features] = scaler.fit_transform(df_test[num_features])

# Logistic Regression

In [23]:
#perform a grid search to find the best hyperparameter for logistic regression

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 2000,
                        random_state = 0)

lr_param = {'C':np.logspace(-10,10)}

lr_cv = GridSearchCV(estimator=lr, param_grid=lr_param , scoring='roc_auc', cv=5)
lr_cv.fit(X_train, y_train)
lr_cv.best_params_

{'C': 7543.120063354608}

In [24]:
lr_best = LogisticRegression(C = lr_cv.best_params_['C'], 
                             max_iter = 2000, 
                             random_state = 0)
lr_best.fit(X_train, y_train)

LogisticRegression(C=7543.120063354608, max_iter=2000, random_state=0)

In [25]:
from sklearn.model_selection import cross_validate

lr_cv_scores = cross_validate(lr_best, X_train, y_train, scoring='roc_auc', cv=5)
round(lr_cv_scores['test_score'].mean(),5)

0.59277

# Make Predictions

In [26]:
lr_best = LogisticRegression(C = lr_cv.best_params_['C'], 
                             max_iter = 2000, 
                             random_state = 0)

lr_best.fit(X_train, y_train)

y_pred = lr_best.predict_proba(df_test.drop([ID, FAILURE], axis=1))
submission = pd.read_csv('data/sample_submission.csv')
submission[FAILURE] = y_pred[:,1]
submission.to_csv("submission_baseline.csv", index=False)
submission.head()

Unnamed: 0,id,failure
0,26570,0.208342
1,26571,0.1598
2,26572,0.183328
3,26573,0.18479
4,26574,0.337852
