# Overview
The goal of this Kaggle competition is to predict which parts in Bosch's production lines will fail quality control, represented by the 'Response' variable. The dataset is one of the largest in terms of the number of features, and the ground truth is highly imbalanced, making it a challenging problem.

# Dataset Description
The dataset consists of three types of files:
* train_numeric.csv: Training set with numeric features (contains the 'Response' variable).
* test_numeric.csv: Test set with numeric features (to predict the 'Response' for these Ids).
* train_categorical.csv and test_categorical.csv: Training and test sets with categorical features.
* train_date.csv and test_date.csv: Training and test sets with date features.
* sample_submission.csv: A sample submission file in the correct format.
Features are anonymized and named based on the production line, station, and feature number.

# Code section
## Setup directory

In [24]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
#Import important library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

/kaggle/input/bosch-production-line-performance/train_date.csv.zip
/kaggle/input/bosch-production-line-performance/sample_submission.csv.zip
/kaggle/input/bosch-production-line-performance/train_numeric.csv.zip
/kaggle/input/bosch-production-line-performance/test_date.csv.zip
/kaggle/input/bosch-production-line-performance/test_categorical.csv.zip
/kaggle/input/bosch-production-line-performance/test_numeric.csv.zip
/kaggle/input/bosch-production-line-performance/train_categorical.csv.zip


## Data Loading and Exploration
### Load and read train data

In [25]:
# Since the data file is extremely large,load a subset of the data for exploration
train_date = pd.read_csv("/kaggle/input/bosch-production-line-performance/train_date.csv.zip", nrows = 1000)
train_cat = pd.read_csv("/kaggle/input/bosch-production-line-performance/train_categorical.csv.zip",nrows = 1000, low_memory =False)
train_num = pd.read_csv("/kaggle/input/bosch-production-line-performance/train_numeric.csv.zip", nrows = 1000)

print('Train date has the shape as ',train_date.shape)
display(train_date.head())
print('Train numerical has the shape as ',train_num.shape)
display(train_num.head())

Train date has the shape as  (1000, 1157)


Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,...,,,,,,,,,,
3,9,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,...,,,,,,,,,,
4,11,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,...,,,,,,,,,,


Train numerical has the shape as  (1000, 970)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,0
1,6,,,,,,,,,,...,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,0


## Preprocessing train data
### Feature Engineering - Date Data
Identify and extract relevant date columns, optimizing for reduced reading time.

In [26]:
# Extract train_date (for each station, extract only 1 measured feature)
counts = train_date.count() 
date_cols = counts.reset_index()["index"].str.split("_", expand=True)
col_idx = date_cols.drop_duplicates(1).index  
date_cols = train_date.columns[col_idx]
display(date_cols)

Index(['Id', 'L0_S0_D1', 'L0_S1_D26', 'L0_S2_D34', 'L0_S3_D70', 'L0_S4_D106',
       'L0_S5_D115', 'L0_S6_D120', 'L0_S7_D137', 'L0_S8_D145', 'L0_S9_D152',
       'L0_S10_D216', 'L0_S11_D280', 'L0_S12_D331', 'L0_S13_D355',
       'L0_S14_D360', 'L0_S15_D395', 'L0_S16_D423', 'L0_S17_D432',
       'L0_S18_D437', 'L0_S19_D454', 'L0_S20_D462', 'L0_S21_D469',
       'L0_S22_D543', 'L0_S23_D617', 'L1_S24_D677', 'L1_S25_D1854',
       'L2_S26_D3037', 'L2_S27_D3130', 'L2_S28_D3223', 'L3_S29_D3316',
       'L3_S30_D3496', 'L3_S31_D3836', 'L3_S32_D3852', 'L3_S33_D3856',
       'L3_S34_D3875', 'L3_S35_D3886', 'L3_S36_D3919', 'L3_S37_D3942',
       'L3_S38_D3953', 'L3_S39_D3966', 'L3_S40_D3981', 'L3_S41_D3997',
       'L3_S42_D4029', 'L3_S43_D4062', 'L3_S44_D4101', 'L3_S45_D4125',
       'L3_S46_D4135', 'L3_S47_D4140', 'L3_S48_D4194', 'L3_S49_D4208',
       'L3_S50_D4242', 'L3_S51_D4255'],
      dtype='object')

In [27]:
# From the date file, extract only columns listed in the date_cols to reduce reading time
train_date = pd.read_csv("/kaggle/input/bosch-production-line-performance/train_date.csv.zip",usecols=date_cols)
display(train_date.head())

train_date["start_station"] = -1
train_date["end_station"] = -1

for col in train_date.drop(columns=["Id", "start_station", "end_station"]).columns:
    notnulls = ~train_date[col].isnull() 
    station_name = int(col.split("_")[1][1:])
    
    train_date.loc[(notnulls) & (train_date.start_station == -1), "start_station"] = station_name
    train_date.loc[(notnulls), "end_station"] = station_name
    
# At the end, for each sattion, we will use 1 feature which is the first one
train_date = train_date[['Id','start_station','end_station']]
train_date

Unnamed: 0,Id,L0_S0_D1,L0_S1_D26,L0_S2_D34,L0_S3_D70,L0_S4_D106,L0_S5_D115,L0_S6_D120,L0_S7_D137,L0_S8_D145,...,L3_S42_D4029,L3_S43_D4062,L3_S44_D4101,L3_S45_D4125,L3_S46_D4135,L3_S47_D4140,L3_S48_D4194,L3_S49_D4208,L3_S50_D4242,L3_S51_D4255
0,4,82.24,82.24,82.24,,82.26,,,82.26,82.27,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,1618.7,1618.7,1618.7,,,1618.72,1618.72,,1618.73,...,,,,,,,,,,
3,9,1149.2,1149.2,1149.21,,1149.22,,,1149.22,1149.22,...,,,,,,,,,,
4,11,602.64,602.64,,602.64,602.66,,,602.67,602.67,...,,,,,,,,,,


Unnamed: 0,Id,start_station,end_station
0,4,0,37
1,6,12,37
2,7,0,37
3,9,0,37
4,11,0,37
...,...,...,...
1183742,2367490,0,37
1183743,2367491,12,37
1183744,2367492,0,37
1183745,2367493,0,37


### Feature Engineering - Numerical Data
Determine columns with a missing ratio less than 0.5, focusing on columns with relatively complete data.

In [28]:
# Find the column with missing ratio less than 0.5 the data of the column
missing_ratio = pd.Series(index=train_num.columns,
                         data=np.zeros(len(train_num.columns))) # 970
column_means = pd.Series(index=train_num.columns,
                         data=np.zeros(len(train_num.columns)))
length = 0

for chunk in pd.read_csv("/kaggle/input/bosch-production-line-performance/train_numeric.csv.zip",chunksize=100000):
    #display(chunk.isnull().sum())
    temp = chunk.isnull().sum()  
    temp2 = chunk.sum()
    
    length = length + len(chunk)
    missing_ratio = missing_ratio + temp 
    column_means = column_means + temp2 
    #display(missing_ratio + chunk.isnull().sum())
    #break

display(missing_ratio) 
display(column_means / length)

missing_ratio = missing_ratio / length
usecols = train_num.columns[missing_ratio <= 0.5]
usecols


Id                    0.0
L0_S0_F0         509885.0
L0_S0_F2         509885.0
L0_S0_F4         509885.0
L0_S0_F6         509885.0
                  ...    
L3_S51_F4256    1123894.0
L3_S51_F4258    1123894.0
L3_S51_F4260    1123894.0
L3_S51_F4262    1123894.0
Response              0.0
Length: 970, dtype: float64

Id              1.184050e+06
L0_S0_F0        3.105900e-05
L0_S0_F2        5.196550e-05
L0_S0_F4        2.280048e-05
L0_S0_F6        7.875838e-06
                    ...     
L3_S51_F4256   -1.098208e-07
L3_S51_F4258    1.887228e-06
L3_S51_F4260    1.092463e-05
L3_S51_F4262    2.845203e-06
Response        5.811208e-03
Length: 970, dtype: float64

Index(['Id', 'L0_S0_F0', 'L0_S0_F2', 'L0_S0_F4', 'L0_S0_F6', 'L0_S0_F8',
       'L0_S0_F10', 'L0_S0_F12', 'L0_S0_F14', 'L0_S0_F16',
       ...
       'L3_S33_F3873', 'L3_S34_F3876', 'L3_S34_F3878', 'L3_S34_F3880',
       'L3_S34_F3882', 'L3_S37_F3944', 'L3_S37_F3946', 'L3_S37_F3948',
       'L3_S37_F3950', 'Response'],
      dtype='object', length=158)

In [29]:
# Implement PCA to reduce the dimensionality of numeric features.
from sklearn.decomposition import PCA
n_components = 15
data = pd.read_csv("/kaggle/input/bosch-production-line-performance/train_numeric.csv.zip",usecols=usecols)
data = data.fillna(0)
ids = data.Id
responses = data.Response
data = data.drop(columns=["Id", "Response"])
pca = PCA(n_components=n_components)
X = pca.fit_transform(data)
pca_df = pd.DataFrame(columns=[f"PC{i}" for i in range(1, n_components+1)],data=X)
display(pca_df)

pca_df["Id"] = ids
pca_df["Response"] = responses
X = pd.merge(train_date, pca_df, on="Id")
X

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15
0,1.042241,0.844171,0.003416,-0.371864,-0.259459,-0.075129,-0.111638,0.113619,0.306129,0.319546,-0.294680,-0.121769,0.068970,-0.357116,-0.178211
1,-0.752831,-0.037533,0.029310,-0.396531,0.066134,0.000293,0.257200,0.431628,0.107946,0.311082,-0.020098,-0.037194,-0.144913,0.255305,0.100864
2,-0.689241,0.200562,-0.136636,0.296196,0.098771,0.133084,-0.190081,-0.097660,0.256051,-0.208719,0.379717,-0.127389,0.105332,0.381589,0.085725
3,-0.711503,0.060450,-0.058964,-0.372314,0.352493,0.156089,0.092519,0.109421,-0.540272,0.275782,-0.298533,-0.111555,0.094223,-0.177494,-0.046122
4,0.719567,-0.343650,-0.048335,0.196432,-0.026665,0.073757,-0.367326,0.101849,-0.639415,0.291604,-0.059433,-0.275651,-0.251157,-0.224226,0.251968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1183742,-0.814037,0.029653,0.012314,-0.294017,0.007412,-0.225509,0.383899,0.382793,0.280638,-0.170617,0.208603,0.053105,-0.105787,0.126857,0.395065
1183743,0.703215,-0.344237,-0.052697,-0.267336,0.190187,-0.252047,0.061086,-0.151427,-0.162259,-0.169758,-0.028512,-0.209701,-0.060654,0.113877,-0.040088
1183744,-0.808204,-0.077716,-0.031379,-0.216292,0.052491,-0.138372,-0.034487,0.266707,0.542645,0.304795,0.094425,-0.131895,0.085689,-0.262777,0.110244
1183745,0.728341,-0.479680,-0.043168,-0.264087,0.016628,0.324876,-0.485579,-0.549643,0.168500,0.320968,0.073140,-0.237937,-0.165061,0.070355,-0.086099


Unnamed: 0,Id,start_station,end_station,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,Response
0,4,0,37,1.042241,0.844171,0.003416,-0.371864,-0.259459,-0.075129,-0.111638,0.113619,0.306129,0.319546,-0.294680,-0.121769,0.068970,-0.357116,-0.178211,0
1,6,12,37,-0.752831,-0.037533,0.029310,-0.396531,0.066134,0.000293,0.257200,0.431628,0.107946,0.311082,-0.020098,-0.037194,-0.144913,0.255305,0.100864,0
2,7,0,37,-0.689241,0.200562,-0.136636,0.296196,0.098771,0.133084,-0.190081,-0.097660,0.256051,-0.208719,0.379717,-0.127389,0.105332,0.381589,0.085725,0
3,9,0,37,-0.711503,0.060450,-0.058964,-0.372314,0.352493,0.156089,0.092519,0.109421,-0.540272,0.275782,-0.298533,-0.111555,0.094223,-0.177494,-0.046122,0
4,11,0,37,0.719567,-0.343650,-0.048335,0.196432,-0.026665,0.073757,-0.367326,0.101849,-0.639415,0.291604,-0.059433,-0.275651,-0.251157,-0.224226,0.251968,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1183742,2367490,0,37,-0.814037,0.029653,0.012314,-0.294017,0.007412,-0.225509,0.383899,0.382793,0.280638,-0.170617,0.208603,0.053105,-0.105787,0.126857,0.395065,0
1183743,2367491,12,37,0.703215,-0.344237,-0.052697,-0.267336,0.190187,-0.252047,0.061086,-0.151427,-0.162259,-0.169758,-0.028512,-0.209701,-0.060654,0.113877,-0.040088,0
1183744,2367492,0,37,-0.808204,-0.077716,-0.031379,-0.216292,0.052491,-0.138372,-0.034487,0.266707,0.542645,0.304795,0.094425,-0.131895,0.085689,-0.262777,0.110244,0
1183745,2367493,0,37,0.728341,-0.479680,-0.043168,-0.264087,0.016628,0.324876,-0.485579,-0.549643,0.168500,0.320968,0.073140,-0.237937,-0.165061,0.070355,-0.086099,0


## Machine Learning modeling 

In [30]:
# Split the data into training and validation sets, and train a LightGBM classifier.
# Evaluate the model using the Matthews Correlation Coefficient (MCC).

from sklearn.model_selection import train_test_split
#from sklearn.ensemble import RandomForestClassifier
from lightgbm.sklearn import LGBMClassifier
from sklearn.metrics import matthews_corrcoef


y = X.Response
X = X.drop(columns=["Id", "Response"])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0xC0FFEE,
                                                 stratify=y)

clf = LGBMClassifier(max_depth=8, n_estimators=200, random_state=0xC0FFEE, verbose=1)
clf.fit(X_train, y_train)

pred = clf.predict(X_train)
pred2 = clf.predict(X_val)

print("Train MCC : %.4f" % matthews_corrcoef(y_train, pred))
print("Validation MCC : %.4f" % matthews_corrcoef(y_val, pred2))


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


[LightGBM] [Info] Number of positive: 5503, number of negative: 941494
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3851
[LightGBM] [Info] Number of data points in the train set: 946997, number of used features: 17
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.005811 -> initscore=-5.142175
[LightGBM] [Info] Start training from score -5.142175


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Train MCC : 0.0785
Validation MCC : 0.0076


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


### Model Persistence
Save the trained model and relevant data for future use or reproducibility.

In [31]:
import pickle

# Save the trained model and data
with open("train_data.pk", "wb") as f:
    pickle.dump(X, f)
with open("LGBM_model.pk", "wb") as f:
    pickle.dump(clf, f)

## Preprocess test data
* Load and preprocess the test data, ensuring consistency with the training data preprocessing steps.
* Make predictions on the test data and create a submission file.

In [32]:
# 1. Load test data 
test_cat = pd.read_csv("/kaggle/input/bosch-production-line-performance/test_categorical.csv.zip",nrows=1000, low_memory = False)
test_num = pd.read_csv("/kaggle/input/bosch-production-line-performance/test_numeric.csv.zip",nrows=1000)
test_date = pd.read_csv("/kaggle/input/bosch-production-line-performance/test_date.csv.zip",nrows=1000)
print(test_cat.shape, test_num.shape, test_date.shape)

(1000, 2141) (1000, 969) (1000, 1157)


### Feature engineering - Date (Test)
Process similar to process the train data

In [33]:
# test_date_part
counts = test_date.count() 
date_cols = counts.reset_index()["index"].str.split("_", expand=True)
col_idx = date_cols.drop_duplicates(1).index 
date_cols = test_date.columns[col_idx]

# start_station, end_station 
test_date = pd.read_csv("/kaggle/input/bosch-production-line-performance/test_date.csv.zip",usecols=date_cols)

test_date["start_station"] = -1
test_date["end_station"] = -1

for col in test_date.drop(columns=["Id", "start_station", "end_station"]).columns:
    notnulls = ~test_date[col].isnull() 
    station_name = int(col.split("_")[1][1:])
    
    test_date.loc[(notnulls) & (test_date.start_station == -1), "start_station"] = station_name
    test_date.loc[(notnulls), "end_station"] = station_name
    
display(test_date)
test_date = test_date[["Id", "start_station", "end_station"]]
test_date

Unnamed: 0,Id,L0_S0_D1,L0_S1_D26,L0_S2_D34,L0_S3_D70,L0_S4_D106,L0_S5_D115,L0_S6_D120,L0_S7_D137,L0_S8_D145,...,L3_S44_D4101,L3_S45_D4125,L3_S46_D4135,L3_S47_D4140,L3_S48_D4194,L3_S49_D4208,L3_S50_D4242,L3_S51_D4255,start_station,end_station
0,1,,,,,,,,,,...,,,,,,,,,27,37
1,2,,,,,,,,,,...,,,,,,,,,24,37
2,3,,,,,,,,,,...,,,,,,,,,26,37
3,5,255.45,255.45,255.46,,255.48,,,255.48,255.48,...,,,,,,,,,0,37
4,8,,,,,,,,,,...,,,,,,,,,26,37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1183743,2367483,653.85,653.85,653.85,,653.86,,,653.87,653.87,...,,,,,,,,,0,37
1183744,2367485,907.34,907.34,,907.34,,907.37,,907.37,907.37,...,,,,,,,,,0,37
1183745,2367486,185.92,185.92,185.92,,185.96,,185.96,,185.96,...,,,,,,,,,0,37
1183746,2367489,570.85,570.85,570.86,,,570.88,,570.89,570.89,...,,,,,,,,,0,37


Unnamed: 0,Id,start_station,end_station
0,1,27,37
1,2,24,37
2,3,26,37
3,5,0,37
4,8,26,37
...,...,...,...
1183743,2367483,0,37
1183744,2367485,0,37
1183745,2367486,0,37
1183746,2367489,0,37


### Feature engineering - Numerical (Test)

In [34]:
# Numeric data missing_value 
usecols = usecols.drop("Response") # Response column
data = pd.read_csv("/kaggle/input/bosch-production-line-performance/test_numeric.csv.zip",usecols=usecols)
data = data.fillna(0) 

# Implement PCA to reduce the dimensionality of numeric features.
ids = data.Id
data = data.drop(columns=["Id"])
X = pca.transform(data) # fit_transform
pca_df = pd.DataFrame(columns=[f"PC{i}" for i in range(1, n_components+1)],data=X)
display(pca_df)
 
# Complete test data frame (pca_df (numeric) + test_date(date))
pca_df["Id"] = ids
X = pd.merge(test_date, pca_df, on="Id")
X

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15
0,0.680208,-0.131046,-0.117420,0.308592,-0.101817,-0.381441,0.003000,0.192951,-0.113476,-0.222917,0.099926,-0.417588,0.134927,0.079427,0.164669
1,-0.752258,0.059437,-0.073471,0.271701,-0.285142,0.102560,-0.376483,-0.284975,0.005057,-0.164305,-0.035016,-0.022008,0.092543,-0.255005,0.155506
2,0.740638,-0.506789,-0.055914,-0.299160,0.023062,-0.229405,-0.328104,0.364364,-0.202059,0.346929,-0.130142,0.102417,-0.299334,-0.193364,0.077265
3,0.733001,-0.155001,-0.066236,-0.272830,0.021756,-0.657618,0.066785,-0.145396,-0.224172,0.399532,-0.201378,0.447257,-0.031912,-0.364725,0.044492
4,0.695820,0.017774,-0.018656,0.358830,0.137488,-0.178232,0.244243,0.582116,0.096867,-0.238892,0.130845,-0.367439,0.181321,-0.155821,0.521229
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1183743,0.768018,-0.274138,-0.062370,-0.045336,0.234465,-0.251646,-0.149974,-0.343337,-0.442708,-0.212002,0.367036,0.198448,0.137065,0.121144,0.082982
1183744,-0.458631,0.100999,-0.111729,0.078377,0.860590,-0.198316,-0.126598,-0.047510,-0.053087,-0.126015,-0.284186,0.270808,0.151812,0.108924,-0.291608
1183745,0.837577,0.222387,-0.056587,-0.447364,0.111645,-0.020272,-0.039589,-0.234776,0.328670,0.295863,-0.116616,-0.025252,0.170084,0.407175,0.268662
1183746,0.724216,-0.401064,-0.061765,-0.416408,-0.117791,-0.241571,0.083448,-0.154608,0.109539,-0.117190,-0.368866,-0.148868,-0.104941,-0.034329,0.158859


Unnamed: 0,Id,start_station,end_station,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15
0,1,27,37,0.680208,-0.131046,-0.117420,0.308592,-0.101817,-0.381441,0.003000,0.192951,-0.113476,-0.222917,0.099926,-0.417588,0.134927,0.079427,0.164669
1,2,24,37,-0.752258,0.059437,-0.073471,0.271701,-0.285142,0.102560,-0.376483,-0.284975,0.005057,-0.164305,-0.035016,-0.022008,0.092543,-0.255005,0.155506
2,3,26,37,0.740638,-0.506789,-0.055914,-0.299160,0.023062,-0.229405,-0.328104,0.364364,-0.202059,0.346929,-0.130142,0.102417,-0.299334,-0.193364,0.077265
3,5,0,37,0.733001,-0.155001,-0.066236,-0.272830,0.021756,-0.657618,0.066785,-0.145396,-0.224172,0.399532,-0.201378,0.447257,-0.031912,-0.364725,0.044492
4,8,26,37,0.695820,0.017774,-0.018656,0.358830,0.137488,-0.178232,0.244243,0.582116,0.096867,-0.238892,0.130845,-0.367439,0.181321,-0.155821,0.521229
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1183743,2367483,0,37,0.768018,-0.274138,-0.062370,-0.045336,0.234465,-0.251646,-0.149974,-0.343337,-0.442708,-0.212002,0.367036,0.198448,0.137065,0.121144,0.082982
1183744,2367485,0,37,-0.458631,0.100999,-0.111729,0.078377,0.860590,-0.198316,-0.126598,-0.047510,-0.053087,-0.126015,-0.284186,0.270808,0.151812,0.108924,-0.291608
1183745,2367486,0,37,0.837577,0.222387,-0.056587,-0.447364,0.111645,-0.020272,-0.039589,-0.234776,0.328670,0.295863,-0.116616,-0.025252,0.170084,0.407175,0.268662
1183746,2367489,0,37,0.724216,-0.401064,-0.061765,-0.416408,-0.117791,-0.241571,0.083448,-0.154608,0.109539,-0.117190,-0.368866,-0.148868,-0.104941,-0.034329,0.158859


## Test Data Prediction and Submission
* Load and preprocess the test data, ensuring consistency with the training data preprocessing steps.
* Make predictions on the test data and create a submission file.

In [35]:
# Test data inference
X_test = X.drop(columns=["Id"])
result = clf.predict(X_test)
result
# make submission
submission = pd.read_csv("/kaggle/input/bosch-production-line-performance/sample_submission.csv.zip")
submission
submission["Response"] = result
submission.to_csv("submission.csv", index=False)