# Bosch Production Line Performance data
## John Burt
### Portland Data Science Group<br/>Applied Data Science Meetup series

### Notebook purpose: Look at data files 

#### NOTE: This is the original data [downloaded from Kaggle](https://www.kaggle.com/c/bosch-production-line-performance), and stored in a folder named "data"

### Questions I have from looking at these data:

- What is the time format of the date data?
- Categorical data doesn't seem to align with date and numerical data. It looks like all line 0 columns are missing.
- Not sure what categorical data means: "T1", "T3", etc.


**From the Kaggle data description:**

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.
File descriptions

    train_numeric.csv - the training set numeric features (this file contains the 'Response' variable)
    test_numeric.csv - the test set numeric features (you must predict the 'Response' for these Ids)
    train_categorical.csv - the training set categorical features
    test_categorical.csv - the test set categorical features
    train_date.csv - the training set date features
    test_date.csv - the test set date features
    sample_submission.csv - a sample submission file in the correct format

### Generic notebook setup and imports

In [1]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np



## Read some data into pandas dataframes

Note that these files are very large (2-3 GB each), so I'm only reading the first few rows here.

In [16]:
location = './data/'

# Read only the first numrows rows
numrows = 1000

data_date=pd.read_csv(location+'train_date.csv', nrows=numrows) 
data_cat=pd.read_csv(location+'train_categorical.csv', nrows=numrows) 
data_numeric=pd.read_csv(location+'train_numeric.csv', nrows=numrows) 


In [7]:
data_date.head()

Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,L0_S0_D19,L0_S0_D21,L0_S0_D23,L0_S1_D26,L0_S1_D30,L0_S2_D34,L0_S2_D38,L0_S2_D42,L0_S2_D46,L0_S2_D50,L0_S2_D54,L0_S2_D58,L0_S2_D62,L0_S2_D66,L0_S3_D70,L0_S3_D74,L0_S3_D78,L0_S3_D82,L0_S3_D86,L0_S3_D90,L0_S3_D94,L0_S3_D98,L0_S3_D102,L0_S4_D106,L0_S4_D111,L0_S5_D115,L0_S5_D117,L0_S6_D120,L0_S6_D124,L0_S6_D127,L0_S6_D130,L0_S6_D134,L0_S7_D137,L0_S7_D139,L0_S7_D140,L0_S7_D141,L0_S7_D143,L0_S8_D145,L0_S8_D147,L0_S8_D148,...,L3_S44_D4104,L3_S44_D4107,L3_S44_D4110,L3_S44_D4113,L3_S44_D4116,L3_S44_D4119,L3_S44_D4122,L3_S45_D4125,L3_S45_D4127,L3_S45_D4129,L3_S45_D4131,L3_S45_D4133,L3_S46_D4135,L3_S47_D4140,L3_S47_D4145,L3_S47_D4150,L3_S47_D4155,L3_S47_D4160,L3_S47_D4165,L3_S47_D4170,L3_S47_D4175,L3_S47_D4180,L3_S47_D4185,L3_S47_D4190,L3_S48_D4194,L3_S48_D4195,L3_S48_D4197,L3_S48_D4199,L3_S48_D4201,L3_S48_D4203,L3_S48_D4205,L3_S49_D4208,L3_S49_D4213,L3_S49_D4218,L3_S49_D4223,L3_S49_D4228,L3_S49_D4233,L3_S49_D4238,L3_S50_D4242,L3_S50_D4244,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,,,,,,,,,,82.26,82.26,,,,,,,,82.26,82.26,82.26,82.26,82.26,82.27,82.27,82.27,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,,,,,,,,,,,,1618.72,1618.72,1618.72,1618.72,1618.72,1618.72,1618.72,,,,,,1618.73,1618.73,1618.73,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,9,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.21,1149.21,1149.21,1149.21,1149.21,1149.21,1149.21,1149.21,1149.21,,,,,,,,,,1149.22,1149.22,,,,,,,,1149.22,1149.22,1149.22,1149.22,1149.22,1149.22,1149.22,1149.22,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,11,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,,,,,,,,,,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.66,602.66,,,,,,,,602.67,602.67,602.67,602.67,602.67,602.67,602.67,602.67,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [11]:
data_cat.head()

Unnamed: 0,Id,L0_S1_F25,L0_S1_F27,L0_S1_F29,L0_S1_F31,L0_S2_F33,L0_S2_F35,L0_S2_F37,L0_S2_F39,L0_S2_F41,L0_S2_F43,L0_S2_F45,L0_S2_F47,L0_S2_F49,L0_S2_F51,L0_S2_F53,L0_S2_F55,L0_S2_F57,L0_S2_F59,L0_S2_F61,L0_S2_F63,L0_S2_F65,L0_S2_F67,L0_S3_F69,L0_S3_F71,L0_S3_F73,L0_S3_F75,L0_S3_F77,L0_S3_F79,L0_S3_F81,L0_S3_F83,L0_S3_F85,L0_S3_F87,L0_S3_F89,L0_S3_F91,L0_S3_F93,L0_S3_F95,L0_S3_F97,L0_S3_F99,L0_S3_F101,L0_S3_F103,L0_S4_F105,L0_S4_F107,L0_S4_F108,L0_S4_F110,L0_S4_F112,L0_S4_F113,L0_S6_F119,L0_S6_F121,L0_S6_F123,...,L3_S47_F4146,L3_S47_F4147,L3_S47_F4149,L3_S47_F4151,L3_S47_F4152,L3_S47_F4154,L3_S47_F4156,L3_S47_F4157,L3_S47_F4159,L3_S47_F4161,L3_S47_F4162,L3_S47_F4164,L3_S47_F4166,L3_S47_F4167,L3_S47_F4169,L3_S47_F4171,L3_S47_F4172,L3_S47_F4174,L3_S47_F4176,L3_S47_F4177,L3_S47_F4179,L3_S47_F4181,L3_S47_F4182,L3_S47_F4184,L3_S47_F4186,L3_S47_F4187,L3_S47_F4189,L3_S47_F4191,L3_S47_F4192,L3_S49_F4207,L3_S49_F4209,L3_S49_F4210,L3_S49_F4212,L3_S49_F4214,L3_S49_F4215,L3_S49_F4217,L3_S49_F4219,L3_S49_F4220,L3_S49_F4222,L3_S49_F4224,L3_S49_F4225,L3_S49_F4227,L3_S49_F4229,L3_S49_F4230,L3_S49_F4232,L3_S49_F4234,L3_S49_F4235,L3_S49_F4237,L3_S49_F4239,L3_S49_F4240
0,4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,9,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,11,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [12]:
data_numeric.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,L0_S0_F18,L0_S0_F20,L0_S0_F22,L0_S1_F24,L0_S1_F28,L0_S2_F32,L0_S2_F36,L0_S2_F40,L0_S2_F44,L0_S2_F48,L0_S2_F52,L0_S2_F56,L0_S2_F60,L0_S2_F64,L0_S3_F68,L0_S3_F72,L0_S3_F76,L0_S3_F80,L0_S3_F84,L0_S3_F88,L0_S3_F92,L0_S3_F96,L0_S3_F100,L0_S4_F104,L0_S4_F109,L0_S5_F114,L0_S5_F116,L0_S6_F118,L0_S6_F122,L0_S6_F132,L0_S7_F136,L0_S7_F138,L0_S7_F142,L0_S8_F144,L0_S8_F146,L0_S8_F149,L0_S9_F155,L0_S9_F160,L0_S9_F165,L0_S9_F170,...,L3_S43_F4095,L3_S44_F4100,L3_S44_F4103,L3_S44_F4106,L3_S44_F4109,L3_S44_F4112,L3_S44_F4115,L3_S44_F4118,L3_S44_F4121,L3_S45_F4124,L3_S45_F4126,L3_S45_F4128,L3_S45_F4130,L3_S45_F4132,L3_S47_F4138,L3_S47_F4143,L3_S47_F4148,L3_S47_F4153,L3_S47_F4158,L3_S47_F4163,L3_S47_F4168,L3_S47_F4173,L3_S47_F4178,L3_S47_F4183,L3_S47_F4188,L3_S48_F4193,L3_S48_F4196,L3_S48_F4198,L3_S48_F4200,L3_S48_F4202,L3_S48_F4204,L3_S49_F4206,L3_S49_F4211,L3_S49_F4216,L3_S49_F4221,L3_S49_F4226,L3_S49_F4231,L3_S49_F4236,L3_S50_F4241,L3_S50_F4243,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,0.083,-0.273,-0.273,-0.271,0.167,-0.213,-0.023,-0.192,-0.088,0.001,0.0,0.01,-0.223,-0.03,,,,,,,,,,-0.001,-0.004,,,,,,-0.164,-0.077,0.06,-0.157,0.0,0.001,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0
1,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,-0.147,0.25,0.25,0.057,-0.079,-0.013,0.011,0.008,-0.06,-0.005,0.0,0.01,-0.223,-0.077,,,,,,,,,,,,-0.073,0.138,-0.336,0.506,-0.13,,,,-0.157,0.0,0.001,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,-0.046,-0.253,-0.253,0.147,-0.007,-0.013,0.12,0.008,-0.231,0.005,0.0,0.01,0.05,0.056,,,,,,,,,,-0.038,-0.001,,,,,,-0.164,0.402,-0.015,0.343,0.0,0.001,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,-0.099,0.042,0.042,-0.012,-0.046,,,,,,,,,,0.009,-0.272,-0.051,0.037,0.004,0.0,-0.081,0.311,0.003,0.021,0.015,,,,,,-0.164,-0.083,0.014,-0.157,0.0,0.001,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0


## How many defect samples are in the training dataset?

In [43]:
# read the entire numeric data file
data_numeric=pd.read_csv(location+'train_numeric.csv') 

In [44]:
print("total training samples = %d, # defect samples = %d"%(
    data_numeric.shape[0],
    data_numeric.Response[data_numeric.Response>0].shape[0] ))


total training samples = 1183747, # defect samples = 6879


## What do the file contents look like?


In [29]:
f = open(location+'train_categorical.csv', 'r')
# f = open(location+'train_numeric.csv', 'r')
# f = open(location+'train_date.csv', 'r')
x = f.readlines(50000)
f.close()

x[1:5]

['4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,