The essence of the problem is as follows: there is a production of plastic film. Production itself is very common and looks relatively simple (google) - roughly speaking, propylene granules are poured, they are melted, pulled, assembled into a large cylinder (pictured in the attached documents) - rolled out, cut. A more detailed process of production is quite easy to find on the Internet.

The bottom line is that sometimes clippings of this film occur. I would like to investigate how these cliffs depend on the regime of production, on the recipes.

The link contains data from the extruder equipment - telemetry tags for the year. In the attachment is a description of these tags in Russian. The film breakage can be found by the tag “thickness” - (ST110_VAREx_0_SDickeIst).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import pandas_profiling
import pandas_summary as ps

# Data processing, metrics and modeling
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.impute import SimpleImputer

# Lgbm
import lightgbm as lgb

# Suppr warning
import warnings
warnings.filterwarnings("ignore")

# Plots
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

# Others
import shap
import datetime
from tqdm import tqdm_notebook
import sys
import pickle
import re
import json
import gc

pd.set_option('display.max_columns', 5000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.width', 1000)
pd.set_option('use_inf_as_na', True)

warnings.simplefilter('ignore')
matplotlib.rcParams['figure.dpi'] = 100
sns.set()
%matplotlib inline

In [None]:
folder = '/kaggle/input/find-a-defect-in-the-production-extrusion-line/'
stats_df = pd.read_csv(folder + 'stat.csv', sep=',')
full_df = pd.read_csv(folder + 'extrusion.csv', sep=',')

In [None]:
stats_df.shape, full_df.shape

In [None]:
stats_df.head()

In [None]:
full_df.head()

In [None]:
full_df.tail()

In [None]:
full_df['Datum'] = pd.to_datetime(full_df['Datum'])
full_df = full_df.reset_index()
full_df.index = full_df['Datum'] 

In [None]:
full_df.head()

In [None]:
dfs = ps.DataFrameSummary(full_df)
dfs.summary()

## Statistics of deficiency. Let's look at the distribution of the target.

In [None]:
stats_df[(stats_df['Tags'] == 'ST110_VAREx_0_SDickeIst')]

### Here, production is almost continuous. Downtime is usually present after a film break or for other reasons (others are not important for the task).

In [None]:
full_df['ST110_VAREx_0_SDickeIst'].hist(bins=40);

In [None]:
full_df[full_df['ST110_VAREx_0_SDickeIst'] < 50]['ST110_VAREx_0_SDickeIst'].hist(bins=40);

In [None]:
full_df[(full_df['ST110_VAREx_0_SDickeIst'] < 28)]['ST110_VAREx_0_SDickeIst'].shape[0]

In [None]:
full_df[(full_df['ST110_VAREx_0_SDickeIst'] > 0)].shape[0]

### Let's make a binary target variable. Since the production is continuous, there are often cases when the film breaks along the metric for several periods in a row. In order not to retrain, we will split it into a train in a much smaller proportion compared to validation, more on that later.

In [None]:
full_df['ST110_VAREx_0_SDickeIst'].apply(lambda x: 1 if x == 0 else 0).value_counts(normalize=True)

In [None]:
full_df['ST110_VAREx_0_SDickeIst'].apply(lambda x: 1 if x == 0 else 0).hist();

In [None]:
full_df[['ST110_VAREx_0_SDickeIst']].describe()