In [None]:
from supportLibrary import *
from IPython.display import display
import ipywidgets as widgets
import pandas as pd

## Identify Significant Columns

Identify the Device ID and Time columns from the dataset.

#### Device ID
This column identifies which device the recorded the datapoint.

#### Time
This column represents how much time the device has left before it will malfunction.

In [None]:
df = pd.read_csv('current2.csv')
deviceIDColumnName, timeColumnName, ui = getDeviceCountsConfiguration(df)
display(ui)

## Calculate Statistical Significance

An adequate sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In order to extrapolate the findings of this exploration onto a larger population, the dataset must contain an adequate number of sampled devices.

In [None]:
confidenceLevel, populationSize, ui = getConfidenceLevelAndPopulation()
display(ui)

In [None]:
numDevices, minimumDevicesNeeded, maximumPopulationSize = calculateSampleSize(df, deviceIDColumnName.value, confidenceLevel.value, populationSize.value)
print('Number of Devices in Dataset: {}'.format(numDevices))
print('Minimum Number of Devices Needed to Represent a Population of {} Devices: {}'.format(populationSize.value, minimumDevicesNeeded))
print('Maximum Population Which {} Devices Could Represent: {}'.format(maximumPopulationSize, numDevices))

## Inspect the Number of Datapoints per Device

Visualizing the number of datapoints per device helps to understand the range of datapoint counts as well as identify any outlier devices with significantly more or less datapoints than the other devices.

In [None]:
displayDeviceCounts(df, deviceIDColumnName.value, timeColumnName.value)

## Understanding Data Through Histograms

Visualizing feature value distributions helps identify the range and spread of the recorded features.

In [None]:
histCheckboxList, ui = getFeaturesToShow(df, deviceIDColumnName.value, timeColumnName.value)
display(ui)

In [None]:
displayFeatureHistograms(df, histCheckboxList)

## Understanding Data Through Feature Plots

Visualizing the change over feature values over time helps identify features which potentially do not add any information to the system, such as features containing too much noise or features which are identical across all devices.

In [None]:
plotCheckboxList, numDevicesToShow, ui = getFeaturePlotsConfiguration(df, deviceIDColumnName.value, timeColumnName.value)
display(ui)

In [None]:
displayFeaturePlots(df, deviceIDColumnName.value, timeColumnName.value, plotCheckboxList, numDevicesToShow.value)

## Save DataSet for Next Step

In [None]:
df2 = df.rename(columns = {deviceIDColumnName.value: 'deviceID', timeColumnName.value: 'time'})
df2.to_csv("current3.csv", index=False)