# EDA
Start with this notebook before moving onto Exploration.ipynb.

This notebook provides a semi-guided exploration of a data set. The purpose is to familiarize yourself with the data, brush up on some common EDA techniques, and prepare for the more open-ended and self-directed task that follows.

In [1]:
# use the following OneDrive link to download a copy of this file to your device
# either save the downloaded file to the path indicated below or update the path accordingly
# https://perkinelmer-my.sharepoint.com/:x:/p/bryan_romas/EQGG-RAMTVxIhZi4XItMdb4BUyV40MW4JD4RnYF2KyWcaQ?e=FiyVOx

COPPER_FILE = 'data/copper_production_v1.csv'

This file contains data from a factory that manufactures copper wire, specifically counts of two types of failures (Cable Failures & Other Failures) and the related amount of downtime captured in minutes (Cable Failure Downtime & Other Failure Downtime).

Overall, the goal is to identify the primary source(s) of failures and downtimes and, if possible, to assess the certainty of those conclusions.  

## Load COPPER_FILE & describe the contents (the number of records, average values, etc.)

In [2]:
import pandas as pd

In [3]:
# load COPPER_FILE
copper_df = None

## Check for & remove exact duplicates

In [4]:
# a duplicate is defined as two or more records with the same values for all columns in the data
# report the number of duplicates that were removed

In [5]:
# remove the duplicates

## Combine multiple records

In [6]:
# there shouldn't be more than one record per Machine-Shift-Operator-Date combination
# combine any such records so the resulting record contains the sums of the remaining columns

## Load & process OPERATOR_FILE

In [7]:
# similar process to COPPER_FILE
# https://perkinelmer-my.sharepoint.com/:t:/p/bryan_romas/EQLgOytFSANKtmt1zKFF7tMBaiLXru6Blscn8kqlyr7t2w?e=M5E5cb

OPERATOR_FILE = 'data/operator_names.txt'

This file contains a mapping between operator numbers and names. In general, the contents are formatted in the following manner: {operator number}{first name}{last name}.

Overall, the goal is to load the file; extract the operator number, first name, and last initial; and merge or map those names into copper_df so you can reference the operators' first names and last initials instead of their numbers in the subsequent analysis.

Ideally, you'll load the file and extract the necessary content by completing the function definition below (get_operator_names). However, if this proves too challenging then you may take any other programmatic approach to accomplish the goal. However, don't manually copy the file contents into the notebook or manipulate the contents of the file so the task is easier.

In [8]:
# load & process OPERATOR_FILE
# HINT: you may want to use regex here
import re

def get_operator_names(operator_file: str)->pd.DataFrame:
    """
    Load & process the contents of a file containing operator details.
    
    Params:
    operator_file - str; the path to the operator file.
    
    Returns:
    operator_df - pd.DataFrame; the extracted content of the operator file. The
        columns should include Operator Number, First Name, & Last Initial.
    """
    assert isinstance(operator_file, str), 'operator_file is not of type str.'
    
    # create an empty df to fill with contents of operator_file
    operator_df = pd.DataFrame(columns=['Operator Number','First Name','Last Initial'])
    
    return operator_df

In [9]:
operator_df = get_operator_names(OPERATOR_FILE)
operator_df

Unnamed: 0,Operator Number,First Name,Last Initial


## How many unique Machines, Shifts, & Operator Names are included in the data? Plot the proportion of the total for each.

Below is an example of the first plot you'll create.

![image](img/machine_proportions.PNG)

In [10]:
import matplotlib.pyplot as plt
import seaborn as sns

# a simple theme that's better than the default
sns.set_theme(
    context='notebook',
    style='darkgrid', 
    palette='husl',
    rc={"figure.figsize":(24, 8)}
)

### Machines

In [11]:
# get the total number of unique Machines in the data as well as the proportion of the total (i.e. normalized count)

In [12]:
# plot each Machines's proportion of the total
# HINT: for a simple bar chart you can call .plot() on a pd.Series or pd.DataFrame

### Shifts

In [13]:
# get the total number of unique Shifts in the data as well as the proportion of the total (i.e. normalized count)

In [14]:
# plot each Shift's proportion of the total, same idea as above

### Operator Names

In [15]:
# get the total number of unique Operator Names in the data as well as the proportion of the total (i.e. normalized count)

In [16]:
# plot each Operators's proportion of the total, again the same as above

## How many dates are included in the data?

In [17]:
# get the total number of Dates in the data as well as how many records exist for each

### Are any dates that should be covered by the range of the data missing? If so, which and how could this be addressed?

In [18]:
# identify if any dates are missing in the data
# if so, simply explain how the missing data could be addressed no need to solve it in code here

## Create 'Day of Week' column

In [19]:
# insert a new column indicating the day of the week e.g., Monday

## Create 'Total Failures' & 'Total Downtime' columns

In [20]:
# insert two new columns to calculate the simple sum of failures & downtime for a given record

## Plot the sum of downtimes by 'Date'

Below is an example of this plot.

![image](img/downtimes_by_date.PNG)

In [21]:
# prep the data, calculate sums
# HINT: you may want to group the data by Date & use an aggregation function like .sum()

In [22]:
# line chart with 'Date' along the x-axis (in ascending chronological order), minutes along the y-axis, & three lines: Cable Failure Downtime, Other Failure Downtime, & Total Failure Downtime
# HINT: you'll likely want to call plt.plot() once for each line you're plotting 

## Plot the mean downtime by 'Day of Week'

In [23]:
# prep the data, calculate means

In [24]:
# line chart with 'Day of Week' along the x-axis, minutes along the y-axis, & three lines: Cable Failure Downtime, Other Failure Downtime, & Total Failure Downtime

# End