# Data Science Project: Investigation into Medical Device Recalls
* Dane Lacey, Sidney Washburn, Shravan Parthasarathy

### Basic Info

According to the FDA's Establishment Registration - 21 CFR Part 807, "Manufacturers (both domestic and foreign) and initial distributors (importers) of medical devices must register their establishments with the FDA." That is to say, companies that wish to manufacture and sell medical devices in US markets must register their devices through the Food and Drug Administration. 

This means that the FDA has extensive data on medical devices, and more importantly here, their recalls. Recalls are an "effective method for removing or correcting marketed products, their labeling, and/or promotional literature that violate the laws administered by the Food and Drug Administration"$^1$. For example, a device may be recalled if a certain wire has been observed to be fraying after a months use, or a seal on a gasket has been consistently leaking. What we have set out to do is analyze this data in detail, and to try to quantify the responsiveness of medical device manufactures in US markets through a variety of measures.

<sub>$^1$ according to FDA Regulatory Procedures Manual, Ch.7, August 2018

### Background and Motivation

When purchasing a medical device -- say, a pacemaker -- the history of responsiveness of the company is of immediate interest. One would not wish to purchase any kind of medical device from a company that has poor, or even non-existent, response time when a device of theirs is recalled. 

### Project Objectives

Our goal is to find significant statistical correlations between variables of recalled medical devices, such as: device class, country of origin, recall reason, and recall termination date; in order to provide a reliability measure for each medical device manufacturer. This will provide important information on which companies act quickly when a device is recalled, that is, how "reliable" they are.

### Data

We will be analyzing .csv tables of recalled medical devices from 2010-2014 from the FDA's website$^2$. We will also be using an auxiliary .csv table from the same website to determine company attributes such as country of origin, and number of factories.

<sub> $^2$ https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/Databases/default.htm

### Ethical Considerations

Considerations of these type will become apparent as the project progresses. Clearly, this project will not put some companies "in the best light". This is a concern because we do not wish to defame companies, only to provide useful data to the consumer.

Another concern is the timeline. We must consider data several years prior, since companies need sufficient time to address (or not) a recall. Based on scant analysis, it seems that 5 years is sufficient for a company to provide a "Termination Date" to their product recall data.

### Data Processing

Our project deals with a large amount of data, but not more than is manageable in python. Our data is mostly clean the way it is as it is directly obtained from the FDA website, but does requires some cleaning of erroneous and duplicate entries in the .csv tables. The numerous .csv tables will need to be collected into a dataframe before such cleaning can take place, due to a limitation of the FDA's website.

### Data Analysis

Two data analysis techniques which seem fruitful here are decision trees or similar classification analysis, and linear regression. These two analysis techniques may lead to insight, which would then require other analysis to furthur exploit.

For visualization purposes, we intend to comb the .csv tables for "Recall Reasons", and project averages of words or phrases to form a word cloud for a given company or brand. We believe this type of visualization analysis will lead to furthur insight, and even more questions. 

### Analysis Methodology

It is difficult to exactly detail our analysis methodology, before we've had the opportunity to begin our analysis. So far, we have decided to make a decision tree based on locale of the manufactures, and to use linear regression on the device classes. 
For instance, we would be developing a decision tree such that if someone were looking for a medical device, they could make a decision based initially on the device location; further specifications would be dictated by any trends or correlations observed from the datasets. 
Additionally, we would also try to develop (multi-) linear correlations between the device class (which is an indicator of the potential severity of the device) and the amount of recalls, which companies have the most or least amount of recalls, and potentially what category the device is. 

### Project Schedule
    - Week 1, 3/3-3/9 -
        -Clean data from downloaded .csv files
        -Get/give peer feedback 3/7
    -Week 2, 3/10-3/16 -
        -Start initial analysis
            -Branch off analysis based on fruitful paths
        -Get feedback from staff 3/10
    Week 3, 3/17-3/23 - 
        -Continue analysis
            -Begin to Classify companies
    Week 4, 3/24-3/30 -
        -Continue analysis
        -Begin visualization
    Week 5, 3/31-4/6 -
        -Submit project milestone 3/31
        -Get personal staff feedback 4/1-4/5
    Week 6, 4/7-4/13 -
        -Continue analysis
        -Continue visualization
    Week 7, 4/14-4/20 -
        -Assess project goals
        -Finish project write up
        -Make video
    Week 8,4/21-4/27 - 
        -Submit Final Project 4/21
        -Project Awards 4/23