# Product Injury
<font size=4 color='blue'>Understand and Prep Data - Codes</font>   
***  

**Project Summary:**   
The Consumer Product Safety Commission operations a surveillance system (NEISS) to track injury data related to consumer products. The data is collected from a representative sample of emergency rooms in the United Status. 
This project will examine the data from 2013 through 2022 to explore trends in product injuries resulting in emergency room visits.

**Notebook Scope:**  
This notebook includes code to review and prep the lookup codes used in the NEISS data files. The data is pulled from each annual injury data file, which includes a lookup code worksheet. 

**Output:**  
An excel file containins a scrubbed lookup table that will be used for analysis of NEISS data.
***  

***
# Notebook Setup
***

In [1]:
# Import libraries
import pandas as pd
from datetime import datetime

***  
# Load Data
***

In [2]:
# Read in the NEIS_FMT worksheet from each annual injury data Excel file
path = '../_data/'
files = ['neiss' + str(x) + '.xlsx' for x in range(2013, 2023)]
all_codes_df = pd.DataFrame()

print(f'Load start time:  {datetime.now().strftime('%H:%M:%S')}')
for file in files:
    raw_df = pd.read_excel(path + file, sheet_name = 'NEISS_FMT')
    raw_df.insert(0, 'Year', file[-9:-5])
    all_codes_df = pd.concat([all_codes_df, raw_df])
print(f'Load finish time:  {datetime.now().strftime('%H:%M:%S')}')

Load start time:  09:05:14
Load finish time:  09:06:13


In [3]:
# Reset index
all_codes_df.reset_index(drop=True, inplace=True)

***
# Preview Data
***

In [4]:
# Preview the dataframe
all_codes_df.head()

Unnamed: 0,Year,Format name,Starting value for format,Ending value for format,Format value label
0,2013,AGELTTWO,0,0,UNK
1,2013,AGELTTWO,2,120,2 YEARS AND OLDER
2,2013,AGELTTWO,201,201,1 MONTH
3,2013,AGELTTWO,202,202,2 MONTHS
4,2013,AGELTTWO,203,203,3 MONTHS


In [5]:
# Drop rows and columns that consist only of NaN data
all_codes_df.dropna(axis = 0, how = 'all', inplace=True)
all_codes_df.dropna(axis = 1, how = 'all', inplace=True)

In [6]:
# View shape of the dataframe
all_codes_df.shape

(12483, 5)

***  
# Variables
*** 

In [7]:
# Display variables (column headings) for the dataframe
all_codes_df.columns.values

array(['Year', 'Format name', 'Starting value for format',
       'Ending value for format', 'Format value label'], dtype=object)

***
**Variable Descriptions**  
-- Year: Indicates which annual injury data file the row was read from  
-- Format name: Code label  
-- Starting value for format: Starting value for the code label  
-- Ending value for format: Ending value for the code label    
-- Format value label: Description for code/label combination
***

In [8]:
# Review Year data
all_codes_df['Year'].unique()

array(['2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
       '2021', '2022'], dtype=object)

In [9]:
# Review Format name data
all_codes_df['Format name'].unique()

array(['AGELTTWO', 'ALC_DRUG', 'BDYPT', 'DIAG', 'DISP', 'FIRE', 'GENDER',
       'HISP', 'LOC', 'PROD', 'RACE'], dtype=object)

In [10]:
# Reivew Starting value for format data
all_codes_df['Starting value for format'].unique()

array(['               0', '               2', '             201', ...,
       '            9999', '             714', '            1552'],
      dtype=object)

In [11]:
# Review Ending value for format data
all_codes_df['Ending value for format'].unique()

array(['               0', '             120', '             201', ...,
       '            9999', '             714', '            1552'],
      dtype=object)

In [12]:
# Review Format value label data
all_codes_df['Format value label'].unique()

array(['UNK', '2 YEARS AND OLDER', '1 MONTH', ...,
       '714 - COMBINATION FIRE/SMOKE ALARM AND CARBON MONOXIDE DETECTORS',
       'NON-BINARY/OTHER', '1552 - CRIBS, NONPORTABLE OR NOT SPECIFIED'],
      dtype=object)

***
## Remove Irrelevant Data
***

In [13]:
# Review codes that encompass a range of values to see if we can simply to a single code/value pair instead of ranges
all_codes_df[all_codes_df['Starting value for format'] != all_codes_df['Ending value for format']]

Unnamed: 0,Year,Format name,Starting value for format,Ending value for format,Format value label
1,2013,AGELTTWO,2,120,2 YEARS AND OLDER
1249,2014,AGELTTWO,2,120,2 YEARS AND OLDER
2497,2015,AGELTTWO,2,120,2 YEARS AND OLDER
3745,2016,AGELTTWO,2,120,2 YEARS AND OLDER
4993,2017,AGELTTWO,2,120,2 YEARS AND OLDER
6241,2018,AGELTTWO,2,120,2 YEARS AND OLDER
7489,2019,AGELTTWO,2,120,2 YEARS AND OLDER
8737,2020,AGELTTWO,2,120,2 YEARS AND OLDER
9986,2021,AGELTTWO,2,120,2 YEARS AND OLDER
11235,2022,AGELTTWO,2,120,2 YEARS AND OLDER


In [14]:
# Our data analysis will focus on age in years only, so we can delete the Ending value for format and just use the starting value
all_codes_df.drop(columns='Ending value for format', inplace=True)

In [15]:
# Since ages over two do not need transalation, delete this row
rows_to_del = all_codes_df[all_codes_df['Format value label'] == '2 YEARS AND OLDER'].index.to_list()
all_codes_df.drop(rows_to_del, inplace=True)

In [16]:
# We are not going to consider alcohol use, drug use, or fire involvement in our analysis
rows_to_del = all_codes_df[all_codes_df['Format name'].isin(['ALC_DRUG', 'FIRE'])].index.to_list()
all_codes_df.drop(index=rows_to_del, inplace=True)

In [17]:
# Reset index
all_codes_df.reset_index(drop=True, inplace=True)

***
## Rename Variables for Clarity
***

In [18]:
# Rename columns
all_codes_df.columns = ['Year', 'Code', 'Value', 'Description']
all_codes_df.head()

Unnamed: 0,Year,Code,Value,Description
0,2013,AGELTTWO,0,UNK
1,2013,AGELTTWO,201,1 MONTH
2,2013,AGELTTWO,202,2 MONTHS
3,2013,AGELTTWO,203,3 MONTHS
4,2013,AGELTTWO,204,4 MONTHS


***
## Data Types and Formats
***

In [19]:
# Review column data types
all_codes_df.dtypes

Year           object
Code           object
Value          object
Description    object
dtype: object

In [20]:
# Convert Year to int
all_codes_df['Year'] = all_codes_df['Year'].astype(int)

In [21]:
# Strip spaces from Value data and check for non-numeric values
all_codes_df['Value'] = all_codes_df['Value'].str.strip()
for x in all_codes_df['Value'].unique():
    if x.isdigit() == False:
        print(all_codes_df[all_codes_df['Value'] == x])

       Year  Code Value     Description
96     2013  HISP     .  NA before 2019
1335   2014  HISP     .  NA before 2019
2574   2015  HISP     .  NA before 2019
3813   2016  HISP     .  NA before 2019
5052   2017  HISP     .  NA before 2019
6291   2018  HISP     .  NA before 2019
7530   2019  HISP     .  NA before 2019
8769   2020  HISP     .  NA before 2019
10010  2021  HISP     .  NA before 2019
11250  2022  HISP     .  NA before 2019


In [22]:
# Convert description data to title case
all_codes_df['Description'] = all_codes_df['Description'].str.title()

In [23]:
# Delete rows with "." as a Value. These rows are essentially comments
rows_to_del = all_codes_df[all_codes_df['Value'] == '.'].index.to_list()
all_codes_df.drop(index=rows_to_del, inplace=True)

In [24]:
# Convert Code to int
all_codes_df['Value'] = all_codes_df['Value'].astype(int)

In [25]:
# Review column data types
all_codes_df.dtypes

Year            int32
Code           object
Value           int32
Description    object
dtype: object

***
# Missing Data
***  

In [26]:
# Look for columns with null values
all_codes_df.columns[all_codes_df.isnull().any()]

Index(['Description'], dtype='object')

In [27]:
# Delete rows that do not contain a description
rows_to_del = all_codes_df[all_codes_df['Description'].isnull()].index
all_codes_df.drop(rows_to_del, axis='index', inplace=True)

***
# Duplicate Records
***

In [28]:
# We expect that we will have duplicate records from various years
all_codes_df.drop_duplicates(subset=['Code', 'Value', 'Description'], inplace=True)

In [29]:
# Look for duplicate code/value combinations that have different descriptions
all_codes_df[all_codes_df.duplicated(subset=['Code', 'Value'])]

Unnamed: 0,Year,Code,Value,Description


In [30]:
# Look for duplicate descriptions for different values by code
all_codes_df[all_codes_df.duplicated(subset=['Code', 'Description'])]

Unnamed: 0,Year,Code,Value,Description


In [31]:
# Reset index
all_codes_df.reset_index(drop=True, inplace=True)

***
# Data Review and Cleanup
***

## Year
***

In [32]:
# Review rows that are not from 2013
all_codes_df[all_codes_df['Year'] > 2013]

Unnamed: 0,Year,Code,Value,Description
1236,2020,PROD,714,714 - Combination Fire/Smoke Alarm And Carbon ...
1237,2021,GENDER,3,Non-Binary/Other
1238,2021,PROD,1552,"1552 - Cribs, Nonportable Or Not Specified"


In [33]:
# Check if these are new, or updated
display(all_codes_df[(all_codes_df['Code'] == 'PROD') & (all_codes_df['Value'].isin([714, 1552]))])
display(all_codes_df[(all_codes_df['Code'] == 'GENDER') & (all_codes_df['Value'] == 3)])

Unnamed: 0,Year,Code,Value,Description
1236,2020,PROD,714,714 - Combination Fire/Smoke Alarm And Carbon ...
1238,2021,PROD,1552,"1552 - Cribs, Nonportable Or Not Specified"


Unnamed: 0,Year,Code,Value,Description
1237,2021,GENDER,3,Non-Binary/Other


In [34]:
# Since there are no conflicts with the 3 newer codes, the Year column is not needed
all_codes_df.drop(columns='Year', inplace=True)

***
## Age, Gender
***

In [35]:
# Review age less than two data
all_codes_df[all_codes_df['Code'] == 'AGELTTWO']

Unnamed: 0,Code,Value,Description
0,AGELTTWO,0,Unk
1,AGELTTWO,201,1 Month
2,AGELTTWO,202,2 Months
3,AGELTTWO,203,3 Months
4,AGELTTWO,204,4 Months
5,AGELTTWO,205,5 Months
6,AGELTTWO,206,6 Months
7,AGELTTWO,207,7 Months
8,AGELTTWO,208,8 Months
9,AGELTTWO,209,9 Months


In [36]:
# Review gender data
all_codes_df[all_codes_df['Code'] == 'GENDER']

Unnamed: 0,Code,Value,Description
93,GENDER,0,Unknown
94,GENDER,1,Male
95,GENDER,2,Female
1237,GENDER,3,Non-Binary/Other


***
## Body Part
***

In [37]:
# Review body part data
all_codes_df[all_codes_df['Code'] == 'BDYPT'].head()

Unnamed: 0,Code,Value,Description
24,BDYPT,0,0 - Internal
25,BDYPT,30,30 - Shoulder
26,BDYPT,31,31 - Upper Trunk
27,BDYPT,32,32 - Elbow
28,BDYPT,33,33 - Lower Arm


In [38]:
# Cleanup body part descriptions
def cln_bdy_desc(x):
    if x['Code'] == 'BDYPT':
        return x['Description'].split('- ')[1]
    else:
        return x['Description']

all_codes_df['Description'] = all_codes_df.apply(cln_bdy_desc, axis=1)
all_codes_df[all_codes_df['Code'] == 'BDYPT'].head()

Unnamed: 0,Code,Value,Description
24,BDYPT,0,Internal
25,BDYPT,30,Shoulder
26,BDYPT,31,Upper Trunk
27,BDYPT,32,Elbow
28,BDYPT,33,Lower Arm


***
## Diagnosis
***

In [39]:
# Cleanup diagnosis descriptions
def cln_diag_desc(x):
    if x['Code'] == 'DIAG':
        return x['Description'].split('- ')[1]
    else:
        return x['Description']

all_codes_df['Description'] = all_codes_df.apply(cln_diag_desc, axis=1)
all_codes_df[all_codes_df['Code'] == 'DIAG'].head()

Unnamed: 0,Code,Value,Description
54,DIAG,41,Ingestion
55,DIAG,42,Aspiration
56,DIAG,46,"Burn, Electrical"
57,DIAG,47,"Burn, Not Spec."
58,DIAG,48,"Burn, Scald"


***
## Disposition
***

In [40]:
# Review disposition data
all_codes_df[all_codes_df['Code'] == 'DISP'].head()

Unnamed: 0,Code,Value,Description
85,DISP,0,0 - No Injury
86,DISP,1,1 - Treated/Examined And Released
87,DISP,2,2 - Treated And Transferred
88,DISP,4,4 - Treated And Admitted/Hospitalized
89,DISP,5,5 - Held For Observation


In [41]:
# Cleanup diagnosis descriptions
def cln_disp_desc(x):
    if x['Code'] == 'DIS':
        return x['Description'].split('- ')[1]
    else:
        return x['Description']

all_codes_df['Description'] = all_codes_df.apply(cln_disp_desc, axis=1)
all_codes_df[all_codes_df['Code'] == 'DISP'].head()

Unnamed: 0,Code,Value,Description
85,DISP,0,0 - No Injury
86,DISP,1,1 - Treated/Examined And Released
87,DISP,2,2 - Treated And Transferred
88,DISP,4,4 - Treated And Admitted/Hospitalized
89,DISP,5,5 - Held For Observation


***
## Race, Hispanic
***

In [42]:
# Review race data
all_codes_df[all_codes_df['Code'] == 'RACE']

Unnamed: 0,Code,Value,Description
1229,RACE,0,N.S.
1230,RACE,1,White
1231,RACE,2,Black/African American
1232,RACE,3,Other
1233,RACE,4,Asian
1234,RACE,5,American Indian/Alaska Native
1235,RACE,6,Native Hawaiian/Pacific Islander


In [43]:
# Rename N.S. to Not Specified
all_codes_df.at[1229, 'Description'] = 'Not Specified'

In [44]:
# Review Hispanic data
all_codes_df[all_codes_df['Code'] == 'HISP']

Unnamed: 0,Code,Value,Description
96,HISP,0,Unk/Not Stated
97,HISP,1,Yes
98,HISP,2,No


***
## Location
***

In [45]:
# Review location data
all_codes_df[all_codes_df['Code'] == 'LOC']

Unnamed: 0,Code,Value,Description
99,LOC,0,Unk
100,LOC,1,Home
101,LOC,2,Farm
102,LOC,3,Apart.
103,LOC,4,Street
104,LOC,5,Public
105,LOC,6,Mobile
106,LOC,7,Indst.
107,LOC,8,School
108,LOC,9,Sports


***
## Products
***

In [46]:
# Review products data
all_codes_df[all_codes_df['Code'] == 'PROD']

Unnamed: 0,Code,Value,Description
109,PROD,101,101 - Washing Machines Without Wringers Or Oth...
110,PROD,102,102 - Wringer Washing Machines
111,PROD,103,103 - Washing Machines With Unheated Spin Dryers
112,PROD,106,106 - Electric Clothes Dryers Without Washers
113,PROD,107,107 - Gas Clothes Dryers Without Washers
...,...,...,...
1226,PROD,9100,9100 - Out Of Scope Product - Retailer Report
1227,PROD,9200,9200 - Drywall Control - No Complaint
1228,PROD,9999,9999 - Uncategorized Product
1236,PROD,714,714 - Combination Fire/Smoke Alarm And Carbon ...


In [47]:
# Cleanup product code descriptions
def cln_diag_desc(x):
    if x['Code'] == 'PROD':
        return x['Description'].split('- ')[1]
    else:
        return x['Description']

all_codes_df['Description'] = all_codes_df.apply(cln_diag_desc, axis=1)
all_codes_df[all_codes_df['Code'] == 'PROD'].head()

Unnamed: 0,Code,Value,Description
109,PROD,101,Washing Machines Without Wringers Or Other Dryers
110,PROD,102,Wringer Washing Machines
111,PROD,103,Washing Machines With Unheated Spin Dryers
112,PROD,106,Electric Clothes Dryers Without Washers
113,PROD,107,Gas Clothes Dryers Without Washers


***
# Export to Excel
***

In [48]:
# Write final table to Excel
all_codes_df.to_excel('../data/std_codes.xlsx', index=False)

***
**End**
***