# Product Injury
<font size=4 color='blue'>Data Prep and Load - Injuries</font>   
***  

**Project Summary:**   
The Consumer Product Safety Commission operations a surveillance system (NEISS) to track injury data related to consumer products. The data is collected from a representative sample of emergency rooms in the United Status. 
This project will examine the data from 2013 through 2022 to explore trends in product injuries resulting in emergency room visits.

**Notebook Scope:**  
This notebook includes code to validate and merge annual injury data. This data has been downloaded from the [NEISS website](https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx).

**Output:**  
<font color='red'>**Revisit...**</font>   
The resulting data will be loaded to a SQL Azure database for EDA and Trend Analysis.
***  

***
# Notebook Setup
***

In [1]:
# Import libraries
import pandas as pd
import calendar as cal
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

In [2]:
%%html
<!-- Prevent text wrappping in dataframe displays for a cleaner print -->
<style> .dataframe td {white-space: nowrap;}</style>

In [3]:
# Set defaults plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [11, 3]
plt.rcParams['legend.loc'] = (1.01, 0)
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['grid.linestyle'] = ':'

***  
# Load Injury Data
***

In [4]:
# Read in the NEIS injury data from each annual Excel file
# NOTE: This code block takes 10-15 minutes to execute
path = '../_data/'
yrs = [x for x in range(2013, 2023)]
rows_read = 0
injuries_df = pd.DataFrame()

print(f'Load start time:  {datetime.now().strftime('%H:%M:%S')}')
for yr in yrs:
    file_name = 'neiss' + str(yr) + '.xlsx'
    sheet_name = 'NEISS' + '_' + str(yr)
    raw_df = pd.read_excel(path + file_name, sheet_name = sheet_name, usecols='A:E,H:I,N,Q:S,W:Y')
    rows_read += len(raw_df)
    injuries_df = pd.concat([injuries_df, raw_df])
print(f'Load finish time: {datetime.now().strftime('%H:%M:%S')}')

Load start time:  14:59:53
Load finish time: 15:10:11


***
# Preview Data
***

In [5]:
injuries_df.head()

Unnamed: 0,CPSC_Case_Number,Treatment_Date,Age,Sex,Race,Body_Part,Diagnosis,Disposition,Product_1,Product_2,Product_3,Stratum,PSU,Weight
0,130104962,2013-01-01,57,1,1,76,53,1,3299,0,0,M,100,88.4147
1,130104963,2013-01-01,207,2,4,75,62,1,1807,0,0,M,100,88.4147
2,130104966,2013-01-01,59,2,1,79,53,1,1842,0,0,M,100,88.4147
3,130104968,2013-01-01,17,2,1,37,64,1,4076,0,0,M,100,88.4147
4,130104970,2013-01-01,38,1,1,92,59,1,474,0,0,M,100,88.4147


In [6]:
# Verify that the length of the combined dataset is the same as the total number of rows read in from all Excel files
print(f'Total rows read:           {rows_read}')
print(f'Rows in combined dataset:  {len(injuries_df)}')

Total rows read:           3559186
Rows in combined dataset:  3559186


In [7]:
# Drop rows and columns that consist only of NaN data
injuries_df.dropna(axis = 0, how = 'all', inplace=True)
injuries_df.dropna(axis = 1, how = 'all', inplace=True)

In [8]:
# View shape of the dataframe
injuries_df.shape

(3559186, 14)

In [9]:
# Reset index
injuries_df.reset_index(drop=True, inplace=True)

***
# Memory Usage
***

In [10]:
# Since our dataset is on the larger size, let's capture the memory usage. This will allow us to evalute how much our cleanup activities
# impact overall size. While this may not be critical in this case, it's a good habit.
print(f'Memory Usage by Column:\n{injuries_df.memory_usage()}')
print(f'\nTotal Memory Usage for Dataframe:   {injuries_df.memory_usage().sum()}')

Memory Usage by Column:
Index                    132
CPSC_Case_Number    28473488
Treatment_Date      28473488
Age                 28473488
Sex                 28473488
Race                28473488
Body_Part           28473488
Diagnosis           28473488
Disposition         28473488
Product_1           28473488
Product_2           28473488
Product_3           28473488
Stratum             28473488
PSU                 28473488
Weight              28473488
dtype: int64

Total Memory Usage for Dataframe:   398628964


***
# Load Data Codes
***

In [11]:
# Read in the standard data codes file. We'll use this later in our data validation activities
file = '../data/std_codes.xlsx'
codes_df = pd.read_excel(file)
codes_df.head()

Unnamed: 0,Code,Value,Description
0,AGELTTWO,0,Unk
1,AGELTTWO,201,1 Month
2,AGELTTWO,202,2 Months
3,AGELTTWO,203,3 Months
4,AGELTTWO,204,4 Months


In [12]:
# Review which data is coded
codes_df['Code'].unique()

array(['AGELTTWO', 'BDYPT', 'DIAG', 'DISP', 'GENDER', 'HISP', 'LOC',
       'PROD', 'RACE'], dtype=object)

In [13]:
# Preview body part codes
codes_df[codes_df['Code'] == 'BDYPT'].head()

Unnamed: 0,Code,Value,Description
24,BDYPT,0,Internal
25,BDYPT,30,Shoulder
26,BDYPT,31,Upper Trunk
27,BDYPT,32,Elbow
28,BDYPT,33,Lower Arm


In [14]:
# Preview diagnosis codes
codes_df[codes_df['Code'] == 'DIAG'].head()

Unnamed: 0,Code,Value,Description
54,DIAG,41,Ingestion
55,DIAG,42,Aspiration
56,DIAG,46,"Burn, Electrical"
57,DIAG,47,"Burn, Not Spec."
58,DIAG,48,"Burn, Scald"


In [15]:
# Preview disposition codes
codes_df[codes_df['Code'] == 'DISP'].head()

Unnamed: 0,Code,Value,Description
85,DISP,0,No Injury
86,DISP,1,Treated/Examined And Released
87,DISP,2,Treated And Transferred
88,DISP,4,Treated And Admitted/Hospitalized
89,DISP,5,Held For Observation


In [16]:
# Preview gender codes
codes_df[codes_df['Code'] == 'GENDER'].head()

Unnamed: 0,Code,Value,Description
93,GENDER,0,Unknown
94,GENDER,1,Male
95,GENDER,2,Female
1237,GENDER,3,Non-Binary/Other


In [17]:
# Preview product codes
codes_df[codes_df['Code'] == 'PROD'].head()

Unnamed: 0,Code,Value,Description
109,PROD,101,Washing Machines Without Wringers Or Other Dryers
110,PROD,102,Wringer Washing Machines
111,PROD,103,Washing Machines With Unheated Spin Dryers
112,PROD,106,Electric Clothes Dryers Without Washers
113,PROD,107,Gas Clothes Dryers Without Washers


In [18]:
# Preview race codes
codes_df[codes_df['Code'] == 'RACE'].head()

Unnamed: 0,Code,Value,Description
1229,RACE,0,Not Specified
1230,RACE,1,White
1231,RACE,2,Black/African American
1232,RACE,3,Other
1233,RACE,4,Asian


***
# Variables
***

In [19]:
# Display variables (column headings) for the dataframe
injuries_df.columns.values

array(['CPSC_Case_Number', 'Treatment_Date', 'Age', 'Sex', 'Race',
       'Body_Part', 'Diagnosis', 'Disposition', 'Product_1', 'Product_2',
       'Product_3', 'Stratum', 'PSU', 'Weight'], dtype=object)

***
**Variable Descriptions**  
-- CPSC_Case_Number: Consumer Product Safety Commision case number  
-- Treatment_Date: date of ER visit  
-- Age: age of patient  
-- Sex: gender of patient  
-- Race: race of patient  
-- Body_Part: injured body part  
-- Diagnosis: diagnosis of injury  
-- Dipsosition: outcome of the visit  
-- Product_1: primary product involved in the injury  
-- Product_2: additional product involved  
-- Product_3: additional product involved  
-- Stratum: type of hospital reporting the visit  
-- PSU: primary sampling unit  
-- Weight: statistical weight used for generating national estimates  
***

***
## Rename Variables for Clarity
***

In [20]:
injuries_df.rename(columns={'CPSC_Case_Number':'Case', 'Treatment_Date':'Date'}, inplace=True)
injuries_df.columns

Index(['Case', 'Date', 'Age', 'Sex', 'Race', 'Body_Part', 'Diagnosis',
       'Disposition', 'Product_1', 'Product_2', 'Product_3', 'Stratum', 'PSU',
       'Weight'],
      dtype='object')

***
## Data Types and Format
***

In [21]:
# Review column data types
injuries_df.dtypes

Case                    int64
Date           datetime64[ns]
Age                     int64
Sex                     int64
Race                    int64
Body_Part               int64
Diagnosis               int64
Disposition             int64
Product_1               int64
Product_2               int64
Product_3               int64
Stratum                object
PSU                     int64
Weight                float64
dtype: object

In [22]:
# Review the range of int64 columns to see if they could be re-types as int32
int_cols = injuries_df.select_dtypes(include=['int']).columns
for col in int_cols:
    print(f'{col:<12} {injuries_df[col].min():>10} to {injuries_df[col].max()}')

Case          130104962 to 230303682
Age                   0 to 223
Sex                   0 to 3
Race                  0 to 6
Body_Part             0 to 94
Diagnosis            41 to 74
Disposition           1 to 9
Product_1           102 to 5555
Product_2             0 to 5555
Product_3             0 to 5555
PSU                   1 to 101


In [23]:
# Convert int64 columns to int32
injuries_df[int_cols] = injuries_df[int_cols].astype('int32')

***
# Missing Data
***  

In [24]:
# Look for columns with null values
injuries_df.columns[injuries_df.isnull().any()]

Index([], dtype='object')

***
# Duplicate Records
***

In [25]:
# Check for duplicated case numbers
# Look for duplicate code/value combinations that have different descriptions
injuries_df[injuries_df.duplicated(subset='Case')]

Unnamed: 0,Case,Date,Age,Sex,Race,Body_Part,Diagnosis,Disposition,Product_1,Product_2,Product_3,Stratum,PSU,Weight


***
# Data Review and Cleanup
***

## Date
***

In [26]:
# Verify all dates are within our date range
print(f'Start Date:   {injuries_df["Date"].min().strftime("%b %d, %Y")}')
print(f'End Date:     {injuries_df["Date"].max().strftime("%b %d, %Y")}')

Start Date:   Jan 01, 2013
End Date:     Dec 31, 2022


## Age
***

In [27]:
# Check for missing or unknown age, which is coded as 0
print(f'Unknown Age:  {len(injuries_df[injuries_df["Age"] == 0])}')

Unknown Age:  280


In [28]:
# Review age value ranges
print(f'Min Age:   {injuries_df["Age"].min()}')
print(f'Max Age:   {injuries_df["Age"].max()}')

Min Age:   0
Max Age:   223


In [29]:
# From the documentation provided by NEISS, ages less than 2 are coded by month.
codes_df[codes_df['Code'] == 'AGELTTWO'].head()

Unnamed: 0,Code,Value,Description
0,AGELTTWO,0,Unk
1,AGELTTWO,201,1 Month
2,AGELTTWO,202,2 Months
3,AGELTTWO,203,3 Months
4,AGELTTWO,204,4 Months


In [30]:
# For our purposes, we only want to consider age by year.  Recode unknown age as -1, Ages 1 Month to 11 Months will be recoded as 0 
# and ages 12 Months to 23 months will be recoded to 1
injuries_df.loc[injuries_df['Age'] == 0, 'Age'] = -1
injuries_df.loc[injuries_df['Age'].between(200, 211), 'Age'] = 0
injuries_df.loc[injuries_df['Age'].between(212, 223), 'Age'] = 1

In [31]:
# Review age value ranges
print(f'Min Age:   {injuries_df["Age"].min()}')
print(f'Max Age:   {injuries_df["Age"].max()}')

Min Age:   -1
Max Age:   113


***
## Sex
***

In [32]:
# Review Sex data
temp_inj_df = pd.DataFrame(injuries_df['Sex'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'GENDER'].drop(columns='Code').set_index('Value')
temp_sex_df = pd.concat([temp_codes_df, temp_inj_df], axis=1)
temp_sex_df.columns = ['Sex', 'Count']
temp_sex_df['Sex'] = temp_sex_df['Sex'].str.title()
temp_sex_df

Unnamed: 0,Sex,Count
0,Unknown,52
1,Male,1929778
2,Female,1629312
3,Non-Binary/Other,44


***
## Race
***

In [33]:
# Review Race data
temp_inj_df = pd.DataFrame(injuries_df['Race'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'RACE'].drop(columns='Code').set_index('Value')
temp_race_df = pd.concat([temp_codes_df, temp_inj_df], axis=1)
temp_race_df.columns = ['Race', 'Count']
temp_race_df['Race'] = temp_race_df['Race'].str.title()
temp_race_df.head()

Unnamed: 0,Race,Count
0,Not Specified,1214319
1,White,1571277
2,Black/African American,538303
3,Other,171173
4,Asian,49321


***
## Body Part
***

In [40]:
# Review Body Part data
temp_inj_df = pd.DataFrame(injuries_df['Body_Part'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'BDYPT'].drop(columns='Code').set_index('Value')
temp_bdypt_df = pd.concat([temp_codes_df, temp_inj_df], axis=1).dropna()
temp_bdypt_df.columns = ['Body Part', 'Count']
temp_bdypt_df['Count'] = temp_bdypt_df['Count'].astype('int32')
temp_bdypt_df.head()

Unnamed: 0,Body Part,Count
0,Internal,38466
30,Shoulder,139173
31,Upper Trunk,198823
32,Elbow,84555
33,Lower Arm,128424


***
## Diagnosis
***

In [41]:
# Review Diagnosis data
temp_inj_df = pd.DataFrame(injuries_df['Diagnosis'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'DIAG'].drop(columns='Code').set_index('Value')
temp_diag_df = pd.concat([temp_codes_df, temp_inj_df], axis=1).dropna()
temp_diag_df.columns = ['Diagnosis', 'Count']
temp_diag_df['Count'] = temp_diag_df['Count'].astype('int32')
temp_diag_df.dropna().head()

Unnamed: 0,Diagnosis,Count
41,Ingestion,33316
42,Aspiration,5147
46,"Burn, Electrical",1095
47,"Burn, Not Spec.",667
48,"Burn, Scald",24183


***
## Disposition
***

In [42]:
# Review dispostion data
temp_inj_df = pd.DataFrame(injuries_df['Disposition'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'DISP'].drop(columns='Code').set_index('Value')
temp_disp_df = pd.concat([temp_codes_df, temp_inj_df], axis=1).dropna()
temp_disp_df.columns = ['Disposition', 'Count']
temp_disp_df['Count'] = temp_disp_df['Count'].astype('int32')
temp_disp_df.dropna().head()

Unnamed: 0,Disposition,Count
1,Treated/Examined And Released,3139530
2,Treated And Transferred,30558
4,Treated And Admitted/Hospitalized,307688
5,Held For Observation,26780
6,Left Without Being Seen,52222


***
## Products
***

In [76]:
# Review product data
temp_inj_df = pd.DataFrame(injuries_df[['Product_1', 'Product_2', 'Product_3']].apply(pd.Series.value_counts).sum(axis=1))
temp_codes_df = codes_df[codes_df['Code'] == 'PROD'].drop(columns='Code').set_index('Value')
temp_prod_df = pd.concat([temp_codes_df, temp_inj_df], axis=1).dropna()
temp_prod_df.columns = ['Product_1', 'Count']
temp_prod_df['Count'] = temp_prod_df['Count'].astype('int32')
temp_prod_df.dropna().head()

Unnamed: 0,Product_1,Count
102,Wringer Washing Machines,21
106,Electric Clothes Dryers Without Washers,235
107,Gas Clothes Dryers Without Washers,24
110,Electric Heating Pads,613
112,Sewing Machines Or Accessories,495


***
## Stratum, PSU, Weight
***

In [79]:
# Review stratum data
pd.DataFrame(injuries_df['Stratum'].value_counts())

Unnamed: 0_level_0,count
Stratum,Unnamed: 1_level_1
V,1386103
C,704759
L,542809
S,469953
M,455562


In [83]:
# Review psu data
pd.DataFrame(injuries_df['PSU'].value_counts()).head()

Unnamed: 0_level_0,count
PSU,Unnamed: 1_level_1
21,227267
95,159709
20,143418
8,136438
31,108385


In [84]:
# Review weight data
pd.DataFrame(injuries_df['Weight'].value_counts()).head()

Unnamed: 0_level_0,count
Weight,Unnamed: 1_level_1
14.8537,150878
14.6504,82652
14.3089,80895
15.6716,78287
4.757,78262
