# Product Injury
<font size=4 color='blue'>Data Prep and Load - Injuries</font>   
***  

**Project Summary:**   
The Consumer Product Safety Commission operations a surveillance system (NEISS) to track injury data related to consumer products. The data is collected from a representative sample of emergency rooms in the United Status. 
This project will examine the data from 2013 through 2022 to explore trends in product injuries resulting in emergency room visits.

**Notebook Scope:**  
This notebook includes code to validate and merge annual injury data. This data has been downloaded from the [NEISS website](https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx).

**Output:**  
<font color='red'>**Revisit...**</font>   
The resulting data will be loaded to a SQL Azure database for EDA and Trend Analysis.
***  

***
# Notebook Setup
***

In [3]:
# Import libraries
import pandas as pd
import calendar as cal
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

In [4]:
%%html
<!-- Prevent text wrappping in dataframe displays for a cleaner print -->
<style> .dataframe td {white-space: nowrap;}</style>

In [5]:
# Set defaults plotting defaults
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [11, 3]
plt.rcParams['legend.loc'] = (1.01, 0)
plt.rcParams['axes.titlesize'] = 16
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['grid.linestyle'] = ':'

***  
# Load Injury Data
***

In [42]:
# Read in the NEIS injury data from each annual Excel file
# NOTE: This code block takes 10-15 minutes to execute
path = '../_data/'
yrs = [x for x in range(2013, 2023)]
rows_read = 0
injuries_df = pd.DataFrame()

print(f'Load start time:  {datetime.now().strftime('%H:%M:%S')}')
for yr in yrs:
    file_name = 'neiss' + str(yr) + '.xlsx'
    sheet_name = 'NEISS' + '_' + str(yr)
    raw_df = pd.read_excel(path + file_name, sheet_name = sheet_name, usecols='A:E,H:I,N,Q:S,W:Y')
    rows_read += len(raw_df)
    injuries_df = pd.concat([injuries_df, raw_df])
print(f'Load finish time: {datetime.now().strftime('%H:%M:%S')}')

Load start time:  09:24:33
Load finish time: 09:35:25


***
# Preview Data
***

In [7]:
injuries_df.head()

Unnamed: 0,CPSC_Case_Number,Treatment_Date,Age,Sex,Race,Other_Race,Body_Part,Diagnosis,Other_Diagnosis,Body_Part_2,Diagnosis_2,Other_Diagnosis_2,Disposition,Product_1,Product_2,Product_3,Stratum,PSU,Weight
0,130104962,2013-01-01,57,1,1,,76,53,,,,,1,3299,0,0,M,100,88.4147
1,130104963,2013-01-01,207,2,4,,75,62,,,,,1,1807,0,0,M,100,88.4147
2,130104966,2013-01-01,59,2,1,,79,53,,,,,1,1842,0,0,M,100,88.4147
3,130104968,2013-01-01,17,2,1,,37,64,,,,,1,4076,0,0,M,100,88.4147
4,130104970,2013-01-01,38,1,1,,92,59,,,,,1,474,0,0,M,100,88.4147


In [8]:
# Verify that the length of the combined dataset is the same as the total number of rows read in from all Excel files
print(f'Total rows read:           {rows_read}')
print(f'Rows in combined dataset:  {len(injuries_df)}')

Total rows read:           3559186
Rows in combined dataset:  3559186


In [9]:
# Drop rows and columns that consist only of NaN data
injuries_df.dropna(axis = 0, how = 'all', inplace=True)
injuries_df.dropna(axis = 1, how = 'all', inplace=True)

In [10]:
# View shape of the dataframe
injuries_df.shape

(3559186, 19)

In [11]:
# Reset index
injuries_df.reset_index(drop=True, inplace=True)

***
# Memory Usage
***

In [12]:
# Since our dataset is on the larger size, let's capture the memory usage. This will allow us to evalute how much our cleanup activities
# impact overall size. While this may not be critical in this case, it's a good habit.
print(f'Memory Usage by Column:\n{injuries_df.memory_usage()}')
print(f'\nTotal Memory Usage for Dataframe:   {injuries_df.memory_usage().sum()}')

Memory Usage by Column:
Index                     132
CPSC_Case_Number     28473488
Treatment_Date       28473488
Age                  28473488
Sex                  28473488
Race                 28473488
Other_Race           28473488
Body_Part            28473488
Diagnosis            28473488
Other_Diagnosis      28473488
Body_Part_2          28473488
Diagnosis_2          28473488
Other_Diagnosis_2    28473488
Disposition          28473488
Product_1            28473488
Product_2            28473488
Product_3            28473488
Stratum              28473488
PSU                  28473488
Weight               28473488
dtype: int64

Total Memory Usage for Dataframe:   540996404


***
# Load Data Codes
***

In [24]:
# Read in the standard data codes file. We'll use this later in our data validation activities
file = '../data/std_codes.xlsx'
codes_df = pd.read_excel(file)
codes_df.head()

Unnamed: 0,Code,Value,Description
0,AGELTTWO,0,Unk
1,AGELTTWO,201,1 Month
2,AGELTTWO,202,2 Months
3,AGELTTWO,203,3 Months
4,AGELTTWO,204,4 Months


In [25]:
# Review which data is coded
codes_df['Code'].unique()

array(['AGELTTWO', 'BDYPT', 'DIAG', 'DISP', 'GENDER', 'HISP', 'LOC',
       'PROD', 'RACE'], dtype=object)

In [26]:
# Preview body part codes
codes_df[codes_df['Code'] == 'BDYPT'].head()

Unnamed: 0,Code,Value,Description
24,BDYPT,0,Internal
25,BDYPT,30,Shoulder
26,BDYPT,31,Upper Trunk
27,BDYPT,32,Elbow
28,BDYPT,33,Lower Arm


In [27]:
# Preview diagnosis codes
codes_df[codes_df['Code'] == 'DIAG'].head()

Unnamed: 0,Code,Value,Description
54,DIAG,41,Ingestion
55,DIAG,42,Aspiration
56,DIAG,46,"Burn, Electrical"
57,DIAG,47,"Burn, Not Spec."
58,DIAG,48,"Burn, Scald"


In [28]:
# Preview disposition codes
codes_df[codes_df['Code'] == 'DISP'].head()

Unnamed: 0,Code,Value,Description
85,DISP,0,0 - No Injury
86,DISP,1,1 - Treated/Examined And Released
87,DISP,2,2 - Treated And Transferred
88,DISP,4,4 - Treated And Admitted/Hospitalized
89,DISP,5,5 - Held For Observation


In [29]:
# Cleanup disposition descriptions
def cln_disp_desc(x):
    if x['Code'] == 'DISP':
        return x['Description'][4:]
    else:
        return x['Description']

codes_df['Description'] = codes_df.apply(cln_disp_desc, axis=1)
codes_df[codes_df['Code'] == 'DISP'].head()

Unnamed: 0,Code,Value,Description
85,DISP,0,No Injury
86,DISP,1,Treated/Examined And Released
87,DISP,2,Treated And Transferred
88,DISP,4,Treated And Admitted/Hospitalized
89,DISP,5,Held For Observation


In [30]:
# Preview gender codes
codes_df[codes_df['Code'] == 'GENDER'].head()

Unnamed: 0,Code,Value,Description
93,GENDER,0,Unknown
94,GENDER,1,Male
95,GENDER,2,Female
1237,GENDER,3,Non-Binary/Other


In [31]:
# Preview product codes
codes_df[codes_df['Code'] == 'PROD'].head()

Unnamed: 0,Code,Value,Description
109,PROD,101,Washing Machines Without Wringers Or Other Dryers
110,PROD,102,Wringer Washing Machines
111,PROD,103,Washing Machines With Unheated Spin Dryers
112,PROD,106,Electric Clothes Dryers Without Washers
113,PROD,107,Gas Clothes Dryers Without Washers


In [32]:
# Preview race codes
codes_df[codes_df['Code'] == 'RACE'].head()

Unnamed: 0,Code,Value,Description
1229,RACE,0,Not Specified
1230,RACE,1,White
1231,RACE,2,Black/African American
1232,RACE,3,Other
1233,RACE,4,Asian


***
# Variables
***

In [33]:
# Display variables (column headings) for the dataframe
injuries_df.columns.values

array(['CPSC_Case_Number', 'Treatment_Date', 'Age', 'Sex', 'Race',
       'Other_Race', 'Body_Part', 'Diagnosis', 'Other_Diagnosis',
       'Body_Part_2', 'Diagnosis_2', 'Other_Diagnosis_2', 'Disposition',
       'Product_1', 'Product_2', 'Product_3', 'Stratum', 'PSU', 'Weight'],
      dtype=object)

***
**Variable Descriptions**  
-- CPSC_Case_Number: Consumer Product Safety Commision case number  
-- Treatment_Date: date of ER visit  
-- Age: age of patient  
-- Sex: gender of patient  
-- Race: race of patient  
-- Hispanic: indicates if the patient is Hispanic  
-- Body_Part: injured body part  
-- Diagnosis: diagnosis of injury  
-- Other_Diagnosis: description if Diagnosis is "Other" (code 71)  
-- Body_Part_2: additional injuried body part  
-- Diagnosis_2: additional diagnosis  
-- Other_Diagnosis_2: description if Other_Diagnosis is "Other" (code 71)  
-- Dipsosition: outcome of the visit  
-- Product_1: primary product involved in the injury  
-- Product_2: additional product involved  
-- Product_3: additional product involved  
-- Stratum: type of hospital reporting the visit  
-- PSU: primary sampling unit  
-- Weight: statistical weight used for generating national estimates  
***

***
## Rename Variables for Clarity
***

In [34]:
injuries_df.columns

Index(['CPSC_Case_Number', 'Treatment_Date', 'Age', 'Sex', 'Race',
       'Other_Race', 'Body_Part', 'Diagnosis', 'Other_Diagnosis',
       'Body_Part_2', 'Diagnosis_2', 'Other_Diagnosis_2', 'Disposition',
       'Product_1', 'Product_2', 'Product_3', 'Stratum', 'PSU', 'Weight'],
      dtype='object')

In [35]:
# Rename columns
injuries_df.columns = ['Case', 'Date', 'Age', 'Sex', 'Race', 'Hispanic', 'Body_Part',
                       'Diagnosis', 'Other_Diagnosis', 'Body_Part_2', 'Diagnosis_2', 'Other_Diagnosis_2',
                       'Disposition', 'Product_1', 'Product_2', 'Product_3', 'Stratum', 'PSU', 'Weight']
injuries_df.head()

Unnamed: 0,Case,Date,Age,Sex,Race,Hispanic,Body_Part,Diagnosis,Other_Diagnosis,Body_Part_2,Diagnosis_2,Other_Diagnosis_2,Disposition,Product_1,Product_2,Product_3,Stratum,PSU,Weight
0,130104962,2013-01-01,57,1,1,,76,53,,,,,1,3299,0,0,M,100,88.4147
1,130104963,2013-01-01,207,2,4,,75,62,,,,,1,1807,0,0,M,100,88.4147
2,130104966,2013-01-01,59,2,1,,79,53,,,,,1,1842,0,0,M,100,88.4147
3,130104968,2013-01-01,17,2,1,,37,64,,,,,1,4076,0,0,M,100,88.4147
4,130104970,2013-01-01,38,1,1,,92,59,,,,,1,474,0,0,M,100,88.4147


***
## Data Types and Format
***

In [36]:
# Review column data types
injuries_df.dtypes

Case                          int64
Date                 datetime64[ns]
Age                           int64
Sex                           int64
Race                          int64
Hispanic                     object
Body_Part                     int64
Diagnosis                     int64
Other_Diagnosis              object
Body_Part_2                 float64
Diagnosis_2                 float64
Other_Diagnosis_2            object
Disposition                   int64
Product_1                     int64
Product_2                     int64
Product_3                     int64
Stratum                      object
PSU                           int64
Weight                      float64
dtype: object

In [37]:
# Review the range of int64 columns to see if they could be re-types as int32
int_cols = injuries_df.select_dtypes(include=['int']).columns
for col in int_cols:
    print(f'{col:<12} {injuries_df[col].min():>10} to {injuries_df[col].max()}')

Case          130104962 to 230303682
Age                   0 to 223
Sex                   0 to 3
Race                  0 to 6
Body_Part             0 to 94
Diagnosis            41 to 74
Disposition           1 to 9
Product_1           102 to 5555
Product_2             0 to 5555
Product_3             0 to 5555
PSU                   1 to 101


In [38]:
# Convert int64 columns to int32
injuries_df[int_cols] = injuries_df[int_cols].astype('int32')

In [39]:
# Review float columns that should should be ints
float_cols = injuries_df.select_dtypes(include=['float']).columns.to_list()
float_cols.remove('Weight')
for col in float_cols:
    print(f'{col}\n{injuries_df[col].unique()}\n')

Body_Part_2
[nan 85. 35. 82. 93. 76. 75. 31. 32. 88. 81. 30. 34. 92. 79. 87. 37. 38.
 89. 94. 83. 36. 80. 33. 77.  0. 84.]

Diagnosis_2
[nan 68. 64. 57. 59. 62. 71. 53. 58. 72. 48. 55. 66. 51. 56. 74. 61. 52.
 63. 60. 50. 46. 54. 42. 67. 41. 65. 49. 47. 73. 69.]



In [40]:
# Since there are no legitimate values assigned to -1, we'll replace nan values with -1 and then convert to int32
injuries_df[float_cols] = injuries_df[float_cols].fillna(-1)
injuries_df[float_cols] = injuries_df[float_cols].astype('int32')

***
# Missing Data
***  

In [41]:
# Look for columns with null values
injuries_df.columns[injuries_df.isnull().any()]

Index(['Hispanic', 'Other_Diagnosis', 'Other_Diagnosis_2'], dtype='object')

***
<font color='blue'>**Note:**</font>  
We'll keep these columns for now, as we can use them for data integrity checks later on. However, we will delete them prior to EDA. For
our purposes, we will use the predefined Race, Other Diagnosis and Other Diagnosis 2 data and not use write-in values for analysis.  
***

In [46]:
# Look for columns with a -1 value, as we used this indicate missing numeric values
injuries_df.columns[injuries_df.eq(-1).any()]

Index(['Hispanic', 'Body_Part_2', 'Diagnosis_2'], dtype='object')

In [47]:
# All three columms were added in 2019, so the data available is limited. For consistency, we will not consider these columns 
# in our analysis
injuries_df.drop(columns=['Hispanic', 'Body_Part_2', 'Diagnosis_2'], inplace=True)

***
# Duplicate Records
***

In [48]:
# Check for duplicated case numbers
# Look for duplicate code/value combinations that have different descriptions
injuries_df[injuries_df.duplicated(subset='Case')]

Unnamed: 0,Case,Date,Age,Sex,Race,Other_Race,Body_Part,Diagnosis,Other_Diagnosis,Other_Diagnosis_2,Disposition,Product_1,Product_2,Product_3,Stratum,PSU,Weight


***
# Data Review and Cleanup
***

## Date
***

In [49]:
# Verify all dates are within our date range
print(f'Start Date:   {injuries_df["Date"].min().strftime("%b %d, %Y")}')
print(f'End Date:     {injuries_df["Date"].max().strftime("%b %d, %Y")}')

Start Date:   Jan 01, 2013
End Date:     Dec 31, 2022


## Age
***

In [50]:
# Check for missing or unknown age, which is coded as 0
print(f'Unknown Age:  {len(injuries_df[injuries_df["Age"] == 0])}')

Unknown Age:  280


In [51]:
# Given the small number of rows with unknown age, we will simply remove these rows from the dataset
rows_to_drop = injuries_df[injuries_df['Age'] == 0].index.to_list()
injuries_df.drop(index=rows_to_drop, inplace=True)

In [52]:
# Review age value ranges
print(f'Min Age:   {injuries_df["Age"].min()}')
print(f'Max Age:   {injuries_df["Age"].max()}')

Min Age:   2
Max Age:   223


In [53]:
# From the documentation provided by NEISS, ages less than 2 are coded by month.
codes_df[codes_df['Code'] == 'AGELTTWO']

Unnamed: 0,Code,Value,Description
0,AGELTTWO,0,UNK
1,AGELTTWO,201,1 MONTH
2,AGELTTWO,202,2 MONTHS
3,AGELTTWO,203,3 MONTHS
4,AGELTTWO,204,4 MONTHS
5,AGELTTWO,205,5 MONTHS
6,AGELTTWO,206,6 MONTHS
7,AGELTTWO,207,7 MONTHS
8,AGELTTWO,208,8 MONTHS
9,AGELTTWO,209,9 MONTHS


In [54]:
# For our purposes, we only want to consider age by year. Ages 1 Month to 11 Months will be relabeled as 0 years old
# and ages 12 Months to 23 months will be relabled a 1 year old
injuries_df.loc[injuries_df['Age'].between(200, 211), 'Age'] = 0
injuries_df.loc[injuries_df['Age'].between(212, 223), 'Age'] = 1

In [55]:
# Review age value ranges
print(f'Min Age:   {injuries_df["Age"].min()}')
print(f'Max Age:   {injuries_df["Age"].max()}')

Min Age:   0
Max Age:   113


***
## Sex
***

In [86]:
# Review Sex data
temp_inj_df = pd.DataFrame(injuries_df['Sex'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'GENDER'].drop(columns='Code').set_index('Value')
temp_sex_df = pd.concat([temp_codes_df, temp_inj_df], axis=1)
temp_sex_df.columns = ['Sex', 'Count']
temp_sex_df['Sex'] = temp_sex_df['Sex'].str.title()
temp_sex_df

Unnamed: 0,Sex,Count
0,Unknown,52
1,Male,1929778
2,Female,1629312
3,Non-Binary/Other,44


***
## Race
***

In [87]:
# Review Race data
temp_inj_df = pd.DataFrame(injuries_df['Race'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'RACE'].drop(columns='Code').set_index('Value')
temp_race_df = pd.concat([temp_codes_df, temp_inj_df], axis=1)
temp_race_df.columns = ['Race', 'Count']
temp_race_df['Race'] = temp_race_df['Race'].str.title()
temp_race_df

Unnamed: 0,Race,Count
0,N.S.,1214319
1,White,1571277
2,Black/African American,538303
3,Other,171173
4,Asian,49321
5,American Indian/Alaska Native,10766
6,Native Hawaiian/Pacific Islander,4027


In [85]:
injuries_df.columns

Index(['CPSC_Case_Number', 'Treatment_Date', 'Age', 'Sex', 'Race',
       'Other_Race', 'Body_Part', 'Diagnosis', 'Other_Diagnosis',
       'Body_Part_2', 'Diagnosis_2', 'Other_Diagnosis_2', 'Disposition',
       'Product_1', 'Product_2', 'Product_3', 'Stratum', 'PSU', 'Weight'],
      dtype='object')

***
## Body Part
***

In [103]:
# Review Body Part data
temp_inj_df = pd.DataFrame(injuries_df['Body_Part'].value_counts())
temp_codes_df = codes_df[codes_df['Code'] == 'RACE'].drop(columns='Code').set_index('Value')
temp_race_df = pd.concat([temp_codes_df, temp_inj_df], axis=1)
temp_race_df.columns = ['Race', 'Count']
temp_race_df['Race'] = temp_race_df['Race'].str.title()
temp_race_df

Unnamed: 0,Race,Count
0,N.S.,38466.0
1,White,
2,Black/African American,
3,Other,
4,Asian,
5,American Indian/Alaska Native,
6,Native Hawaiian/Pacific Islander,
75,,603199.0
76,,310061.0
92,,292118.0
