# Exercise: NEISS, Question Set R

#### Summary

The [National Electronic Injury Surveillance System](https://www.cpsc.gov/Safety-Education/Safety-Guides/General-Information/National-Electronic-Injury-Surveillance-System-NEISS) is a data product produced by the US Consumer Product Safety Commission. It tracks emergency room injuries related to consumer products (e.g., "a door fell on me!").

#### Files

- **nss15.tsv**: injury data (one injury per row)
- **2017NEISSCodingManualCPSConlyNontrauma.pdf**: column definitions and explanations
- **2015 Neiss data highlights.pdf**: a partial summary of the data
- **2017ComparabilityTable.pdf**: product code definitions
- **categories-cleaned.txt**: product code definitions in CSV format (great for joining!)

#### Source

https://www.cpsc.gov/Safety-Education/Safety-Guides/General-Information/National-Electronic-Injury-Surveillance-System-NEISS

#### Skills

- Using codebooks
- Reading tab-separated files
- Ignoring bad lines
- Replacing LOTS of values
- Merging dataframes
- Using numpy/`np.nan`
- Padding strings
- String search using regular expressions

# Read in `nss15.tsv`

Some of the lines just **aren't formatted correctly**. Maybe we can avoid those?

### Check that your dataframe has 357727 rows and 19 columns.

### List the columns and their data types

### What does each column mean?

# Cleaning up a column

Take a look at the **body part** column. How many rows of each body part are there?

## Replace the numbers with the appropriate words they stand for.

Those numbers are terrible - codes are fine for storage but not really for reading. **Replace the numbers with the  words they stand for.**

Refer to pages 11-12 of the column definitions file, or... hey, I typed it in below!

- Tip: If I've already talked about how to replace values, maybe there's a really easy way to replace a lot at once? Maybe I'll tell you if you ask me?

In [None]:
# 0:  'internal'
# 30: 'shoulder'
# 31: 'upper trunk'
# 32: 'elbow'
# 33: 'lower arm'
# 34: 'wrist'
# 35: 'knee'
# 36: 'lower leg'
# 37: 'ankle'
# 38: 'pubic region'
# 75: 'head'
# 76: 'face'
# 77: 'eyeball'
# 79: 'lower trunk'
# 80: 'upper arm'
# 81: 'upper leg'
# 82: 'hand'
# 83: 'foot'
# 84: '25-50% of body'
# 85: 'all parts of body'
# 87: 'not recorded'
# 88: 'mouth'
# 89: 'neck'
# 92: 'finger'
# 93: 'toe'
# 94: 'ear'

## Confirm you have 58677 head, 30992 face, and 30579 lower trunk.

Isn't this much nicer?

## Graph the number of each body part, but don’t included the “Not Recorded” records

## "Not Recorded" seems silly - change it to be `NaN` instead

Don't use `na_values` for this.

## Graph the count of each body part, but don’t included the “Not Recorded” records

Yes, again! The code you use should be different this time.

## For each body part, get the average age of the person who injured that part

Sort from youngest to oldest

## Wait, "not recorded" seems really really really really old!

How can the average age be like 80???? Read page 6 of the documentation. Fix the issue however you think is best, but explain what you're doing. **It would be nice to talk about this in class if you'd write it on the board!**

### Tech tip you might find useful

If you want to replace a column based on a condition, pandas will probably yell at you. You get to learn this new thing called `loc` now! 

```
df.loc[df.country == 'Angola', "continent"] = "Africa"
```

This updates the `continent` column to be `Africa` for every row where `country == 'Angola'`. You CANNOT do the following, which is probably what you've wanted to do:

```
df[df.country == 'Angola']['continent'] = 'Africa'
```

And now you know.

# Finding injuries

## How many people were injured by "Musical instruments, electric or battery operated?"

Try to do what you think would work, then see... it doesn't work. There are all kinds of reasons why it wouldn't work. Keep reading once it doesn't work for you.

- Tip: `prod1` and `prod2` are the product fields
- Tip: You can use the codebook or `cleaned-categories.txt`

### Did something go wrong when you read in your data?

Maybe it's one of those problems like we had with `008382` in the homework, where when we read in the file it got rid of the leading zeroes? Try to read the file in again and fix that.

Try to get all of the "Musical instruments, electric or battery operated" injurie again. It still won't work. Keep reading.

### Looks the data export is bad!

They turned `0565` into `565` when they exported it or something (so irresponsible!), and it's up to us to fix it. Pad the `prod1` and `prod2` columns to be the proper length. If you didn't actually do the task above it's going to be more difficult.

## How many people were injured by musical instruments, total?

Include normal musical instruments, electric musical instruments and toy musical instruments.

- Tip: You can use the codebook or `cleaned-categories.txt`
- Tip: Answer this in one line **without** using `and` or `&`.

## Out of those three, which is the most popular reason for admission?

## What parts of the body are injured by musical instruments most often?

# Adding categories

## Read in `categories-cleaned.txt`

## How many different categories are electric/electrical/electronic?

## Join this with your injuries dataframes to give every row a text product code

## How many different injuries involved an electric/electrical/electronic product?

## Graph the most common injuries involving an electrical product

## When people get injuries using an electrical product, what part of their body is injured?

## What product is most likely to injure your mouth?

## What product is most likely to injure your ears?