# Exercise: NEISS, Question Set C

#### Summary

The [National Electronic Injury Surveillance System](https://www.cpsc.gov/Safety-Education/Safety-Guides/General-Information/National-Electronic-Injury-Surveillance-System-NEISS) is a data product produced by the US Consumer Product Safety Commission. It tracks emergency room injuries related to consumer products (e.g., "a door fell on me!").

#### Files

- **neiss2017.tsv**: injury data (one injury per row)
- **2018-NEISS-CPSC-only-CodingManual.pdf**: column definitions and explanations
- **2017 NEISS Data Highlights.pdf**: a partial summary of the data
- **2018ComparabilityTable.pdf**: product code definitions
- **categories-cleaned.txt**: product code definitions in CSV format (great for joining!)

#### Source

https://www.cpsc.gov/Safety-Education/Safety-Guides/General-Information/National-Electronic-Injury-Surveillance-System-NEISS

#### Skills

- Reading in files
    - Reading tab-separated files
    - Reading in N/A values
    - Only reading in some of the data
- Replacing values
- Using strings
    - Searching for strings
    - Comparing to a list of strings
    - Regular expressions
- Using numpy/`np.nan`
- Averages practice
- Converting `.value_counts()` and similar results into DataFrames

### Before we start, import what you need to

# Read in `neiss2017.tsv`

Something's... weird about this one.

### Check that your dataframe has 386907 rows and 19 columns.

### List the columns and their data types

### I've selected a few columns. What do they mean?

The columns you are interested in are...

- `CPSC_Case_Number`
- `Sex`
- `Age`
- `Narrative_1`
- `Narrative_2`

You'll need to use the **coding manual**, `2018-NEISS-CPSC-only-CodingManual.pdf`. Column definitions are all in the first 30 pages or so. I recommend using the table of contents.

# Cleaning up a column

Take a look at the **sex** column. How many rows of each sex are there?

## Replace the numbers with the appropriate words they stand for.

Those numbers are terrible - codes are fine for storage but not really for reading. We want to **replace the numbers with the words they stand for.**

## Confirm you have 208695 male, 178203 female, and 8 "Not Recorded."

Uh, wait, there is also one patient with `8` as their sex.

### Look at only the column where `8` is the sex

## Let's drop that bad row

It looks like bad data! Maybe we can drop it based on `Treatment_Date` being null? 

### How many times is `Treatment Date` been empty?

### Drop the row where `Treatment_Date` is empty

### Confirm you have no more missing treatment dates

## Graph the number of men and women, but don’t included the “Not Recorded” records

## "Not recorded" seems silly - change it to be `NaN` instead

If we've talked about it yet, don't use `na_values` for this.

## Graph the count of men and women, but don’t included the “Not Recorded” records

Yes, again! The code you use should be different this time.

# Finding injuries

## Find every instance where the narrative includes punching a wall

Include phrases like "punched a wall" or "punch wall" or "punched ten thousand walls." Do not type them each individually. How do you do that?????

## Graph the gender distribution of wall-punching.

## Find the average age of a wall-puncher.

Graph the distribution of the ages, too.

## Finding products

### What are the most popular products for wall punchers?

## Fix the product codes

### What does the product code `1884` stand for? How about `652`?

## Uh, wait, look at those product codes.

`652` shouldn't be possible, it should be `0652`.

### Why did pandas change it from `0652` to `652`?

### Can we fix it so when reading in the data it doesn't change that column?

### Or, well, can we fill in those empty zeroes?

## Get meaningful names for "product code"

Go clean `categories-exported.txt` and save it as `categories-cleaned.csv`. I made another notebook for you!

When you're done, we'll use this to turn the codes into actual words.

### Read in `categories-cleaned.csv` and make sure it looks okay

**It probably doesn't.** Go back to the other notebook and work on it until it looks right.

## Merge together the two datasets

This will use `merge`, but it's really just like an SQL join. Is that exciting? I don't know.

### Confirm that it worked by searching for every injury involving a `Christmas tree`

## Graph the top 30 most popular products for injuries 

## Graph the top 30 most popular products for injuries for men

## Graph the top 30 most popular products for injuries for women