# Wrangling and Analyzing Homicide Data with Pandas

Below we'll use some fake data -- generated by ChatGPT for learning purposes -- on city populations and murder counts to practice basic data wrangling and analysis skills using the **pandas** library.

The skills we'll cover include:

- Creating a DataFrame
- Slicing and Dicing DataFrames (aka [indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing)).
- Adding new columns based on a calculation
- Merging DataFrames
- Aggregating data
- Sorting and Filtering

In [None]:
cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose']
population = [3926219, 6411233, 3272371, 3003751, 6621623, 1609186, 2705762, 2235335, 5075959, 4356045]
murders_by_city = [219, 476, 127, 239, 220, 497, 498, 413, 311, 398]

In [None]:
import pandas as pd

Let's assume that our population data comes from one source (e.g. the Census) and our data on homicides comes from another (e.g. the FBI UCR crime stats).

Below, we'll create two DataFrames using our fabricated data: one containing the population of each city, and the other the count of murders. 

You'll notice that we're using the `cities` list in both DataFrames. The city name will serve as a unique identifier common to both data sets, which is important when it comes time to join the data sets.

In [None]:
pops = pd.DataFrame({'cities': cities, 'pop': population})
murders = pd.DataFrame({'cities': cities, 'murders': murders_by_city})

Now let's merge those DataFrames.

In [None]:
df = pops.merge(murders, on='cities', how='inner')
df

**Task**: Verify that the join didn't cause records to be dropped by comparing the size of the original DataFrames with the merged DataFrame.

## Slicing and dicing a DataFrame

There are a head-spinning variety of ways to use row position, column names, or a combination of both to select a subset of data in a DataFrame. 

While you may not use all of these techniques on a regular basis, you should at least be aware of them since you'll see them in other people's code. 

Below is a brief overview of some of the more common techniques.

### `.loc`

Select a single value by index (aka row) and column. 

> Note that start counting at 0 and we use square brackets (NOT parentheses).

In [None]:
df.loc[0, 'cities']

Select a range of rows and column.

> Note, unlike Python lists, the end position is "inclusive". In other words, the item will be included in the list returned by the range)

In [None]:
df.loc[0:1, 'cities']

Select range of rows and multiple columns. 

In [None]:
df.loc[2:5, ['cities', 'pop']]

Select range of rows and **all** columns.

> Note the `:` tells `loc` to include all columns.

In [None]:
df.loc[2:3, :]

### `.iloc`

The `.iloc` method on a DataFrame can be used to grab entire, including all columns, simply using the a range of the row index positions. This is similar to plain old `.loc`, but you don't need to specify a second argument for columns.

The `i` in `iloc` refers to index, or row number, starting from 0.

In [None]:
df.iloc[0:2]

### Square brackets selection

Yet another way to select rows, with arguably the simplest syntax, is to simply use the square brackets `[]` on a DataFrame, similar to how we filter the DataFrame.

> Note that iunlike `.loc` and `.iloc`, the second number (`2`) is **not** inclusive, so we only get the results in rows 0 and 1.

In [None]:
df[0:2]

## Adding new columns

It's often handy to apply functions to a column, for example to generate a new column. To do so, we use the `apply` method on a particular column.

Below, we apply Python's built-in `str.upper` method to all values in the `cities` column.


In [None]:
df.cities.apply(str.upper)

Note that the original values haven't changed.

In [None]:
df

To retain those upper-cased city names, we can assign the results of the `apply` function to a new column using the square brackets syntax:`df['name_of_new_field']`.

Take special note that we're **not** actually calling the function/method or method that is passed to `apply`. You can tell because we're not ending the name with parentheses containing an argument. Normally, we'd "call" the function or method as so:

```python
# Use parens to call "upper" and pass in a string, 
# since that's what this method expects

str.upper("this is all lower case")
```

We're simply passing `str.upper` to `apply`, which in turn...well...*applies* it to each value in the column. The `apply` method is actually calling the `str.upper` method for us behind the scenes.

In [None]:
df['cities_upper'] = df.cities.apply(str.upper)
df

### Sidenote

We might apply other clean-ups or standardizations to such a column in the real world, to ensure that the values are identical in both data sets. This is important when we want to join two datasets, which requires each dataset to contain one or more columns that can be used to match rows between them. 

*Ensuring consistency in the values we're using to join datasets is critical to avoid losing records during the merge process.*

In this simple exercise, that shouldn't be a problem because we'll use the same list of `cities` (created at the top of this notebook) in both DataFrames.

### Dropping a note

So let's just remove the new `cities_upper` column to keep our dataframe tidy.

> Note that we have to specify the `axis=1` argument. This tells the DataFrame to search for the label by column, rather than by index (ie row), which is the default behavior. We also use the `inplace=True` argument to modify the original dataframe, rather than returning a copy without the `cities_upper` column.

In [None]:
df.drop('cities_upper', axis=1, inplace=True)

In [None]:
df

## Calculating a per capita rate

Now that we're comfortable creating news columns, let's try calculating the murder rate for this data. We'll pretend we're working with data from 2023, so we'll call the new field  `rate23`. This will help us differentiate this rate from another identical calculation later on, when we merge data from a prior year.

You can once again use `apply` to calculate a rate for each row. 

Due to the low number of murders relative to the overall population, we'll use a larger "multiplier" in our calculation (100,000 instead of 1,000). That should produce a more human-friendly number (typically above 1) as a result.

Below is one way to perform the calculation. 

A few important items to note:

- The `axis=1` argument, which tells pandas to apply the calculation to each row
- The use of a [lambda](https://docs.python.org/3/reference/expressions.html), which is a way of writing an anonymous, inline function -- i.e. one without a name using the traditional `def my_function_name():` syntax. Lambdas are handy for short snippets of code. If the logic or math was significantly more complex, we might want to create a custom function and then pass that into `apply` instead.
- It might be hard to tell from the lambda defintion, but once again, we're not actually calling the lambda function. `apply` is doing that for us.

In [None]:
df.apply(lambda row: (row['murders'] / row['pop']) * 100_000, axis=1)

The above works, but there's a simpler and faster way to generate this new column that is especially well-suited to short bits of math/logic. You can simply select the columns from the DataFrame and use standard math operators such as division and multiplication. This can have several benefits:

- The code is more *readable*.
- The operation is more *efficient* or fast because, under the hood, the DataFrame is using a technique called "vector math", rather than applying a custom function to each row.

In [None]:
df['murders'] / df['pop'] * 100000

The above appears to generate the same answers as our `apply` approach earlier, so let's go ahead and re-run the calculation, this time storing it in a new DataFrame column called `rate23`.

In [None]:
df['rate23'] = df['murders'] / df['pop'] * 100000

## Sorting by rate

At this point, we can use some simple sorting to find the city with the highest murder rates in 2023: Philadelphia.

In [None]:
df.sort_values('rate23', ascending=False)

## Filtering and sorting

And if we can add some filtering to find the cities with murder rates greater than or equal to 10 per 100,000 people

> Note, we'll filter first, then apply sorting. You could do this in reverse (ie sort first), but the sorting operation will be quicker on a smaller data set than on the entire data set. The difference in speed becomes much more noticeable as datasets increase in size.

In [None]:
# Let's create a filter for our DataFrame first
rate_filter = df.rate23 >= 10.0

In [None]:
# Then apply the filter, followed by the sorting
df[rate_filter].sort_values('rate23', ascending=False)

## Adding another year to the mix

Let's say we have additional data for the year 2022. *Again, this is totally fake population and homicide data generated for the purposes of the exercise.*

We can rinse and repeat the above steps to calculate the homicide rate for 2022.

We'll reuse the original `cities` variable defined at the beginning of our notebook, and simply add new population and homicide numbers.

In [None]:
pop2 = [3732063, 6334621, 3759166, 2825857, 6936845, 1690307, 3111121, 1956014, 5423644, 4561616]
murders2 = [284, 536, 36, 240, 201, 553, 539, 503, 232, 363]

df2 = pd.DataFrame({'cities': cities, 'pop': pop2, 'murders': murders2})       

Now we're ready to calculate the murder rate for 2022 using the new DataFrame `df2`.

In [None]:
df2['rate22'] = df2['murders'] / df2['pop'] * 100000

In [None]:
df2

## Exercises

Repeat the basic analyses we performed on the original 2023 data. Specifically
 
- Sort the 2022 DataFrame to fin the city with the highest murder rate
- Filter *and* sort the DataFRame to find cities with a murder rate greater than or equal to 10 per 100,000 people

## Calculating the rate of change over years

We now have two dataframes showing the murder rates for these cities in 2022 and 2023. 

To find which cities had the largest increase or decrease in homicide rates over these years, we can merge these datasets and perform a rate change calculation.


### Joining data

Step 1 is joining these DataFrames on a common "key" or column -- in this case `cities`.

Note that because we have columns with identical names across these data sets, pandas automatically prefixes them with an `x_` and `y_` to allow us to tell them apart. 

The `x_` is assigned to columns from the DataFrame on the left side of the `.merge` method (ie the DataFrame that we're merging into); the `y_` is applied to the DataFrame passed into the `.merge` method (ie the one that is being merged in).

In [None]:
df_merged = df.merge(df2, on='cities')
df_merged

### Rate of Change

After joining our data, we can add a new column containing the rate of change between 2022 and 2023.

As a refresher, that calculation involves:

- Subtracting the original rate from the new rate
- Dividing the result by the original rate
- Multiplying the result by 100 to produce a percent change

In simple terms: `(New - Old / Old) * 100`.

In [None]:
df_merged['rate_change'] = (df_merged['rate23'] - df_merged['rate22'] ) / df_merged['rate22'] * 100
df_merged