Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Trends in Maryland Crime

**Name:** Richard Erickson

**Email address associated with your DataCamp account:** raerickson@gmail.com

**Project description**: 

Is the violent crimes rate in Maryland increasing, decreasing, or staying the same? During this project, you will find out. First, wrangle the raw data supplied by State of Maryland. Second, analyze the data using a hierarchical regression to examine both the statewide crime rate and crime rate for each Maryland county. Third, plot the changes in crime rates for each county.

Like many data science work, this project requires both R and statistical skills. For R, this includes proficiency with the Tidyverse, including `ggplot2`. These skills are taught in DataCamp Courses such as [Introduction to the Tidyverse](https://www.datacamp.com/courses/introduction-to-the-tidyverse), [Data Manipulation in R with dplyr](https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial), or [Data Visualization with ggplot2 (Part 1)](https://www.datacamp.com/courses/data-visualization-with-ggplot2-1). For statistics, this includes [Hierarchical and Mixed Effects Models](https://www.datacamp.com/courses/hierarchical-and-mixed-effects-models) and [Multiple and logistic regression](https://www.datacamp.com/courses/multiple-and-logistic-regression)

You will use the [crime statistics](http://goccp.maryland.gov/crime-statistics/) from the State of Maryland. The Maryland Statistical Analysis Center provides this and updates data. For the project, you will analyze data from 1975 to 2016.


# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1.  Importation and inspecting Maryland crime data

Crime rates change through time. Reasons and even trends can be difficult to detect. Data journalists at sites like [Freakeconmics](http://freakonomics.com/2005/05/15/abortion-and-crime-who-should-you-believe/) and [FiveThrityEight](https://fivethirtyeight.com/features/why-we-cant-be-sure-if-violent-crime-is-on-the-rise/) discuss possible reasons and trends. Regression analysis allows us to estimate linear trends. But, some datasets can be hierarchical or nested, which presents a regression challenge. 

This includes many government statistics, such as crime rates. For example, counties exist within most US states (Alaska has “burrows” and Louisiana has “parishes”). Counties and county-level governments vary even within the same state. For example, one county might have a high population density and be urban whereas a second county might have a low population density and be rural. 

Hierarchical modeling allows us to capture and mode this hierarchy. This figure shows the counties and their populations in Maryland. You will create a similar figure for their crime trends at the end of this project.

![Map of Maryland counties and population](./img/countyPop.jpg). 

However, before running a regression analysis, data often needs to be cleaned to make it easily work with in R. This can include renaming and reformatting columns so they are easier to work with in R. [Working with Dates and Times in R](https://www.datacamp.com/courses/working-with-dates-and-times-in-r) provide more details on working with dates and times in R. 


In [None]:
# Load the tidyverse
#install.packages('tidyverse', repos='http://cran.us.r-project.org')
library(tidyverse)

# Read in the crime data
crime_raw <- read_csv("Violent_Crime___Property_Crime_by_County__1975_to_Present.csv")

# select columns JURISDICTION, YEAR, POPULATION, and VIOLENT CRIME RATE PER 100,000 PEOPLE. 
# Rename the last to be crime_rate 
crime_use <- crime_raw %>% 
    select(JURISDICTION, YEAR, POPULATION, crime_rate = `VIOLENT CRIME RATE PER 100,000 PEOPLE`)

# mutate YEAR to be a date (with mdy_hms()) and then extract the year (with year())
crime_use <- crime_use %>% 
    mutate(YEAR = year(mdy_hms(YEAR)))

# examine the raw data for each county 
crime_use %>% group_by(JURISDICTION) %>% summarize(n(), mean())


## 2. Plot the raw data with trend lines

We have now loaded our data into R. Before running a regression or building a model, I like to visualize the data. First, we plot the raw data and change the theme. Next, we add a trend line for each county with `stat_smooth()` using a linear regrsesion (`method = 'lm'`). We also disable the uncertainty around the regression lines by setting `se = FALSE`. This is done to make the plot easier to read.  

In [6]:
# plot the data using ggplot2
raw_plot <- 
    ggplot(crime_use, aes(x = YEAR, y = crime_rate,
                          group = county)) + 
    geom_line() +
    theme_minimal()

# add on a regression line for each county
raw_plot + stat_smooth(method = 'lm', se = FALSE) +

## 3. Building a lmer()

Now, we can build a hierarhcial model, also known as a linear mixed-effects regression (`lmer()`). `lmer()` uses similar syntax as `lm()`, but also requires a random-effect. For example, `y` predicted by fixed-effect slope `x` and random-effect intercept `group` would be `y ~ x + (1|group)`. `x` can also be included as a random-effect slope: `y ~ x + (x|group)`. [Hierarchical and Mixed Effect Models](https://www.datacamp.com/courses/hierarchical-and-mixed-effects-models) coves these models in greater details.

In our case, we are interested in if `YEAR` predicts the `crime_rate` as both a fixed- and random-effect slope across counties (`JURISDICTION`). However, if we try to run the model with the raw data, we get a warning message. We will need to rescale the `YEAR` parameter and then run the model again. 

In [None]:
# Load lmer package
library(lmerTest)
# Run a lmer() on data, notice the error message
lmer(crime_rate ~ YEAR + (YEAR|county) - 1, both_data)

# Mutate data to create a second year column 
both_data <-
  both_data %>%
  mutate(YEAR2 = YEAR - min(YEAR))

# Run a lmer() on mutuated data 
lmer_out <- lmer(crime_rate ~ YEAR2 + (YEAR2|county) - 1, both_data)

*Stop here! Only the three first tasks. :)*