# Section 2
You would have received a US_Arrest_Data.zip file with 8 datasets in .csv format and 3 data dictionaries in .docx format. You may or may not use all the datasets during this assessment.

In [None]:
## Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels as sm
import os

In [None]:
## DEFINE DATA PATH OF INTEREST
US_ARREST_DATA_DIR = os.path.join(os.getcwd(), "data", "US_Arrest_Data")

## Question 2.1
Categorise / group the states based on their characteristics.

Dataset to consider:
1. **US Population data** (Time series of 19 values)
2. **US State related data**:
    - *USstatedivision*: Factor giving state divisions (New England, Middle Atlantic, South Atlantic, East South Central, West South Central, East North Central, West North Central, Mountain, and Pacific).
    - *USstateregion*: Factor giving the region (Northeast, South, North Central, West) that each state belongs to.
    - *USstatex77*: Matrix with 50 rows and 8 columns giving the following statistics in the respective columns as follows:
    <br/>
    <br/>
    
    |Column|Description|
    |---|---|
    |Population|Population estimate as of July 1, 1975|
    |Income|Per capita income (1974)|
    |Illiteracy|Illiteracy (1970, percent of population)|
    |Life Exp|Life expectancy in years (1969–71)|
    |Murder|Murder and non-negligent manslaughter rate per 100,000 population (1976)|
    |HS Grad|Percent high-school graduates (1970)|
    |Frost|Mean number of days with minimum temperature below freezing (1931–1960) in capital or large city|
    |Area|Land area in square miles|

    Based on information provided via page 24 of Chapter 6 the United States Census Bureau document accessable via: https://www2.census.gov/geo/pdfs/reference/GARM/Ch6GARM.pdf, and noting the data is based on 1970s, it is noted that reach US region can be further segmented into divisions as follows:

    |Region|Division|
    |---|---|
    |Northeast|New England, Middle Atlantic|
    |North Central Region|East North Central, West North Central|
    |South|South Atlantic, East South Central, West South Central|
    |West|Mountain, Pacific|


Dataset not considered:
- *USstateabb*: character vector of 2-letter abbreviations for the state names. Reason: This is included in *USStatex77* dataset
- *USstatearea*: Numeric vector of state areas (in square miles). Reason: Such data is included in *USStatex77* dataset and only some form of identifier is provided instead of actual state name.
- *USstatecenter*: List with components named x and y giving the approximate geographic center of each state in negative longitude and latitude. Alaska and Hawaii are placed just off the West Coast. Geographic coordinates of states is not an useful feature for categorisation since we cannot compare such data between states.


In [None]:
# Required data files
US_POP_DATA_PATH = os.path.join(US_ARREST_DATA_DIR, "USPop.csv")
US_STATE_DIVISION_PATH = os.path.join(US_ARREST_DATA_DIR, "USstatedivision.csv")
US_STATE_REGION_PATH = os.path.join(US_ARREST_DATA_DIR, "USstateregion.csv")
US_STATE_MATRIX_PATH = os.path.join(US_ARREST_DATA_DIR, "USstatex77.csv")

## Question 2.2
What factors are statistically significant in predicting Assault rates?

Side note:
- Statistical significance indicates that an effect you observe in a sample is unlikely to be the product of chance. For statistically significant results, you can conclude that an effect you observe in a sample also exists in the population.

- We can fit a LinearRegression model on the regression dataset and retrieve the coeff_ property that contains the coefficients found for each input variable.

- These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.
Dataset: USArrest.csv (main). You may wish to use the other datasets to enhance your study.

Dictionary for US Arrest data: 
A data frame with 50 observations on 4 variables.
|Column|Type|Description|
|---|---|---|
|Murder|numeric|Murder arrests (per 100,000)|
|Assault|numeric|Assault arrests (per 100,000)|
|UrbanPop|numeric|Percent urban population|
|Rape|numeric|Rape arrests (per 100,000)|



In [None]:
# Required data file
arrest_data_file = os.path.join(US_ARREST_DATA_DIR, "USArrest.csv")