# Assignment 1
### Understanding Uncertainty
### Due 9/5

Sophie Kim

1. Create a new public repo on Github under your account. Include a readme file.
2. Clone it to your machine. Put this file into that repo.
3. Use the following function to download the example data for the course:

In [1]:
def download_data(force=False):
    """Download and extract course data from Zenodo."""
    import urllib.request, zipfile, os
    
    zip_path = 'data.zip'
    data_dir = 'data'
    
    if not os.path.exists(zip_path) or force:
        print("Downloading course data")
        urllib.request.urlretrieve(
            'https://zenodo.org/records/16954427/files/data.zip?download=1',
            zip_path
        )
        print("Download complete")
    else:
        print("Download file already exists")
        
    if not os.path.exists(data_dir) or force:
        print("Extracting data files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Data extracted")
    else:
        print("Data directory already exists")

download_data()

Download file already exists
Data directory already exists


4. Open one of the datasets using Pandas:
    1. `ames_prices.csv`: Housing characteristics and prices
    2. `college_completion.csv`: Public, nonprofit, and for-profit educational institutions, graduation rates, and financial aid
    3. `ForeignGifts_edu.csv`: Monetary and in-kind transfers from foreign entities to U.S. educational institutions
    4. `iowa.csv`: Liquor sales in Iowa, at the transaction level
    5. `metabric.csv`: Cancer patient and outcome data
    6. `mn_police_use_of_force.csv`: Records of physical altercations between Minnessota police and private citizens
    7. `nhanes_data_17_18.csv`: National Health and Nutrition Examination Survey
    8. `tuna.csv`: Yellowfin Tuna Genome (I don't recommend this one; it's just a sequence of G, C, A, T )
    9. `va_procurement.csv`: Public spending by the state of Virginia

In [2]:
import pandas as pd
import numpy as np

In [3]:
mn_police_data = pd.read_csv("data/mn_police_use_of_force.csv")

5. Pick two or three variables and briefly analyze them
    - Is it a categorical or numeric variable?
    - How many missing values are there? (`df['var'].isna()` and `np.sum()`)
    - If categorical, tabulate the values (`df['var'].value_counts()`) and if numeric, get a summary (`df['var'].describe()`)

In [4]:
# looking at data
mn_police_data

Unnamed: 0,response_datetime,problem,is_911_call,primary_offense,subject_injury,force_type,force_type_action,race,sex,age,type_resistance,precinct,neighborhood
0,2016/01/01 00:47:36,Assault in Progress,Yes,DASLT1,,Bodily Force,Body Weight to Pin,Black,Male,20.0,Tensed,1,Downtown East
1,2016/01/01 02:19:34,Fight,No,DISCON,,Chemical Irritant,Personal Mace,Black,Female,27.0,Verbal Non-Compliance,1,Downtown West
2,2016/01/01 02:19:34,Fight,No,DISCON,,Chemical Irritant,Personal Mace,White,Female,23.0,Verbal Non-Compliance,1,Downtown West
3,2016/01/01 02:28:48,Fight,No,PRIORI,,Chemical Irritant,Crowd Control Mace,Black,Male,20.0,Commission of Crime,1,Downtown West
4,2016/01/01 02:28:48,Fight,No,PRIORI,,Chemical Irritant,Crowd Control Mace,Black,Male,20.0,Commission of Crime,1,Downtown West
...,...,...,...,...,...,...,...,...,...,...,...,...,...
12920,2021/08/30 21:38:46,Assault in Progress,Yes,ASLT5,,Bodily Force,Joint Lock,White,Female,69.0,,1,Loring Park
12921,2021/08/30 22:32:22,Unwanted Person,Yes,CIC,,Bodily Force,Joint Lock,,,,,1,Cedar Riverside
12922,2021/08/31 12:03:08,Overdose w/All,Yes,FORCE,,Bodily Force,Body Weight Pin,Black,Male,,,3,Seward
12923,2021/08/31 12:52:52,Attempt Pick-Up,No,WT,,Bodily Force,Body Weight Pin,Black,Male,31.0,,4,Camden Industrial


In [6]:
mn_police_data["age"].describe()

count    11859.000000
mean        29.484527
std         10.987780
min          0.000000
25%         22.000000
50%         28.000000
75%         35.000000
max         82.000000
Name: age, dtype: float64

In [7]:
na_police = mn_police_data['age'].isna() 
np.sum(na_police)

np.int64(1066)

**Age variable**
The age variable in this dataset is a numeric variable. There are 1,066 NAs in the variable. The average age is around 29 years old with a standard deviation of around 11 years. The median age is 28, with 25% of ages below 22 years old and 75% of ages falling below 35 years old. The highest age is 82 years old. 

In [8]:
mn_police_data['force_type_action'].value_counts()

force_type_action
Body Weight Pin                                      2651
Body Weight to Pin                                   1711
Joint Lock                                           1578
Takedown                                             1330
Crowd Control Mace                                    899
Push Away                                             828
Personal Mace                                         669
Firing Darts                                          426
Red Dot                                               353
Display                                               349
Punches                                               288
Punch                                                 231
Knees                                                 208
Knee                                                  203
Side Recovery Position                                149
Touch                                                 141
Kicks                                                 

In [9]:
force_action_police = mn_police_data['force_type_action'].isna()
np.sum(force_action_police)

np.int64(0)

**Force type action variable**

The force type action variable is a categorical variable. There were 0 NAs in this column. The top force type action was body weight pin with the second being body weight to pin. The three smallest force type action counts were jabs, vehicle, and crowd control techniques. 

6. What are some questions and prediction tools you could create using these data? Who would the stakeholder be for that prediction tool? What practical or ethical questions would it create? What other data would you want, that are not available in your data?

***Questions and prediction tools that I could create with this data are:***
1. *What is the likeliness of a certain crime being committed a specific neighborhood?*
2. *What race is more likely to encounter police force?*
3. *Which type of problem commonly is reported by a 911 phone call?*


*Since these questions are all predicting categorical variables, I think a contingency table would help to predict.*

***The stakeholders for these questions:***
1. *The residents of the neighborhoods to be aware of what crimes are most common for public safety reasons.*
2. *This is generally good information to prevent race bias among police and for public awareness in general.*
3. *This could be helpful for 911 dispatchers since they could perhaps find a more streamlined question/answer process that would help to gather information from the caller more efficiently.*

***Practical or ethical questions:***

1. *I think that this question could help people who are moving to a certain area to see what crimes are most likely to occur in that neighborhood.*
2. *I think that this ethical question could create a lot of further ethical questions surrounding racial bias in police. Practically, I think that any information from this question could be implemented in some sort of training module for police.*
3. *I think that this question (as I mentioned earlier) can create practical questions for 911 dispatchers to help streamline their calls to get the information from callers as quickly as they can.*

7. Commit your work to the repo (`git commit -am 'Finish assignment'` at the command line, or use the Git panel in VS Code). Push your work back to Github and submit the link on Canvas in the assignment tab.