## Final Assignment


Before working on this assignment please read these instructions fully. Use blackboard to submit a link to your repository. 

On blackboard your find the assessment criteria. Please familiarize yourself with the criteria before beginning the assignment.

This assignment requires that you to find at least two datasets on the web which are related, and that you build an application that visualize these datasets to answer a research question with the broad topic of **health** or **agriculture** in the **region where you were born**. The region can be a city, town or a provence.  

The research question should be a question with a causual nature. For instance questions like: How does independent variable X influence the dependent variable of Y?

The code should be programmed efficiently. Also identify the most critical part and write software test for this part. Take into account the performance of the dataprocessing

### About the data

You can merge these datasets with data from different regions if you like. For instance, you might want to compare the health effect of earhtquacks in Groningen versus Los Angelos USA. 

You are welcome to choose datasets at your discretion, but keep in mind they will be shared with others, so choose appropriate datasets. You are welcome to use datasets of your own as well, but minimual two datasets should be coming from the web and or API's. 

Also, you are welcome to preserve data in its original language, but for the purposes of grading you should provide english translations in your visualization. 

### Instructions:

Define a research question, select data and code your data acquisition, data processing, data analysis and visualization. Write code to test most critical parts. Use a repository with a commit strategy and write a readme file. 

Write a small document with the following:
- State the region and the domain category that your data sets are about 
- State the research question 
- Justify the chosen data storage and processing approach
- Justify the chosen analysis approach
- Justify the chosen data visualization approach

Upload your document and the link of your repository to black board

In [125]:
import pandas as pd
import numpy as np
import yaml


In [126]:
# Makes pandas show all the columns and rows.
pd.set_option('display.max_columns', None)

In [127]:
def get_config():
    """ 
    Read in config file and return it as a dictionary. 
    """
    with open("C:/Users/rie12/Desktop/config.yaml", 'r') as stream:
        config = yaml.safe_load(stream)
        return config
    
path = get_config()['programming1']

In [128]:
crime_stat = pd.read_csv(path + '/safety_actual.csv', sep=';')
safe_per = pd.read_csv(path + '/safety_perceiv.csv', sep=';')

# meta data for the perceived safety file but can also be used for the crime file
meta_data_safe = pd.read_csv(path + '/81881NED_metadata.csv', sep=';', skiprows=1)

In [129]:
# getting the necessary personal characteristics for both datasets

meta_data_safe_persoon = meta_data_safe[['ID','Title']]
meta_data_safe_persoon = meta_data_safe_persoon[51:93]

In [130]:
meta_data_safe_persoon

Unnamed: 0,ID,Title
51,1012600,Migratieachtergrond: Nederland
52,2012655,Migratieachtergrond: westers
53,2012657,Migratieachtergrond: niet-westers
54,2018700,Onderwijsniveau: laag onderwijs
55,2018710,Onderwijsniveau: basisonderwijs
56,2018720,"Onderwijsniveau: vmbo,mbo1,avo onderbouw"
57,2018740,Onderwijsniveau: middelbaar onderwijs
58,2018750,"Onderwijsniveau: havo, vwo, mbo"
59,2018790,Onderwijsniveau: hoog onderwijs
60,2018800,"Onderwijsniveau: hbo, wo bachelor"


In [131]:
# merging the safety and the meta data
safe_per = safe_per.merge(meta_data_safe_persoon, how='inner', right_on='ID',left_on='Persoonskenmerken')
# merging the crime and meta data
crime_stat = crime_stat.merge(meta_data_safe_persoon, how='inner', right_on='ID',left_on='Persoonskenmerken')

In [132]:
# obtaining only the age personal characteristics as that is what I am interested in.
safe_per_age = safe_per[safe_per['Title'].str.contains('Leeftijd') | safe_per['Title'].str.contains('Totaal') == True]
safe_per_age = safe_per_age[safe_per_age['Marges'].str.contains('MW00000') == True]
# getting only the data from 2019
safe_per_age_19 = safe_per_age[safe_per_age['Perioden'].str.contains('2019') == True]


# Doing the same to the crime statistics file
crime_stat_age = crime_stat[crime_stat['Title'].str.contains('Leeftijd') | crime_stat['Title'].str.contains('Totaal') == True]
crime_stat_age = crime_stat_age[crime_stat_age['Marges'].str.contains('MW00000') == True]

crime_stat_age_19 = crime_stat_age[crime_stat_age['Perioden'].str.contains('2019') == True]

In [146]:
crime_filter = crime_stat_age_19[['Title','AantalDelicten_1', 'PogingTotInbraak_38', 'Bedreiging_30', 'Mishandeling_31', 'GeweldMetSeksueleBedoelingen_32','PogingTotZakkenrollerij_66', 'Zakkenrollerij_67', 'PogingTotBeroving_68']]
safe_filter = safe_per_age_19[['Title','VoeltZichWelEensOnveilig_1','VoeltZichVaakOnveilig_2','VanZakkenrollerij_3','VanBerovingOpStraat_4','VanInbraakInWoning_5','VanMishandeling_6']]

In [147]:
safe_filter

Unnamed: 0,Title,VoeltZichWelEensOnveilig_1,VoeltZichVaakOnveilig_2,VanZakkenrollerij_3,VanBerovingOpStraat_4,VanInbraakInWoning_5,VanMishandeling_6
6,Totaal personen,31.8,1.4,2.6,1.7,7.8,1.9
48,Leeftijd: 15 tot 25 jaar,40.1,1.8,3.8,1.8,7.2,2.6
62,Leeftijd: 25 tot 45 jaar,36.0,1.5,2.9,1.8,9.5,2.4
76,Leeftijd: 45 tot 65 jaar,30.3,1.3,2.2,1.6,8.0,1.7
90,Leeftijd: 65 jaar of ouder,23.3,1.0,2.0,1.4,5.8,1.1
104,Leeftijd: 15 tot 18 jaar,36.9,1.4,3.4,1.4,5.0,2.1
118,Leeftijd: 18 tot 25 jaar,41.8,2.1,4.0,2.0,8.2,2.9
132,Leeftijd: 25 tot 35 jaar,37.7,1.7,3.1,2.0,8.9,2.7
146,Leeftijd: 35 tot 45 jaar,34.1,1.3,2.6,1.6,10.1,2.1
160,Leeftijd: 45 tot 55 jaar,31.5,1.3,2.2,1.6,8.4,1.8
