# <img style="float: left; padding-right: 10px; width: 200px" src="https://raw.githubusercontent.com/trivikverma/researchgroup/master/assets/media/logo.png"> EPA-122A Introduction to *Spatial* Data Science 


## Assignment 1: Data Collection and Wrangling


---



# ``Instructions``

This assignment puts together what you learned in **Weeks 1-2**. You will be working with a dataset which is in the form of a spreadsheet. It may contain many different data types in the columns. All data frames contain column names, which are strings, and row indices, which are integers. In this assignment you will illustrate your knwoledge about bundling various kinds of data together to be able to do higher-level tasks.

_Note:_ Go through **labs and homeworks 00-02** before starting this assignment. 

#### 1.1 Submission

Please submit the results by Brightspace under **Assignment 01**, using a single file as example,

```text
firstname_secondname_thirdname_lastname_01.html

```

**If your file is not named in lowercase letters as mentioned above, your assignment will not be read by the script that works to compile > 200 assignments and you will miss out on the grades. I don't want that, so be exceptionally careful that you name it properly. Don't worry if you spelled your name incorrectly. I want to avoid a situation where I have 200 assignments all called assignment_01.html**

Please **do not** submit any data or files other than the ``html file``.

#### 1.2 How do you convert to HTML? 

There are 2 ways, 

1. from a running notebook, you can convert it into html by clicking on the file tab on the main menu of Jupyter Lab 
    * File &rightarrow; Export Notebooks as... &rightarrow; Export Notebook to HTML
2. go to terminal or command line and type
    * ``jupyter nbconvert --to html <notebook_name>.ipynb  ``


#### 1.3 Learning Objectives

This assignment is designed to support three different learning objectives. After completing the following exercises you will be able to:

* Explore variables in a dataset
* Manage missing data 
* Reshape data to get it in a form useful for statistical analysis 

#### 1.4 Tasks

This assignment requires you to go through five tasks in cleaning your data. 

1. Reading and summarizing the data.
2. Subsetting the Data. This extracts just the part of the data you want to analyse. 
3. Manage Missing Data. Some data is not available for all objects of interest (rows) or all variables for every object (columns). 
4. Shape the Data. We need to convert the data into a suitable format for analysis. 
5. Saving the Results. The results are saved for future use.

<br/>

***

# ``Critical Data Science``

Throughout the assignment, we encourage you to critically reflect on your choices during the Data Science process. To help you set-up your Critical Data Science process, we have provided you with a 'Guide on Critical Data Science'. Section 3.2 contains a step-by-step approach which each key considerations for each part of the data science process. The guide can be found here: https://trivikverma.github.io/spatial-data-science/resources_index.html#a-guide-to-critical-data-science. Throughout this exercises, you will also find several questions which you can use to reflect on your data science choices. 

<br/>

***

# ``Task 1: Downloading the Data``

For this assignment we are going to use the Den Haag cijfers database as a source of data. The database allows us to investigate a range of socio-economic variables for residents in The Hague. The data consists of time series which in some cases dates back into the seventies. 

You can download the data here as a csv file (You will have to choose the variables and file format yourself):
https://denhaag.incijfers.nl/jive 

Put the data in a convenient location on your computer or laptop, ideally in a folder called **data** which is next to this **jupyter notebook**. I recommend taking a look at the file in a text editor like _zed_ for any system or notepad++ for windows. These will also make your life easy for everything else on your computer. It’s a big file and it may take a while to load onto your laptop and into Python (running on the jupyter labs environment). 

## ``Exercise: Downloading the Data``

**IMPORTANT** make sure your code can run independent of the machine. i.e. 
- Use relative path links instead of absolute paths. If your data folder is named C:/HelloKitty/MyGummyBears/IlovePython/DenHaagData.csv, then your program will not be reproducible on any other machine. Check out this very easy to follow and handy guide on [relative paths](https://www.delftstack.com/howto/python/relative-path-in-python/).
- Organise the data in a folder called `data` and run your notebook next to it organised as follows

```text
├── trivik_verma_01.ipynb
├── data
│   ├── DenHaagData.csv
```

- Load the csv file into Python
- Explore it by looking at first and last 5 rows
- Programatically find and print information on the data,
    - number of columns in the data
    - names of the columns in the data 
    - number of rows in the data (excluding the header names)
    - how many unique neighborhoods in the data
    - how many unique variables are in the data

In [None]:
# your code here
# use many cells if you like to structure your code well


<br/>

***

# ``Task 2: Hypothesis formulation``

Think about your focus of analysis. What do you want to find out? A first step is to think about what variables might be related with each other and what these relationships could like like. Think also about a causal theory. Why do you expect this relationship to exist and to take on a specific form.

To give you an example from another dataset (this is a dataset of the worldbank: http://data.worldbank.org/data-catalog/world-development-indicators), have a look at the hypothesis below. I chose an analysis focus and then made some asumptions on the relationships between the focus variables. These asumptions are what I want to test as a formal hypothesis later on.

```text
My hypothesis
I’d like to examine world broadband access. For that reason I chose a broadband account variable. The data is organized by country. I want to control for the wealth, population, and land area of the country. I also have a hypothesis that more urban countries are more likely to have good broadband services. There are economies of scale when providing services to a large city. 

I hypothesize that larger countries have lesser access, since it is expensive to provide access over larger areas. On the contrary, countries with a lot of urban land area can take advantage of economies of scale resulting in relatively more broadband users concentrated in smaller zones within the country. We also hypothesize that wealthier countries have better broadband access, since there is a larger market to provide the newest services. A final variable which we add is rail lines. I hypothesize that broadband lines can take advantage of existing infrastructure right-of-ways, of which rail is a surrogate measure. Furthermore, the presence of rail lines may indicate other factors including a geography which is conclusive to physical development, and favourable institutional factors which promote high technology development. 

My choice of variables were, 

| Variable Name                 | Variable Code     |
| ----------------------------- | ----------------- |
| Fixed broadband subscriptions | IT.NET.BBND       |
| GDP (current US$)             | NY.GDP.MKTP.CD    |
| Population, total             | SP.POP.TOTL       |
| Land area (sq. km)            | AG.LND.TOTL.K2    |
| Urban land area (sq. km)      | AG.LND.TOTL.UR.K2 |
| Rail lines (total route-km)   | IS.RRS.TOTL.KM    |

```

Formulate a working hypothesis. Provide a list of variables you need to include in your hypothesis. Explain what your dependent variable is. Explain your choice of independent and control variables. Would you wish to include any variables that is not available? Are there any variables (proxies) present with which you could replace these missing values? Would that have an effect on your outcomes? Think about bias in the data. Why do you expect to find the relationship that you are hypothesising? Can you think of any cases in which this hypothesis will be rejected? Why? Can you think of any cases in which the hypothesis will be falsely rejected or falsely acepted? Think about bias in the data and in your own reasoning. 

## ``Exercise: Hypothesising``

- State your hypothesis in a markdown cell as I showed in the example above (there is no single right hypothesis, you are free to make a **reasonable** choice for this task)
- Find the variables of interest for your hypothesis and mention them in the markdown cell (4-7 variables)
- Explain what your dependent variable is. Explain your choice of independent and control variables.
- Would you wish to include any variables that is not available? Are there any variables (proxies) present with which you could replace these missing values? Would that have an effect on your outcomes? Think about bias in the data. Explain your reasoning.
- Can you think of any cases in which this hypothesis will be rejected? Why?
- Can you think of any cases in which the hypothesis will be falsely rejected or falsely acepted? Think about bias in the data and in your own reasoning.
- Can you shortly reflect if there are any potentially important perspectives that you are not taking into account by choosing this hypothesis?

In [None]:
# your code here
# use many cells if you like to structure your code well



<br/>

***

# ``Task 2: Subsetting the Data``




From now on we want to work with a subset of the data that is relevant to your study focus. Choose **4 to 7 variables** from your dataset for further exploratory statistics (more information will follow later in exercises). Think carefully about how you want to address the neighboorhoods. Does it make sense to build a new ID or should you use the provided IDs? Remember that it is important to work with codes as opposed to names to make our analyses more reproducible across time. For now let’s set aside the added complexity of time series and dynamics. Our task is to select just one year. 


## ``Exercise: Subsetting the Data``

- Subset your dataframe based on your focus of analysis and your choice of variables. 
- your dataframe would have greatly reduced in size and looks neater, show us what it looks like now using head() or something similar
    - show some statistics like number or rows, columns, names of variables and unique neighbourhoods, etc.
    
You can count the columns manually, but in a large data set like this it is accurate and convenient to let python calculate this for us. Get the index of relevant columns and store them in a variable. 


- You also will want to experiment to see which year you want to use. How likely do you think, your results will change if you choose a different year for your analysis? 
- Do you have enough data/ variables to sufficiently explore your hypothesis? Is something missing? Why? Should you drop variables from your data? Why?


In [None]:
# your code here
# use many cells if you like to structure your code well


<br/>

***

# ``Task 3: Manage Missing Data``

Datasets often have missing data. Not all variables are as well documented for every year as others. There are a number of ways in general of dealing with missing data. These involve

1. Dropping off cases (or rows) in the data with any missing variables
2. Excluding variables in the data with any missing data 
3. Selectively choosing indicators with only a limited amount of missing data
4. Replacing missing variables with averages, or other representative values
5. Creating a separate model to predict missing data

In this assignment we are going to use a number of these strategies. We can certainly be dropping off cases (strategy one). I am loathe to drop off whole indicators. But we can, for example, choose a year for the indicator where most of the data is available (strategy three).

Building a separate model to impute missing data, is often a good idea. But that requires a first working model before we even consider building a missing data model (and we haven't gotten there yet in this course); the working model and the missing data model are often constructed together. Note also that there are packages in Python which will construct a model of your data, and then impute missing values for you. You may or may not find these functions and packaging for modelling your data to be fully appropriate. Therefore treat these missing data models very seriously, and not as a black box. Models of missing data are as important, and deserve just as much care and caution as any other statistical model.  

In the next section I discuss some specifics about how the data is currently formatted, and how we would like to have it formatted for analysis purposes. 

## ``Exercise: Manage Missing Data``
- Explore and handle the missing data in your data set using one of the methods mentioned above. Visualize your findings if necessary. Think about potential explanations for gaps in the data. Think about what stage of the process led to the data gaps? Are the gaps going to be a problem for your analysis? Why? 
- Explain your choice of method. Can you think of ways how the chosen method skews your results? Did your data quality increase? Why? 
- At this stage, reflect on your choices along the way. Should you reiterate (e.g. choose a different focus/ hypothesis/ year for your analysis)? Why? 
- Also shortly reflect what the data gaps might mean. Are there collections of variables for which data is structurally missing? Are there neighbourhoods for which data is structurally missing? What can cause these structural gaps in the data? 

In [None]:
# your code here
# use many cells if you like to structure your code well


## ``Task 4: Reshape the Data``
Reshaping data is a two-step process of melting and pivoting the data. Melting the data involves describing which data are indicators ("id") and which are variables for retrieval (“measure”). In this case your data may already be in melted form (long form). Pivoting then involves actually reshaping the data into the needed format. In this step, you have to reshape the data from long to wide format.
 
Pivoting the data involves specifying what data is on the rows and on the columns. Hint: functions melt and pivot offered by ``numpy`` library in python. For our analyses we want the neighborhood IDs to be on the rows, and to have all 4-7 variables as columns, where the value of each cell is the value taken from the column year that you chose at the subsetting step.

## ``Exercise: Reshape the Data``
- Do you need to melt and pivot the data in your specific case? Explain. 
- Examine the dimensions of the new datadrame if you had to pivot. Show it to us using head or print commands.
- Then rename all column names to something better and useful and easily addressable.  

In [None]:
# your code here
# use many cells if you like to structure your code well


<br/>

***

# ``Task 5: Saving the Results``

_Note:_ We do not need this file but we expect that if you learn how to save your data, it will be very useful in the future, as you do not need to run the script to clean your data again. 

- Save the cleaned dataframe as 'assignment-01-cleaned.csv' in your data folder
- Consider your final dataset. Reflect on why you think that this newly created dataset is appropriate/ not appropriate to asnwer your research question.


In [None]:
# your code here
# use many cells if you like to structure your code well
