# a1 - Python

This assignment will cover some questions related to topics of data types, attribute types, and exploratory data analysis. The assignment will also serve as a further introduction to using a high-level language for analysis (e.g., R, Python).

Make sure that you keep this notebook named as "a1.ipynb" 

Submit the zip-file created after running your notebook on the Linux lab machines. 

Your answers must be computer generated (including text and diagrams). Your final document submission should include text responses to questions and description of your efforts, tables, R/Python code used to calculate answers, and figures.

Any other packages or tools, outside those listed in the assignments or Canvas, should be cleared
by Dr. Brown before use in your submission.




## Q0 - Setup

The following code looks to see whether your notebook is run on Gradescope (GS), Colab (COLAB), or the linux Python environment you were asked to setup.

In [None]:
import re 
import os
import platform 
import sys 

# flag if notebook is running on Gradescope 
if re.search(r'amzn', platform.uname().release): 
    GS = True
else: 
    GS = False

# flag if notebook is running on Colaboratory 
try:
  import google.colab
  COLAB = True
except:
  COLAB = False

# flag if running on Linux lab machines. 
cname = platform.uname().node
if re.search(r'(guardian|colossus|c28)', cname):
    LLM = True 
else: 
    LLM = False

print("System: GS - %s, COLAB - %s, LLM - %s" % (GS, COLAB, LLM))

### Notebook Setup 

It is good practice to list all imports needed at the top of the notebook. You can import modules in later cells as needed, but listing them at the top clearly shows all which are needed to be available / installed.

If you are doing development on Colab, the otter-grader package is not available, so you will need to install it with pip (uncomment the cell directly below).

In [None]:
# Only uncomment if you developing on Colab 
# if COLAB == True: 
#     print("Installing otter:")
#     !pip install otter-grader==4.2.0 

In [None]:
# Import standard DS packages 
import pandas as pd 
import numpy as np
import matplotlib as mpl 
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline 


# Package for Autograder 
import otter 
grader = otter.Notebook()

In [None]:
grader.check("q0")

## Q1 - Census Data 

The ACSIncome dataset is one of five datasets created by Ding et al. [1] as an improved alternative to the popular UCI Adult dataset [2].

You can explore the files a bit in a text editor to understand the format. Note, there are about 1.6M rows of data. 

Use the column labels as described on the [reference website](https://fairlearn.org/main/user_guide/datasets/acs_income.html), e.g., 'AGEP', 'COW', etc.  

*References*   
[1]: Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: new datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.  
[2]: Ronny Kohavi and Barry Becker. Adult data set. UCI Machine Learning Repository, 1996.  
DOI: https://doi.org/10.24432/C5XW20. URL: https://archive.ics.uci.edu/ml/datasets/adult.

### Q1a - Load Data 

Load in the data from the csv-files into a Pandas Data Frame. 
Make sure to recognize any missing values when the data is read in. 

Use the column labels as described on the [reference website](https://fairlearn.org/main/user_guide/datasets/acs_income.html), e.g., 'AGEP', 'COW', etc.

In [None]:

income = pd.read_csv(...)  

income.head()

In [None]:
grader.check("q1a")

<!-- BEGIN QUESTION -->

### Q1b - Variable Definitions

To answer this question, you may have to do a bit of reading and research into this data set. If you can not find a clear explanation of what a variable is and how it is defined say so.

Describe what each row represents in the data set. 

For each variable (column of the data set), write a brief, clear 1-sentence description (less than one line long) of what the variable is, i.e., what information does it describe and how is it defined or collected.  

For example, the variable `AGEP` could be described as:

*  **AGEP** is the age of an individual; the value is reported in integer units of years, 0-99.  

Refer to each variables by the column name.

**YOUR ANSWERS**

A row represents ...

* **AGEP** is the age of an individual; the value is reported in integer units of years.  
* **COW** 
* ...

<!-- END QUESTION -->

### Q1c - Attribute Types

To answer this question, you may again have to do a bit of reading and research into this data set. If you can not find a clear explanation of what a variable is and how it is defined say so.

For each variable, state the attribute type: 1- *nominal*, 2- *ordinal*, 3- *interval*, or 4- *ratio*. 


For example, the variable `AGEP` could be described as:

*  **AGEP** is the age of an individual; the value is reported in integer units of years, 0-99.  
    *Ratio*

Therefore, in the code cell below you would have: 

`type_agep = 4`

In [None]:
type_agep = ...
type_cow = ...
type_schl = ...
type_mar = ...
type_occp = ...
type_pobp = ...
type_relp = ...
type_wkhp = ...
type_sex = ...
type_rac1p = ...
type_st = ...
type_pincp = ...

In [None]:
grader.check("q1c")

## Q2 - Exploratory Data Analysis 

We will explore aspects of the income data from above. 

**REMINDER!** For each part of the question below, make sure the figure is of reasonable size.  You may want to consider using the `figsize` parameter in matplotlib.  

Use good visualization practices as discussed in class and in the [Visualization book reference](https://clauswilke.com/dataviz/). 
Make sure to label everything. 


<!-- BEGIN QUESTION -->

### Q2a - Visualize: Amounts: Single Variable

Create a bar plot of `SCHL`. 



In [None]:

# Create bar plot of education




<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Q5b: Visualize: Amounts: Multiple Variables

Create a grouped bar plot of `MAR` with `SEX`.

In [None]:
# Create plot




<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Q5c - Visualize: Distribution: Single Attribute

Generate a histogram with an appropriate number of bins to visualize the distribution of `AGEP`.

In [None]:
# Create plot




<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Q5d - Visualize: Distribution: Multiple Attributes

Create an overlapping density plots for `AGEP` by `MAR`.

In [None]:
# Create plot




<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

**NOTE** the submission must be run on the campus linux machines.  See the instruction in the Canvas assignment.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)