# Unit Project:  Analysis of a Data File

## Learning Objectives
This project meets four of the learning objectives for this unit:

1. Learn to navigate tools that support computer programming for Python.
2. Program linear scripts in python.
3. Program scripts that include functions, logic, and control statements
4. Manage data within the program using lists and dictionaries.


You will meet these objectives by applying everything you have learned throughout the course in the implementation of a script for data analysis.

## Instructions
Read through all of the text in this page.  Section headers with the <i class="fas fa-puzzle-piece"></i> puzzle icon indicates that the section provides a tasks that you must perform to complete the project. 

The **Expected Outcomes** section provides a listing of the exerptise that you should have developed through the course and employed in this project.  If you feel uncomfortable in any of these areas please reach out to the instructor.

Follow the instructions in the **What to turn in** section to turn in the exercises of the assginment for course credit.

## Background
One of the most important uses for the Python programming in science is for **Data Analytics**.  With data sets becoming increasingly large (**volume**), complex (**variety**) and arriving faster (**velocity**)--the 3 V's of **big data**--scientsts are in greater need of programming skills.  Excel is insufficient and why wait around for an informatician when there are tasks you can do on your own! 

While a lot of open-source, freely-available software programs and websites are also written in Python, this module is geared for training you in the basics of Python towards use in Data Analytics. Therefore, for this project, we will write a Python script that can use those basic skills to read in a data file and perform some basic statistical analysis.

### Data Introducution
For this project we will be using the [Cover Type Dataset](https://archive.ics.uci.edu/ml/datasets/Covertype) available at the UC Irvine Machine Learning Repository. The site provides this explanation as to the purpose of this dataset:

> Natural resource managers responsible for developing ecosystem management strategies require basic descriptive information including inventory data for forested lands to support their decision-making processes.  However, managers generally do not have this type of data for inholdings or neighboring lands that are outside their immediate jurisdiction.  **One method of obtaining this information is through the use of predictive models.**  

> [The purpose of this dataset is for] predicting forest cover type from cartographic variables... The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data... Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types). 

We will not writing any code to do predictive analysis. We will just write code to perform some basic statistical analysis.

## Project Tasks
### <i class="fas fa-puzzle-piece"></i> Step 1: Download the Data
Go to the [Cover Type Dataset](https://archive.ics.uci.edu/ml/datasets/Covertype) web page and download the data. Click the link at the top that reads **Data Folder**.  On the resulting page you will find three files for download.  Please download the files named `covtype.data.gz`, and `convtype.info`. For the `convtype.data.gz` file, use your preferred decompression utility to uncompress the file.  The [7zip](https://www.7-zip.org/) program is a good option if you do not have a preferred tool.

Move these files into the `project` folder of this Git repository. Do not rename the file, it should retain its original name `covtype.data.gz`



### <i class="fas fa-puzzle-piece"></i> Step 2: Explore the Data
The `covtype.data.gz` is a comma-separated (CSV) data file. Before you can begin with this project, you must familarize yourself with the data.  Please read the `convtype.info` to learn the purpose of each column of data.  Use any text editor to view the file.

Here are some hints to help explain some of the data columns:

- Slope:  The angle in degrees of the slope on which the forest cover is growing.  
- Aspect:  The direction the slope is facing in degrees azimuth:  North = 0, East = 90, South = 180, West = 270.
- The columns representing shade contain values from 0 to 255 with 0 meaning no sun and 255 meaning full sun.
- There are 40 columns representing differnet soil types.  See the `convtype.info` file for a listing of these types.  The observations in these 40 columns indicate if cover was: absent = 0, present = 1
- There are 4 columns representing 4 different wilderness areas. The observations in these 4 columns indicate if cover was:  absent = 0, present = 1

If the meaning of the data is not clear, reach out on the Slack workspace to ask for help and clarification.

### <i class="fas fa-puzzle-piece"></i> Step 3: Write a Script
In the `project` folder of this repository, write a Python script. Name the script `project1.py`. The program will perform the following:

1. Read the `covtype.data`
2. Calculate the following statistics for the first 10 columns of the data file:
   1. Maximum value
   2. Minimum value
   3. Average (Mean)
3. Calculate the same set of statistics but for each of the seven wilderness areas. The wilderness areas are identified in columns 11-14 of the data file.
4. Write a new file that contains a report of the statistics calculated.

### Rules
Your program **must** make use of the following:

1. arguments passed via the command-line
2. use of lists or dictionaries
3. use of functions
4. use of loops
5. proper commenting

Your program **must not**
1. Use global variables (except the exceptions listed below)

### Exceptions
You will not be allowed to use global variables. However, there is always an exception to the rule. You can use the two following dictionaries as constants to help simplify your program.  But these dictionaries should never change.

```python

AREAS = {
    0: 'All',
    1: 'Rawah Wilderness Area',
    2: 'Neota Wilderness Area',
    3: 'Comanche Peak Wilderness Area',
    4: 'Cache la Poudre Wilderness Area'
}

COLUMNS = {
    0: 'Elevation',
    1: 'Aspect',
    2: 'Slope',
    3: 'Horizontal_Distance_To_Hydrology',
    4: 'Vertical_Distance_To_Hydrology',
    5: 'Horizontal_Distance_To_Roadways',
    6: 'Hillshade_9am',
    7: 'Hillshade_Noon',
    8: 'Hillshade_3pm',
    9: 'Horizontal_Distance_To_Fire_Points'
}
```

## Collaboration
You may work together in self-formed groups if you like. Group work is not required, but it may help save you time. As group you can:

**Do**

- Work together on an overall design for the program.
- Help one another debug your scripts.
- Code together simultaneously.
- You may work together in person or by Zoom.

**Do not**

- Do not share code with each other such that group members cut-and-paste code. Each person should write the entire script themselves. You can look at each other's code but the author must type it all themselves. 
- Do not post project code on Slack. If you help each other with programming, only show code on your computer screen.


## Student Peer-Review
Every student must review at least one other students script before turning it in.  You cannot get an **A** grade if you do not have someone else provide a peer-review.  The purpose for this is just a check to make sure your script works and to help you spot mistakes you may have missed.  

## Best Practices
Try the following when working on the project.
- Do not wait until the last week of the unit to start this project. 
  - You can start thinking about ideas after Assignment 6.
  - You can start reading the file after Assignment 7
  - Build the program as you go!
- Do not be shy!  Ask questions, take advantage of other students, the instructor and Slack!
- Check the evaulation checklist before turning in your script.
- Be sure to leave your student peer-review a few days.

## Evaulation Checklist  (100 points possible)

Credit is earned for your script if it meets the following criteria. 

1. The script runs properly (**45 points**):
   1. Executes to completion without presenting an error (20 points). 
   2. Writes a correct output file (25 points).
2. The script uses the following (**30 points**):
   1. Receives the data file as a single command-line arguments (5 points). 
   2. Uses at least one for or while loop (5 points). 
   3. Uses an if statements (5 points). 
   4. Uses a list or dictionary (5 points).
   5. The program does not use global variables (5 points).
   6. Uses a `main()` function and at least one other function (5 points).
3. The script follows Sphinx docstring style documentation (**15 points**):
   1. The program has a header and the header indicates if the author worked in a group (5 points). 
   2. Each function has documentation for each parameter and its type is described (5 points). 
   3. Each declared variable has a comment (5 points).
4. The program was reviewed by another student (**10 points**):

## Expected Outcomes

You should know how to do the following in Python

1. use variables including lists and dictionaries.
2. get command-line arguments from the user.
3. perform mathematics in code.
4. use control statments using logic
5. use functions.
6. read and write files.
7. make code readable through comments.


## What to Turn in?
Because your program is in this repository you can **commit** and **push** your changes to GitHub.  Once completed, send a **Slack message** to the instructor indicating you have completed the project. The instructor will verify all work is completed. 