We're interested in enabling reproducible statistical analyses; we're therefore going to focus on both accessing our data as well as executing statistical analyses programmatically. In this assignment, we'll work through accessing data in a standardized fashion using the Python programming language.
Although there are a wide variety of programming languages available for data analysis, we've chosen to focus on Python. We've made this choice since Python has several important characteristics:
- It's free and open source software (FOSS), meaning that it's accessible without a subscription plan
- It's a great general purpose scripting language, used in a wide variety of industries today
- It has a strong ecosystem for scientific analyses
For statistics, the other very popular language to know is R—I highly recommend you check it out, even though we won't be using it in this course.
To get a working python environment, please install Miniconda into the computational environment you set up in A1. Specifically, please make sure to install the Python 3.7 build as there are significant differences between the two Python versions; if you're interested, you can learn more about some of these differences here.
Once you have installed Miniconda, you will have access to a working Python environment. First, we'll be using pip
—a python package management system— to install several packages we'll need to execute the assignment. From an open terminal, you can execute the following commands
pip ipython
pip numpy
pip pandas
Notice that these commands are executed outside of the Python environment! Now we can start Python using the following command:
ipython
For this assignment, we'll be using the classic iris dataset. We'll ask you to load the dataset and provide some descriptive statistics.
I've started the code for you in the associated descriptive_statistics.py
file.
In your copy of the repository, please extend the descriptive_statistics.py
code to include calculating descriptive statistics in either numpy
or pandas
.
When you are ready to submit, please open a pull request on your copy of the repostiory with the modified descriptive_statistics.py
code and the table below completed:
Iris type | Mean Sepal Length | Standard Deviation of Sepal Length |
---|---|---|
Setosa | XX | XX |
Versicolor | XX | XX |
Virginica | XX | XX |