This repo contains extra practice sets based on Advancing into Analytics: From Excel to Python and R.
Skip to sections:
Part I: Foundations of Analytics in Excel
Part II: From Excel to R
Part III: From Excel to Python
Solutions are available for students of Advancing into Analytics courses.
I can also provide to high school or college teachers using the book in their course. Learn more about using Advancing into Analytics in the classroom here.
All datasets referred to can be found in the datasets
subfolder. There is also a file with links to their sources where you can get column definitions and so forth.
Using the titanic
dataset and Excel:
- Classify each variable’s type.
- Build and visualize a two-way frequency of
Pclass
bySurvived
. - Return descriptive statistics for
Fare
. How does the distribution differ byPclass
? - Visualize the distribution of
Age
.
- Simulate 100 coin tosses and visualize the result as a histogram in Excel. Do some online research -- what probability distribution does a series of coin tosses follow?
- Now take the average of twenty coin tosses and visualize the distribution of 100 such averages as a histogram. What probability distribution do the sample means follow?
- Plot two probability distributions in Excel on the same area graph: one with a mean of 25 and standard deviation of 8, the other with 40 and 2, respectively.
- Using the
geyser
dataset and Excel, test the relationship between the kind of eruption (long or short) and the waiting time until the next eruption:
- What are your statistical hypotheses?
- Are your results statistically significant? What evidence does this lend to your hypotheses?
- What is the estimated effect size?
- Answer the same questions, but for the relationship between having a CD-ROM and the price of the computer in the
computers
datset.
- Using the
car-prices
dataset and Excel:
- What is the correlation between mileage and price?
- Which is the independent and dependent variable?
- Find the coefficents and their statistical significance along with the R-square of the regression model.
- Using the
birthweights
dataset and Excel:
- Derive a correlation for all relevant variables.
- Between
length
andbirthweight
, which is the independent versus dependent variable? - Find the coefficents and their statistical significance along with the R-square of the regression model.
- This chapter is a conceptual overview of how Excel, R and Python fit into the broader data analytics stack. If you'd like to get started learning other elements of analytics, I suggest the following resources:
- Codeacademy's free Introduction to SQL course
- Power BI on Microsoft Learn
- If you're on Mac and can't run Power BI, start with Tableau Public instead
- Run the following code. What happens?
temps <- c(66, 71, 68, 74, 70)
days <- c('Mon', 'Tue', 'Wed', 'Thr', 'Fri')
names(temps) <- days
temps['Wed']
- Knowing what you do now, return the values for Monday and Friday first by index position, then by name.
- Explore the R Help page in RStudio: from the menu click Help > R Help.
- Create an RStudio Community account. This is an excellent forum for getting help in R and RStudio moderated by the pros.
- If you would like additional reading on getting started in R, check out the following:
- A Tour of RStudio
- Five Things Excel Users Should Know about R
- What is an open source software distribution?
- Easter egg time! In this chapter we used the
write_xlsx()
function from packagewritexl
. Pull its help documentation... an anthropomorphized office supply awaits! - Many packages include vignettes, similar to white papers showing important features and concepts of the package. These can be viewed via the
browseVignettes()
function.
- Open the vignettes for
dpyr
andtidyverse
. The vignettes will open in a separate internet tab. Use the function's help documentation to confirm what to do here.
- How many datasets does the
babynames
package come with? Choose one, run its descriptive statistics and write the dataset to an Excel worksheet.
Using the car-crashes
dataset:
- The
total
variable represents the number of drivers involved in fatal collisions per billion miles. Derive a column that represents the number of such drivers per million miles. - Find the average insurance premium for each Census region.
- Visualize the distribution of
no_previous
. Which distribution does it appear to follow? - Visualize the relationship between
ins_premium
andins_losses
. Re-label your axes and add a chart title. Does the relationship appear to be linear? - Create a data frame in R that would look like this in Excel:
Using the birthweights
dataset:
- Is there a significant difference in the gestation period of babies whose mother is and is not a smoker?
- Produce a correlation matrix of the relevant variables in this dataset.
- Visualize the relationship of weight by length.
- Regress weight on height. Find the equation of the fit regression line. Is there a significant relationship? What percentage of the variance in weight is explained by height?
- Split your regression model into training and testing subsets. What is the R-squared and RMSE on your test model?
Extra credit: Conduct your work in an R Notebook (included with RStudio) or Jupyter Notebook (see instructions here for using R with Jupyter).
- Continue getting to know the Jupyter interface. In particular, under the Help menu you will find an interactive "User Interface Tour" to walk you through the interface, as well as a guide to Markdown, which you may need for the next exercise...
- Practice your Markdown skills by including the following elements in a notebook:
- Two different levels of headers
- Italics
- A hyperlink
- An image
- Google Colab is a free cloud-based environment based on the Jupyter Notebook. This is a great tool for sharing your data projects on the web. Check out this introductory Colab notebook to get started.
Further reading:
- Five Things Excel Users Should Know About Python
- Sourcing Python Packages: A decision tree (Note: The downloadable file is available to you in this repo
- A tour of Python + Jupyter notebooks
- Easter egg time! Run the following code:
import this
. What happens? - Install the
vega_datasets
frompip
, then run the following code to import amovies
dataset:
import pandas as pd
from vega_datasets import data
movies = data.movies()
movies.head()
- Generate descriptive statistics for the dataset.
- Write the first four columns of the dataset to an Excel file.
Further reading:
Using the car-crashes
dataset:
- The
total
variable represents the number of drivers involved in fatal collisions per billion miles. Derive a column that represents the number of such drivers per million miles. - Find the average insurance premium per Census region.
- Visualize the distribution of
no_previous
. Which distribution does it appear to follow? - Visualize the relationship between
ins_premium
andins_losses
. Re-label your axes and add a chart title. - Create a data frame in Python that would look like this in Excel:
Using the birthweights
dataset:
- Is there a significant difference in the gestation period of babies whose mother is and is not a smoker?
- Produce a correlation matrix of the relevant variables in this dataset.
- Visualize the relationship of weight by length.
- Regress weight on height. Find the equation of the fit regression line. Is there a significant relationship? What percentage of the variance in weight is explained by height?
- Split your regression model into training and testing subsets. What is the R-squared and RMSE on your test model?
The conclusion offers a few next steps for continuing on in your analytics journey. Here are a couple of quick wins on that front:
- If you've not already downloaded GitHub Desktop, do it now. If you have, continue to practice. Either way, the official Desktop documentation is a great place to learn more.
- Read "How to Design Smart Business Experiments" by Thomas H. Davenport (Harvard Business Review)
Thanks for "advancing into analytics" with me .... roll credits!