Skip to content

quin97/STA130

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 

Repository files navigation

STA130 (Fall 2022): An Introduction to Statistical Reasoning and Data Science

This course transitions the Winter 2022 Online version of this course back to the Fall 2019 In-Person format. This is done by translating previous powerpoint slides and accompanying pre-recorded lectures into an Rmd beamer pdf presentation format (which is the format required for student project submissions). At the highest level, the course objectives are to develop and practice the two steps the statistical and data science workflow:

  1. Extract meaning from data through coding and analysis
  2. Communicate learned knowledge in writing and speaking

Weekly Course Routine

  • 2 hours of in-person lecture and interactive class practice quizzes

    • 2 hours of review, office hours, and piazza discussion board time
    • 3 hours of "completion credit" R homework (5%) designed to develop and practice the skills evaluated on the exams (45%) and the course project (20%)
  • 2 hours of in-person tutorial activities (15%) focussing on written and verbal comminication

    • 1 hour of written and verbal communication homework (10%) designed to develop and practice the skills evaluated on the exams and the course project

Course Grading

5% R Homework 10% Written/Verbal Communication Homework 15% In-Person Tutorial Activies
20% Course project 20% Midterm and 25% Final exam 5% Participation Activities

Tutorial attendance is mandatory in the sense that it involves graded in-person activies. Lecture attendance is mandatory in the sense that lectures will not be recorded, but there are no addendance grades for lectures. Practice quizzes and other study materials to help prefar for the exams are available, but these are optional in the sense that they are not graded for course points. The participation activities involve surveys and mentorship activies.

Misses Assessments

  • Missed Exams may be rescheduled for valid excused absences
  • Late Project submission will not generally be accomodated
  • Late Homework and Tutorial work will not be accepted; however...
    • the highest grade will replace the lowest for each of these categories
    • For additional support contact your College Registrar

Communication

In-person Office Hours TBA Online Piazza Discussion Board
Online Zoom Office Hours TBA Special inquiries sta130@utorono.ca

Course project

The course project will be done in consultation with our "project collaborator" Dr. Heman Shakeri of the University of Virginia's (UVA) Data Science Institute (DSI). The DSI is a recently created School (the 12th) of this premier US university originally founded by Thomas Jefferson the principal author of the US Declaration of Independence.

Dr. Shakeri's research, in conjuction with the Department of Biomedical Engineering and Systems Biology and Biomedical Data Sciences, is motivated by the experience with cancer of a close family friend. So, Heman (Dr. Shakeri) wants to use "data-driven identification and control of high-dimensional dynamical systems" to detect deviations away from normal cellular function and intervene to interrupt the pregression of cancer before it can establish a deleterious cellular homeostasis in order to give family's more time with their loved ones and close friends.

The data we will work with is based on advances in the fields of Flow Cytometry for single cell analysis and Mass Spectrometry for measurement of cellular proteomic processes (the phenotypical process endpoint of cellular function and behavior). Based on these technologies, the multivariate landscape of proteomic activity can be measured for a single cell in any experiemental condition for any cell type (e.g., cancerous and benign cellular lines) at scale. By understanding typical cellular homeostatis of healthy and deliterious cells, and observing the phenotypical transformation of cellular proteomic homeostatsis over time in response to different treatments, it is hoped that we will eventually understand how to direct deleterious cellular states to transition into non-deleterious states. I.e., "data-driven identification and control of high-dimensional dynamical systems".

The data observations simultaneously measure 17 so-called AP-1 transcription factors over thousands of cells in a given experiemental condition. By repeatedly observing this protein complex (as is done in this data), the correlation between the different proteins can be observed. Further, the evolution of these dependencies can be observed over time as a response to different interventions. And even further, the emergence of downstream cellular phenotypes in response to changes of the state of the AP-1 system can also be observed and can be characterized through 4 other "phenotype" proteins (whose measurements are also available in this data set). Thus, though our course project, we seek to understand the inter-dependence between the AP-1 proteins, and their driving relationship with downstream cellular phenotypes, which might eventually suggest how we can intervene along this pathway to induce transformation away from deletarious cellular states.

Outline

Week 1: Jupyterhub and Rstudio and R Basics

Week 1 is concerned with introducing students to R and Rstudio using UofT's Jupyterhub. Our primary reference resources in this task are

The UofT Jupyterhub is a phenomenal resource; however, it is subject to service outages from time to time (which have in the past coincided with assignment due dates), and it can take a long time to load when there's a lot of simultaneous user demand (if a lot of students in our or another class log in at once). When you cannot use UofT Jupyterhub you must use your own local Rstudio instance.

An extremely valuable skill in the context of coding for statistics and Data Science is troubleshooting and figuring things out. Resources like the R for Data Science textbook and the the DoSS Toolkit are excellent recources to learn things in a systematic, structured, and organized manner; however, google, stack exchange/overflow, and coding blog posts can be an invaluable resource for finding quick solutions for coding bugs and suggestions for how to complete a desired analyses. Hopefully through this class you will take the opportunity to build your self-sufficiency and coding-resiliance.

Week 1 Course Material

Slides [Jupyterhub], Demo 1 [Jupyterhub], Demo 2 [Jupyterhub] Questions [Round 1, Round 2, Round 3, Round 4, Round 5]
Homework Assignment and Practice Quiz

Module 2: Distributions and Statistics

Module 3: Data Wrangling with Tidy

Module 4: Statistical Inference for a Single Sample Proportion

Module 5: Permutation Tests for Two Groups

Midterm Review

Module 6: Sampling Distributions and Bootstrap Confidence Intervals

Module 7: Linear Regression I

Module 8: Linear Regression II

Module 9: Classification Trees

Module 10: Study Design, Confounding, and Ethics

Final Review

Releases

No releases published

Packages

No packages published