# Data Science Achievements

In [1]:

import yaml as yml
import pandas as pd
import os
from IPython.display import display, Markdown
pd.set_option('display.max_colwidth', None)


def yml_df(file):
    with open(file, 'r') as f:
        file_unparsed = f.read()

    file_dict = yml.safe_load(file_unparsed)
    return pd.DataFrame(file_dict)

outcomes_df = yml_df('../_data/learning_outcomes.yml')
# outcomes_df.set_index('keyword',inplace=True)
schedule_df = yml_df('../_data/schedule.yml')
schedule_df.set_index('week', inplace=True)
# schedule_df = pd.merge(schedule_df,outcomes_df,right_on='keyword',  left_on= 'clo')
rubric_df = yml_df('../_data/rubric.yml')
rubric_df.set_index('keyword', inplace=True)

In this course there are 5 learning outcomes that I expect you to achieve by
the end of the semester.  To get there, you'll focus on 15 smaller achievements
that will be the basis of your grade.  This section will describe how the topics
covered, the learning outcomes, and the achievements are covered over time. In
the next section, you'll see how these achievements turn into grades.


## Learning Outcomes

By the end of the semester

In [2]:
outcome_list = [ str(i+1) + '. ' + ' (' + k + ') '  + o  for i,(o,k) in enumerate(zip(outcomes_df['outcome'], outcomes_df['keyword']))]

display(Markdown('  \n'.join(outcome_list)))
#outcomes_df[['keyword','outcome']]

1.  (process) Describe the process of data science, define each phase, and identify standard tools  
2.  (data) Access and combine data in multiple formats for analysis  
3.  (exploratory) Perform exploratory data analyses including descriptive statistics and visualization  
4.  (modeling) Select models for data by applying and evaluating mutiple models to a single dataset  
5.  (communicate) Communicate solutions to problems with data in common industry formats

We will build your skill in the `process` and `communicate` outcomes over the whole semester. The middle three skills will correspond roughly to the content taught for each of the first three portfolio checks.  

(schedule)=
## Schedule

````{margin}
```{note}
On the {{ bscalendar }} page you can get a feed link to add to the calendar of your choice by clicking on the subscribe (star) button on the top right of the page. Class is for 1 hour there because of Brightspace/zoom integration limitations, but that calendar includes the zoom link.
```
````

The course will meet {{ time }} in {{ location }}. Every class will include participatory live coding (instructor types code while explaining, students follow along)) instruction and small exercises for you to progress toward level 1 achievements of the new skills introduced in class that day.

Each Assignment will have a deadline posted on the page.  Portfolio deadlines will be announced at least 2 weeks in advance.

In [3]:

schedule_df.replace({None:'TBD'})
schedule_df[['topics','skills']]

Unnamed: 0_level_0,topics,skills
week,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[admin, python review]",process
2,"Loading data, Python review","[access, prepare, summarize]"
3,Exploratory Data Analysis,"[summarize, visualize]"
4,Data Cleaning,"[prepare, summarize, visualize]"
5,"Databases, Merging DataFrames","[access, construct, summarize]"
6,"Modeling, classification performance metrics, cross validation",[evaluate]
7,"Naive Bayes, decision trees","[classification, evaluate]"
8,Regression,"[regression, evaluate]"
9,Clustering,"[clustering, evaluate]"
10,"SVM, parameter tuning","[optimize, tools]"


(achievement-definitions)=
## Achievement Definitions


The table below describes how your participation, assignments, and portfolios will be assessed to earn each achievement. The keyword for each skill is a short name that will be used to refer to skills throughout the course materials; the full description of the skill is in this table.

In [4]:

rubric_df.replace({None:'TBD'},inplace=True)
rubric_df.rename(columns={'mastery':'Level 3',
              'compentent':'Level 2',
              'aware':'Level 1'}, inplace=True)

rubric_df[['skill','Level 1','Level 2','Level 3']]

Unnamed: 0_level_0,skill,Level 1,Level 2,Level 3
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
python,pythonic code writing,"python code that mostly runs, occasional pep8 adherance","python code that reliably runs, frequent pep8 adherance","reliable, efficient, pythonic code that consistently adheres to pep8"
process,describe data science as a process,Identify basic components of data science,Describe and define each stage of the data science process,Compare different ways that data science can facilitate decision making
access,access data in multiple formats,load data from at least one format; identify the most common data formats,Load data for processing from the most common formats; Compare and constrast most common formats,access data from both common and uncommon formats and identify best practices for formats in different contexts
construct,construct datasets from multiple sources,identify what should happen to merge datasets or when they can be merged,apply basic merges,merge data that is not automatically aligned
summarize,Summarize and describe data,Describe the shape and structure of a dataset in basic terms,compute summary statndard statistics of a whole dataset and grouped data,Compute and interpret various summary statistics of subsets of data
visualize,Visualize data,"identify plot types, generate basic plots from pandas",generate multiple plot types with complete labeling with pandas and seaborn,generate complex plots with pandas and plotting libraries and customize with matplotlib or additional parameters
prepare,prepare data for analysis,"identify if data is or is not ready for analysis, potential problems with data","apply data reshaping, cleaning, and filtering as directed","apply data reshaping, cleaning, and filtering manipulations reliably and correctly by assessing data as received"
evaluate,Evaluate model performance,Explain basic performance metrics for different data science tasks,Apply and interpret basic model evaluation metrics to a held out test set,Evaluate a model with multiple metrics and cross validation
classification,Apply classification,"identify and describe what classification is, apply pre-fit classification models","fit, apply, and interpret preselected classification model to a dataset",fit and apply classification models and select appropriate classification models for different contexts
regression,Apply Regression,identify what data that can be used for regression looks like,fit and interpret linear regression models,fit and explain regrularized or nonlinear regression


In [5]:

assignment_dummies  = pd.get_dummies(rubric_df['assignments'].apply(pd.Series).stack()).groupby(level=0).sum()
assignment_dummies['# Assignments'] = assignment_dummies.sum(axis=1)
col_rename = {float(i):'A' + str(i) for i in range(1,14)}
assignment_dummies.rename(columns =col_rename,inplace=True)

portfolio_dummies  = pd.get_dummies(rubric_df['portfolios'].apply(pd.Series).stack()).groupby(level=0).sum()
col_rename = {float(i):'P' + str(i) for i in range(1,5)}
portfolio_dummies.rename(columns =col_rename,inplace=True)


rubric_df = pd.concat([rubric_df,assignment_dummies, portfolio_dummies],axis=1)

assignment_cols =  ['A'+ str(i) for i in range(1,14)] + ['# Assignments']

portfolio_cols = [ 'Level 3'] + ['P' + str(i) for i in range(1,5)]

(assignment-skills)=
### Assignments and Skills

Using the keywords from the table above, this table shows which assignments you will be able to demonstrate which skills and the total number of assignments that assess each skill. This is the number of opportunities you have to earn Level 2 and still preserve 2 chances to earn Level 3 for each skill.

In [6]:
rubric_df[assignment_cols]

Unnamed: 0_level_0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,# Assignments
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
python,1,1,0,1,1,0,0,0,0,0,0,0,0,4
process,1,0,0,0,0,1,1,1,1,1,1,0,0,7
access,0,1,1,1,1,0,0,0,0,0,0,0,0,4
construct,0,0,0,0,1,0,1,1,0,0,0,0,0,3
summarize,0,0,1,1,1,1,1,1,1,1,1,1,1,11
visualize,0,0,1,1,0,1,1,1,1,1,1,1,1,10
prepare,0,0,0,1,1,0,0,0,0,0,0,0,0,2
evaluate,0,0,0,0,0,1,1,1,0,1,1,0,0,5
classification,0,0,0,0,0,0,1,0,0,1,0,0,0,2
regression,0,0,0,0,0,0,0,1,0,0,1,0,0,2


```{warning}
**process** achievements are accumulated a little slower. Prior to portfolio check 1, only level 1 can be earned.  Portfolio check 1 is the first chance to earn level 2 for process, then level 3 can be earned on portfolio check 2 or later.
```

(portfolioskills)=
### Portfolios and Skills

The objective of your portfolio submissions is to earn Level 3 achievements. The following table shows what Level 3 looks like for each skill and identifies which portfolio submissions you can earn that Level 3 in that skill.

In [7]:
rubric_df[portfolio_cols]

Unnamed: 0_level_0,Level 3,P1,P2,P3,P4
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
python,"reliable, efficient, pythonic code that consistently adheres to pep8",1,1,0,1
process,Compare different ways that data science can facilitate decision making,0,1,1,1
access,access data from both common and uncommon formats and identify best practices for formats in different contexts,1,1,0,1
construct,merge data that is not automatically aligned,1,1,0,1
summarize,Compute and interpret various summary statistics of subsets of data,1,1,0,1
visualize,generate complex plots with pandas and plotting libraries and customize with matplotlib or additional parameters,1,1,0,1
prepare,"apply data reshaping, cleaning, and filtering manipulations reliably and correctly by assessing data as received",1,1,0,1
evaluate,Evaluate a model with multiple metrics and cross validation,0,1,1,1
classification,fit and apply classification models and select appropriate classification models for different contexts,0,1,1,1
regression,fit and explain regrularized or nonlinear regression,0,1,1,1


### Detailed Checklists

In [8]:
ach = yml_df('../_data/achievments.yml')

entry = '#### **{name}** \n_{description}_ \n\n{components_checklist}'

def issue_action_str(row):
    rep_dict = {}
    if type(row['components']) ==list:
        rep_dict['components_checklist'] = '\n- '.join(['']+row['components'])
    else:
        rep_dict['components_checklist'] = ''

    rep_dict['name'] = row['name']
    rep_dict['description'] = row['description']
    return entry.format_map(rep_dict)

checklist_list = ach.apply(issue_action_str,axis=1).values
all_checklists = '\n\n'.join(list(checklist_list))

display(Markdown(all_checklists))

#### **python-level1** 
_python code that mostly runs, occasional pep8 adherance_ 


- logical use of control structures
- callable functions
- correct calls to functions
- correct use of variables
- use of logical operators

#### **python-level2** 
_python code that reliably runs, frequent pep8 adherance_ 


- descriptive variable names
- pythonic loops
- efficient use of return vs side effects in functions
- correct, effective use of builtin python iterable types (lists & dictionaries)

#### **python-level3** 
_reliable, efficient, pythonic code that consistently adheres to pep8_ 


- pep8 adherant variable, file, class, and function names
- effective use of multi-paradigm abilities for efficiency gains
- easy to read code that adheres to readability over other rules

#### **process-level1** 
_Identify basic components of data science_ 


- identify component disciplines OR
- idenitfy phases

#### **process-level2** 
_Describe and define each stage of the data science process_ 


- correctly defines stages
- identifies stages in use
- describes general goals as well as a specific processes

#### **process-level3** 
_Compare different ways that data science can facilitate decision making_ 


- describes exceptions to process and iteration in process
- connects choices at one phase to impacts in other phases
- connects data science steps to real world decisions

#### **access-level1** 
_load data from at least one format; identify the most common data formats_ 


- use at least one pandas `read_` function correctly
- name common types
- describe the structure of common types

#### **access-level2** 
_Load data for processing from the most common formats; Compare and constrast most common formats_ 


- load data from at least two of (.csv, .tsv, .dat, database, .json)
- describe advantages and disadvantages of most commone types
- descive how most common types are different

#### **access-level3** 
_access data from both common and uncommon formats and identify best practices for formats in different contexts_ 


- load data from at least 1 uncommon format
- describe when one format is better than another

#### **construct-level1** 
_identify what should happen to merge datasets or when they can be merged_ 


- identify what the structure of a merged dataset should be (size, shape, columns)
- idenitfy when datasets can or cannot be merged

#### **construct-level2** 
_apply basic merges_ 


- use 3 different types of merges
- choose the right type of merge for realistic scenarios

#### **construct-level3** 
_merge data that is not automatically aligned_ 


- manipulate data to make it mergable
- identify how to combine data from many sources to answer a question
- implement stesp to combine data from multiple sources

#### **summarize-level1** 
_Describe the shape and structure of a dataset in basic terms_ 


- use attributes to produce a description of a dataset
- display parts of a dataset

#### **summarize-level2** 
_compute and interpret summary standard statistics of a whole dataset and grouped data_ 


- compute descriptive statistics on whole datasets
- apply individual statistics to datasets
- group data by a categorical variable for analysis
- apply split-apply-combine paradigm to analyze data
- interprete statistics on whole datasets
- interpret statistics on subsets of data

#### **summarize-level3** 
_Compute and interpret various summary statistics of subsets of data_ 


- produce custom aggregation tables to summarize datasets
- compute multivariate summary statistics by grouping
- compute custom cacluations on datasets

#### **visualize-level1** 
_identify plot types, generate basic plots from pandas_ 


- generate at least two types of plots with pandas
- identify plot types by name
- interpret basic information from plots

#### **visualize-level2** 
_generate multiple plot types with complete labeling with pandas and seaborn_ 


- generate at least 3 types of plots
- use correct, complete, legible labeling on plots
- plot using both pandas and seaborn
- interpret multiple types of plots to draw conclusions

#### **visualize-level3** 
_generate complex plots with pandas and plotting libraries and customize with matplotlib or additional parameters_ 


- use at least two libraries to plot
- generate figures with subplots
- customize the display of a plot to be publication ready
- interpret plot types and explain them for novices
- choose appopriate plot types to convey information
- explain why plotting common best practices are effective

#### **prepare-level1** 
_identify if data is or is not ready for analysis, potential problems with data_ 


- identify problems in a dataset
- anticipate how potential data setups will interfere with analysis
- describe the structure of tidy data
- label data as tidy or not

#### **prepare-level2** 
_apply data reshaping, cleaning, and filtering as directed_ 


- reshape data to be analyzable as directed
- filter data as directed
- rename columns as directed
- rename values to make data more analyzable
- handle missing values in at least two ways
- transform data to tidy format

#### **prepare-level3** 
_apply data reshaping, cleaning, and filtering manipulations reliably and correctly by assessing data as received_ 


- identify issues in a dataset and correctly implement solutions
- convert varialbe representation by changing types
- change variable representation using one hot encoding

#### **evaluate-level1** 
_Explain basic performance metrics for different data science tasks_ 


- define at least two performance metrics
- describe how those metrics compare or compete

#### **evaluate-level2** 
_Apply and interpret basic model evaluation metrics to a held out test set_ 


- create test train splits
- describe why test train splits are important
- apply at least three performance metrics to models
- choose at least one appropriate metric for each modeling task
- interpret at least three metrics

#### **evaluate-level3** 
_Evaluate a model with multiple metrics and cross validation_ 


- explain cross validation
- describe why cross vaidation is important
- idenitfy appropriate metrics for different types of modeling tasks
- use multiple metriccs together to create a more complete description of a model's performance

#### **classification-level1** 
_identify and describe what classification is, apply pre-fit classification models_ 


- describe what classification is
- describe what a dataset must look like for classifcation
- identify appliations of classifcation in the real world

#### **classification-level2** 
_fit, apply, and interpret preselected classification model to a dataset_ 


- fit a classification model
- apply a classification model to obtain predictions
- interpret the predictions of a classification model
- examine parameters of at least one fit classifier to explain how the prediction is made
- differentiate between model fitting and generating predictions
- evaluate how model parameters impact model performance

#### **classification-level3** 
_fit and apply classification models and select appropriate classification models for different contexts_ 


- choose appropriate classifiers based on application context
- explain how at least 3 different classifiers make predictions
- evaluate how model parameters impact model performance and justify choices when tradeoffs are necessary

#### **regression-level1** 
_identify what data that can be used for regression looks like_ 


- identify data that is/not appropriate for regression
- describe univariate linear regression
- identify appliations of regression in the real world

#### **regression-level2** 
_fit and interpret linear regression models_ 


- fit univariate linear regression models
- interpret linear regression models
- fit multivariate linear regression models

#### **regression-level3** 
_fit and explain regrularized or nonlinear regression_ 


- fit nonlinear or regrularized regression models
- interpret and explain nonlinear or regrularized regresion models

#### **clustering-level1** 
_describe what clustering is_ 


- differentiate clustering from classification and regression
- identify appliations of clustering in the real world

#### **clustering-level2** 
_apply basic clustering_ 


- fit Kmeans
- interpret kmeans

#### **clustering-level3** 
_apply multiple clustering techniques, and interpret results_ 


- apply at least two clustering techniques
- explain the differences between two clustering models

#### **optimize-level1** 
_Identify when model parameters need to be optimized_ 


- identify when parameters might impact model performance

#### **optimize-level2** 
_Optimize basic model parameters such as model order_ 


- automatically optimize multiple parameters
- evaluate potential tradeoffs
- interpret optimization results in context

#### **optimize-level3** 
_Select optimal parameters based of mutiple quanttiateve criteria and automate parameter tuning_ 


- optimize models based on multiple metrics
- describe when one model vs another is most appropriate

#### **compare-level1** 
_Qualitatively compare model classes_ 


- compare models within the same task on complexity

#### **compare-level2** 
_Compare model classes in specific terms and fit models in terms of traditional model performance metrics_ 


- compare models in multiple terms
- interpret cross model comparisons in context

#### **compare-level3** 
_Evaluate tradeoffs between different model comparison types_ 


- compare models on multiple criteria
- compare optimized models
- jointly interpret optimization result and compare models
- compare models on quanttiateve and qualitative measures

#### **representation-level1** 
_Identify options for representing text and categorical data in many contexts_ 


- describe the basic goals for changing the representation of data

#### **representation-level2** 
_Apply at least one representation to transform unstructured  or inappropriately data for model fitting or summarizing_ 


- transform text or image data for use with ML

#### **representation-level3** 
_apply transformations in different contexts OR  compare and contrast multiple representations a single type of data in terms of model performance_ 


- transform both text and image data for use in ml
- evaluate the impact of representation on model performance

#### **workflow-level1** 
_Solve well strucutred fully specified problems with a single tool pipeline_ 


- pseudocode out the steps to answer basic data science questions

#### **workflow-level2** 
_Solve well-strucutred, open-ended problems, apply common structure to learn new features of standard tools_ 


- plan and execute answering real questions to an open ended question
- describe the necessary steps and tools

#### **workflow-level3** 
_Independently scope and solve realistic data science problems OR independently learn releated tools  and describe strengths and weakensses of common tools_ 


- scope and solve realistic data science problems
- compare different data science tool stacks