# 2612 Advanced Programming for Data Science

---

## Quick guide to Course Structure

1. [Object oriented programming](01-ObjectOrientedProgramming.ipynb)
1. [Documentation and Testing](02-DocumentationAndTesting.ipynb)
1. [Version Control with git](03-VersionControl.ipynb)
1. [Software Deployment](04-SoftwareDeployment.ipynb)
1. [Time-series Analysis](05-TimeseriesAnalysis.ipynb)
1. [Scalable computing: Big data, Dask, ML in big data](06-Scale.ipynb)

---

<div class="alert alert-danger">
    <b>This course assumes you have knowledge of python at a basic level.</b>
</div>

---

## Introduction

Advanced Programming encompasses concepts and procedures beyond what you may learn in introductory tutorials. The goal of the course is to show you how a software project would be handled in a corporate environment by creating a robust workflow. The latter means a collection of trustworthy methods by which you build, piece-by-piece, a reusable, reproducible, and reliable set of software tools.

## Why python?

The Python language has demonstrated to be a powerful tool. It is user-friendly and widespread. You will probably have already faced the question:

[__Is python the language for this project?__](https://www.ishir.com/blog/36749/top-75-programming-languages-in-2021-comparison-and-by-type.htm)

However, as we will see, it does not mean the python language is a silver bullet against every coding challenge. Python is also in constant evolution. As we will see, python has many [__Enhancement Proposals__](https://www.python.org/dev/peps/), or __PEP__.

The major weakness of python is its *scalability*. We will adress that later on.

### Why Jupyter?

You are free to use whatever IDE you feel most comfortable with. In the demos used in the classes, jupyter-lab is used. It is mostly the same platform across every OS. [__Jupyter-lab__](https://jupyter.org/) also offers just a minimal text edition support, so we can develop coding skills without "training wheels".

### As per the Syllabus 

In this unit, students get acquainted with advanced concepts in programming. More complex
concepts allow for higher level abstractions in code. Code abstractions like objects and classes
allow increased functionality, faster deployment, and enable collaborations in software projects.

The __life cycle management__ of a data science product in a corporate environment must follow
specific guidelines. The guidelines, while built on top of common programming frameworks,
differ in the specificities of the mathematical models used. Present-day approaches to
programming and project management are introduced and explored.

### Modules

1. Introduction to advanced programming concepts. Concepts for Code reusability, style guides, and linting. Learn by doing: example of time series analysis with pandas.
2. Documentation and Testing. Unit tests, Functional tests, and Integration tests with the pytest framework.
3. Version control with git. Automating linting and testing with Continuous Integration/Continuous Delivery.
4. Software deployment: virtual environments with conda and dockers.
5. Time-Series: analysis, decomposition, and forecasting.
6. Working at scale: machine learning in Big Data with Dask.

It may be possible some delay in the classes might ocurr due to technical dificulties. If that is the case, the content of one module will continue into the next, at a brisker pace. Modules 3 and 5 are also designed to be used as buffer classes, in case of major setbacks.

---
### The skillset you will acquire

* Reliability
    * Write legible and structured code.
    * Document code and projects.
* Resilience
    * Learn coordination, collaboration, and code versioning best practices.
    * Develop tests for code quality assurance.
    * Create and manage virtual environments in python.
* Durability
    * Learn the fundamentals of code versioning control.
    * Automate linting and testing
* Scale
    * Understand concepts of Big Data
    * Learn the fundamentals of ML at large scales
    
But most important: be critical of your own way of thinking via software development.

__*And understand the huge ammount of pain that is developing software.*__

---

## Evaluation

The overall evaluation of performance consists of 3 parts, by University Standards:
    
* Class participation through 3 quizzes (20%)
    * At the end of weeks 3, 4, and 6.
    * Best 2 out of 3 grades are accounted for.
* Group project (30%)
    * First part presented at the end of week 2.
    * Second part presented at the end of week 4.
* Final exam (50%)
    * After all classes.

Students need to participate in class quizzes for at least 2 times. Best 2 out of 3.

You will develop a coding project using a public dataset and create a data analysis
pipeline that must pass a series of tests. The tests will be executed in the pytest framework. The code of the project will be stored in a repository together with instructions on how to generate an appropriate virtual environment. Anyone with the instructions should be able to clone the repository, install the virtual environment, and run the analysis developed by the students.

---

## Bibliography

* [The replication crisis](https://en.wikipedia.org/wiki/Replication_crisis)
* [Data Lifecycle Management](https://medium.com/jagoanhosting/what-is-data-lifecycle-management-and-what-phases-would-it-pass-through-94dbd207ff54)
* The contents of these notebooks
* [Pep8](https://www.python.org/dev/peps/pep-0008/)
* [Flake8](https://flake8.pycqa.org/en/latest/)
* [Sphinx](https://www.sphinx-doc.org/en/master/)
* [git](https://git-scm.com/)
* [pytest](https://docs.pytest.org/en/stable/)
* [Pip](https://pip.pypa.io/en/stable/)
* [Conda](https://docs.conda.io/en/latest/)
* [Virtualenv](https://virtualenv.pypa.io/en/latest/)
* Time series analysis
* [Dask](https://dask.org/)
* Last year we tackled [Apache Spark](https://spark.apache.org/). The notes are still on the repository, if you feel curious.