Skip to content

Program Description and Certificates for the Professional Certificate in Data Science

Notifications You must be signed in to change notification settings

rahulthairani/Data-Science-Professional-Certificate

Repository files navigation

Data-Science-Professional-Certificate

Data Science Professional Certificate - HarvardX

What you will learn

  • Fundamental R programming skills

  • Statistical concepts such as probability, inference, and modeling and how to apply them in practice

  • Gain experience with the tidyverse, including data visualization with ggplot2 and data wrangling with dplyr

  • Become familiar with essential tools for practicing data scientists such as Unix/Linux, git and GitHub, and RStudio

  • Implement machine learning algorithms

  • In-depth knowledge of fundamental data science concepts through motivating real-world case studies

Program Overview

The demand for skilled data science practitioners in industry, academia, and government is rapidly growing. The HarvardX Data Science program prepares you with the necessary knowledge base and useful skills to tackle real-world data analysis challenges. The program covers concepts such as probability, inference, regression, and machine learning and helps you develop an essential skill set that includes R programming, data wrangling with dplyr, data visualization with ggplot2, file organization with Unix/Linux, version control with git and GitHub, and reproducible document preparation with RStudio.

In each course, we use motivating case studies, ask specific questions, and learn by answering these through data analysis. Case studies include: Trends in World Health and Economics, US Crime Rates, The Financial Crisis of 2007-2008, Election Forecasting, Building a Baseball Team (inspired by Moneyball), and Movie Recommendation Systems.

Throughout the program, we will be using the R software environment. You will learn R, statistical concepts, and data analysis techniques simultaneously. We believe that you can better retain R knowledge when you learn how to solve a specific problem.

Data Science: R Basics

Data Science: R Basics

Build a foundation in R and learn how to wrangle, analyze, and visualize data.

About this course

The first in our Professional Certificate Program in Data Science, this course will introduce you to the basics of R programming. You can better retain R when you learn it to solve a specific problem, so you'll use a real-world dataset about crime in the United States. You will learn the R skills needed to answer essential questions about differences in crime across the different states.

We'll cover R's functions and data types, then tackle how to operate on vectors and when to use advanced functions like sorting. You'll learn how to apply general programming features like "if-else," and "for loop" commands, and how to wrangle, analyze and visualize data.

Rather than covering every R skill you might need, you'll build a strong foundation to prepare you for the more in-depth courses later in the series, where we cover concepts like probability, inference, regression, and machine learning. We help you develop a skill set that includes R programming, data wrangling with dplyr, data visualization with ggplot2, file organization with UNIX/Linux, version control with git and GitHub, and reproducible document preparation with RStudio.

The demand for skilled data science practitioners is rapidly growing, and this series prepares you to tackle real-world data analysis challenges.

What you'll learn

  • Basic R syntax

  • Foundational R programming concepts such as data types, vectors arithmetic, and indexing

  • How to perform operations in R including sorting, data wrangling using dplyr, and making plots

1 - 2 hours per week for 8 weeks

Data Science: Visualization

Data Science: Visualization

Learn basic data visualization principles and how to apply them using ggplot2.

About this course

As part of our Professional Certificate Program in Data Science, this course covers the basics of data visualization and exploratory data analysis. We will use three motivating examples and ggplot2, a data visualization package for the statistical programming language R. We will start with simple datasets and then graduate to case studies about world health, economics, and infectious disease trends in the United States.

We'll also be looking at how mistakes, biases, systematic errors, and other unexpected problems often lead to data that should be handled with care. The fact that it can be difficult or impossible to notice a mistake within a dataset makes data visualization particularly important.

The growing availability of informative datasets and software tools has led to increased reliance on data visualizations across many areas. Data visualization provides a powerful way to communicate data-driven findings, motivate analyses, and detect flaws. This course will give you the skills you need to leverage data to reveal valuable insights and advance your career.

What you'll learn

  • Data visualization principles

  • How to communicate data-driven findings

  • How to use ggplot2 to create custom plots

  • The weaknesses of several widely-used plots and why you should avoid them

1 - 2 hours per week for 8 weeeks

Data Science: Probability

Data Science: Probability

Learn probability theory -- essential for a data scientist -- using a case study on the financial crisis of 2007-2008.

About this course

In this course, part of our Professional Certificate Program in Data Science,you will learn valuable concepts in probability theory. The motivation for this course is the circumstances surrounding the financial crisis of 2007-2008. Part of what caused this financial crisis was that the risk of some securities sold by financial institutions was underestimated. To begin to understand this very complicated event, we need to understand the basics of probability.

We will introduce important concepts such as random variables, independence, Monte Carlo simulations, expected values, standard errors, and the Central Limit Theorem. These statistical concepts are fundamental to conducting statistical tests on data and understanding whether the data you are analyzing is likely occurring due to an experimental method or to chance.

Probability theory is the mathematical foundation of statistical inference which is indispensable for analyzing data affected by chance, and thus essential for data scientists.

What you'll learn

  • Important concepts in probability theory including random variables and independence

  • How to perform a Monte Carlo simulation

  • The meaning of expected values and standard errors and how to compute them in R

  • The importance of the Central Limit Theorem

1 - 2 hours per week for 8 weeks

Data Science: Inference and Modeling

Data Science: Inference and Modeling

Learn inference and modeling, two of the most widely used statistical tools in data analysis.

About this course

Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.

This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we'll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.

Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals, and p-values. Then, to understand statements about the probability of a candidate winning, you will learn about Bayesian modeling. Finally, at the end of the course, we will put it all together to recreate a simplified version of an election forecast model and apply it to the 2016 election.

What you'll learn

  • The concepts necessary to define estimates and margins of errors of populations, parameters, estimates and standard errors in order to make predictions about data

  • How to use models to aggregatedata from different sources

  • The very basics of Bayesian statistics and predictive modeling

1 - 2 hours per week for 8 weeks

Data Science: Productivity Tools

Data Science: Productivity Tools

Keep your projects organized and produce reproducible reports using GitHub, git, Unix/Linux, and RStudio.

About this course

A typical data analysis project may involve several parts, each including several data files and different scripts with code. Keeping all this organized can be challenging.

Part of our Professional Certificate Program in Data Science, this course explains how to use Unix/Linux as a tool for managing files and directories on your computer and how to keep the file system organized. You will be introduced to the version control systems git, a powerful tool for keeping track of changes in your scripts and reports. We also introduce you to GitHub and demonstrate how you can use this service to keep your work in a repository that facilitates collaborations.

Finally, you will learn to write reports in R markdown which permits you to incorporate text and code into a document. We'll put it all together using the powerful integrated desktop environment RStudio.

What you'll learn

  • How to use Unix/Linux to manage your file system

  • How to perform version control with git

  • How to start a repository on GitHub

  • How to leverage the many useful features provided by RStudio

1 - 2 hours per week for 8 weeks

Data Science: Wrangling

Data Science: Wrangling

Learn to process and convert raw data into formats needed for analysis.

About this course

In this course, part of our Professional Certificate Program in Data Science, we cover several standard steps of the data wrangling process like importing data into R, tidying data, string processing, HTML parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but a data scientist will likely face them all at some point.

Very rarely is data easily accessible in a data science project. It's more likely for the data to be in a file, a database, or extracted from documents such as web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and tidy the data, using the tidyverse package. The steps that convert data from its raw form to the tidy form is called data wrangling.

This process is a critical step for any data scientist. Knowing how to wrangle and clean data will enable you to make critical insights that would otherwise be hidden.

What you'll learn

  • Importing data into R from different file formats

  • Web scraping

  • How to tidy data using the tidyverse to better facilitate analysis

  • String processing with regular expressions (regex)

  • Wrangling data using dplyr

  • How to work with dates and times as file formats

  • Text mining

1 - 2 hours per week for 8 weeks

Data Science: Linear Regression

Data Science: Linear Regression

Learn how to use R to implement linear regression, one of the most common statistical modeling approaches in data science.

About this course

Linear regression is commonly used to quantify the relationship between two or more variables. It is also used to adjust for confounding. This course, part of our Professional Certificate Program in Data Science, covers how to implement linear regression and adjust for confounding in practice using R.

In data science applications, it is very common to be interested in the relationship between two or more variables. The motivating case study we examine in this course relates to the data-driven approach used to construct baseball teams described in Moneyball. We will try to determine which measured outcomes best predict baseball runs by using linear regression.

We will also examine confounding, where extraneous variables affect the relationship between two or more other variables, leading to spurious associations. Linear regression is a powerful technique for removing confounders, but it is not a magical process. It is essential to understand when it is appropriate to use, and this course will teach you when to apply this technique.

What you'll learn

  • How linear regression was originally developed by Galton

  • What is confounding and how to detect it

  • How to examine the relationships between variables by implementing linear regression in R

1 - 2 hours per week for 8 weeks

Data Science: Machine Learning

Data Science: Machine Learning

Build a movie recommendation system and learn the science behind one of the most popular and successful data science techniques.

About this course

Perhaps the most popular data science methodologies come from machine learning. What distinguishes machine learning from other computer guided decision processes is that it builds prediction algorithms using data. Some of the most popular products that use machine learning include the handwriting readers implemented by the postal service, speech recognition, movie recommendation systems, and spam detectors.

In this course, part of our Professional Certificate Program in Data Science, you will learn popular machine learning algorithms, principal component analysis, and regularization by building a movie recommendation system.

You will learn about training data, and how to use a set of data to discover potentially predictive relationships. As you build the movie recommendation system, you will learn how to train algorithms using training data so you can predict the outcome for future datasets. You will also learn about overtraining and techniques to avoid it such as cross-validation. All of these skills are fundamental to machine learning.

What you'll learn

  • The basics of machine learning

  • How to perform cross-validation to avoid overtraining

  • Several popular machine learning algorithms

  • How to build a recommendation system

  • What is regularization and why it is useful?

2 - 4 hours per week for 8 weeks

About

Program Description and Certificates for the Professional Certificate in Data Science

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published