Skip to content

Repo of student materials for the General Assembly Data Science Course

Notifications You must be signed in to change notification settings

shwedosh/GA-SEA-DAT2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

## SEA-DAT2 course repository ###General Assembly Data Science course Location: Seattle, WA
Class times: Classes: 6:30pm - 9:30pm
Instructor: Jim Byers

Note: Prior to the first day of class complete the 10-15 hours of pre-work in order to be properly prepared for class (prework)

Tuesday Thursday
Research Design and Exploratory data analysis
3/15: L01 Introduction to Data Science 3/17: L02 Research design and Pandas
3/22: L03 Statistics fundamentals 3/24: L04 Command Line and Version Control
3/29: L05 Fetching Data, Project Discussion Deadline
Foundations of data modeling
3/31: L06 Intro to Regression, Project Question and Dataset Due
4/5: L07 Intro to Classification - K nearest neighbor 4/7: L08 Evaluating Model Fit
4/12: L09 Classifying with Logistic Regression 4/14: L10 Advanced model evaluation
4/19: L11 Standardization and Clustering 4/21: L12: First Project Presentations + bonus topics
Data science in the real world
4/26: L13 Natural Language Processing 4/28: L14 Dimensionality reduction, Draft Paper Due
5/3: L15 Decision Trees 5/5: L16 Ensembling, Bagging and Random Forests
5/10: L17 Modeling with Time Series Data I, Peer Review Due 5/12 L18 Modeling with Time Series Data II
5/17: L19 Where to go next + bonus topics 5/19: Final Project Presentations

| | | Bonus Content | Bonus content: SVC - Support Vector Classifier | Bonus content: Naive Bayes Classifier| Bonus content: Intro to Neural Networks |

Submission Forms


Other resources

 


Class 1: Lets get rolling! - Intro to Data Science

Student Prework Before this lesson you should already be able to:

  • Define basic data types used in object-oriented programming
  • Recall the Python syntax for lists, dictionaries, and functions
  • Create files and navigate directories using the command line interface (for your specific environment)

After this lesson, you will be able to:

  • Describe the roles and components of a successful learning environment
  • Define data science and the data science workflow
  • Apply the data science workflow to meet your classmates
  • Setup your development environment and review python basics

Topics/Highlights

Homework:

  • Due Mar 17
  • Due Tuesday March 22
    • Review each concept and each line of code in these files of python code: 00_python_beginner_workshop.py and 00_python_intermediate_workshop.py. Complete the coding exercises in the files: If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time before Mar 22nd practicing Python. Use your resources such as documentation, searches, the class Slack to get help if you get stuck. Here are some additional resources:
      • Introduction to Python does a great job explaining Python essentials and includes tons of example code.
      • If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
      • If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
      • If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
      • If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message)

Resources:


Class 2: Research Design and Pandas

Student pre-work Before this lesson, you should already be able to:

  • Have completed the python pre-work in the class pre-work described here

After this lesson, you will be able to:

  • Define a problem and types of data
  • Identify data set types
  • Define the data science workflow
  • Apply the data science workflow in the pandas context
  • Write an IPython Notebook to import, format and clean data using the Pandas Library

Topics/Highlights

  • Discuss the course project: requirements and example projects
  • The why's and how's of a good question (slides)
  • Types of datasets (slides)
  • Write a research question with raw data (exercise)
  • Data science workflow steps 2. Acquire and 3. Understand the data
  • Acquire and Understand data with Pandas

Homework:

  • Due Tuesday March 22
    • To turn in homework, attach files to a personal message in Slack to Jim Byers and Kevin Mcalear
    • Review each concept and each line of code in these files of python code: 00_python_beginner_workshop.py and 00_python_intermediate_workshop.py. Complete the coding exercises in the files: If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time before Mar 22nd practicing Python. Use your resources such as documentation, searches, the class Slack to get help if you get stuck. Here are some additional resources:
      • Introduction to Python does a great job explaining Python essentials and includes tons of example code.
      • If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
      • If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
      • If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
      • If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message)

Resources:

Python resources

Pandas resources

Name Description
Official Pandas Tutorials Wes & Company's selection of tutorials and lectures
Julia Evans Pandas Cookbook Great resource with eamples from weather, bikes and 311 calls
Learn Pandas Tutorials A great series of Pandas tutorials from Dave Rojas
Research Computing Python Data PYNBs A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas

Class 3: Statistics fundamentals

By the end of this lesson you will be able to:

  • Use NumPy and Pandas libraries to analyze datasets using basic summary statistics: mean, median, mode, max, min, quartile, inter-quartile range, variance, standard deviation, and correlation
  • Create data visualizations - including: line graphs, box plots, and histograms- to discern characteristics and trends in a dataset
  • Identify a normal distribution within a dataset using summary statistics and visualization

Topics/Highlights

Homework:

  • Due Tuesday March 24
    • Windows users, install Git Bash prior to starting the command line pre-class exercise*** as you will need the "bash" type command window on your Windows laptop in order to do the exercise and later to use git
      • We recommend Git Bash instead of Git Shell (which uses Powershell).
      • For Mac users, you will probably be using Terminal, or another command line application of your choice. It already is a bash type command line interpreter. No need to load anything. Git is part of the MAC OS so is already installed and ready to use.
    • Complete GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows)
    • Complete the command line pre-class exercise (code). You do not need to turn in this homework
    • Find one link to a resource about statistics that you find especially useful and send it in a slack message to Jim and Kevin. Note this will not be graded against the homework evaluation criteria. Jim will share these links back out on our repo so all can benefit.

Statistics Resources:

Pandas Resources:

  • This is a nice, short tutorial on pivot tables in Pandas.
  • For working with geospatial data in Python, GeoPandas looks promising. This tutorial uses GeoPandas (and scikit-learn) to build a "linguistic street map" of Singapore.

Visualization Resources:


Class 4: Command Line and Version Control

By the end of this lesson you will be able to:

  • clone a Githib repository to your laptop
  • synch your local files with your GitHub repository using git add, commit, push and pull
  • use more advanced command line commands such as Grep and |

Topics/Highlights

  • Review the command line pre-class exercise (code)
  • Git and GitHub (slides)
  • Intermediate command line (commands)

Homework:

Git and Markdown Resources:

  • Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
  • If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
  • If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
  • GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
  • Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
  • Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.

Command Line Resources:

  • If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
  • If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.

Class 5: Fetching Data

After this lesson you will be able to:

  • Articulate what JSON, APIs and Web scraping are and how they help us fetch data
  • Retrieve data from a website using the site’s APIs
  • Scrape a web page to extract data

Topics/Highlights:

Homework:

  • If you're using Anaconda, install Seaborn by running conda install seaborn at the command line. (Note that some students in past courses have had problems with Anaconda after installing Seaborn.) If you're not using Anaconda, install Seaborn using pip.
  • Optional: Complete the homework exercise listed in the web scraping code. It will take the place of any one homework you miss, past or future! This is due on Tuesday (April 5th).

API Resources:

  • This Python script to query the U.S. Census API was created by a former DAT student. It's a bit more complicated than the example we used in class, it's very well commented, and it may provide a useful framework for writing your own code to query APIs.
  • Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
  • The Data Science Toolkit is a collection of location-based and text-related APIs.
  • API Integration in Python provides a very readable introduction to REST APIs.
  • Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.

Web Scraping Resources:


Class 6: Intro to regression

After this lesson you will be able to:

  • Indentify the kinds of problems that Linear Regression can solve
  • Create a linear regression predictive model
  • Evaluate the error of the model's fit to the training data

Topics/Highlights:

Homework:

Linear Regression Resources:


Class 7: K-Nearest Neighbors

After this lesson you will be able to:

  • Indentify the steps to build a predictive model in scikit-learn
  • Create a k nearest neighbors (knn) predictive model
  • Describe the difference between a supervised and unsupervised model

Topics/Highlights:

Homework:

KNN Resources:

Seaborn Resources:


Class 8: Basic Model Evaluation

Model Evaluation Resources:

Reproducibility Resources:


Class 9: Logistic Regression

After this lesson you will be able to:

  • Describe the kind of problem Logistic regression can solve
  • Create a logistic regression model
  • Describe the elements of a Confusion Matrix

Topics/Highlights:

Homework:

Logistic Regression Resources:

  • To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
  • For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
  • For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
  • The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
  • Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).

Confusion Matrix Resources:


Class 10: Advanced Model Evaluation

After this lesson you will be able to:

  • Prepare your data by overcoming issues such as null values
  • Be able to measure accuracy of Logistic Regression with ROC curves and AUC
  • Be able to use cross validation to measure model accuracy more effectively than with test/train split

Topics/Highlights:

Homework:

ROC Resources:

Cross-Validation Resources:

Other Resources:


Class 11: Standardization (z-value scaling) and Clustering

By the end of this lesson you will be able to:

  • Standardize feature values
  • Cluster using K-means
  • Compare "how good" the clustering models are

Topics/Highlights

Homework:

  • Prepare your initial project presentation for Thursday!!
  • By Tuesday April 26, run the "DCSCAN Clustering" part of the 11_clustering.ipynb notebook to understand how to use the DBSCAN estimator to build a clustering model.

scikit-learn Resources:

Clustering Resources:


Class 13: Natural Language Processing

By the end of this lesson you will be able to:

  • Apply the NLP techniques of Vectorization and Tokenization to text to create features
  • Use stop word removal and other techniques to increase the accuracy of your models using these features
  • Create features using Stemming and Lemmatization

Topics/Highlights

  • Natural language processing (notebook)
  • Vectorization/Tokenization
  • Stopword Removal
  • Other CountVectorizer options
  • Intro to TextBlob
  • Stemming and Lemmatization
  • NLP Exercise (notebook)

Homework:

  • Your draft paper is due on Thursday (12/22)! Please submit a link to your project repository (with paper, code, data, and visualizations) before class

NLP Resources:


Class 14: Dimensionality reduction

By the end of this lesson you will be able to:

  • Apply TF-IDF to text (Natural language Processing)
  • Reduce dimensions using Principle Compoment analysis (Dimensionality reduction)

Topics/Highlights

NLP Resources:


Class 15: Decision Trees

By the end of this lesson you will be able to:

  • Create a Regression tree
  • Graph and interpret the decision tree

Topics/Highlights

Homework:

Resources:

  • scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
  • For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
  • If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
  • The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
  • Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.

Class 16: Ensembling, Bagging and Random Forests

Resources:


Class 17: Modeling with Time Series Data I


Class 18: Modeling with Time Series Data II


Class 19: Bonus Topics

Topics/Highlights

  • Using a multi-user git repository - Exercise
  • Trends
    • Data storage, Hadoop and MapReduce
    • AWS, Azure and Google, data and data science engine services
      • SQL Databases in the cloud
        • AWS Redshift, Oracle, SQL Server
        • Azure SQL Server
        • Google BigQuery
      • NoSQL Databases
        • AWS Mongo, DynamoDB
        • Azure
      • Data collection for IoT
      • Data science software as a service (SaaS) engines - ex. Azure Machine Learning
      • Machine learning with computer clusters: Spark with MLlib
  • Additional models and variants - Exercise
  • Top takeaways and top surprises - Exercise (form)
  • Where to go from here data scientists!
  • Next steps in your journey

Additional models and variants (in blue)


Bonus Content: Support Vector Classifier - SVC

SVC resources

  • For a more in-depth inderstanding of Support Vector Machines and SVC, read Chapter 9 of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
  • SVC videos by the authors of An Introduction to Statistical Learning can be found here.

Bonus content: Naive Bayes and Text Data

Resources:

  • Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
  • Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.

Bonus Content: Intro to Neural Networks

Neural Network resources

About

Repo of student materials for the General Assembly Data Science Course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 95.5%
  • HTML 2.9%
  • Python 1.6%