Skip to content

sinanuozdemir/sfdat26

Repository files navigation

SF DAT 26 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, CA (8/10/16 - 10/19/16).

Instructors: Sinan Ozdemir Teaching Assistants: Peter Gao / Cari Levay

Course Times

Monday/Wednesday: 6:30pm - 9:30pm

Office hours:

TBD

All courses / office hours will be held in the student center at GA, 225 Bush Street

Course Project Information

Course Project Examples

Monday Wednesday Project Milestone HW
No Class 8/10: Introduction / Expectations / Intro to Data Science
8/15: Pandas 8/17: APIs / Web Scraping 101 HW 1 Assigned (W)
8/22: Intro to Machine Learning / KNN 8/24: Linear Regression / Model Evaluation Three Potential Project Ideas (W)
8/29: Model Evaluation Con't / Logistic Regression 8/31: / Natural Language Processing HW 1 Due (W)
9/5: Labor Day (No Class) 9/7: Naive Bayes Classification
9/12: Advanced Sklearn (Pipeline and Feaure Unions) 9/14: Review HW 2 Assigned (W)
9/19: Decision Trees 9/21: Ensembling Techniques First Draft Due (W)
9/26: Dimension Reduction 9/28: Clustering / Topic Modelling Peer Review Due (M)
10/3: Stochastic Gradient Descent 10/5: Neural Networks / Deep Learning HW 2 Due (W)
10/10: Recommendation Engines 10/12: Web Development with Flask
10/17: Data Science in Practice / Projects 10/19: Projects Git Er Done Git Er Done

Installation and Setup

Resources

##Introduction / Expectations / Intro to Data Science

Agenda

  • Introduction to General Assembly slides
  • Course overview: our philosophy and expectations (slides)
  • Ice Breaker

Break -- Command Line Tutorial

  • Figure out office hours
  • Intro to Data Science: slides

Homework

  • Setup a conda virtual environment
  • Install Git and create a GitHub account.
    • Read my intro to Git and be sure to come back on monday with your very own repository called "sfdat26-lastname"
  • Once you receive an email invitation from Slack, join our "SFDAT26 team" and add your photo!
  • Introduction on how to read and write iPython notebooks tutorial

Class 2: Introduction to Pandas

####Goals

  • Feel comfortable importing, manipulating, and graphing data using Python's Pandas
  • Be able to find missing values and begin to have a sense of how to deal with them

####Agenda

  • Don't forget to git pull in the sfdat26 repo in your command line
  • Intro to Pandas walkthrough here

####Homework

  • Go through the python class/lab work and finish any exercise you weren't able to in class
  • Make sure you have all of the repos cloned and ready to go
    • You should have both "sfdat26" and "sfdat26_work"
  • Read Greg Reda's Intro to Pandas
  • Take a look at Kaggle's Titanic competition
  • I will be using a module called tweepy next time.
    • To install please type into your console conda install tweepy
      • OR if that does not work, pip install tweepy

Resources:

  • Another Git tutorial here
  • In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
  • Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
    • Here is a video of Wes McKinney going through his ipython notebook!
  • Examples of joins in Pandas
  • For more on Pandas plotting, read the visualization page from the official Pandas documentation.

Next Time on SFDAT26...

  • Maria finds out that Sancho has been cheating on her with her.. mother!

  • We will use python to programatically obtain data via open sources on the internet

    • We will be scraping the National UFO reporting center
    • We will be collecting tweets regarding Donald Trump and Hilary Clinton
    • We will be examining What people are really looking for in a data scientist..
  • We will continue to use pandas to investigate missing values in data and have a sense of how to deal with them

Class 3: APIs / Web Scraping 101

####Agenda

  • To install tweepy please type into your console conda install tweepy
    • OR if that does not work, pip install tweepy
  • Slides on Getting Data here
  • Intro to Regular Expressions here
  • Getting Data from the open web here
  • Getting Data from an API here
  • LAB on getting data here

####Homework

  • The first homework will be assigned by tomorrow morning (in a homework folder) and it is due in two Wednesdays (8/31)
    • It is a combo of pandas question with a bit of API/scraping
    • Please push your completed work to your sfdat26_work repo for grading
  • Your first project milestone is due next Wednesday. It is the first three ideas you have for your project. Think about potential interesting sources of data you would like to work with. This can come from work, hobby, or elsewhere!

####Resources:

Class 4: Intro to Machine Learning / KNN

####Agenda

  • Iris pre-work code

    • Using numpy to investigate the iris dataset further
    • Understanding how humans learn so that we can teach the machine!
    • If nedded, read intro to numpy code
      • Numerical Python, code adapted from tutorial here
      • Special attention to the idea of the np.array
  • Intro to Machine Learning and KNN slides

    • Supervised vs Unsupervised Learning
    • Regression vs. Classification
  • Lab to use KNN models to investigate accelerometer data

####Homework

  • The one page project milestone as well as the pandas homework! See requirements here
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?
  • For our talk on linear regression, read:

Resources:

  • For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
  • Stackoverflow article on the difference between generative and discriminative models here

Class 5: Model Evaluation Procedures / Linear Regression

Agenda

  • Discusss with people at your table about your three potential ideas.
    • Try to figure out which kinds of machine learning would be appropiate
      • supervised
      • unsupervised
  • Model evaluation procedures (slides, code)
  • Linear regression (notebook)
    • To run this, I use a module called "seaborn"
    • To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
    • In depth slides here
  • LAB -- Yelp dataset here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!
  • Discuss the article on the bias-variance tradeoff
  • Look as some code on the bias variace tradeoff

Homework:

Resources:

Class 6: Logistic Regression

Homework:

Resources:

##Class 7: Natural Language Processing

pre-work

  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install three new packages: yahoo_finance, textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install yahoo_finance pip install textblob and pip install lda.

Agenda

  • Quick recap of what we've done so far
  • Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
  • code showing topics in NLP
  • lab analyzing tweets about the stock market
  • NO CLASS ON MONDAY

Homework:

  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class when we get back!. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
    • Confusion matrix: a good guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.

##Class 8: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!

Agenda

  • Are you smart enough to work at Facebook?
  • Learn about Naive Bayes and ROC/AUC curves
    • Slides here
    • Code here
    • In the code file above we will create our own spam classifier!
  • Work on Homework / previous labs

Resources

  • Bayes Theorem as applied to Monty Hall here and here
  • Video on ROC Curves (12 minutes).
  • My good buddy's blog post about the ROC video includes the complete transcript and screenshots, in case you learn better by reading instead of watching.
  • Accuracy vs AUC discussions here and here

##Class 9: Advanced Sklearn Modules

Agenda

Today we are going to talk about four major things as related to advanced sklearn features and modules:

  • We will use sklearn's Pipeline feature to chain together multiple sklearn modules
  • We will look at the Feature Selection module to automatically find the most effective features in our dataset
  • We can use Feature Unions to combine several feature extraction techniques
  • More on StandardScalar as well
  • Find the notebook here!

Class 10: Review (crime doesn't pay)

Class 11: Decision Trees / Ensembling

  • Decision trees (notebook)
    • Bonus content deals with the algorithm behind building trees

** Homework **

  • HW2 is live in the homework file
  • Project milestone 2 is due on Wednesday!

Resources:

  • scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
  • For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
  • If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
  • The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
  • Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.

CLass 12: Ensembling Techniques

Agenda:

Resources:

Class 13: Dimension Reduction

Agenda

Resources

  • Facial Recognition using PCA
  • Layman's intro to PCA
  • Simple PCA using iris
  • PCA step by step in python
  • Sklearn page on dimension reduction techniques including SVD

Class 14: Clustering / Topic Modelling

Homework:

  • Homework 2 is due in one week!

Resources

Class 15: Stochastic Gradient Descent

Agenda

  • Understand how vecotr calculus can help us minimize errors in our machine learning algorithms
  • See how batch and stochastic gradient descent are effective tools
  • notebook here

Resources

  • Tutorial on gradient descent here
  • Another one here

Homework

Class 16: Deep Learning / Neural Networks

Agenda

  • Understand what is meant by deep learning
  • See how Tensorflow utilizes deep learning in thei neural networks
  • Slides here
  • notebook here

Resources

Class 17: Recommendation Engines

Agenda

  • See how companies like Pandora, Spotify, and Netflix make their recommendations
  • Slides here
  • notebook here

Resources

Class 18: Web Development with Flask and Heroku

Objectives

  • To launch your own machine learning powered website

Agenda

  • Sign up for a spot next week! signups

  • Present our speaker on Monday

  • Tips on presentation

    • Start with why
  • Understand Web Development and how we can deploy machine learning models to the web using Flask and Heroku

  • Slides here

Homework

  • Work on your project!!
  • Work on your project!!
  • Work on your project!!
  • Work on your project!!
  • please
  • Work on your project!!

Next Steps

The hardest thing to do now is to stay sharp! I have a few recommendations on next steps in order to make sure that you don't forget what we learned here!

Thank you all for such a wonderful time and I truly hope to stay in touch.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published