GitHub

SF DAT 26 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, CA (8/10/16 - 10/19/16).

Instructors: Sinan Ozdemir Teaching Assistants: Peter Gao / Cari Levay

Course Times

Monday/Wednesday: 6:30pm - 9:30pm

Office hours:

TBD

All courses / office hours will be held in the student center at GA, 225 Bush Street

Course Project Information

Course Project Examples

Monday	Wednesday	Project Milestone	HW
No Class	8/10: Introduction / Expectations / Intro to Data Science
8/15: Pandas	8/17: APIs / Web Scraping 101		HW 1 Assigned (W)
8/22: Intro to Machine Learning / KNN	8/24: Linear Regression / Model Evaluation	Three Potential Project Ideas (W)
8/29: Model Evaluation Con't / Logistic Regression	8/31: / Natural Language Processing		HW 1 Due (W)
9/5: Labor Day (No Class)	9/7: Naive Bayes Classification
9/12: Advanced Sklearn (Pipeline and Feaure Unions)	9/14: Review		HW 2 Assigned (W)
9/19: Decision Trees	9/21: Ensembling Techniques	First Draft Due (W)
9/26: Dimension Reduction	9/28: Clustering / Topic Modelling	Peer Review Due (M)
10/3: Stochastic Gradient Descent	10/5: Neural Networks / Deep Learning		HW 2 Due (W)
10/10: Recommendation Engines	10/12: Web Development with Flask
10/17: Data Science in Practice / Projects	10/19: Projects	Git Er Done	Git Er Done

Installation and Setup

Install the Anaconda distribution of Python 2.7x.
Setup a conda virtual environment
Install Git and create a GitHub account.
Once you receive an email invitation from Slack, join our "SFDAT26 team" and add your photo!

Resources

PEP 8 - Style Guide for Python
Learn How to Think Like a Computer Scientist
Potential book for course? :)

##Introduction / Expectations / Intro to Data Science

Agenda

Introduction to General Assembly slides
Course overview: our philosophy and expectations (slides)
Ice Breaker

Break -- Command Line Tutorial

Figure out office hours
Intro to Data Science: slides

Homework

Setup a conda virtual environment
Install Git and create a GitHub account.
- Read my intro to Git and be sure to come back on monday with your very own repository called "sfdat26-lastname"
Once you receive an email invitation from Slack, join our "SFDAT26 team" and add your photo!
Introduction on how to read and write iPython notebooks tutorial

Class 2: Introduction to Pandas

####Goals

Feel comfortable importing, manipulating, and graphing data using Python's Pandas
Be able to find missing values and begin to have a sense of how to deal with them

####Agenda

Don't forget to git pull in the sfdat26 repo in your command line
Intro to Pandas walkthrough here
- Pandas Lab 2

####Homework

Go through the python class/lab work and finish any exercise you weren't able to in class
Make sure you have all of the repos cloned and ready to go
- You should have both "sfdat26" and "sfdat26_work"
Read Greg Reda's Intro to Pandas
Take a look at Kaggle's Titanic competition
I will be using a module called tweepy next time.
- To install please type into your console conda install tweepy
  - OR if that does not work, pip install tweepy

Resources:

Another Git tutorial here
In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
- Here is a video of Wes McKinney going through his ipython notebook!
Examples of joins in Pandas
For more on Pandas plotting, read the visualization page from the official Pandas documentation.

Next Time on SFDAT26...

Maria finds out that Sancho has been cheating on her with her.. mother!
We will use python to programatically obtain data via open sources on the internet
- We will be scraping the National UFO reporting center
- We will be collecting tweets regarding Donald Trump and Hilary Clinton
- We will be examining What people are really looking for in a data scientist..
We will continue to use pandas to investigate missing values in data and have a sense of how to deal with them

Class 3: APIs / Web Scraping 101

####Agenda

To install tweepy please type into your console conda install tweepy
- OR if that does not work, pip install tweepy
Slides on Getting Data here
Intro to Regular Expressions here
Getting Data from the open web here
Getting Data from an API here
LAB on getting data here

####Homework

The first homework will be assigned by tomorrow morning (in a homework folder) and it is due in two Wednesdays (8/31)
- It is a combo of pandas question with a bit of API/scraping
- Please push your completed work to your sfdat26_work repo for grading
Your first project milestone is due next Wednesday. It is the first three ideas you have for your project. Think about potential interesting sources of data you would like to work with. This can come from work, hobby, or elsewhere!

####Resources:

Mashape allows you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
The Data Science Toolkit is a collection of location-based and text-related APIs.
API Integration in Python provides a very readable introduction to REST APIs.
Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application. Web Scraping Resources:
For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
import.io and Kimono claim to allow you to scrape websites without writing any code. Its alrighhhtttttt
How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.

Class 4: Intro to Machine Learning / KNN

####Agenda

Iris pre-work code
- Using numpy to investigate the iris dataset further
- Understanding how humans learn so that we can teach the machine!
- If nedded, read intro to numpy code
  - Numerical Python, code adapted from tutorial here
  - Special attention to the idea of the np.array
Intro to Machine Learning and KNN slides
- Supervised vs Unsupervised Learning
- Regression vs. Classification
Lab to use KNN models to investigate accelerometer data

####Homework

The one page project milestone as well as the pandas homework! See requirements here
Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?
For our talk on linear regression, read:
- This explanation of p values
- Correlation does not imply Causation
- P-values can't always be trusted

Resources:

For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
Stackoverflow article on the difference between generative and discriminative models here

Class 5: Model Evaluation Procedures / Linear Regression

Agenda

Discusss with people at your table about your three potential ideas.
- Try to figure out which kinds of machine learning would be appropiate
  - supervised
  - unsupervised
Model evaluation procedures (slides, code)
Linear regression (notebook)
- To run this, I use a module called "seaborn"
- To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
- In depth slides here
LAB -- Yelp dataset here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!
Discuss the article on the bias-variance tradeoff
Look as some code on the bias variace tradeoff

Homework:

Please upload your three potential ideas for your final project to your personal sfdat26_work repo
- Here is a markdown cheatsheet for making your very own markdown files
- I recommend Macdown editor for mac users and MarkPad for windows users
Watch these videos on probability and odds (8 minutes) if you're not familiar with either of those terms.
Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln).
Homework 1 is due in 7 days!

Resources:

Correlation does not imply Causation
P-values can't always be trusted
Setosa has an excellent interactive visualization of linear regression.
To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
This is a relatively quick post on the assumptions of linear regression.
John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
A major scientific journal recently banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
An article on "P Hacking" the idea that you can alter data in order to achieve good p values
Here's a great 30-second explanation of overfitting.
For more on today's topics, these videos from Hastie and Tibshirani are useful: overfitting and train/test split (14 minutes), cross-validation (14 minutes). (Note that they use the terminology "validation set" instead of "test set".)
- Alternatively, read section 5.1 (12 pages) of An Introduction to Statistical Learning, which covers the same content as the videos.
This video from Caltech's machine learning course presents an excellent, simple example of the bias-variance tradeoff (15 minutes) that may help you to visualize bias and variance.

Class 6: Logistic Regression

Linear regression (Continued) notebook
Logistic regression notebook and slides
- Confusion matrix slides
LAB -- Exercise with Titanic data instructions

Homework:

Homework due in 2 days!!!!
If you aren't yet comfortable with all of the confusion matrix terminology, watch Rahul Patwari's videos on Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes).

Resources:

To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
This simple guide to confusion matrix terminology may be useful to you as a reference.

##Class 7: Natural Language Processing

pre-work

Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
- import nltk
- nltk.download()
- Choose "all".
- Alternatively, just type nltk.download('all')
Install three new packages: yahoo_finance, textblob and lda.
- Open a terminal or command prompt.
- Type pip install yahoo_finance pip install textblob and pip install lda.

Agenda

Quick recap of what we've done so far
Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
code showing topics in NLP
lab analyzing tweets about the stock market
NO CLASS ON MONDAY

Homework:

Read Paul Graham's A Plan for Spam and be prepared to discuss it in class when we get back!. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
- Confusion matrix: a good guide roughly mirrors the lecture from class 10.
- Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
- Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.

##Class 8: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!

Agenda

Are you smart enough to work at Facebook?
Learn about Naive Bayes and ROC/AUC curves
- Slides here
- Code here
- In the code file above we will create our own spam classifier!
Work on Homework / previous labs

Resources

Bayes Theorem as applied to Monty Hall here and here
Video on ROC Curves (12 minutes).
My good buddy's blog post about the ROC video includes the complete transcript and screenshots, in case you learn better by reading instead of watching.
Accuracy vs AUC discussions here and here

##Class 9: Advanced Sklearn Modules

Agenda

Today we are going to talk about four major things as related to advanced sklearn features and modules:

We will use sklearn's Pipeline feature to chain together multiple sklearn modules
We will look at the Feature Selection module to automatically find the most effective features in our dataset
We can use Feature Unions to combine several feature extraction techniques
More on StandardScalar as well
Find the notebook here!

Class 10: Review (crime doesn't pay)

Review on the board
Review part deux (notebook)
LAB -- Kaggle competition instructions

Class 11: Decision Trees / Ensembling

Decision trees (notebook)
- Bonus content deals with the algorithm behind building trees

** Homework **

HW2 is live in the homework file
Project milestone 2 is due on Wednesday!

Resources:

scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.

CLass 12: Ensembling Techniques

Agenda:

Ensembling (notebook)
- Major League Baseball player data from 1986-87
- Data dictionary (page 7)

Resources:

scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
Interpretable vs Powerful Predictive Models: Why We Need Them Both is a short post on how the tactics useful in a Kaggle competition are not always useful in the real world.
Not Even the People Who Write Algorithms Really Know How They Work argues that the decreased interpretability of state-of-the-art machine learning models has a negative impact on society.
For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
Large Scale Decision Forests: Lessons Learned is an excellent post from Sift Science about their custom implementation of Random Forests.
Unboxing the Random Forest Classifier describes a way to interpret the inner workings of Random Forests beyond just feature importances.
Understanding Random Forests: From Theory to Practice is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn.

Class 13: Dimension Reduction

Agenda

PCA
- Explained visually
- Slides
- Code: PCA

Resources

Facial Recognition using PCA
Layman's intro to PCA
Simple PCA using iris
PCA step by step in python
Sklearn page on dimension reduction techniques including SVD

Class 14: Clustering / Topic Modelling

Clustering (slides, notebook)
- K-means: documentation, visualization 1, visualization 2
- DBSCAN: documentation, visualization
- LAB -- Pandora notebook

Homework:

Homework 2 is due in one week!

Resources

The Psychology Of Trump’s Twitter Followers
Sklearn clustering example here and here
LDA explained
LSA tutorial

Class 15: Stochastic Gradient Descent

Agenda

Understand how vecotr calculus can help us minimize errors in our machine learning algorithms
See how batch and stochastic gradient descent are effective tools
notebook here

Resources

Tutorial on gradient descent here
Another one here

Homework

Homework 2 is due on Wednesday!!!!
Install Tensorflow and skflow
After you have installed tensorflow, read this intro to tensorflow

Class 16: Deep Learning / Neural Networks

Agenda

Understand what is meant by deep learning
See how Tensorflow utilizes deep learning in thei neural networks
Slides here
notebook here

Resources

Blog post on Object Detection
Beginner tutorial for tensorflow
Directory of training in Tensorflow

Class 17: Recommendation Engines

Agenda

See how companies like Pandora, Spotify, and Netflix make their recommendations
Slides here
notebook here

Resources

Website on The Netflix Prize
Another intro to recommendation engines
Using Kmeans to make clusters for recommending

Class 18: Web Development with Flask and Heroku

Objectives

To launch your own machine learning powered website

Agenda

Sign up for a spot next week! signups
Present our speaker on Monday
Tips on presentation
- Start with why
Understand Web Development and how we can deploy machine learning models to the web using Flask and Heroku
Slides here

Homework

Work on your project!!
Work on your project!!
Work on your project!!
Work on your project!!
please
Work on your project!!

Next Steps

The hardest thing to do now is to stay sharp! I have a few recommendations on next steps in order to make sure that you don't forget what we learned here!

Read [my book] (https://www.amazon.com/Principles-Data-Science-Sinan-Ozdemir-ebook/dp/B01A8T8YNC) ;)
Always stay up to date on Kaggle
- Try working with some other people in this class!
- Our Slack channel will stay around if you still want to post cool blogs, videos, etc!
Try implementing some of the models we learned in class on your own!
- Great book [Data Science From Scratch] (http://file.allitebooks.com/20150707/Data%20Science%20from%20Scratch-%20First%20Principles%20with%20Python.pdf) with code
- Text classification with Naive Bayes from Scratch [here] (https://web.stanford.edu/class/cs124/lec/naivebayes.pdf)
- Introduction to Statistical Learning book Videos here
- PCA by hand here
Take a look at the Resources for each class to get a deeper understanding of what we've learned. Trust me there are a lot
Follow data scientists on Twitter. This will help you stay up on the latest news/models/applications/tools.
Read blogs to keep learning. I really like District Data Labs and Data Elixer.
There are some active Python Data meetups in the area:
- SF Python
- SF Data Science
- SF Data Mining
- Request sponsorship for study groups through GA
- General GA Alumni Perks

Thank you all for such a wonderful time and I truly hope to stay in touch.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.jupyter/custom		.jupyter/custom
data		data
hw		hw
labs		labs
notebooks		notebooks
project-examples		project-examples
slides		slides
.DS_Store		.DS_Store
README.md		README.md
project-examples.md		project-examples.md
project.md		project.md
public_data.md		public_data.md

sinanuozdemir/sfdat26

Folders and files

Latest commit

History

Repository files navigation

SF DAT 26 Course Repository

Installation and Setup

Resources

Class 2: Introduction to Pandas

Resources:

Next Time on SFDAT26...

Class 3: APIs / Web Scraping 101

Class 4: Intro to Machine Learning / KNN

Class 5: Model Evaluation Procedures / Linear Regression

Class 6: Logistic Regression

Class 10: Review (crime doesn't pay)

Class 11: Decision Trees / Ensembling

CLass 12: Ensembling Techniques

Agenda:

Resources:

Class 13: Dimension Reduction

Agenda

Resources

Class 14: Clustering / Topic Modelling

Resources

Class 15: Stochastic Gradient Descent

Class 16: Deep Learning / Neural Networks

Class 17: Recommendation Engines

Class 18: Web Development with Flask and Heroku

Next Steps

About

Resources

Stars

Watchers

Forks

Languages