Course materials for General Assembly's Data Science course in San Francisco, CA (3/29/16 - 6/9/16).
Instructors: Sinan Ozdemir Teaching Assistants: Mars Williams / Imeh Williams
Office hours:
W: 5:30pm - 7:30pm
Sa: 12pm-2pm
Su: 12pm-2pm
All will be held in the student center at GA, 225 Bush Street
Tuesday | Thursday | Project Milestone | HW |
---|---|---|---|
3/29: Introduction / Expectations / Intro to Data Science | 3/31: Introduction to Git / Pandas | ||
4/5: Pandas | 4/7: APIs / Web Scraping 101 | HW 1 Assigned (Th) | |
4/12: Intro to Machine Learning / KNN | 4/14: Scikit-learn / Model Evaluation | Question and Data Set (Th) | HW 1 Due (Th) |
4/19: Linear Regression | 4/21: Logistic Regression | ||
4/26: Time Series Data | 4/28: Review (SF Crime Data Lab) | ||
5/3: Clustering | 5/5: Natural Language Processing | HW 2 Assigned (Th) | |
5/10: Naive Bayes | 5/12: Decision Trees / Ensembling Techniques | HW 2 Due (Th) | |
5/17: Dimension Reduction | 5/19: Web Development with Flask | One Pager Due (Th) | |
5/24: Recommendation Engines (peer reviews due!!!!!!) | 5/26: Support Vector Machines / Neural Networks | Peer Review Due (Tues) | |
5/31: Projects | 6/2: Projects | Git Er Done | Git Er Done |
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "SF_DAT_17 team" and add your photo!
####Agenda
- Introduction to General Assembly slides
- Course overview: our philosophy and expectations (slides)
- Intro to Data Science: slides
Break -- Command Line Tutorial
- Introduction on how to read and write iPython notebooks tutorial
- Python pre-work here
- Next class we will go over proper use of git and ipython notebooks in more depth
####Homework
- Make sure you have everything installed as specified above in "Installation and Setup" by Thursday
- Read this awesome intro to Git here
- Read this intro to the iPython notebook here
--
####Agenda
- Introduction to Git
- Intro to Pandas walkthrough here
- Pandas is an excellent tool for exploratory data analysis
- It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
- Pandas Lab!
####Homework
- Go through the python file and finish any exercise you weren't able to in class
- Make sure you have all of the repos cloned and ready to go
- You should have both "sfdat22" and "sfdat22_work"
- Read Greg Reda's Intro to Pandas
- Take a look at Kaggle's Titanic competition
- Another Git turorial here
- In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
- Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
- Here is a video of Wes McKinney going through his ipython notebook!
--
####Agenda
- Don't forget to
git pull
in the sfdat22 repo in your command line - Intro to Pandas walkthrough here (same as last Thursdays)
- Pandas Lab 1 (same as last Thursday)
- Extended Intro to Pandas walkthrough here (new)
- Pandas Lab 2 (new)
####Homework
- Finish any lab questions that you did not finish in class
- Make sure everything is pushed to sfdat22_work if you'd like us to take a look
- make sure both requests and beautifulsoup are installed
- To check, try
import requests
andimport bs4
both work without error while running python!
- To check, try
- Read this intro to APIs
- Check out the National UFO Reporting Center here it will be one of the topics of the lab on Thursday
- Examples of joins in Pandas
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.
--
####Agenda
- I will also be using a module called
tweepy
today.- To install please type into your console
conda install tweepy
- OR if that does not work,
pip install tweepy
- OR if that does not work,
- To install please type into your console
- Slides on Getting Data here
- Intro to Regular Expressions here
- Getting Data from the open web here
- Getting Data from an API here
- LAB on getting data here
####Homework
- The first homework will be assigned by tomorrow morning (in a homework folder) and it is due NEXT Thursday (4/14)
- It is a combo of pandas question with a bit of API/scraping
- Please push your completed work to your sfdat22_work repo for grading
####Resources:
- Mashape allows you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- The Data Science Toolkit is a collection of location-based and text-related APIs.
- API Integration in Python provides a very readable introduction to REST APIs.
- Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application. Web Scraping Resources:
- For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
- import.io and Kimono claim to allow you to scrape websites without writing any code. Its alrighhhtttttt
- How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
--
####Agenda
-
Iris pre-work code
- Using numpy to investigate the iris dataset further
- Understanding how humans learn so that we can teach the machine!
-
Intro to numpy code
- Numerical Python, code adapted from tutorial here
- Special attention to the idea of the np.array
-
Intro to Machine Learning and KNN slides
- Supervised vs Unsupervised Learning
- Regression vs. Classification
-
Lab to create our own KNN model
####Homework
- The one page project milestone as well as the pandas homework! See requirements here
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?
Resources:
- For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
--
- Introduction to scikit-learn with iris data (code)
- Exploring the scikit-learn documentation: user guide, module reference, class documentation
- Discuss the article on the bias-variance tradeoff
- Look as some code on the bias variace tradeoff
- To run this, I use a module called "seaborn"
- To install to anywhere in your terminal (git bash) and type in
sudo pip install seaborn
- Model evaluation procedures (slides, code)
- Glass Identification Lab here
Homework:
- Keep working on your project. Your data exploration and analysis plan is due in three weeks!
Optional:
- Practice what we learned in class today! Finish up the Glass data lab
Resources:
- Here's a great 30-second explanation of overfitting.
- For more on today's topics, these videos from Hastie and Tibshirani are useful: overfitting and train/test split (14 minutes), cross-validation (14 minutes). (Note that they use the terminology "validation set" instead of "test set".)
- Alternatively, read section 5.1 (12 pages) of An Introduction to Statistical Learning, which covers the same content as the videos.
- This video from Caltech's machine learning course presents an excellent, simple example of the bias-variance tradeoff (15 minutes) that may help you to visualize bias and variance.
--
-
Linear regression (notebook)
- In depth slides here
-
LAB -- Yelp dataset here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!
Homework:
- Watch these videos on probability and odds (8 minutes) if you're not familiar with either of those terms.
- Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln).
Resources:
- Correlation does not imply Causation
- P-values can't always be trusted
- Setosa has an excellent interactive visualization of linear regression.
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
- This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
- John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
- A major scientific journal recently banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
- An article on "P Hacking" the idea that you can alter data in order to achieve good p values
--
- Logistic regression (notebook)
- BONUS slides here (These slides go a bit deeper into the math)
- Confusion matrix (slides)
- LAB -- Exercise with Titanic data instructions
Homework:
- If you aren't yet comfortable with all of the confusion matrix terminology, watch Rahul Patwari's videos on Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes).
Resources:
- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
- For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
- For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
- Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
- This simple guide to confusion matrix terminology may be useful to you as a reference.
--
- Time Series (notebook)
- slides here
- LAB -- Walmart Sales Forecasting instructions
Resources:
- Some more practice with time series here
- And here
- More practice with Autocorrelation
- Extensive blog on time series and ARIMA
--
- Review on the board
- Review part deux (notebook)
- LAB -- Kaggle competition instructions
- Clustering (slides, notebook)
- K-means: documentation, visualization 1, visualization 2
- DBSCAN: documentation, visualization
- LAB -- Pandora notebook
Homework:
- Homework 2 is due in one week!
- Project Milestone 2 is due in two weeks!
- Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
import nltk
nltk.download()
- Choose "all".
- Alternatively, just type
nltk.download('all')
- Install two new packages:
textblob
andlda
.- Open a terminal or command prompt.
- Type
pip install textblob
andpip install lda
.
Resources:
- For a very thorough introduction to clustering, read chapter 8 (69 pages) of Introduction to Data Mining (available as a free download), or browse through the chapter 8 slides.
- scikit-learn's user guide compares many different types of clustering.
- This PowerPoint presentation from Columbia's Data Mining class provides a good introduction to clustering, including hierarchical clustering and alternative distance metrics.
- An Introduction to Statistical Learning has useful videos on K-means clustering (17 minutes) and hierarchical clustering (15 minutes).
- This is an excellent interactive visualization of hierarchical clustering.
- This is a nice animated explanation of mean shift clustering.
- The K-modes algorithm can be used for clustering datasets of categorical features without converting them to numerical values. Here is a Python implementation.
- Here are some fun examples of clustering: A Statistical Analysis of the Work of Bob Ross (with data and Python code), How a Math Genius Hacked OkCupid to Find True Love, and characteristics of your zip code.
--
##Class 12: Natural Language Processing
Agenda
- Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
- code showing topics in NLP
- lab analyzing tweets about the stock market
Homework:
- HW2 is assigned in the hw folder
- Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Wednesday. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
- Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
- Confusion matrix: a good guide roughly mirrors the lecture from class 10.
- Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
- Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
- You should definitely be working on your project! First draft is due Monday!!
--
##Class 13: Naive Bayes Classifier
Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!
Agenda
- Are you smart enough to work at Facebook?
- Learn about Naive Bayes and ROC/AUC curves
- Work on Homework / previous labs
Resources
- Bayes Theorem as applied to Monty Hall here and here
- Video on ROC Curves (12 minutes).
- My good buddy's blog post about the ROC video includes the complete transcript and screenshots, in case you learn better by reading instead of watching.
- Accuracy vs AUC discussions here and here
--
- Decision trees (notebook)
- Ensembling (notebook)
- Major League Baseball player data from 1986-87
- Data dictionary (page 7)
Decision Tree Resources:
- scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
- For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
- If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
- The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
- Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.
Ensembling Resources:
- scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
- MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
- Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
- Interpretable vs Powerful Predictive Models: Why We Need Them Both is a short post on how the tactics useful in a Kaggle competition are not always useful in the real world.
- Not Even the People Who Write Algorithms Really Know How They Work argues that the decreased interpretability of state-of-the-art machine learning models has a negative impact on society.
- For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
- Large Scale Decision Forests: Lessons Learned is an excellent post from Sift Science about their custom implementation of Random Forests.
- Unboxing the Random Forest Classifier describes a way to interpret the inner workings of Random Forests beyond just feature importances.
- Understanding Random Forests: From Theory to Practice is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn.
--
- Finish up Ensembling (notebook)
- Major League Baseball player data from 1986-87
- Data dictionary (page 7)
- PCA
- Explained visually
- Slides
- Code: PCA
Resources
- Some hardcore math in python here
- PCA using the iris data set here and with 2 components here
- PCA step by step here
- Check out Pyxley
--
Objectives
- To launch your own machine learning powered website
Agenda
- Sign up for Project times NEXT WEEK
- Slides here
Homework
- Work on your project!!
- Work on your project!!
- Work on your project!!
- Work on your project!!
- please
- Work on your project!!
--
Agenda
- Sign up for a project slot here
- slides here
- SVM notebook here
- Neural Network notebook here
- We will need a new package Pybrain!!
sudo pip install pybrain
Resources
- An intro to Neural Networks here
- An intro to SVM
- SVM Margins Example here
- SVM digits was adapted from here
- Google Deep Dream: why does it always see dogs?!
- Deep Dream Generator
- Most used non-sklearn ANN PyBrain
- Step by Step back propagation here
- Code adapted from here and here
- Calculus adapted from here
- Sklearn will come out with their own supervised neural network soon! here