Skip to content

sinanuozdemir/SF_DAT_15

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SF DAT15 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, DC (6/15/15 - 8/26/15).

Instructors: Sinan Ozdemir (who is awesome)

Teaching Assistants: Liam Foley, Patrick Foley, and Ramesh Sampath (who are all way more awesome)

Office hours: All will be held in the student center at GA, 225 Bush Street

  • Monday 5:15-6:15pm
  • Tuesday 6:30-8:30pm
  • Wednesday 5:15-6:15pm
  • Friday 12:30-2:30pm
  • Saturday 10:00am-12:00pm

Course Project information

Monday Wednesday
6/15: Introduction / Expectations / Git Intro 6/17: Python
6/22: Data Science Workflow / Pandas 6/24: More Pandas
6/29: Intro to ML / Numpy / KNN 7/1: Scikit-learn / Model Evaluation
Project Milestone: Question and Data Set
HW Homework 1 Due
7/6: Linear Regression 7/8: Logistic Regression
7/13: Working on a Data Problem 7/15: Clustering
7/20: Natural Language Processing 7/22: Naive Bayes
Milestone: First Draft Due
7/27: Decision Trees 7/29:Ensembling Techniques
8/3: Recommendation Engines
Milestone: Peer Review Due
8/5: Databases / MapReduce
8/10: Dimension Reduction 8/12: Ensemble Techniques
8/17: Web Development with Flask 8/17: Neural Networks
8/24: Projects 8/26: Projects

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "SF_DAT_15 team" and add your photo!

Resources

Class 1: Introduction / Expectations / Git Intro

  • Introduction to General Assembly
  • Course overview: our philosophy and expectations (slides)
  • Git overview: (slides)
  • Tools: check for proper setup of Git, Anaconda, overview of Slack

Homework:

  • Resolve any installation issues before next class.
  • Make sure you have a github profile and created a repo called "SF_DAT_15"
  • Clone the class repo (this one!)
  • Review this code for a recap of some Python basics.

Optional:

Class 2: Python

  • Brief overview of Python environments: Python interpreter, IPython interpreter, Spyder, Rodeo
  • Python quiz (code)
  • Check out some iPython Notebooks!
  • Working with data in Python in Spyder
  • Lab on files and API usage

Homework:

Optional:

Resources:

Class 3: Data Science Workflow / Pandas

Agenda

  • Slides on the Data Science workflow here
    • Data Science Workflow
  • Intro to Pandas walkthrough here
    • I will give you semi-cleaned data allowing us to work on step 3 of the data science workflow
    • Pandas is an excellent tool for exploratory data analysis
    • It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
    • Pandas Lab!

Homework

  • Begin thinking about potential projects that you'd want to work on. Consider the problems discussed in class today (we will see more next time and next Monday as well)
    • Do you want a predictive model?
    • Do you want to cluster similar objects (like words or other)?

Resources:

Class 4 - More Pandas

Agenda

  • Class code on Pandas here
  • We will work with 3 different data sets today:
  • Pandas Lab! here

####Homework

  • Please review the readme for the first homework. It is due NEXT Wednesday (7/1/2015)
  • The one-pager for your project is also due. Please see project guidelines

Class 5 - Intro to ML / Numpy / KNN

####Agenda

  • Intro to numpy code
    • Numerical Python, code adapted from tutorial here
    • Special attention to the idea of the np.array
  • Intro to Machine Learning and KNN slides
    • Supervised vs Unsupervised Learning
    • Regression vs. Classification
  • Iris pre-work code and code solutions
    • Using numpy to investigate the iris dataset further
    • Understanding how humans learn so that we can teach the machine!
  • Lab to create our own KNN model

####Homework

  • The one page project milestone as well as the pandas homework!
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:

  • For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)

    Class 6: scikit-learn, Model Evaluation Procedures

  • Introduction to scikit-learn with iris data (code)

  • Exploring the scikit-learn documentation: user guide, module reference, class documentation

  • Discuss the article on the bias-variance tradeoff

  • Look as some code on the bias variace tradeoff

    • To run this, I use a module called "seaborn"
    • To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
  • Model evaluation procedures (slides, code)

Homework:

Optional:

  • Practice what we learned in class today!
    • If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
    • If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
    • Either way, you can submit your commented code to your SF_DAT_15_WORK, and we'll give you feedback.

Resources:

Class 7: Linear Regression

Homework:

Resources:

Class 8: Logistic Regression

Homework:

Resources:

Class 9: Working a Data Problem

  • Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.

  • Project overview (slides)

    • Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...

Class 10: Clustering and Visualization

  • The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
  • The code for today focuses on two main examples:
    • We will investigate simple clustering using the iris data set.
    • We will take a look at a harder example, using Pandora songs as data. See data. See code here
    • Checking out some of the limitations of K-Means Clutering here

Homework:

  • HW2 and Project Milestone 2 are due in one week!
  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.

Resources:

##Class 11: Natural Language Processing

Agenda

  • Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
  • code showing topics in NLP
  • lab analyzing tweets about the stock market

Homework:

  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Wednesday. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
    • Confusion matrix: a good guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
  • You should definitely be working on your project! First draft and HW2 are both due Wednesday!!

##Class 12: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!

Agenda

  • Learn about ROC/AUC curves
  • Learn the Naive Bayes Classifier
    • Slides here
    • Code here
    • In the code file above we will create our own spam classifier!

Resources

##Class 13: Decision Trees

We will look into a slightly more complex model today, the Decision Tree.

Agenda

Homework

  • Project reviews due August 3rd!

Resources

  • Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
  • The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
  • For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
    • Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
    • If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
    • If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
  • Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough

Class 15: Recommenders

  • Recommendation Engines slides
  • Recommendation Engine Example code

Resources:

Class 16: Databases and Mapreduce

Class 17: Dimension Reduction

Resources

  • Some hardcore math in python here
  • PCA using the iris data set here and with 2 components here
  • PCA step by step here
  • Check out Pyxley for our guest speaker's (Nick Kridler) talk on Wednesday

Class 18: Ensembling

Resources:

Class 19: Web Development

  • slides here
  • We will be working with the flask app found here

Resources:

  • MVC Architecture blog post
  • More on using Flask and Heroku here (Note you can ignore the virtual environment stuff, unless you want a challenge!)

Homework

  • Try to deploy your own ML model to Heroku!
  • Read an intro to Neural Networks here
  • And this intro to SVM

Class 20: Neural Networks and SVM

Agenda

Resources

##Project Info

  • Everyone will have a maximum of 15 minutes to present including Q&A
  • Please sign up for a slot if you haven't done so here
    • If you don't want to be crunched for time, try going on Monday :)
    • If you don't sign up by end of class today (Wednesday 8/19) we will assign you a slot
  • Final Projects are mandatory if you want a certification of completion from General Assembly
  • Remember you must submit both a presentation as well as a write up (What a write up you never mentioned that!) I did and also it is in the project requirements :)

About

Repo for SF DAT 15

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages