Skip to content

vandonova/sfdat22

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SF DAT 22 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, CA (3/29/16 - 6/9/16).

Instructors: Sinan Ozdemir Teaching Assistants: Mars Williams / Imeh Williams

Office hours:

W: 5:30pm - 7:30pm

Sa: 12pm-2pm

Su: 12pm-2pm

All will be held in the student center at GA, 225 Bush Street

Course Project Information

Course Project Examples

Tuesday Thursday Project Milestone HW
3/29: Introduction / Expectations / Intro to Data Science 3/31: Introduction to Git / Pandas
4/5: Pandas 4/7: APIs / Web Scraping 101 HW 1 Assigned (Th)
4/12: Intro to Machine Learning / KNN 4/14: Scikit-learn / Model Evaluation Question and Data Set (Th) HW 1 Due (Th)
4/19: Linear Regression 4/21: Logistic Regression
4/26: Time Series Data 4/28: Review (SF Crime Data Lab)
5/3: Clustering 5/5: Natural Language Processing HW 2 Assigned (Th)
5/10: Naive Bayes 5/12: Decision Trees / Ensembling Techniques HW 2 Due (Th)
5/17: Dimension Reduction 5/19: Web Development with Flask One Pager Due (Th)
5/24: Recommendation Engines (peer reviews due!!!!!!) 5/26: Support Vector Machines / Neural Networks Peer Review Due (Tues)
5/31: Projects 6/2: Projects Git Er Done Git Er Done

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "SF_DAT_17 team" and add your photo!

Resources

Class 1: Introduction / Expectations / Intro to Data Science / Python Exercises

####Agenda

  • Introduction to General Assembly slides
  • Course overview: our philosophy and expectations (slides)
  • Intro to Data Science: slides

Break -- Command Line Tutorial

  • Introduction on how to read and write iPython notebooks tutorial
  • Python pre-work here
  • Next class we will go over proper use of git and ipython notebooks in more depth

####Homework

  • Make sure you have everything installed as specified above in "Installation and Setup" by Thursday
  • Read this awesome intro to Git here
  • Read this intro to the iPython notebook here

--

Class 2: Introduction to Git / Pandas

####Agenda

  • Introduction to Git
  • Intro to Pandas walkthrough here
    • Pandas is an excellent tool for exploratory data analysis
    • It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
    • Pandas Lab!

####Homework

  • Go through the python file and finish any exercise you weren't able to in class
  • Make sure you have all of the repos cloned and ready to go
    • You should have both "sfdat22" and "sfdat22_work"
  • Read Greg Reda's Intro to Pandas
  • Take a look at Kaggle's Titanic competition

Resources:

  • Another Git turorial here
  • In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
  • Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
    • Here is a video of Wes McKinney going through his ipython notebook!

--

Class 3: Pandas

####Agenda

  • Don't forget to git pull in the sfdat22 repo in your command line
  • Intro to Pandas walkthrough here (same as last Thursdays)
  • Extended Intro to Pandas walkthrough here (new)

####Homework

  • Finish any lab questions that you did not finish in class
    • Make sure everything is pushed to sfdat22_work if you'd like us to take a look
  • make sure both requests and beautifulsoup are installed
    • To check, try import requests and import bs4 both work without error while running python!
  • Read this intro to APIs
  • Check out the National UFO Reporting Center here it will be one of the topics of the lab on Thursday

Resources:

--

Class 4: APIs / Web Scraping 101

####Agenda

  • I will also be using a module called tweepy today.
    • To install please type into your console conda install tweepy
      • OR if that does not work, pip install tweepy
  • Slides on Getting Data here
  • Intro to Regular Expressions here
  • Getting Data from the open web here
  • Getting Data from an API here
  • LAB on getting data here

####Homework

  • The first homework will be assigned by tomorrow morning (in a homework folder) and it is due NEXT Thursday (4/14)
    • It is a combo of pandas question with a bit of API/scraping
    • Please push your completed work to your sfdat22_work repo for grading

####Resources:

--

Class 5: Intro to Machine Learning / KNN

####Agenda

  • Iris pre-work code

    • Using numpy to investigate the iris dataset further
    • Understanding how humans learn so that we can teach the machine!
  • Intro to numpy code

    • Numerical Python, code adapted from tutorial here
    • Special attention to the idea of the np.array
  • Intro to Machine Learning and KNN slides

    • Supervised vs Unsupervised Learning
    • Regression vs. Classification
  • Lab to create our own KNN model

####Homework

  • The one page project milestone as well as the pandas homework! See requirements here
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:

--

Class 6: scikit-learn, Model Evaluation Procedures

  • Introduction to scikit-learn with iris data (code)
  • Exploring the scikit-learn documentation: user guide, module reference, class documentation
  • Discuss the article on the bias-variance tradeoff
  • Look as some code on the bias variace tradeoff
    • To run this, I use a module called "seaborn"
    • To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
  • Model evaluation procedures (slides, code)
  • Glass Identification Lab here

Homework:

Optional:

  • Practice what we learned in class today! Finish up the Glass data lab

Resources:

--

Class 7: Linear Regression

  • Linear regression (notebook)

    • In depth slides here
  • LAB -- Yelp dataset here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!

Homework:

Resources:

--

Class 8: Logistic Regression

  • Logistic regression (notebook)
    • BONUS slides here (These slides go a bit deeper into the math)
  • Confusion matrix (slides)
  • LAB -- Exercise with Titanic data instructions

Homework:

Resources:

--

Class 9: Time Series Data

Resources:

--

Class 10: Review (crime doesn't pay)

Class 11: Clustering

Homework:

  • Homework 2 is due in one week!
  • Project Milestone 2 is due in two weeks!
  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.

Resources:

--

##Class 12: Natural Language Processing

Agenda

  • Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
  • code showing topics in NLP
  • lab analyzing tweets about the stock market

Homework:

  • HW2 is assigned in the hw folder
  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Wednesday. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
    • Confusion matrix: a good guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
  • You should definitely be working on your project! First draft is due Monday!!

--

##Class 13: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!

Agenda

  • Are you smart enough to work at Facebook?
  • Learn about Naive Bayes and ROC/AUC curves
    • Slides here
    • Code here
    • In the code file above we will create our own spam classifier!
  • Work on Homework / previous labs

Resources

  • Bayes Theorem as applied to Monty Hall here and here
  • Video on ROC Curves (12 minutes).
  • My good buddy's blog post about the ROC video includes the complete transcript and screenshots, in case you learn better by reading instead of watching.
  • Accuracy vs AUC discussions here and here

--

Class 14: Decision Trees / Ensembling

Decision Tree Resources:

  • scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
  • For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
  • If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
  • The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
  • Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.

Ensembling Resources:

--

Class 15: Dimension Reduction

Resources

  • Some hardcore math in python here
  • PCA using the iris data set here and with 2 components here
  • PCA step by step here
  • Check out Pyxley

--

Class 17: Web Development with Flask and Heroku

Objectives

  • To launch your own machine learning powered website

Agenda

  • Sign up for Project times NEXT WEEK
  • Slides here

Homework

  • Work on your project!!
  • Work on your project!!
  • Work on your project!!
  • Work on your project!!
  • please
  • Work on your project!!

--

Class 18: Neural Networks and SVM

Agenda

  • Sign up for a project slot here
  • slides here
  • SVM notebook here
  • Neural Network notebook here
  • We will need a new package Pybrain!! sudo pip install pybrain

Office Hours this weekend are Saturday from 9am-5pm and Sunday from 12pm-2pm

Project questions ONLY

Resources

About

SF DAT 22 Course Repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%