In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

## Daily / Weekly Format:

The main parts of the course are the lecture and miniprojects.

- **Lecture:** We'll work through the course material each day at 12 PM EST.  Friday is interview practice day.

- **Miniprojects:** Each week's action items are due *by Saturday night*.  We had consensus amongst prior Fellows that working on miniprojects gave them the exposure to concepts that a single large project did not.  Fellows that did miniprojects also had a better chance of getting placed.

We encourage the posting of advice/issues/resolved problems on the wiki in the [subpages](https://sites.google.com/a/thedataincubator.com/the-data-incubator-wiki/course-information-and-logistics/course) of the Course section.

### Interview Practice
We'll do a little bit of practice with interviewing on Fridays.  Here are some notes:
- [Programming_Questions](Programming_Questions.ipynb)
- [Statistics_Questions](Statistics_Questions.ipynb)
  
**WARNING:** *It's tempting to just "discuss the problems in a group."  The main point of practicing for interviews is practicing and getting familiar with thinking on your feet with no guidance.  "Discussing problems in a group" is not at all useful for this.  We suggest that everyone should solve the problem on their own, timed (give yourself 5 minutes), preferably while standing.  You want to feel as comfortable with this as possible because you will be much more nervous when you interview.  You can discuss the problems afterwards.*
    
### Capstone Projects

**Ghosting** your project and loading your data:
1. Decide on the basic format of your presentation (website, ipynb etc.).  Build a bare-bones version of your website / ipynb with some fake data, which are the results you think you'll get.  Your work should tell a story and all the supporting points (there should be at most 3) need to reinforce this story.  Think about how you would present your results.  What kind of graph or visualizations would you need?  Plan your analysis accordingly.
1. Once you know the destination, it's a lot easier to lay the groundwork.  That means setting up SQL tables, getting pandas dataframe code loaded, and joining appropriate tables so that you can start working.

Capstone projects are an important part of the program.  The advantages of doing a capstone project include:
1. Being invited to pitch them to employers at our employer events.
1. Some employers judge Fellows based on projects and this can result in invitations to pitch to employers onsite.
1. In the past, completing projects consistently resulted in multiple (and higher) offers for Fellows.

We're going to be doing 1-2 min screencasts each week.  The format of the video should be roughly:

1. Introduce yourself, say 1 - 2 sentences about your academic background.  ("Hi, I'm Michael.  I got my PhD at Princeton in Applied Math").
2. Introduce your project for The Data Incubator.  Why is it important to businesses?  ("My Data Incubator project uses machine learning to identify the cutest cat videos on the internet").
3. Talk through two to three "analytical" or "engineering" points that were *interesting*.  Great lines feel like this:
    - We looked at 5 million user reviews collected over 4 years on Yelp.
    - I'm using Mongo, a No-SQL datastore, to make my webpage load faster.  
    - I used cross validation to train a Random Forest for my Citibike demand predictions.
    - I realized that restaurant category tagging was inconsistent and I used a Naive Bayes model on tip text to fill in missing categories.

Be sure to be very visual in your explanations (basically, if it can be graphed, do it).  The format should be you showing off your website (if you have one) and code.

**Sample Screencasts:**
- Here's one that's thematically similar, discussing building a reddit recommender: https://www.youtube.com/watch?v=lGXQ8mQMR0s
- Some more from the same Harvard class (CS 109) about [the flu](https://www.youtube.com/watch?v=_eUDtxGzOMo), [predicting salaries](https://www.youtube.com/watch?v=9odQde25oSQ), [predicting NBA games](https://www.youtube.com/watch?v=HlF6eXJ4UgQ), [Twitter response to the Boston Marathon bombing](https://www.youtube.com/watch?v=IbXRxmNn-Jk)

**Resources:**
1. You can use your camera to record the introduction of yourself and your project.
1. You can do a screencast (recording your audio and your desktop as you scroll through demoing things) for the demo portion.  Here are instructions for doing it on [OSX](http://thenextweb.com/apple/2011/01/15/how-to-record-quick-easy-screencast-videos-with-mac-osx/).1. [Recommended Project Structure](Project_Structure.ipynb)

## Overview for Module 1:

**Goal:** Developing familiarity with basic Python tools and getting practice with them for data analysis

- [Numpy, Scipy, and Matplotlib](../module1/Numpy_Scipy_Matplotlib.ipynb)
- [Pandas](../module1/Pandas.ipynb)
- [Scraping](../module1/Scraping.ipynb)
- [APIs and JSON](../module1/APIs_and_JSON.ipynb)
- [What Technology to Use](../module1/What_Technology_to_Use.ipynb)
- [Good Engineering Practice](../module1/Good_Engineering_Practice.ipynb)

**Optional Topics:**

- [SQL](../module1/SQL.ipynb)
- [Handling Strings in Python](../module1/Dealing_with_Strings.ipynb)


## Overview for Module 2:

**Goal:** Developing familiarity with machine-learning and the Python tooling around it

- [Learning and Metrics](../module2/Learning_and_Metrics.ipynb)
- [Overfitting](../module2/Overfitting.ipynb)
- [Linear Regression](../module2/Linear_Regression.ipynb)
- [Unsupervised_Learning](../module2/Unsupervised_Learning.ipynb)
- [Scikit-learn Workflow](../module2/Scikit_Learn_Workflow.ipynb)

**Optional Topics:**
- [K Nearest Neighbors](../module2/K_Nearest_Neighbors.ipynb)

## Overview for Module 3:

**Goal:** More advanced topics in machine-learning and statistics, and more work on case studies
- [Decision Trees and Random Forests](../module3/Decision_Trees_and_Random_Forest.ipynb)
- [NLP Example Disambiguation](../module3/Natural_Language_Processing.ipynb)
- [Support Vector Machines](../module3/Support_Vector_Machines.ipynb)
- [Time Series](../module3/Time_Series.ipynb)

**Optional Topics:**
- [Comparing ML Algorithms](../module3/Comparing_ML_Algorithms.ipynb)
- [Unbalanced Classes](../module3/Unbalanced_Classes.ipynb)

## Overview for Module 4:

**Goal:** Get familiarity with the tooling around *Big Data*
- [Amazon_Web_Services](../module4/Amazon_Web_Services.ipynb)
- [Intro_Distributed_Computing](../module4/Intro_Distributed_Computing.ipynb)
- [Python Mapreduce](../module4/Python_Mapreduce.ipynb)
- [Naive Bayes](../module4/Naive_Bayes.ipynb)
- [Out of Core and Online Learning](../module4/Out_of_Core_and_Online_Learning.ipynb)

**Optional Topics:**
- [Hadoop and Java_Mapreduce](../module4/Hadoop_and_Java_Mapreduce.ipynb)

## Overview for Module 5:

**Goal:** Introduce Spark and the Python/Scala APIs; introduce Scalding; get a feel for a production data workload
- [Intro to PySpark](../module5/PySpark_Intro.ipynb)
- [Creating Spark Applications](../module5/Spark_Creating_Applications.ipynb)
- [PySpark MlLib](../module5/PySpark_MLlib.ipynb)
- [Spark Advanced Topics](../module5/Spark_Advanced_Topics.ipynb)

**Optional Topics:**
- [Scala Primer](../module5/Scala_Primer.ipynb)
- [Scala Spark Intro](../module5/Spark_Intro.ipynb)
- [Scala Spark MlLib](../module5/Spark_MLlib.ipynb)
- [Scalding](../module5/Scalding.ipynb)
- [Spark Streaming](../module5/Spark_Streaming.ipynb)

## Overview for Module 6:

**Goal:** Thinking about and practicing data visualization
- [Exploring Data Visually](../module6/Exploring_Data_Visually.ipynb)
- [Visualization Best Practices](../module6/Visualization_Best_Practices.ipynb)
- [Visualization with D3](../module6/Visualization_with_D3.ipynb)
- [Interactive Visualizations with D3](../module6/D3.ipynb)

**Optional Topics:**
- [Interactive Plotting in Jupyter Notebooks](../module6/Interactive_Visualizations_in_Notebooks.ipynb)

**Action Item (ungraded):** Update your 12-day project to include real-time interactivity such as sliders or drop-down menus. Feel free to use different datasets (e.g. Yelp reviews) to make something interesting and attractive.

## Overview for Module 7:

**Goal:** Thinking more broadly about data and interview prep

- [Case Studies](../module7/Case_Studies.ipynb)
- [Personal Interview Questions](../module7/Personal_Interview_Questions.ipynb)
- [Thinking Outside the Data](../module7/Thinking_Outside_the_Data.ipynb)

**Optional Topics:**
- [Algorithms and Data Structures](../module7/Algorithms_and_Data_Structures.ipynb)
- [Hypothesis Testing](../module7/Hypothesis_Testing.ipynb)
- [Statistics.ipynb](../module7/Statistics.ipynb)

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*