## Lighthouse Labs
### W01D4 Miniprojects 
Instructor: Socorro Dominguez  
January 07, 2021


**Agenda:**

* Quick Review Days 1-3 (30 min)
    * Python
    * APIs
    * BASH
* The value of mini projects
* How to approach mini projects
    * explore
    * design
    * implement
    * present

* Presentation Guidelines
    * introduction - 1 min
    * tools and packages - 1 min
    * results 1-2 min
    * challenges and next steps - 1min

* EXTRA: Intro to Git
    * How to collaborate with Git (Simple Git Workflow)

What are Data Science Projects?

* Source of knowledge/experience.

* What you will be talking about in your job interviews

* Benchmarking a new technique
     * Example: you develop a new classification algorithm and want to compare it to existing ones
     * Use popular publicly available benchmark datasets (e.g. from Kaggle)
* New use-case for existing technique
     * Example: you apply a product-recommendations algorithm to your business’s order history

### Lighthouse Projects

* Can be question/answer-based or open-ended.
* You have to come up with some of your own problem statements and goals to frame your work.
* Not limited to just answering the questions


**Miniprojects**
- APIs - Python data structures (individual)
- Databases (SQL) - Pandas (individual)
- Feature engineering - Dimensionality reduction - Unsupervised learning (pairs)
- Supervised learning - Deployment (individual)
- Deep Learning - NLP (individual)
- Open-ended final project

## Individual and group work

* How to plan the project?
     - Set up your working space.
     - Write or download a public Code of Conduct
     - Choose a Licence over your work
     - Decide how will you collaborate, will you share a Notebook on Google Cloud? Will you use a Version Control (Github) system?
     - Decide how many Jupyter notebooks or python scripts you will need so that you do not duplicate work.
     - Write a clear README on how your repo works.

Keep in mind:
- Two minds do not necessarily code twice as fast!
- Try Version control and Parallelization
    - Different sub-tasks
    - Different files
- No pair-coding, instead try doing a Code reviews of each-other’s work.
- Decide whether you will work in branches or forks. 
- GitHub: push / submit pull requests only at working milestones (no errors). 

## Mini Project 1

**Part 1: Transport of London API**  
Example: Plan the journey from Heathrow Airport to Tower Bridge using Public Transport, Taxi or Bike? Which way is the fastest?


**Part 2: The Movie Database API (stretch)**  
Example: Find top 5 trending movies


**Challenges:**
- Working with difficult documentation (poor descriptions of what input/return values are)
- Parsing complex data structures (nested lists/dictionaries)
- 7 minute presentation (plus 1 minute feedback)
    * Make sure to respect your 7 minutes, do not go past them. 

## Strategies for your projects

- Good code follows the DRY (Don't Repeat Yourself) principle.
    * Name variables in a human readable fashion.
    * Use comments and docstrings when needed. 
    * Any time you find yourself writing similar code again and again, make a function
    
    
* Other good practices include:
    - Define functions and/or classes whenever possible (e.g. get_transit(url))
    - Save GET request results and only fetch when needed (e.g. function that checks if data has been fetched, fetches data at a URL, then saves it)
    - Jupyter code blocks should have a clearly defined goal.
    - Periodically refactor code (i.e. clean up, reorganize, consolidate)
    - Write tests and assert statements for your functions.

## Progress and Dificulty

* Make a minimum viable product (MVP) early
* Dataset difficulty 
* Model complexity (not yet)
* Task complexity (not yet)

- Explore the dataset, API, and other tools  
**BE CURIOUS**  
- Play with the dataset
    - What do the variables mean?
    - Which variables are categorical, ordinal, and continuous?
    - Range/variance of each variable?
    - Plot things that interest you of the dataset
- Try out the API functions and explore the returned structure

What does a good Repo may look like?

[Template](https://github.com/sedv8808/template)

What do you notice?

* Intro to Git 
    * A small [Guide](https://github.com/sedv8808/LighthouseLabs/blob/main/W05D5/GitHub%20Presentation/01_VersionControl.pdf) to version control
    * How to collaborate with Git (Simple Git Workflow)
        * Decide if you are going to do Branches or Forking the repository for collaboration.
        * Your Master branch should always be "ready to deploy"

* Why use Git?
    - Public: Employers can see all the projects you’ve worked on
    - Versioned: You will have a history and can roll back to old commits
    - Server deployment: Just git pull to any new machine
    - Teamwork: Everyone can work on their own copy and working versions to the master copy


## Project presentations


**General Pointers**  

1. Present as if to a client (who has some data science knowledge)
2. Make it a story
3. Explain the problem
4. Describe the dataset
5. Unfold how you analyzed the dataset
6. Show your findings

Avoid walking through code. 


**Structure**  

- Motivation: What is the problem? Why is it important (either business, public good, or research perspective)?
- Task: Problem from a technical perspective. Description of the dataset, algorithm inputs/outputs, analyses done using model
- Usually 'Modeling' but not yet.
- Results: Visuals! Show metrics and experiments. Demo (if any)
- Conclusions: What worked? What didn’t (and why)? How are we better off? Where could the project go next?

## Need figures? 
- Use draw.io
- python plotting libraries
