# Regression Mechanics

## Housecleaning

<img src="https://i.imgur.com/9LNt8ji.jpg" alt="Drawing" style="width: 400px;"/>

### Teams

#### Who still needs one?

#### GitHub for your team

GitHub set up two locations for your team:
1. A team area where you can add discussions and extra repos if you want to experiment. The URL is like this: https://github.com/orgs/LeDataSciFi/teams/corona-crushers 
1. The usual repo URL structure for the assignment: https://github.com/LeDataSciFi/semester-project-corona-crushers 

### The proposal, due Monday

In the "usual" repo, modify the README as a group. 
- The title should be: "#Research Proposal: < Title >"
- A "Research Question" section about the research question that answers questions 1.1-1.3 [here](https://www.textbook.ds100.org/ch/01/lifecycle_students_1.html)
    1. What do we want to know or what problems are we trying to solve?
    2. What are our hypotheses?
    3. What are our metrics of success? 
- A "Data" section about the data you think you'll need 
    2. What does the final dataset need to look like (mostly dictated by the question and the availability of data):
        - What is an observation, e.g. a firm, or a firm-year, etc.
        - What is the sample period?
        - What are the sample conditions? (Years, restrictions you anticipate (e.g. exclude or require some industries)
        - What variables are absolutely necessary and what would you like to have if possible?
    1. What data do we have and what data do we need?
    2. How will we collect more data? 
    1. What are the raw inputs and how will you store them (the folder structure(s) for each input type). 
    1. Speculate at a high level (not specific code!) about how you'll transform the raw data into the final form

## Collaboration as a group

### **Q: How should you "meet"?**

A: It's up to you! Be entrepreneurial and run your group as you all see fit. 
- WhatsApp, groupme, google doc, zoom, skype...
- How can we work on code at the same time? Crowdsourced from your students
    - Zoom allows for sharing screen, and others can jump in and control
    - Probably best to "give ownership" over parts to different people
    
### **Q: How do we code "at the same time"?**

The main issue is that two people might make conflicting changes. E.g., Johnny added a line to `data.py` but Cindy deleted a line from `data.py`.

A: You have, basically, three approaches, and you might use all three at different points of the project:
1. **Free-for-all approach** Everyone works in the "master" branch of the repo all the time. This is what your default instinct might be. It can work, but you will have to fix merge conflicts to update and progress at all. 
    - One option is the "one person at a time on one code file at a time".
1. **The "branching" approach.** Basically, you create an offshoot of the folder to work on, and when you're done and happy, you create a "pull request" where you ask the main project's owner (your own team, in this case) to pull your branch's changes up to the main repo.
    - See [here for a walkthrough](https://www.attosol.com/create-and-merge-branches-using-github-desktop-client/).
1. **The fork-and-clone approach.**
    - You fork the repo (make your own version), clone it to your computer, make changes there, push it up to GitHub, then give a pull request.
    - It's somewhat like branching, but your work is outside the repo completely. I would probably recommend not using this. 
    - But if you want to try this, [here is a walkthrough](https://guides.github.com/activities/forking/).
    
### Feedback request / Discussion Participation

**I would love your feedback on how you deal with the asynchronous work problem!** 
- Please let me know what issues/problems your group runs into
- What solutions did you use (were they good or awful?)
- If your group has an easy time, or finds something that works well, please let me and your classmates know! 
- Submit your experience on this via the discussion board 

Because of the shift to remote learning, the participation grade is going to rely heavily on discussions in the issue repo.
- Questions and issues you ask 
- Answers and discussion you have on classmates' issues is absolutely key
- Quantity and quality of your posts both matter!
- Review the instructions for ["how to ask for help"](https://ledatascifi.github.io/studentresourcevert/resource-landing.html#how-to-properly-ask-for-help) before you post your next issue
    
### Recommendations

1. No matter which of the three approaches, you choose, **FOLLOW THESE RULES EVERY SINGLE TIME YOU WORK ON CODE OR DO ANYTHING IN THE FOLDER:**
    - BEFORE YOU START A CODING SESSION: Go to GH Desktop and "Fetch/Pull" origin
    - WHEN YOU ARE DONE WITH A SESSION: Clear your code, rerun all, save file, then push to cloud
    
    **It's very important to do these every single time you code on your computer in group projects!**
    
    **Why?**
    
    If you forget to fetch/pull before you start (and someone made a change on the github repo since you last synced), or if someone is working at the same time (and pushes a change to the github repo that conflicts with a change you made), you are likely to receive a "Merge Conflict" notification from GH Desktop.
    
2. Your most experienced coder might be given "CEO" status over the repo and "leads the way" on pull requests and gives guidance on merge conflicts.
3. Instead of putting the entire project in one ipynb file, structure the project like the latest assignment: 
    - One code file to download each input needed, 
    - One code file to parse/transform each input, 
    - One "get_all_data" code file that, if executed, would run all files above
    - One code to build the analysis sample, explore it, and analyze it 



### Branching Demo

![](https://media.giphy.com/media/d7neiLOpRXaz5bueyd/giphy.gif)

## Regression

![](https://media.giphy.com/media/nwyqBwP65XCAU/giphy.gif)

### Required reading for this topic

1. [Chapters 22-24 of R 4 Data Science](https://r4ds.had.co.nz/model-intro.html) are an excellent overview of the thought process of modeling
2. Use `statsmodels.api` to make nice regression tables by [following this guide](https://python.quantecon.org/ols.html) (you can use different data though)
3. If you want to train a model using more sophisticated ML ideas (which we will talk about some later in the course!), use `sklearn.linear_model` and [follow this guide](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html) (you can use different data though)
    - _The "Linear Regression" section [here](https://becominghuman.ai/linear-regression-in-python-with-pandas-scikit-learn-72574a2ec1a5) shows how you can run regressions on training samples and test them out of sample_
    
### Demo on Diamonds

Inspired by R4DS.

![](https://media.giphy.com/media/piXrzDejeWIM/giphy.gif)


### Homework

1. Load the titanic dataset from `sns` and try some regressions.
2. Required reading above.
3. Obviously, the 5th assignment is nearly due.
3. As is your team's proposal.

### Objectives

1. You can fit a regression with `statsmodels.api` or `sklearn.linear_model`
2. You can view the results of your model with either
3. Practice estimating and interpreting linear models

![](https://media.giphy.com/media/yoJC2K6rCzwNY2EngA/giphy.gif)
