In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("midterm_project-checkpoint-autograder.ipynb")

In [1]:
# Run this cell

from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
np.set_printoptions(legacy='1.13')

## Logistics

**Checkpoint.** For full credit, **you must complete the checkpoint** (this is your first draft of the project). For the checkpoint, you must complete the questions up until the end of Part 2 (indicated by the cell that says "End of Checkpoint"), by **Wednesday, Feburary 2, 11:59pm**. Submit in Gradescope.

**Deadline.** The full project is due on **Wednesday, Feburary 9, 11:59pm**. It's **much** better to be early than late, so start working now. Submit in Gradescope.

**Partners.** You may work with one other partner. Only one of you is required to submit the project in Gradescope.

**Rules.** Don't share your code with anybody but your partner. You are welcome to discuss questions with other students, but don't share the answers. The experience of solving the problems in this project will prepare you for exams (and life). If someone asks you for the answer, resist! Instead, you can demonstrate how you would solve a similar problem.

**Support.** You are not alone! Come to office hours, post on Piazza, and talk to your classmates. If you want to ask about the details of your solution to a problem, make a private Piazza post and the staff will respond. If you're ever feeling overwhelmed or don't know how to make progress, send a Piazza message (public, private or anonymous) to your TA or ULA for help. 

**Tests.** Passing the tests for a question **does not** mean that you answered the question correctly. Tests usually only check that your table has the correct column labels. However, more tests will be applied to verify the correctness of your submission in order to assign your final score, so be careful and check your work!

**Advice.** Develop your answers incrementally. To perform a complicated table manipulation, break it up into steps, perform each step on a different line, give a new name to each result, and check that each intermediate result is what you expect. You can add any additional names or functions you want to the provided cells. 

All of the concepts necessary for this project are found in the textbook. If you are stuck on a particular problem, reading through the relevant textbook section often will help clarify the concept.

Here is a roadmap for this project:

* In Part 1, we'll investigate whether the speed limit has an effect on limiting car accidents.
* In Part 2, we'll look at the salaries of data scientists in different big companies to see if they differ significantly.
* In Part 3, you'll design your own experiment using the titanic dataset.

## Part 1: SPEED LIMIT AND ACCIDENTS

In 1961-1962, an experiment was conducted in Sweden to assess whether the implementation of a speed limit reduced the amount of accidents on a highway. Researchers found that cars tended to drive significantly faster on days without a speed limit than days with one. 

Our [dataset](https://r-data.pmagunia.com/dataset/r-dataset-package-mass-traffic) contains the following rows and information:
1. **year** - indicates the year
2. **day** - indicates the day of the year eg. 1 = Jan 1, 2 = Jan 2, etc.
3. **limit** - whether there was a speed limit enforced on that day
4. **accidents** - count of how many accidents there was recorded on that singular day

In [2]:
# Run this cell
traffic = Table.read_table('data/traffic.csv')
traffic.show(5)

**Question 1** 

Before we conduct any statistical analysis, we should do some data exploration. First, let's see if there is a difference in average accidents for days with a speed limit versus days without one.

Create a table named `accidents`, with two columns and two rows. The two columns should be "limit" and "accidents mean" There should be one row for whether there was a speed limit ('yes') and not ('no'), and each row should encode the average accidents depending on whether there was a speed limit imposed.

<!--
BEGIN QUESTION
name: q1_1
-->

In [3]:
accidents = ...
accidents

In [None]:
grader.check("q1_1")

<!-- BEGIN QUESTION -->

**Question 2.** Let's visualize our data.

Create a bar chart to visualize the difference in these two groups. For a refresher on how to create a bar chart, please reference the textbook: [Ch 7.1.1 Bar Chart](https://inferentialthinking.com/chapters/07/1/Visualizing_Categorical_Distributions.html#bar-chart)

<!--
BEGIN QUESTION
name: q1_2
points: 1
manual: true
-->

In [8]:
# Create Bar Histogram Here

...

<!-- END QUESTION -->

**Question 3:**
Through our visualization, we can see that there are on average more accidents on days without a speed limit. Thus, we define our hypothesis as follows:

**Test Statistic:** The average difference of accidents between days with no speed limit and days with a speed limit.

**Null Hypothesis:** The speed limit does not affect the amount of accidents on any given day. Any deviation is due to random chance

**Alternative Hypothesis:** The speed limit does reduce the amount of accidents on any given day.

In the cell below, write a code to compute the average difference of accidents for days with a speed limit versus days without one.

<!--
BEGIN QUESTION
name: q1_3
-->

In [9]:
accidents_observed_statistic = ...
accidents_observed_statistic

In [None]:
grader.check("q1_3")

**Question 4:** Now, create a function `compute_accidents_test_statistic` which takes in a table like traffic and computes our test statistic, which is the average difference of accidents between days with no speed limit and days with a speed limit.

This function should combine what was done in Q1 and Q3, and should have the same output as Q3.
<!--
BEGIN QUESTION
name: q1_4
-->

In [12]:
def compute_accidents_test_statistic(tbl):
    ...
    
traffic_observed_statistic = compute_accidents_test_statistic(traffic)
traffic_observed_statistic

In [None]:
grader.check("q1_4")

**Question 5:** Now that we have defined hypotheses and a test statistic, we are ready to conduct a hypothesis test. We'll start by defining a function to simulate the test statistic under the null hypothesis, and then use that function 1000 times to understand the distribution under the null hypothesis.

Write a function to simulate the test statistic under the null hypothesis. 

The `simulate_traffic_null` function should simulate the null hypothesis once (not 1000 times) and return the value of the test statistic for that simulated sample.

**HINT:** This is similar to that of your Death Penalty lab! We are trying to conduct A/B testing!

<!--
BEGIN QUESTION
name: q1_5
-->

In [15]:
def simulate_traffic_null():
    ...
    
# Run your function once to make sure that it works.
simulate_traffic_null()

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

**Question 6:** Fill in the blanks below to complete the simulation for the hypothesis test. Your simulation should compute 1000 values of the test statistic under the null hypothesis and store the result in the array `accidents_simulated_stats`.

*Hint*: You should use the function you wrote above in Question 5.

*Note*: Warning: running should only take a couple minutes at max!  We encourage you to check your `simulate_traffic_null()` code to make sure it works correctly before running this cell. 
<!--
BEGIN QUESTION
name: q1_6
points: 1
manual: true
-->

In [17]:
accidents_simulated_stats = make_array()

for i in np.arange(1000):
    ...

<!-- END QUESTION -->



The following line will plot the histogram of the simulated test statistics, as well as a point for the observed test statistic. Make sure to run it, as it will be graded. 

In [18]:
# RUN THIS CELL FOR PLOT

Table().with_column('Simulated statistics', accidents_simulated_stats).hist()
plots.scatter(accidents_observed_statistic, 0, color='red', s=100)
plots.axvline(x=accidents_observed_statistic, color = 'red');

**Question 7:** Compute the p-value for this hypothesis test, and assign it to the name `accidents_p_value`.

<!--
BEGIN QUESTION
name: q1_7
-->

In [19]:
accidents_p_value = ...
accidents_p_value

In [None]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

**Question 8:** Using the P-Value above, what can we conclude about the implimentation of a speed limit in reference to the difference of accidents? Test under a p-value cutoff of 0.05.

**What does our p-value mean in this experiment?**

<!--
BEGIN QUESTION
name: q1_8
points: 1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Part 2: DATA SCIENCE COMPENSATION

[Kaggle](https://www.kaggle.com/) is a community for anyone interested in data science. There are people with experience ranging from complete beginners to experts within the field. There are also datasets on Kaggle on any topic imaginable for anyone to conduct data science methods on!

One such dataset includes the salaries records of many top companies- [source](https://www.kaggle.com/jackogozaly/data-science-and-stem-salaries?select=Levels_Fyi_Salary_Data.csv).

This dataset is quite large, but we have cleaned and refined it to only have data relevant for our experiment. The columns are as follows:

1. **company** - The company an individual works in, either Facebook or Microsoft
2. **title** - Their job title (which are all data scientists)
3. **total_compensation** - Their base pay
4. **years_of_experience** - How many years of total work experience the individual has at their respective title.
5. **years_at_company** - How long the individual has stayed in their respective company.

In [21]:
# load csv

ds = Table.read_table('data/ds_data.csv')
ds.show(5)

**Question 1.** Companies like Facebook and Microsoft are titans in the tech industries, but do they pay their data scientists the same?

Create a table named `compensation`, with two columns and two rows. The two columns should be "company" and "total_compensation mean" There should be one row for Facebook and one row for Microsoft group, and each row should encode the total average salary of data scientisits within their respective company.

<!--
BEGIN QUESTION
name: q2_1
-->

In [22]:
compensation = ...
compensation

In [None]:
grader.check("q2_1")

<!-- BEGIN QUESTION -->

**Question 2.**  Then, then create a bar chart to visualize the difference in these two groups. 
<!--
BEGIN QUESTION
name: q2_2
points: 1
manual: true
-->

In [27]:
# Create Bar Histogram Here

...

<!-- END QUESTION -->

**Question 3:**

Through our visualization, we can see that Facebook on average pays more than Microsoft. Thus, we define our hypothesis as follows:

**Test Statistic:** The average difference of salaries of data scientists in Facebook vs Microsoft.

**Null Hypothesis:** Facebook and Microsoft pay their data scientists the same. Any deviation is due to random chance

**Alternative Hypothesis:** Facebook pays their data scientists more than Microsoft.

In the cell below, write a code to compute the average difference of salaries for data scientists in Facebook vs Microsoft.

<!--
BEGIN QUESTION
name: q2_3
-->

In [28]:
comp_diff = ...
comp_diff

In [None]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

**Question 4:** In the code cell below, do any sort of exploratory analysis (like visualizing total compensation versus years of experience) and create an inferential/observational comment based on your code. For example, your exploratory code may lead you to find that there may be other outside factors aside from which company data scientists work in that determine the total compensation.

You will get full credit as long as there is some exploratory coding and thoughtful analysis.

<!--
BEGIN QUESTION
name: q2_4
points: 2
manual: true
-->

In [32]:
# CODING HERE

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 5:** Now, create a function `compensation_diff` which takes in a table like `ds` and computes our test statistic, which is the average difference of salaries of data scientists in Facebook vs Microsoft.

<!--
BEGIN QUESTION
name: q2_5
-->

In [33]:
def compensation_diff(tbl):
    """Returns the absolute difference in average salary between the two groups (companies)"""
    ...
    
compensation_diff(ds)

In [None]:
grader.check("q2_5")

**Question 6:** Fill in the function `one_bootstrap_cd` so that it generates one bootstrap sample and computes the difference of compensation between Facebook and Microsoft. Assign `bootstrap_cds` to 1000 computation of our test statistic from our bootstrapped sample.


<!--
BEGIN QUESTION
name: q2_6
-->

In [37]:
def one_bootstrap_cd():
    return ...

bootstrap_cds = ...
for i in np.arange(1000):
    new_bootstrap_cd = ...
    bootstrap_cds = ...


In [None]:
grader.check("q2_6")

**Question 7:** Use these bootstrapped values to compute a 99% confidence interval, storing the left endpoint as `ci_left` and the right endpoint as `ci_right`.

<!--
BEGIN QUESTION
name: q2_7
-->

In [40]:
ci_left = ...
ci_right = ...

print("Middle 99% of bootstrappped compensation difference: [{:f}, {:f}]".format(ci_left, ci_right))

In [None]:
grader.check("q2_7")

Run the cell below to generate a histogram of the compensation difference between Facebook and Microsoft alongside our confidence interval!

In [44]:
Table().with_column("Compensation Difference (Facebook - Microsoft)", bootstrap_cds).hist()
plots.plot([ci_left, ci_right], [.0000005,.0000005], color="gold");

<!-- BEGIN QUESTION -->

**Question 8:** Based on the histogram above, do we reject the null hypothesis? What led you to choose your decision?

<!--
BEGIN QUESTION
name: q2_8
points: 1
manual: true
-->


_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 9:** We've conducted the same experiment but for Amazon vs Microsoft and generated this histogram. Based on this histogram, should we reject the null hypothesis?
Assume the null hypothesis is that there is no difference between Amazon data scientist salaries versus Microsoft.

<!--
BEGIN QUESTION
name: q2_9
-->

1. Reject the null
2. Fall to reject the null
3. Not enough info


<img src="data/amazonmicrosoft.png" width=500>


In [45]:
result = ...

In [None]:
grader.check("q2_9")

## End of Checkpoint
#### Congratulations, you have reached the checkpoint! 

Follow the steps in the Submit your Work section (the last section) of this notebook and submit your Midterm Project Checkpoint to Gradescope.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()