# Homework 5 - Statistics and Pandas
**Due: Nov 9** 

***Total Points: 100***

For full points, your code
- must run without errors
- must by *pythonic*
- must be easily understandable, and well documented (either through inline comments or markdown).
- All plots must have clearly and meaningfully labeled axes, unless you are simply plotting arbitrary numbers. Add legends where needed.
- **Use separate markdown cells for any text answers.**
- **Show your work, i.e, print all relevant output.** Remember that having a variable in the last line of a jupyter cell block automatically prints it's value. You can use multiple code cells for a single question.
- ***Please see the solutions (and related notes) for the previous HWs, as well as the comments on your submissions, before submitting this assignment. We will be less lenient for repeated mistakes.***

Remember to export your Jupyter notebook as a PDF file and upload both to Canvas.
```
File > Save and Export Notebook As... > PDF
```

Run every code block (and make sure the answer if fully visible) before submitting your notebook/PDF.

## Question 0

Import `matplotlib`, `numpy`, `pandas`, and `seaborn` here. You can import the required `scipy` and `statsmodels` modules when you need them.

Before we get to the questions, let's look at the **`namedtuple`** data type.

A `namedtuple` is similar to a `tuple`, but specific items in the `namedtuple` can be accessed as object attributes.

See example below:

In [None]:
# Import namedtuple
from collections import namedtuple

# Create a namedtuple container for a "Line", which contains the slope and intercept
Line = namedtuple('Line', 'slope intercept')
Line

In [None]:
# Create two instances of Line
line_1 = Line(0.7, 4.2)
line_2 = Line(1.3, 3.5)
print(line_1)
print(line_2)

In [None]:
# Access the slope and intercept as attributes of each instance
print(f"Line 1: y = {line_1.slope}x + {line_1.intercept}")
print(f"Line 2: y = {line_2.slope}x + {line_2.intercept}") 

## Question 1: Pandas

*25 points*

### Question 1.1

*10 points*

- Load the `"mpg"` dataset from `seaborn`.
- Print the model_year and name for all the vehicles that have an mpg of at least 42.

Hint: use `pandas.DataFrame.itertuples()` which returns an iterator of `namedtuple` objects.

### Question 1.2

*5 points*

Print which vehicle had the lowest and highest mpg for each year. Also print the corresponding values of mpg. Include at least one blank line between model years.

Hint: use `pandas.DataFrame.groupby()`

### Question 1.3
*10 points*

On a single figure, plot the mean mpg by year for vehicles originating from Europe, Japan and USA (separately). *Use different colored lines for each origin.* Make sure you include a lengend.

Hint: use `pandas.DataFrame.groupby()` with `pandas.DataFrame.agg()` which aggregates data using one or more operations over the specified axis.

## Question 2: Hypothesis testing
*10 points*

For the data above, use `scipy.stats.mannwhitneyu` to test whether
- the `mpg` of vehicles with a more cylinders ($>4$) is *statistically different* from that of vehicles with a less cylinders ($\leq4$).
- the `acceleration` of vehicles with more cylinders ($>4$) is *statistically greater* than that of vehicles with less cylinders ($\leq4$).

Clearly state the null and alternative hypotheses, as well as the inference, for each case.

## Question 3: Multiple linear regression
*15 points*

Using `statsmodels.formula.api.ols()`, determine how the acceleration of a vehicle in the dataset depends on its displacement and origin.

- Do you need to include interactions?
- Don't just fit a curve, plot the data and comment on the regression results. What do you learn from it?

Hint: Use `pandas.DataFrame.dropna()` to remove rows thant contain missing values.

## Question 4: Higher order polynomial regression
*10 points*

- Load the data in `hopr.csv` using `pandas` 
- Perform polynomial regression to get the best fit.
- Print the parameters/coefficients for the final best fit polynomial.

*Take care to not overfit the data.* Use the lowest polynomical possible that sufficiently represents the underlying data.

## Question 5: Regression and Curve fitting
*15 points*

In HW3, we looked at data from a compression test on a brittle material. Here, we are going to look at a test on a ductile strain-hardening material. The material deform's as per Hooke's law up to it's yield point, after which it follows a power-law hardening behavior.

The stress in the material is given by 

Elastic: $\sigma = E \epsilon_e \quad\quad\quad\quad\quad \sigma \leq \sigma_y$

Plastic: $\sigma = \sigma_y + \kappa \epsilon_p^n \quad\quad\quad \sigma > \sigma_y$

where:

$\sigma$ is the stress and \$\sigma_y$ is the yield stress;

$\epsilon_e$ and $\epsilon_p$ are the elastic and plastic strains, respectively;

$\kappa$ is the strength coefficient, and $n$ is the strain hardening exponent.

The compression test is peformed on a rectangular prismatic brittle specimen (cross-section area = 4x4 mm, length = 6 mm).

- Load the data from the provided `strain_hardening.csv` file using `pandas`. The data has already been cleaned for you. Displacement is in mm, force is in N.
- Perform a linear regression using `scipy.stats.linregress()` in the elastic regime of the data to calculate $E$ and $\sigma_y$.
- Perform a curve fit using `scipy.optimize.curve_fit()` in the plastic regime of the data to calculate $\kappa$ and $n$
- Print the values of all the variables calculated above using *scientific notation* with 2 decimal places. Use units of MPa for all parameters except $n$, which is dimensionless.
- Plot the raw stress (in MPa) vs strain curve, as well as the overall curve fit (elastic and plastic parts). The entire fit curve should be the same color. Mark the yield point. Make sure to include a legend.

**Note:** $\epsilon_p$ is the plastic strain, you must subtract the elastic strain ($\epsilon_e$) from the total strain to get the plastic strain.

Assume a 0.2% strain at the yield point; i.e., $\epsilon_y$ = 0.002. There is no plastic strain before yield and no elastic strain after yield.

## Question 6: Optimization
*10 points*

In HW3, we obtained the launch angle for the maximum range for a simple projectile motion problem.

Solve the same problem using `scipy.optimize.minimize_scalar()`. I.e., Maximize the range of the projectile within the bounds of $0^{\circ} \leq \theta \leq 90^{\circ}$. Print the answer in degrees.

Use `scipy.constants` for the value of $g$.

Since we already know that the velocity ($v$) is immaterial, assume $v=1$ for simplicity.

Ignore aerodynamic drag.

## Question 7: Monte/y-Carlo-Hall Problem.
*15 points*

Before starting this question, familiarize yourself with the [Monty Hall problem](https://betterexplained.com/articles/understanding-the-monty-hall-problem/).

We are going to simulate this problem using the Monte-Carlo method.

Perform the following operations a large number of times. You can use a loop.
- Pick the locations for the car and the choices for the contestant's initial guess.
- Simulate the case where the contestant does not switch their guess after the goat is revealed.
- Simulate the case where the contestant swithes their guess after the reveal.

Calculate the probability of success (contestant wins the car) for each case, and print both probabilities in %, up to 2 decimal places.

With the Law of Large Numbers, your calculations should match up with the actual probablities if you used a large enough number of simulations. I suggest at least 1,000,000.

However, you should probably start with a small number like 10 while you get the code ready.