# Project 4

In addition to answering the bolded questions on Coursera, also attach your notebook, both as `.ipynb` and `.html`.

This project should be answered using the `Weekly` data set (attached). This data contains 1,089
weekly stock market percentage returns for 21 years, from the beginning of 1990 to the end of 2010.

Details about the columns in the data are summarized below:

- `Year` : The year that the observation was recorded
- `Lag1` : Percentage return for previous week
- `Lag2` : Percentage return for 2 weeks previous
- `Lag3` : Percentage return for 3 weeks previous
- `Lag4` : Percentage return for 4 weeks previous
- `Lag5` : Percentage return for 5 weeks previous
- `Volume` : Volume of shares traded (average number of daily shares traded in billions)
- `Today` : Percentage return for this week
- `Direction` : A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week

In this assignment, we will be using PennGrader, a Python package built by a former TA for autograding Python notebooks. PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it's right or wrong instantly. We then record your most recent answer in our backend database. You will have 100 attempts per test case, which should be more than sufficient.

<b>NOTEï¼šPlease remember to remove the </b>

```python
raise notImplementedError
```
<b>after your implementation, otherwise the cell will not compile.</b>

## Getting Setup
Please run the below cells to get setup with the autograder. If you need to install packages, please copy these lines into the Terminal!

In [1]:
# !pip install pandas==1.0.5 --user
# pip install penngrader --user

In [2]:
# pip install seaborn --user
# pip install scikit-learn --user
# pip install statsmodels --user

Let's try PennGrader out! Fill in the cell below with your PennID and then run the following cell to initialize the grader.

<font color='red'>Warning:</font> Please make sure you only have one copy of the student notebook in your directory in Codio upon submission. The autograder looks for the variable `STUDENT_ID` across all notebooks, so if there is a duplicate notebook, it will fail.

In [3]:
#PLEASE ENSURE YOUR STUDENT_ID IS ENTERED AS AN INT (NOT A STRING). IF NOT, THE AUTOGRADER WON'T KNOW WHO 
#TO ASSIGN POINTS TO YOU IN OUR BACKEND

STUDENT_ID = 56803282                   # YOUR 8-DIGIT PENNID GOES HERE
STUDENT_NAME = "Jacky Choi"     # YOUR FULL NAME GOES HERE

In [4]:
import penngrader.grader

grader = penngrader.grader.PennGrader(homework_id = 'ESE542_Online_Su_2021_HW4', student_id = STUDENT_ID)

In [5]:
# Let's import the relevant Python packages here
# Feel free to import any other packages for this project

# Data Wrangling
import pandas as pd
import numpy as np

# Statistics
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Plotting
import matplotlib.pyplot as plt

%matplotlib inline

We're also going to run a quick (0-point) check that the pandas version set up here is correct. If you fail this, please open a Terminal window and run `pip install pandas==1.0.5 --user`. If the updates do not take effect immediately, you can hit Kernel --> Restart for the Codio virtual machine to restart the notebook. Keep in mind that Codio is running on the external machines, not your local resources. 

In [6]:
pip install pandas==1.0.5 --user
grader.grade(test_case_id = 'A0_pandas_test', answer = str(pd.__version__))

SyntaxError: invalid syntax (<ipython-input-6-3c043c6323bd>, line 1)

In [None]:
print(pd.__version__) 

## Part A

We are first interested in trying to predict the direction of the returns.

To start, load `Weekly.csv` into your notebook.

In [None]:
weekly = pd.read_csv('Weekly.csv')
weekly.Direction.dtype

In [None]:
grader.grade(test_case_id = 'A0_weekly_test', answer = weekly)

### A1.

First, transform our `Direction` variable into a numerical feature that is equal to 1 if `Direction = Up`. Then, pass the dataframe into the test case to make sure it's working properly!

In [None]:
weekly['Direction'] = weekly['Direction'].apply(lambda x: 1 if x == 'Up' else 0)
weekly

In [None]:
grader.grade(test_case_id = 'A1_direction_test', answer = weekly)

Produce some numerical and graphical summaries of the `Weekly` data. Do there appear to be any patterns?

In [None]:
yearly = weekly[['Year', 'Volume']]
yearly = yearly.groupby('Year')
yearlydf = yearly.agg({'Volume': 'sum'}).reset_index()
yearlydf


In [None]:
x = np.array(yearlydf['Year'])
y = np.array(yearlydf['Volume'])
plt.plot(x,y)

Include a brief description of what relationshipis and correlations you find.

In [None]:
axes = pd.plotting.scatter_matrix(weekly)
#V

In [None]:
weekly.corr()

In [None]:
relationships = '''
As the year approaches 2000s, stocks had a dramatic increase in volume.
'''



In [None]:
grader.grade(test_case_id = 'A1_relationships_test', answer = relationships)

### A2.

Use the full data set to perform a logistic regression with `Direction` as the response and the
five lag variables as predictors.

In [None]:
log1 = smf.glm('Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume', data = weekly, family = sm.families.Binomial()).fit()
print(log1.summary())

Pass in the regression equation to `logit_equation` below. Hint: You do not need the coefficients of the equation yet, just which variables you want to include in the model. Your answer should look something like `Response~Var1+Var2` which is the input for `statsmodels.formula.api`. 

In [None]:
logit_equation = 'Direction+Lag1+Lag2+Lag3+Lag4+Lag5'


In [None]:
grader.grade(test_case_id = 'A2_logit_test', answer = logit_equation)

### A3.

Use the `summary()` function to print the results. Do any of the predictors appear to be
statistically significant? Which predictors appear to be statistically significant?

In [None]:
print(log1.summary())

Type the number of apparently significant variables into `num_significant` and the names of the variables into the list `var_significant` -- the test case will only give points if both variables are correct!

In [None]:
num_significant = 1
var_significant = ['Lag2'] # This should be a list!
#Lag2 is the only significant because it has an absolute z stat of >= 2

In [None]:
grader.grade(test_case_id = 'A3_significant_test', answer = (num_significant, var_significant))

### A4. 
Compute the overall fraction of correct predictions. Name this variable `fraction_correct_all`.
What is the overall fraction of correct predictions?

In [None]:
pred = log1.predict(weekly)
predicted = [1 if prob > 0.5 else 0 for prob in pred]
fraction_correct_all = (predicted == weekly['Direction']).mean()
#0.5 since its a binary logistic regression

In [None]:
print(f'Overall fraction of correct predictions is {fraction_correct_all}')

In [None]:
grader.grade(test_case_id = 'A4_fraction_test', answer = fraction_correct_all)

### A5.

Now fit the logistic regression model using a training data period from 1990 to 2007, with
`Lag2` as the only predictor. 

Compute the overall fraction of correct predictions for the held
out data (that is, the data from 2008, 2009 and 2010) and assign it to a variable called
`fraction_correct_test`. What is the overall fraction of correct predictions?

In [None]:
# Train and test split
train = weekly[(weekly['Year'] >= 1990) & (weekly['Year'] <= 2007)]
test = weekly[(weekly['Year'] >= 2008) & (weekly['Year'] <= 2010)]

# Model 
model = smf.glm('Direction~+Lag2', data = train, family = sm.families.Binomial()).fit()

pred = model.predict(test)
predictions = [1 if prob > 0.5 else 0 for prob in pred]


In [None]:
fraction_correct_test = (predictions == test['Direction']).mean()

In [None]:
print(f'Overall fraction of correct predictions is {fraction_correct_test}')

Pass in the train and test datasets to make sure that they're working (feel free to rename the variables in the test case), and then run the test for `fraction_correct_test`!

In [None]:
grader.grade(test_case_id = 'A5_df_test', answer = (train, test)) 

In [None]:
grader.grade(test_case_id = 'A5_fraction_correct_test', answer = fraction_correct_test) 

## Part B

Now, we want to develop an investment strategy in which we buy if the returns are greater than
$0.5\%$ and sell otherwise.

### B1. 
Create a response variable called `Response` such that

$$
\text{Response}_i = \begin{cases}
1 \text{ if Today } > 0.5 &\\
0 \text{ otherwise }
\end{cases}
$$

In [None]:
weekly['Response'] = weekly['Today'] > 0.5

In [None]:
grader.grade(test_case_id = 'B1_response_test', answer = weekly)

### B2.
Fit a logistic regression model to predict `Response` using a training data period from 1990 to 2008, with the five
lag variables and volume as predictors.

In [None]:
train = weekly[(weekly['Year'] >= 1990) & (weekly['Year'] <=2008)]
res_model = smf.glm('Response~Lag1+Lag2+Lag3+Lag4+Lag5+Volume', data = train, family = sm.families.Binomial()).fit()

Pass in the regression equation to `logit_equation_B` below

In [None]:
logit_equation_B = 'Response+Lag1+Lag2+Lag3+Lag4+Lag5+Volume'


In [None]:
grader.grade(test_case_id = 'B2_logit_test', answer = logit_equation_B)

### B3.

Use the `summary()` function to print the results. Do any of the predictors appear to be
statistically significant? Which predictors appear to be statistically significant?

In [None]:
res_model.summary()

Type the number of apparently significant variables into `num_significant_B` and the names of the variables into the list `var_significant_B` -- the test case will only give points if both variables are correct!

In [None]:
num_significant_B = 1
var_significant_B = ['Lag1'] # This has to be a list!


In [None]:
grader.grade(test_case_id = 'B3_significant_test', answer = (num_significant_B, var_significant_B))

### B4. 

Compute the overall fraction of correct predictions for the held out data (that is, the data
from 2009 and 2010). Assign this value to the variable `fraction_correct`. What is the
overall fraction of correct predictions?

In [None]:
test = weekly[(weekly['Year'] >= 2009) & (weekly['Year'] <=2010)]
pred = res_model.predict(test)
prediction = [1 if prob > 0.5 else 0 for prob in pred]
fraction_correct = (prediction == test['Response']).mean()

In [None]:
print(f'Overall fraction of correct predictions is {fraction_correct}')

In [None]:
grader.grade(test_case_id = 'B4_fraction_test', answer = fraction_correct)

## Submit

You're done! Please make sure you've run all the PennGrader cells and count up your score to be sure (there are 20 points in total) and then make sure to submit this on Codio.