## Assignment 3

This assignmemt is based on content discussed in module 3 and test basic concepts of probability theory and normalization in statistics.

## Learning outcomes

-   Work on problems of different distributions eg., binomial, gaussian 
-   Calculate z score 
-	 Make statistical inferences on given data
-	 Construct a null and an alternate hypothesis
-	 Find the p-value for a given hypothesis and T test statistic.


**Question 1**

The Capital Asset Pricing Model (CAPM) is a financial model that assumes returns on a portfolio are normally distributed.  Suppose a portfolio has an average annual return of 14.7% (i.e., an average gain on 14.7%) with a standard deviation of 33%.  A return of 0% means the value of the portfolio doesn't change, a negative return means that the portfolio loses money, and a positive return means that the portfolio gains money. Determine the following:

1. What percentage of years does this portfolio lose money, (i.e. have a return less than 0%)?
2. What is the cutoff for the highest 15% of annual returns with this portfolio?

See CAPM here https://en.wikipedia.org/wiki/Capital_asset_pricing_model 

**Question 2**

Past experience indicates that because of low morale, a company loses 20 hours a year per employee due to lateness and abstenteeism.  Assume that the standard deviation of the population is 6 and normally distributed.

The HR department implemented a new rewards system to increase employee morale, and after a few months it collected a random sample of 20 employees and the annualized absenteeism was 14.

1. Could you confirm that the new rewards system was effective with a 90% confidence?
2. An HR subject matter expert would be very happy if the program could reduce absenteeism by 20% (i.e. to 16 hours).  Given the current sampling parameters, what is the probability that the new rewards system reduced absenteeism to 16 hours and you miss it?
3. Repeat part 1) and 2) with an α = 95% CI.
4. Based on the answers in 3), is the sampling method good enough to identify a reduction from 20 to 16 hours if I use a confidence of 95%?
5. What should the sample size be if you want β to be 5%

**Question 3**

Chi-Square Goodness of fit

Please access and review **section 6.3.5** in the OpenIntro Statistics textbook:

Diez, D., Barr, C. & Çetinkaya-Rundel, M. (2017). OpenIntro Statistics (3rd Ed.). https://www.openintro.org/stat/textbook.php?stat_book=os

Given the information in section 6.3.5, write python code for the following:

 - Calculate the expected values based on the geometric distribution with a probability of 53.2%
 - Compare the expected vs. the observed values from the textbook using the Chi-Square distribution
 - Reach a conclusion
 - Explain what is the business impact of your conclusion

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import math
from scipy.stats import chisquare

In [2]:
# Answer to Question 1.1 - This portfolio lose money 32.8% of the time based on annualized returns
mu = 14.7
sigma = 33
x_critical = 0
zscore = (x_critical - mu) / sigma
p = stats.norm.cdf(zscore)
p

0.3279956507031998

In [3]:
# Answer to Question 1.2 - The cutoff for the highest 15% is 48.9% annual returns
mu = 14.7
sigma = 33
zscore = stats.norm.ppf(0.85)
cutoff = zscore * sigma + mu
cutoff

48.90230185329506

In [4]:
# Answer to Question 2.1
# p-value is below 0.10%, so we have 90% confidence to reject the null hypothesis and conclude that the new reward system was effective
population_mean = 20
std_dev = 6
sample_mean = 14
sample_size = 20
sample_data = np.random.normal(sample_mean, std_dev, sample_size)
ttest = stats.ttest_1samp(sample_data, 20)
ttest

Ttest_1sampResult(statistic=-6.292732676584281, pvalue=4.849943627844383e-06)

In [5]:
# Answer to Question 2.2
# Using 90% confidence interval, we have 4.45% probability that the new rewards system reduced absenteeism to 16 hours and we miss to reject it
expected_mean = 16
zscore_1 = stats.norm.ppf(0.10)
x_critical = zscore_1 * (std_dev / math.sqrt(sample_size)) + population_mean
zscore_2 = (x_critical - expected_mean) / (std_dev / math.sqrt(sample_size))
p = 1 - stats.norm.cdf(zscore_2)
p

0.044577464303378944

In [6]:
# Answer to Question 2.3
# p-value is below 0.05%, so we have 95% confidence to reject the null hypothesis and conclude that the new reward system was effective
# Using 95% confidence interval, we have 9.07% probability that the new rewards system reduced absenteeism to 16 hours and we miss to reject it
expected_mean = 16
zscore_1 = stats.norm.ppf(0.05)
x_critical = zscore_1 * (std_dev / math.sqrt(sample_size)) + population_mean
zscore_2 = (x_critical - expected_mean) / (std_dev / math.sqrt(sample_size))
p = 1 - stats.norm.cdf(zscore_2)
p

0.09068146248885567

In [7]:
# Answer to Question 2.4
# Based on Answer to Question 2.3, 
# the sampling method is not good enough to identify a reduction from 20 to 16 hours if we use a confidence of 95%

In [8]:
# Answer to Question 2.5
# By doing trial and errors, we find that we need the sample size to be 25 if we want β to be 5%
sample_size_required = 25
x_critical = stats.norm.ppf(0.05) * (std_dev / math.sqrt(sample_size_required)) + population_mean
zscore = (x_critical - 16) / (std_dev / math.sqrt(sample_size_required))
p = 1 - stats.norm.cdf(zscore)
p

0.045659590664034466

In [9]:
# Answer to Question 3.1 - Calculate the expected values based on the geometric distribution with a probability of 53.2%
def expected_value(n):
    return round((1 - 0.532) ** (n - 1) * 0.532 * 1362)

EV_list = [expected_value(1), expected_value(2), expected_value(3), expected_value(4), expected_value(5), expected_value(6),
           1362 - (expected_value(1) + expected_value(2) + expected_value(3) + expected_value(4) + expected_value(5) + expected_value(6))]
EV_list

[725, 339, 159, 74, 35, 16, 14]

In [10]:
# Answer to Question 3.2 - Compare the expected vs. the observed values from the textbook using the Chi-Square distribution
test=chisquare([717, 369, 155, 69, 28, 14, 10], f_exp = EV_list)
print('Chi-square Statistic= %.2f' %test.statistic)
print('p-value= %.2f' %test.pvalue)

Chi-square Statistic= 5.97
p-value= 0.43


In [11]:
# Answer to Question 3.3 - Reach a conclusion
# We do not have enough statistical evidence to say that our observations do not fit the geometric distribution where p = 53.2%

In [12]:
# Answer to Question 3.4 - Explain what is the business impact of your conclusion
# Thus we conclude at at p = 53.2%, stock activity each day is independent of the stock’s behavior on previous days