### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

In [2]:
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
import scipy.stats as stats

### Case Study 1 :

You were recently hired as a business analyst in a top sports company. The senior management team has asked you to come up with metrics with which they can gauge which team will win the upcoming La Liga cup (Football tournament). 

The given data set contains information on all the teams that have so far participated in all the past tournaments. 

It has data about how many goals each team scored, conceded; how many times they came within the first 6 positions, how many seasons they have qualified, their best position in the past, etc. You are required to do the following:

Before doing any analysis it would be a good idea to check for any hyphens or other symbols in the data set and make appropriate replacements to make sure you can perform arithmetic operations on the data. Prepare a short report to answer the following questions:

In [27]:
# importing data:
df = pd.read_csv('Laliga.csv', header = 1)
df.head()

Unnamed: 0,Pos,Team,Seasons,Points,GamesPlayed,GamesWon,GamesDrawn,GamesLost,GoalsFor,GoalsAgainst,Champion,Runner-up,Third,Fourth,Fifth,Sixth,T,Debut,Since/LastApp,BestPosition
0,1,Real Madrid,86,4385,2762,1647,552,563,5947,3140,33,23,8,8,3,4,79,1929,1929,1
1,2,Barcelona,86,4262,2762,1581,573,608,5900,3114,25,25,12,12,4,6,83,1929,1929,1
2,3,Atletico Madrid,80,3442,2614,1241,598,775,4534,3309,10,8,16,9,7,6,56,1929,2002-03,1
3,4,Valencia,82,3386,2664,1187,616,861,4398,3469,6,6,10,11,10,7,50,1931-32,1987-88,1
4,5,Athletic Bilbao,86,3368,2762,1209,633,920,4631,3700,8,7,10,5,8,10,49,1929,1929,1


In [28]:
# shape and info about the data:
df.shape

(61, 20)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Pos            61 non-null     int64 
 1   Team           61 non-null     object
 2   Seasons        61 non-null     int64 
 3   Points         61 non-null     object
 4   GamesPlayed    61 non-null     object
 5   GamesWon       61 non-null     object
 6   GamesDrawn     61 non-null     object
 7   GamesLost      61 non-null     object
 8   GoalsFor       61 non-null     object
 9   GoalsAgainst   61 non-null     object
 10  Champion       61 non-null     object
 11  Runner-up      61 non-null     object
 12  Third          61 non-null     object
 13  Fourth         61 non-null     object
 14  Fifth          61 non-null     object
 15  Sixth          61 non-null     object
 16  T              61 non-null     object
 17  Debut          61 non-null     object
 18  Since/LastApp  61 non-null     o

In [31]:
### Replacing - in the data by 0:
df.replace('-',0, inplace=True)

In [37]:
### Converting the features to numerical:

cat = df.dtypes[df.dtypes == 'object'].index
cat

l_cat = ['Points', 'GamesPlayed', 'GamesWon', 'GamesDrawn', 'GamesLost',
       'GoalsFor', 'GoalsAgainst', 'Champion', 'Runner-up', 'Third', 'Fourth',
       'Fifth', 'Sixth', 'T']

In [38]:
for i in l_cat:
    df[i]= df[i].astype('int64')

In [39]:
# summary statistics of Numerical variable:

df.describe()

Unnamed: 0,Pos,Seasons,Points,GamesPlayed,GamesWon,GamesDrawn,GamesLost,GoalsFor,GoalsAgainst,Champion,Runner-up,Third,Fourth,Fifth,Sixth,T,BestPosition
count,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0
mean,31.0,24.0,901.42623,796.819672,303.967213,188.934426,303.754098,1140.344262,1140.229508,1.42623,1.409836,1.409836,1.409836,1.409836,1.409836,8.47541,7.081967
std,17.752934,26.827225,1134.899121,876.282765,406.99103,201.799477,294.708594,1506.740211,1163.710766,5.472535,4.540107,3.247445,2.74698,2.50584,2.23142,18.100282,5.276663
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,16.0,4.0,96.0,114.0,34.0,24.0,62.0,153.0,221.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
50%,31.0,12.0,375.0,423.0,123.0,95.0,197.0,430.0,632.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6.0
75%,46.0,38.0,1351.0,1318.0,426.0,330.0,563.0,1642.0,1951.0,0.0,0.0,1.0,2.0,2.0,2.0,6.0,10.0
max,61.0,86.0,4385.0,2762.0,1647.0,633.0,1070.0,5947.0,3889.0,33.0,25.0,16.0,12.0,12.0,10.0,83.0,20.0


In [42]:
# summary statistics of Categorical variable:

df.describe(exclude=[np.number])

Unnamed: 0,Team,Debut,Since/LastApp
count,61,61,61
unique,61,45,37
top,Pontevedra,1929,2015-16
freq,1,10,6


In [24]:
# checking missing values:
df.isnull().sum()

Pos              0
Team             0
Seasons          0
Points           0
GamesPlayed      0
GamesWon         0
GamesDrawn       0
GamesLost        0
GoalsFor         0
GoalsAgainst     0
Champion         0
Runner-up        0
Third            0
Fourth           0
Fifth            0
Sixth            0
T                0
Debut            0
Since/LastApp    0
BestPosition     0
dtype: int64

There are no missing values in the data.

#### 1a. Which are the teams which started playing between 1930-1980?


In [66]:
df_debut_year = df[df['Debut'].astype(str).str[:4].astype(int).between(1930,1980)]

df_debut_year = df_debut_year[['Team','Debut']].sort_values(by='Debut').reset_index(drop=True)

print("The teams that started playing between 1930 - 1980 are as follows:")
df_debut_year
    

The teams that started playing between 1930 - 1980 are as follows:


Unnamed: 0,Team,Debut
0,Alaves,1930-31
1,Valencia,1931-32
2,Real Betis,1932-33
3,Oviedo,1933-34
4,Sevilla,1934-35
5,Hercules,1935-36
6,Osasuna,1935-36
7,Zaragoza,1939-40
8,Celta Vigo,1939-40
9,Murcia,1940-41


------------

#### 1b. Which are the top 5 teams in terms of points?


In [83]:
points_df = df[['Team','Points']].sort_values('Points',ascending = False).head(5)
points_df

Unnamed: 0,Team,Points
0,Real Madrid,4385
1,Barcelona,4262
2,Atletico Madrid,3442
3,Valencia,3386
4,Athletic Bilbao,3368


In [94]:
print('The top 5 teams in terms of Points are: ', points_df['Team'].to_list())

The top 5 teams in terms of Points are:  ['Real Madrid', 'Barcelona', 'Atletico Madrid', 'Valencia', 'Athletic Bilbao']


The top 5 teams in terms of Points are:  'Real Madrid', 'Barcelona', 'Atletico Madrid', 'Valencia', 'Athletic Bilbao'

-------------

#### 1c. What is the distribution of the winning percentage for all teams? Which teams are in the top 5 in terms of winning percentage? (Winning percentage= (GamesWon / GamesPlayed)*100)


In [100]:
df['Winning_percentage']= (df['GamesWon'] / df['GamesPlayed'])*100
df['Winning_percentage']

0     59.630702
1     57.241130
2     47.475134
3     44.557057
4     43.772629
        ...    
56    21.052632
57    23.333333
58    23.333333
59    16.666667
60          NaN
Name: Winning_percentage, Length: 61, dtype: float64

In [104]:
df_win = df[['Team','Winning_percentage']].sort_values('Winning_percentage', ascending=False).head(5)
df_win

Unnamed: 0,Team,Winning_percentage
0,Real Madrid,59.630702
1,Barcelona,57.24113
2,Atletico Madrid,47.475134
3,Valencia,44.557057
4,Athletic Bilbao,43.772629


In [105]:
print('The top 5 teams in terms of Winning Percentages are: ', df_win['Team'].to_list())

The top 5 teams in terms of Winning Percentages are:  ['Real Madrid', 'Barcelona', 'Atletico Madrid', 'Valencia', 'Athletic Bilbao']


The top 5 teams in terms of Winning Percentages are:  'Real Madrid', 'Barcelona', 'Atletico Madrid', 'Valencia', 'Athletic Bilbao'

------------------

#### 1d. Is there a significant difference in the winning percentage for teams which have attained a best position between 1-3 and those teams which have had a best position between 4-7.


In [112]:
# Step 1: Filter on Winning Percentage and Best Position
table = df[['Team','Winning_percentage','BestPosition']]
table.head()

Unnamed: 0,Team,Winning_percentage,BestPosition
0,Real Madrid,59.630702,1
1,Barcelona,57.24113,1
2,Atletico Madrid,47.475134,1
3,Valencia,44.557057,1
4,Athletic Bilbao,43.772629,1


In [116]:
# Step 2: Filter over BestPositions
t1 = table[table['BestPosition'].between(1,3)]
t2 = table[table['BestPosition'].between(4,7)]


We'll check for averages between the respective points to check if there is significant difference between the winning percentages for these teams.

In [121]:
m1 = round(t1['Winning_percentage'].mean(),2)
m2 =  round(t2['Winning_percentage'].mean(),2)

print('The average of winning percentage for Best Position between 1-3 is:' ,m1)
print('The average of winning percentage for Best Position between 4-7 is:' ,m2)

The average of winning percentage for Best Position between 1-3 is: 39.67
The average of winning percentage for Best Position between 1-3 is: 30.29


In [128]:
diff = m1-m2
diff

9.380000000000003

There is not much significant difference between the winning percentage for Best Position between 1-3 and between 4-7 which is 9.38.

---------------

---------------

### Case Study 2: 

A study was done to measure the blood pressure of 60 year old women with glaucoma. A random sample of 200 60-year old women with glaucoma was chosen. The mean of the systolic blood pressure in the sample was 140 mm Hg and the standard deviation was 25 mm Hg.



#### 2a. Calculate the estimated standard error of the sample mean? What does the standard error indicate?


In [179]:
n = 200
mean = 140
std = 25

# Standard error 
std_err = std/np.sqrt(n)
std_err

1.7677669529663687

The estimated standard error fo the sample mean is 1.7677.

The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation. In statistics, a sample mean deviates from the actual mean of a population—this deviation is the standard error of the mean. The Standard Error indicates that with the increae in sample size, there is a decrease in the Standard Error of Mean

#### 2b. Estimate a 95% confidence interval for the true mean blood pressure for all 60-year old women with glaucoma.


For the standard normal distribution,  P(-1.96 < Z < 1.96) = 0.95, i.e., there is a 95% probability that a standard normal variable, Z, will fall between -1.96 and 1.96. 

We need to compute $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$ 

In [182]:
print(stats.norm.isf(0.025))

1.9599639845400545


In [184]:
UL_bp = mean + 1.96*std_err
LL_bp = mean - 1.96*std_err

display(UL_bp)
display(LL_bp)

143.4648232278141

136.5351767721859

The 95% confidence interval for the true mean blood pressure for all 60-year old women with glucoma is 136.54 to 143.45.

#### 2c. Assume that instead of 200, a random sample of only 100 60-year old women with glaucoma was chosen. The sample mean and standard deviation estimates are the same as those in the original study. What is the estimated 95% confidence interval for the true mean blood pressure?


In [186]:
mean = 140
std = 25
n2 = 100

std_err_2 = std/np.sqrt(n2)
std_err_2

2.5

In [187]:
UL_2 = mean + 1.96*std_err_2
LL_2 = mean - 1.96*std_err_2

display(UL_2)
display(LL_2)

144.9

135.1

The 95% confidence interval for the true mean blood pressure for all 60-year old women with glucoma for a sample size of 100 is 135.1 to 144.9.

#### 2d. Which of the two above intervals are wider?


The confidence interval is wider for the smaller sample size of 100. As smaller the sample size the higher would be the standard error of mean. 

Our sample contains less information than a sample of 200 women, and therefore will yield less precise (or more uncertain) estimate of the population mean.

The higher the standared error of mean would be wider the confidence interval range. Thus, we conclude that the confidence interval for sample size of 100 (that is the in the second case) is more wider.

#### 2e. Explain in non-technical terms why the estimated standard error of a sample mean tends to decrease with an increase in sample size.


In simpler terms, the standard error of mean indicates the how far the sample mean is from the population mean (or in other words the accuracy and consistency of the date). This deviation is popularly known as the standard error of mean.

In case, we consider more and more samples from the given population, it will give more reliabilty of the sample mean concluding that sample is a true representation of the entire Population. Thus, we can clearly say that as and when the sample size increases, the standard error of mean tends to decrease.

-----------------

### Case Study 3. 

Par Inc., is a major manufacturer of golf equipment. Management believes that Par’s market share could be increased with the introduction of a cut-resistant, longer-lasting golf ball. Therefore, the research group at Par has been investigating a new golf ball coating designed to resist cuts and provide a more durable ball. The tests with the coating have been promising.

One of the researchers voiced concern about the effect of the new coating on driving distances. Par would like the new cut-resistant ball to offer driving distances comparable to those of the current-model golf ball. To compare the driving distances for the two balls, 40 balls of both the new and current models were subjected to distance tests. The testing was performed with a mechanical hitting machine so that any difference between the mean distances for the two models could be attributed to a difference in the design. The results of the tests, with distances measured to the nearest yard, are contained in   the data set “Golf”.



In [131]:
# importing data:
golf = pd.read_csv('Golf.csv')
golf.head()

Unnamed: 0,Current,New
0,264,277
1,261,269
2,267,263
3,272,266
4,258,262


In [164]:
# shape and info of the data:
golf.shape

(40, 2)

In [165]:
golf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Current  40 non-null     int64
 1   New      40 non-null     int64
dtypes: int64(2)
memory usage: 768.0 bytes


#### 3a. Formulate and present the rationale for a hypothesis test that Par could use to compare the driving distances of the current and new golf balls.


To compare the driving distances of the current and new golf ball we shall conduct Two-Sample Independent T-Test.
The test hypothesis as follows:
    
    H0: Average of the driving distance of the Current golf balls ==  Average of the driving distance of the New golf balls
(There is no difference between the mean distances for the two models that could be attributed to a difference in the design)
    
    H1: Average of the driving distance of the Current golf balls <>  Average of the driving distance of the New golf balls 
   (There is difference between the mean distances for the two models could be attributed to a difference in the design)

#### 3b. Analyze the data to provide the hypothesis testing conclusion. What is the p-value for your test? What is your recommendation for Par Inc.?


#### Performing Shapiro Test to check for Normality of the data:
    
    H0: The data is normal
    H1: The data is not normal

In [140]:
# Shapiro test for current golf balls:
sc = stats.shapiro(golf.Current)
sc

print('The test-statistic is {} and the p-value is {}'.format(sc[0], sc[1]))

The test-statistic is 0.9707046747207642 and the p-value is 0.378787100315094


In [141]:
# checking if pvalue is > 0.05 alpha
sc[1] > 0.05

True

In [142]:
# Shapiro test for new golf balls:
sn = stats.shapiro(golf.New)
sn

print('The test-statistic is {} and the p-value is {}'.format(sn[0], sn[1]))

The test-statistic is 0.9678263664245605 and the p-value is 0.3064655363559723


In [143]:
# checking if pvalue is > 0.05 alpha
sn[1] > 0.05

True

#### Inference of Shapiro Test:

The p-value is greater than 5% alpha for both the current and new golf balls, hence we fail to reject the Null Hypothesis. With 95% confidence can conclude that the data is normal and is not statistically significant. Hence, will conduct the Two sample independent T-test. 

In [145]:
t1 = stats.ttest_ind(golf['Current'], golf['New'])
t1

print('The T-test statistic is {} and the p-value is {}'.format(t1[0], t1[1]))

The T-test statistic is 1.3283615935245678 and the p-value is 0.18793228491854663


In [146]:
# checking if pvalue is > 0.05 alpha
t1[1] > 0.05

True

#### Inference of Two Sample Independent T-Test:

The p-value is greater than 5% alpha, hence we fail to reject the Null Hypothesis. With 95% confidence can conclude that there is no statistically significance btween the averages of the driving distance between the current golf balls and the new golf balls. 

Hence, there is no difference between the mean distances for the two models that could be attributed to a difference in the design.

#### 3c. What is the 95% confidence interval for the population mean of each model, and what is the 95% confidence interval for the difference between the means of the two populations?


For the standard normal distribution,  P(-1.96 < Z < 1.96) = 0.95, i.e., there is a 95% probability that a standard normal variable, Z, will fall between -1.96 and 1.96. 

We need to compute $\overline{X} \quad \pm 1.96 \frac {\sigma} {\sqrt{n}}$ 

In [157]:
# critical value for 95% confidence interval:
z = print(stats.norm.isf(0.05/2))
z


1.9599639845400545


In [171]:
# 95% confidence interval for the population mean for Current Golf Ball

current_mean = golf['Current'].mean()
current_std = golf['Current'].std()
n = golf['Current'].count()

std_error = current_std/np.sqrt(n)


UL = current_mean + (1.96*np.sqrt(std_error))
LL = current_mean - (1.96*np.sqrt(std_error))

display(UL)
display(LL)

272.5807868690116

267.96921313098835

In [172]:
# 95% confidence interval for the population mean for New Golf Ball

new_mean = golf['New'].mean()
new_std = golf['New'].std()
n1 = golf['New'].count()

std_error_new = new_std/np.sqrt(n1)

UL_new = new_mean + (1.96*np.sqrt(std_error_new))
LL_new = new_mean - (1.96*np.sqrt(std_error_new))

display(UL_new)
display(LL_new)

269.9518323024398

265.0481676975602

The 95% confidence interval for current golf balls is 267.97 to 272.58 and the 95% confidence interval for new golf balls is 265.048 to 269.95. 

--------------

In [174]:
# 95% confidence interval for the difference between the means of the two populations:

# means of the two populations are:
current_mean
new_mean

# difference in the means of the two population:
diff = current_mean - new_mean
print('The difference in the means of the two populations is:', diff)

# variance of the two populations are:
current_var = golf.Current.var()
new_var = golf.New.var()

# calculating  mean square error:
MSE = (current_var + new_var)/2

# sample size is 40 for the both the populations:
SE = np.sqrt(2*MSE/n)
print('The standard error of the difference between the sample of the ttwo populations is:', SE)


The difference in the means of the two populations is: 2.7749999999999773
The standard error of the difference between the sample of the ttwo populations is: 2.08903962108466


In [176]:
UL_SE = diff + (1.96*SE)
LL_SE = diff - (1.96*SE)

display(UL_SE)
display(LL_SE)

6.86951765732591

-1.3195176573259557

The 95% confidence interval for the difference between the means of the two populations is -1.32 to 6.87.

---------------