## Problem Statement

The Health Company, which provides various diet plans to lose weight, conducted a market test experiment to test three different levels of diets (A, B, C). Each diet had a different impact on the weight after following the diet for 6 weeks. We are given a dataset of 78 observations who are provided with either of the diet plans - A, B, or C, along with their weights before and 6 weeks after taking the diet. The information is recorded in the columns 'diet', 'preweight', and 'weight6weeks'.

Find statistical evidence to check whether the mean weight losses with respect to the three diet plans are significantly different.

## Data Dictionary

diet: Type of the diet

preweight: Weight before following the diet plan

weight6weeks: Weight after 6 weeks of following the diet plan

### Importing the necessary libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt

### Reading the data into a dataframe

In [6]:
df=pd.read_csv('students.csv')
df.head(10)

Unnamed: 0,gender,race,parental_level_of_education,taken_test_preparation_course,maths_score,reading_score,writing_score,overall_score
0,male,Race C,high school,no,73,74,74,73.67
1,female,Race A,high school,no,71,62,68,67.0
2,male,Race B,bachelor's degree,yes,87,92,89,89.33
3,female,Race D,high school,no,63,62,58,61.0
4,female,Race A,high school,no,60,66,66,64.0
5,female,Race A,high school,no,75,71,75,73.67
6,female,Race A,master's degree,no,67,75,72,71.33
7,male,Race D,high school,no,53,58,53,54.67
8,male,Race C,high school,no,43,40,38,40.33
9,male,Race D,associate's degree,yes,70,82,84,78.67


In [3]:
# shape of the dataset
df.shape

(100, 8)

* The dataset contains 100 rows and 8 columns

In [4]:
# information about the columns in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gender                         100 non-null    object 
 1   race                           100 non-null    object 
 2   parental_level_of_education    100 non-null    object 
 3   taken_test_preparation_course  100 non-null    object 
 4   maths_score                    100 non-null    int64  
 5   reading_score                  100 non-null    int64  
 6   writing_score                  100 non-null    int64  
 7   overall_score                  100 non-null    float64
dtypes: float64(1), int64(3), object(4)
memory usage: 6.4+ KB


* The `Diet` column is categorical. While, `preweight` and `weight6weeks` are numerical.

In [5]:
# summary of the dataset
df.describe()

Unnamed: 0,maths_score,reading_score,writing_score,overall_score
count,100.0,100.0,100.0,100.0
mean,68.45,70.8,69.65,69.6335
std,14.240627,13.586238,13.567173,13.099381
min,36.0,35.0,33.0,38.0
25%,57.75,61.75,60.75,61.0
50%,69.0,71.0,69.5,69.33
75%,78.25,81.0,80.0,78.7525
max,100.0,100.0,100.0,99.0


* The mean weight before any diet plan is 72.53 kg, while the mean weight after 6 weeks of a diet plan is 68.68 kg.
* Overall it can be seen that weights have been reduced after taking a diet.

### Q 1 In which ranges do the mean and standard deviation for maths score lie?

In [42]:
# After loading the data set, simply calculate the mean and the standard deviation

df['maths_score'].mean(),df['maths_score'].std()
# this result is also visible in the data.describe() function

(68.45, 14.240626901420269)

### Q 2 Using the above mean and standard deviation, find the range on which the z-score for the maths score of the 5th student lies.

In [44]:
# find the z-score of the 5th student
(df['maths_score'][4]-df['maths_score'].mean())/df['maths_score'].std()

-0.5933727537765388

The result is -0.59 so the closest offered range in the solution options is from -0.7, -0.5

### Q 3 Suppose that the reading scores follow a normal distribution with a mean of 71 and standard deviation of 13 and the writing scores follow a normal distribution with a mean of 69 and standard deviation of 10. In which subject, the 5th student has performed the best?

#### To solve...
* Calculate the z-scores for both reading and writing scores. 
* The higher the z-score, the better is the performance.

In [51]:
# find the z-score for reading scores
reading=(df['reading_score'][4]-71)/13
writing=(df['writing_score'][4]-69)/10

if reading>writing:
    print("The 5th student has performed the best in reading.")
else:
    print("The 5th student has performed the best in writing.")

The 5th student has performed the best in writing.


### Q 4  Suppose the reading scores follow a normal distribution with a mean of 71 and a standard deviation of 13. Find the range on which the probability that a student will score more than 90 in reading lies.

In [57]:
# probability to score more than 90 in reading, on the scale to 100, is like scoring 90% or better
(1-norm.cdf(90, loc=71, scale=13))

0.07193386424080761

The answer above expressed as percentage is 7.2%, and that falls in the range between the given 6-8%.
You can also take the above solution * 100 for percentage.

### Q 5 Construct a 95% confidence interval for the mean overall scores from the sample data.

In [60]:
# set the values of sample mean and sample standard deviation
x_bar, s = df['overall_score'].mean(), df['overall_score'].std()

# set the value of sample size and degrees of freedom
n = len(df)
k = n - 1 # degrees of freedom is always / often '-1'

# construct the confidence interval
np.round(t.interval(0.95, df=k, loc=x_bar, scale=s / np.sqrt(n)), 2)

array([67.03, 72.23])

In [62]:
# the above can also be written as follows

np.round(t.interval(0.95, df=100-1, loc=69.633500, scale=13.099381 / np.sqrt(100)), 2)
# df is overal number of values less 1
# loc = mean of overall scores
# scale = standard deviation

array([67.03, 72.23])

### Q 6 An educational consultant claims that on average students receive an overall score of more than 70. Suppose you tested the educational consultant's claim using the sample data. What will be the conclusion based on the p-value?

In [64]:
#import the required functions
from scipy.stats import ttest_1samp

# calculate the test statistic and p-value
test_stat, p_value = ttest_1samp(df['overall_score'], popmean = 70, alternative = 'greater') # greater bcs more than 70
print('The p-value is ', p_value)

The p-value is  0.6098866036964306


As the p-value is much greater than 0.05, there is no enough evidence to say that the average overall score of the students is more than 70.

### Q 7 Suppose you want to test whether the maths score of students who have not taken the test preparation course is less than the maths score of students who have taken any test preparation course. What will be the conclusion based on the p-value?

#### Note - Assume that the population standard deviations are equal.

In [67]:
# import the required functions
from scipy.stats import ttest_ind

# prepare the data using Python
# first filter for students who have taken the test prep and find their math scores
score_taken_test=df[df.taken_test_preparation_course=='yes']['maths_score']

# then filter for students who have not taken the test prep and find their math scores
score_not_taken_test=df[df.taken_test_preparation_course=='no']['maths_score']

# find the p-value
test_stat, p_value = ttest_ind(score_not_taken_test, score_taken_test, equal_var = True, alternative = 'less')
# we use alternative = less because it's part of the the task hypothesis

print('The p-value is ', p_value)

The p-value is  6.010315522311647e-07


#### As the p-value is less than 0.05, you do have enough evidence (significance) to say that the maths scores of students who have not taken the test preparation course is less than the maths scores of students who have taken any test preparation course.

### Q 8 Suppose you want to test whether the student takes a test course preparation course or not depends on their parents' educational level. To test this, which of the below tests will be the most appropriate?

* ANOVA test

* --> **Chi-square test of independence**

* Paired t-test

* One sample t-test

### Q 9 Suppose you want to test whether the student takes a test preparation course or not depends on their parents' educational level. What will be the conclusion based on the p-value?

In [73]:
#import the required functions
from scipy.stats import chi2_contingency

#preparing the needed data table using 'crosstab function'

"""crosstab() function in Python. This method is used to compute a simple cross-tabulation of 
two (or more) factors. By default, computes a frequency table of the factors unless an array of 
values and an aggregation function are passed."""

table= pd.crosstab(df['parental_level_of_education'],df['taken_test_preparation_course'])
table

taken_test_preparation_course,no,yes
parental_level_of_education,Unnamed: 1_level_1,Unnamed: 2_level_1
associate's degree,6,10
bachelor's degree,5,5
high school,40,9
master's degree,5,2
some high school,15,3


In [74]:
chi, p_value, dof, expected = chi2_contingency(table)

print('The p-value is ', p_value)

The p-value is  0.004949288534412502


#### As the p-value is less than 0.05, we have enough evidence to reject the H0 and say that the student taking the test preparation course depends on their parents' educational level.

### Q 10 Suppose you want to test that whether there is any difference in means of student performance in reading, writing, and maths. What will be the conclusion based on the p-value?

In [98]:
# import the required function
from scipy.stats import f_oneway

# preparing the data
maths=df['maths_score']
reading=df['reading_score']
writing=df['writing_score']

In [99]:
# find the p-value
test_stat, p_value = f_oneway(maths,reading,writing)

print('The p-value is ', p_value)

The p-value is  0.4852226859144665


The above can also be written as...

In [97]:
# find the p-value
test_stat, p_value = f_oneway(df['maths_score'],
                              df['reading_score'],
                              df['writing_score'])

print('The p-value is ', p_value)

The p-value is  0.4852226859144665


#### As the p-value is greater than 0.05, you do not have enough evidence to say that there is any difference in means of student performance in reading, writing, and maths.