<div align='center'> <h1>Hypothesis Testing</h1></div>
<div align='center'> <h3>Problem Statement</h3></div>

**Part A**
Six different machines are being considered for use in manufacturing rubber seals. The machines are being compared with respect to tensile strength of the product. A random sample of four seals from each machine is used to determine whether the mean tensile strength varies from machine to machine. In the Data.xlsx (Sheet Part A) file you find the tensile-strength measurements in kilograms per square centimeter 
Perform the analysis of variance at the 0.05 level of significance and indicate whether or not the mean tensile strengths differ significantly for the six machines.
 
**Part B**
Please refer to the file Data.xlsx (Sheet Part A) for this part.
A study measured the sorption (either absorption or adsorption) rates of three different types of organic chemical solvents. These solvents are used to clean industrial fabricated-metal parts and are potentially hazardous waste. Independent samples from each type of solvent were tested, and their sorption rates were recorded as a mole percentage.  Is there a significant difference in the mean sorption rates for the three solvents? Use a P-value for your conclusions. Which solvent would you use? 

In [11]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# for hypothesis testing
import scipy.stats as st

In [13]:
# Getting all sheet from excel file
df1=pd.ExcelFile('Data.xlsx')
sheet_name=df1.sheet_names
sheet_name

['Part A', 'Part B']

# Part A

In [16]:
# By default Part A is seleted while reading excel file
df=pd.read_excel('Data.xlsx')
df.tail()

Unnamed: 0,Machine,Measurement
19,5,21
20,6,18
21,6,16
22,6,18
23,6,20


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Machine      24 non-null     int64
 1   Measurement  24 non-null     int64
dtypes: int64(2)
memory usage: 516.0 bytes


## Hypothesis Testing


**H0 :** The mean tensile strength of all six machine is same.

**H1 :** The mean tensile strength of all six machine is different.

In [22]:
alpha = 0.05

# getting all machines
machines=df['Machine'].unique()
print("machines : ",machines)

# separating all the data based on machine
sample1=df[df['Machine']==1]
sample2=df[df['Machine']==2]
sample3=df[df['Machine']==3]
sample4=df[df['Machine']==4]
sample5=df[df['Machine']==5]
sample6=df[df['Machine']==6]
print("sample created for diff machines successfully")

# applying Annova test 
f_test,p_value=st.f_oneway(sample1['Measurement'],sample2['Measurement'],sample3['Measurement'],sample4['Measurement'],sample5['Measurement'],
                           sample6['Measurement'])
print('f_test : ',f_test,'\np_value :',p_value)

# Checking for pvalues 
print('','\n***CONCLUSION***')
if p_value < 0.05:
    print('WE REJECT THE NULL HYPOTHESIS')
else:
    print(' WE FAILED TO REJECT THE NULL HYPOTHESIS')


machines :  [1 2 3 4 5 6]
sample created for diff machines successfully
f_test :  0.4363636363636363 
p_value : 0.8173294233639146
 
***CONCLUSION***
 WE FAILED TO REJECT THE NULL HYPOTHESIS


#
**Conclusion : As we failed to reject the null hypothesis so mean that all six machine has equal tensile strength**

#

# Part B

In [26]:
# Loading data from same file 2nd sheet
df=pd.read_excel('Data.xlsx',sheet_name='Part B')
df.head()

Unnamed: 0,Solvent,Samples
0,1,1.06
1,1,0.79
2,1,0.82
3,1,0.89
4,1,1.05


In [28]:
df.shape

(32, 2)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Solvent  32 non-null     int64  
 1   Samples  32 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 644.0 bytes


## Hypothesis testing
**H0 :** The mean sorption rate for all three solvents are same.

**H1 :** The mean sorption rate for all three solvents are not same.

#

In [37]:
# applying Annova test to get the pvalues

# Get the unique solvent categories
solvent=df['Solvent'].unique()
print('solvent list : ',solvent)

sample1=df[df['Solvent']==1]
sample2=df[df['Solvent']==2]
sample3=df[df['Solvent']==3]

# Perform pairwise independent T-tests
t_test1, p_value1 = st.ttest_ind(sample1['Samples'], sample2['Samples'])  # Group 1 vs Group 2
t_test2, p_value2 = st.ttest_ind(sample2['Samples'], sample3['Samples'])  # Group 2 vs Group 3
t_test3, p_value3 = st.ttest_ind(sample3['Samples'], sample1['Samples'])  # Group 3 vs Group 1

f_test4,p_value4 = st.f_oneway(sample1['Samples'],sample2['Samples'],sample3['Samples'])

# Collect p-values
pvalues = [p_value1, p_value2, p_value3,p_value4]
print("P-values for pairwise T-tests:", pvalues)


P-values for pairwise T-tests: [0.6669822255373288, 2.4369120574820632e-05, 1.7846387388782062e-07, 5.855201452781719e-07]


In [39]:
# checking for different pvalues 
print('','\n***CONCLUSION***')
for i in pvalues:
    print(f'\n{i}')
    if i < 0.05:
        print('WE REJECT THE NULL HYPOTHESIS')
    else:
        print(' WE FAILED TO REJECT THE NULL HYPOTHESIS')


 
***CONCLUSION***

0.6669822255373288
 WE FAILED TO REJECT THE NULL HYPOTHESIS

2.4369120574820632e-05
WE REJECT THE NULL HYPOTHESIS

1.7846387388782062e-07
WE REJECT THE NULL HYPOTHESIS

5.855201452781719e-07
WE REJECT THE NULL HYPOTHESIS


#
**CONCLUSION :** 

- (sample 2 , sample 3) 
- (sample 3 , sample 1)  
- (sample1 , sample 2 , sample 3)

**we reject the null hypothesis.  -> This mean that the mean sorption rate in not same for these.**

##

(sample 1 and sample 2) 

**we failed to reject the null hypothesis. -> This mean that only sample 1 and sample 2 has same mean sorption rate**