# ************************************************************************************************************

# Dr. Kesselly Kamara
# D207 - Exploratory Data Analysis/Descriptive Analytics
#Considered the simplest is used on historical data to discover trends and relationships in the data.
# 1. Select one of the following methods to perform analysis (Chi-square, ANOVA, t-test)
# 2. Perform statistical testing by using a hypothesis.
# 3 Hypothesis testing: is a formal process for applying statistics to examine theories about the world.
# 4. Variable: a container that holds values (categorical/numerical data). 
# 5. Categorical: qualitative data
# 6. Numerical: quantitative data (continuous or discrete data)
# 8. Continuous: measurable numerical data
# 9. Discrete: countable numerical data

# ********************************************************************************************************

# Steps:
# 1. Define your practical Theory (e.g. gender is related to smoking) 
# 2. Determine the method: (e.g. Chi-square)
# 3 State your hypothesis: (null hypothesis -Ho> gender is not related to smoking, Ha> gender is related to smoking
# 4. The null hypothesis is rejected if the p-value < 0.05  
# 5. Collect data (Gender and smoking and Day and Smoking)
# 6. Hypothesis (test to reject or accept)
# 7. Report finding: practical conclusion

# *****************************************************************************************************************

# Dr. Kesselly Kamara

# Python version 

In [1]:
from platform import python_version
print(python_version())

3.11.7


# ********************************************************************************************************************

# Install and import appropriate libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import seaborn as sn
import researchpy as rp

# chi-square test
from scipy.stats import chi2_contingency

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# **********************************************************************************************************************

# Collect Data

In [3]:
data=sn.load_dataset('tips')

In [4]:
df = data.rename(columns={'sex':'gender', 'smoker':'smoking'}) 

In [5]:
df.head()

Unnamed: 0,total_bill,tip,gender,smoking,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


# *************************************************************************************************************

In [6]:
# Calculate the cross_tab to determine frequencies
cross_tab=pd.crosstab(index=df['day'],columns=df['smoking'])
cross_tab

smoking,Yes,No
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,17,45
Fri,15,4
Sat,42,45
Sun,19,57


In [7]:
chi_result=chi2_contingency(cross_tab)

In [8]:
def is_related(x,y):
    ct=pd.crosstab(index=df[x],columns=df[y])
    chi_result=chi2_contingency(ct)
    p, x=chi_result[1], "related" if chi_result[1] < 0.05 else "is not related"
    return p,x

In [14]:
is_related('day', 'smoking') # Cannot reject the null hypothesis. Practical conclusion day has no impact on smoking.

(1.0567572499836523e-05, 'related')

# *******************************************************************************************************************

# ANOVA - Analysis of variance

In [11]:
da=sn.load_dataset('diamonds')
da.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


# ANOVA or Analysis of Variance test to see if there are differences between two groups. 
#Ho hypothesis is there is no association among the variables. 

# One-way ANOVA has one independent categorical variables 

In [33]:
df=da[['price','cut', 'clarity', 'depth', 'table' ]]
df.head()

Unnamed: 0,price,cut,clarity,depth,table
0,326,Ideal,SI2,61.5,55.0
1,326,Premium,SI1,59.8,61.0
2,327,Good,VS1,56.9,65.0
3,334,Premium,VS2,62.4,58.0
4,335,Good,SI2,63.3,58.0


In [29]:
one_way=ols('tip~day',data=df).fit()
one_ano=sm.stats.anova_lm(one_way, type=2)
one_ano

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
day,3.0,9.525873,3.175291,1.672355,0.173589
Residual,240.0,455.686604,1.898694,,


# Two-way ANOVA has two independent categorical variables
#https://www.scribbr.com/statistics/one-way-anova/

In [32]:
anov2=ols('tip~gender+day', data=df).fit()
t_way=sm.stats.anova_lm(anov2, type=2)
t_way

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
gender,1.0,3.673534,3.673534,1.933473,0.165672
day,3.0,7.4469,2.4823,1.306497,0.272937
Residual,239.0,454.092042,1.899967,,


# *******************************************************************

In [16]:
def plot_hist(col_name, num_bins, do_rotate=False):
     plt.hist(data[col_name], bins=num_bins)
     plt.xlabel(col_name)
     plt.ylabel('Frequency')
     plt.title(f'Histogram of {col_name}')
     if do_rotate:
         plt.xticks(rotation=90)
     plt.show()
    
#function to describe column
def print_desc(col_name):
 print(data[col_name].describe())

# The t-test is used to test the evaluate a population means

# One Sample t-test 
#One-sample t-test is used to evaluate a population means using a single sample.

#Problem: Five diabetes patients were randomly selected from a treatment. The doctor wants patients to have a glucose score of 110

#The Five patients glucose are 80, 90, 135, 140, 150. Can the doctor be 95% confident that the glucose average is 110

#Ho =  The group means is 110

#Ha = the means is not 110

In [17]:
from scipy import stats as st
glucose = [80, 90, 135, 140, 150]
one_sample=st.ttest_1samp(glucose, 110)
one_sample

TtestResult(statistic=0.6348110542727384, pvalue=0.5600471348994379, df=4)

# ------------------------------------------------------------------------------------------------------------------

# Two Sample t-test 
#two-sample t-test is used to evaluate a population means using more than one sample.

In [43]:
two_sample=rp.ttest(group1= data['tip'][data['sex'] == 'Male'], group1_name= "Male",
         group2= data['tip'][data['sex'] == 'Female'], group2_name= "Female")
two_sample

(   Variable      N      Mean        SD        SE  95% Conf.  Interval
 0      Male  157.0  3.089618  1.489102  0.118843   2.854868  3.324367
 1    Female   87.0  2.833448  1.159495  0.124311   2.586326  3.080570
 2  combined  244.0  2.998279  1.383638  0.088578   2.823799  3.172758,
               Independent t-test   results
 0  Difference (Male - Female) =     0.2562
 1          Degrees of freedom =   242.0000
 2                           t =     1.3879
 3       Two side test p value =     0.1665
 4      Difference < 0 p value =     0.9168
 5      Difference > 0 p value =     0.0832
 6                   Cohen's d =     0.1855
 7                   Hedge's g =     0.1849
 8              Glass's delta1 =     0.1720
 9            Point-Biserial r =     0.0889)