The heart disease dataset is an amalgamation of five previously independent but well-known datasets, which are combined to create the largest dataset available for heart disease research to date. The five datasets included are Cleveland, Hungarian, Switzerland, Long Beach VA, and the Statlog (Heart) Data Set. This comprehensive dataset merges 11 common features across these sources, encompassing 1190 instances, facilitating advanced research on Coronary Artery Disease (CAD) through machine learning and data mining approaches. The aim is to enhance clinical diagnosis and early treatment interventions.

Here are the detailed descriptions of the 11 features present in this dataset:

1. **Age (age)**: Numeric value indicating the age in years.
2. **Sex (sex)**: Binary value where 1 represents male and 0 represents female.
3. **Chest Pain Type (chest pain type)**: Nominal data categorized into four types: 1 for typical angina, 2 for atypical angina, 3 for non-anginal pain, and 4 for asymptomatic.
4. **Resting Blood Pressure (resting blood pressure)**: Numeric value measured in millimeters of mercury (mm Hg).
5. **Serum Cholesterol (serum cholesterol)**: Numeric value measured in milligrams per deciliter (mg/dl).
6. **Fasting Blood Sugar (fasting blood sugar)**: Binary value indicating whether fasting blood sugar is above 120 mg/dl (1) or not (0).
7. **Resting Electrocardiogram Results (resting electrocardiogram results)**: Nominal data with three categories: 0 for normal, 1 for having ST-T wave abnormalities, and 2 for showing probable or definite left ventricular hypertrophy by Estes' criteria.
8. **Maximum Heart Rate Achieved (maximum heart rate achieved)**: Numeric value indicating the maximum heart rate recorded during the test, measured in beats per minute (bpm).
9. **Exercise Induced Angina (exercise induced angina)**: Binary value indicating the presence (1) or absence (0) of angina induced by exercise.
10. **Oldpeak = ST Depression (oldpeak)**: Numeric value indicating the depression in the ST segment post-exercise.
11. **The Slope of the Peak Exercise ST Segment (ST slope)**: Nominal data classified into three slopes: 0 for upsloping, 1 for flat, and 2 for downsloping.
12. **Class Target (class target)**: Binary outcome used to indicate the presence (1) or absence (0) of heart disease.

This dataset's development aims to provide a robust resource for developing algorithms that improve the prediction and understanding of heart disease, ultimately contributing to better diagnostic and therapeutic outcomes.

ALL these data  can be found in the folder named 'data1'

Based on the attributes described in the disease dataset, you can conduct various hypothesis tests to explore the relationships between different variables. Below are some possible hypothesis tests suggestions:

**Relationship between Sex and Heart Disease:**

- **Hypothesis**: Is the proportion of males and females with heart disease the same?
- **Test Type**: Chi-squared test.

**Relationship between Type of Chest Pain and Heart Disease:**

- **Hypothesis**: Is there a relationship between different types of chest pain (typical angina, atypical angina, non-anginal pain, asymptomatic) and the occurrence of heart disease?
- **Test Type**: Chi-squared test.

**Relationship between Age and Heart Disease:**

- **Hypothesis**: Is the proportion of heart disease the same among younger and older people?
- **Test Type**: T-test or ANOVA (if there are more than two groups).

**Relationship between Resting Electrocardiogram Results and Heart Disease:**

- **Hypothesis**: Is there a relationship between resting electrocardiogram results (normal, ST-T wave abnormalities, left ventricular hypertrophy) and the occurrence of heart disease?
- **Test Type**: Chi-squared test.

**Relationship between Maximum Heart Rate and Heart Disease:**

- **Hypothesis**: Is the proportion of people with heart disease different among those who achieve a higher maximum heart rate?
- **Test Type**: T-test or ANOVA.

**Relationship between Exercise Induced Angina and Heart Disease:**

- **Hypothesis**: Is the proportion of heart disease different between people who experience exercise-induced angina and those who do not?
- **Test Type**: Chi-squared test.

**Relationship between Fasting Blood Sugar and Heart Disease:**

- **Hypothesis**: Is the proportion of heart disease different between people with fasting blood sugar levels above 120 mg/dl and those below this threshold?
- **Test Type**: Chi-squared test.

In [4]:
import pandas as pd
import scipy.stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load your dataset
df = pd.read_csv('data1/heart_statlog_cleveland_hungary_final.csv')

# Rename columns: replace spaces with underscores and convert to lowercase
df.columns = df.columns.str.replace(' ', '_').str.lower()

print(df.columns)

# Chi-squared tests and ANOVA

# Sex and Heart Disease: Chi-squared Test
table_sex = pd.crosstab(df['sex'], df['target'])
chi2_sex, p_sex, _, _ = scipy.stats.chi2_contingency(table_sex)
print(f"Chi-squared test for Sex and Heart Disease: Chi2 = {chi2_sex}, P-value = {p_sex}")

# Type of Chest Pain and Heart Disease: Chi-squared Test
table_cp = pd.crosstab(df['chest_pain_type'], df['target'])
chi2_cp, p_cp, _, _ = scipy.stats.chi2_contingency(table_cp)
print(f"Chi-squared test for Chest Pain Type and Heart Disease: Chi2 = {chi2_cp}, P-value = {p_cp}")

# Age and Heart Disease: ANOVA
model_age = ols('age ~ C(target)', data=df).fit()
anova_age = sm.stats.anova_lm(model_age, typ=2)
print("ANOVA for Age and Heart Disease:\n", anova_age)

# Resting ECG Results and Heart Disease: Chi-squared Test
table_ecg = pd.crosstab(df['resting_ecg'], df['target'])
chi2_ecg, p_ecg, _, _ = scipy.stats.chi2_contingency(table_ecg)
print(f"Chi-squared test for Resting ECG and Heart Disease: Chi2 = {chi2_ecg}, P-value = {p_ecg}")

# Maximum Heart Rate and Heart Disease: ANOVA
model_hr = ols('max_heart_rate ~ C(target)', data=df).fit()
anova_hr = sm.stats.anova_lm(model_hr, typ=2)
print("ANOVA for Maximum Heart Rate and Heart Disease:\n", anova_hr)

# Exercise Induced Angina and Heart Disease: Chi-squared Test
table_angina = pd.crosstab(df['exercise_angina'], df['target'])
chi2_angina, p_angina, _, _ = scipy.stats.chi2_contingency(table_angina)
print(f"Chi-squared test for Exercise Induced Angina and Heart Disease: Chi2 = {chi2_angina}, P-value = {p_angina}")

# Fasting Blood Sugar and Heart Disease: Chi-squared Test
table_fbs = pd.crosstab(df['fasting_blood_sugar'], df['target'])
chi2_fbs, p_fbs, _, _ = scipy.stats.chi2_contingency(table_fbs)
print(f"Chi-squared test for Fasting Blood Sugar and Heart Disease: Chi2 = {chi2_fbs}, P-value = {p_fbs}")

Index(['age', 'sex', 'chest_pain_type', 'resting_bp_s', 'cholesterol',
       'fasting_blood_sugar', 'resting_ecg', 'max_heart_rate',
       'exercise_angina', 'oldpeak', 'st_slope', 'target'],
      dtype='object')
Chi-squared test for Sex and Heart Disease: Chi2 = 113.83203133881408, P-value = 1.418278207711176e-26
Chi-squared test for Chest Pain Type and Heart Disease: Chi2 = 334.4185814396394, P-value = 3.5261912147845324e-72
ANOVA for Age and Heart Disease:
                  sum_sq      df          F        PR(>F)
C(target)   7149.319845     1.0  87.580158  3.906913e-20
Residual   96978.496122  1188.0        NaN           NaN
Chi-squared test for Resting ECG and Heart Disease: Chi2 = 18.18230431693647, P-value = 0.0001126581936273619
ANOVA for Maximum Heart Rate and Heart Disease:
                   sum_sq      df           F        PR(>F)
C(target)  132235.387412     1.0  244.704259  2.694125e-50
Residual   641981.634437  1188.0         NaN           NaN
Chi-squared test for Exer

The dataset provided is a comprehensive de-identified summary table focusing on vision and eye health data indicators from the National Health Interview Survey (NHIS), meticulously stratified by all possible combinations of age group, race/ethnicity, gender, and risk factor. The NHIS is an authoritative annual household survey executed by the National Center for Health Statistics at the Centers for Disease Control and Prevention (CDC). It is designed to monitor trends in illness, disabilities, and assess progress towards achieving national health objectives.

The approximate sample size of the NHIS encompasses around 35,000 households and 87,500 individuals each year, ensuring a robust data set that reflects a broad spectrum of the U.S. population. In terms of vision and eye health, the NHIS data for the Vision and Eye Health Surveillance System (VEHSS) includes specifically targeted questions related to Visual Function. This focus helps in understanding and monitoring conditions and disabilities related to vision among the U.S. populace.

To maintain privacy and data integrity, the dataset suppresses any data for cell sizes smaller than 30 individuals or where the relative standard error exceeds 30% of the mean value. This suppression ensures the reliability of the data while safeguarding participant confidentiality.

Updates to the VEHSS NHIS dataset are made as new data becomes available, with the latest update being in November 2019. For more detailed information on the methodologies and analysis specific to the VEHSS NHIS data, one can refer to the VEHSS NHIS webpage. Additional comprehensive information about the broader NHIS can be accessed via the NHIS official website at http://www.cdc.gov/nchs/nhis/about_nhis.htm.

This dataset serves as a vital resource for researchers, policymakers, and public health officials aiming to enhance understanding of vision-related health outcomes and to design interventions that improve eye health across different demographic groups in the United States.

Here’s a comprehensive description of each column:

1. **YearStart**: The starting year of the data collection period.
2. **YearEnd**: The ending year of the data collection period, identical to YearStart if data pertains to a single year.
3. **LocationAbbr**: The abbreviation of the location (e.g., state or country abbreviation).
4. **LocationDesc**: The full name of the location.
5. **DataSource**: The source from which the data was obtained.
6. **Topic**: The main topic of the data (e.g., vision health, diabetes).
7. **Category**: The category under which the topic falls.
8. **Question**: Description of the question being analyzed (e.g., "Percentage of adults with diabetic retinopathy").
9. **Response**: Holds the response value that was evaluated, if applicable.
10. **Age**: Stratification by age groups (e.g., "All ages", "0-17 years", "18-39 years").
11. **Gender**: Stratification by gender (e.g., "Total", "Male", "Female").
12. **RaceEthnicity**: Stratification by race/ethnicity (e.g., "All races", "Asian", "Hispanic").
13. **RiskFactor**: Stratification by major health risk factors (e.g., "Diabetes", "Smoking").
14. **RiskFactorResponse**: The response for the risk factor evaluated (e.g., "Yes", "No").
15. **Data_Value_Unit**: The unit of the data value, often a percentage ("%").
16. **Data_Value_Type**: Type of data value, such as "age-adjusted prevalence" or "crude prevalence".
17. **Data_Value**: A numeric value representing the data point, could be zero or missing if not applicable.
18. **Data_Value_Footnote_Symbol**: Symbol used in footnotes.
19. **Data_Value_Footnote**: Text of the footnote explaining anomalies or exceptions in the data.
20. **Low_Confidence_Limit**: The lower bound of the 95% confidence interval for the data value.
21. **High_Confidence_Limit**: The upper bound of the 95% confidence interval for the data value.
22. **Numerator**: An estimate of the number of individuals affected by the condition in the location.
23. **Sample_Size**: The size of the sample used to derive the data value.
24. **LocationID**: A unique identifier for the location.
25. **TopicID**: A unique identifier for the topic.
26. **CategoryID**: A unique identifier for the category.
27. **QuestionID**: A unique identifier for the question.
28. **ResponseID**: A unique identifier for the response.
29. **DataValueTypeID**: A unique identifier for the type of data value.
30. **AgeID**: A unique identifier for the age stratification.
31. **GenderID**: A unique identifier for the gender stratification.
32. **RaceEthnicityID**: A unique identifier for the race/ethnicity stratification.
33. **RiskFactorID**: A unique identifier for the risk factor.
34. **RiskFactorResponseID**: A unique identifier for the response to the risk factor.
35. **GeoLocation**: Geographic location details, not provided for national data.
36. **Geographic Level**: The level of geography (e.g., national, state) relevant to the data.

This structured dataset appears to be a comprehensive tool for analyzing various health outcomes by different demographic and geographic stratifications, facilitating detailed public health research and decision-making based on diverse health indicators.

Given the structure and the depth of data provided by the NHIS on vision and eye health, several hypotheses can be formulated and tested. These hypotheses can explore the relationships between various demographic and health-related factors and vision impairments or conditions. Here are some potential hypotheses you could examine:

1. **Age and Vision Health:**
   - **Hypothesis**: Vision health deteriorates significantly with age.
   - **Test Type**: ANOVA or linear regression to compare vision health across different age groups.
2. **Race/Ethnicity and Vision Health:**
   - **Hypothesis**: There are significant differences in vision health outcomes among different racial and ethnic groups.
   - **Test Type**: Chi-squared test for categorical outcomes or logistic regression for binary outcomes (like presence or absence of a vision condition).
3. **Gender and Specific Eye Conditions:**
   - **Hypothesis**: The prevalence of specific eye conditions (like glaucoma, cataracts) differs by gender.
   - **Test Type**: Chi-squared test or logistic regression depending on the data structure.
4. **Impact of Risk Factors on Vision Health:**
   - **Hypothesis**: Individuals with certain risk factors (e.g., diabetes, hypertension) have poorer vision health compared to those without these risk factors.
   - **Test Type**: Logistic regression to assess the impact of risk factors on the likelihood of having poor vision health.
5. **Interaction Effects:**
   - **Hypothesis**: The impact of risk factors on vision health varies by age or race/ethnicity.
   - **Test Type**: Multivariate regression or ANOVA with interaction terms to explore how the combination of risk factors and demographic characteristics affects vision health.
6. **Geographic Variation in Vision Health:**
   - **Hypothesis**: Vision health varies significantly across different geographic locations.
   - **Test Type**: Chi-squared test if categorical or one-way ANOVA if continuous, to compare different locations.
7. **Temporal Trends in Vision Health:**
   - **Hypothesis**: There has been a significant change in the prevalence of vision health issues over the years covered by the dataset.
   - **Test Type**: Time series analysis or linear regression to identify trends over time.
8. **Effectiveness of Health Interventions:**
   - **Hypothesis**: Health interventions or public health policies have led to an improvement in vision health outcomes.
   - **Test Type**: Pre-post analysis using paired t-tests or ANCOVA to compare data before and after the implementation of specific health interventions.

These hypotheses can help direct the statistical analysis of the NHIS data, leading to a better understanding of the factors influencing vision and eye health across the U.S. population. Each hypothesis can be adjusted based on the specific questions of interest and data availability within your dataset.

In [5]:
import pandas as pd
import scipy.stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load your dataset
df = pd.read_csv('data2/NHIS_Vision_and_Eye_Health_Surveillance_20240501.csv')

# Rename columns: replace spaces with underscores and convert to lowercase
df.columns = df.columns.str.replace(' ', '_').str.lower()

print(df.columns)

# 1. Age and Vision Health (ANOVA or Linear Regression)
model_age = ols('data_value ~ C(age)', data=df).fit()
print("ANOVA for Age and Vision Health:\n", model_age.summary())

# 2. Race/Ethnicity and Vision Health (Chi-squared Test or Logistic Regression)
contingency_race = pd.crosstab(df['raceethnicity'], df['data_value'])
chi2_race, p_race, _, _ = scipy.stats.chi2_contingency(contingency_race)
print(f"Chi-squared Test for Race/Ethnicity and Vision Health: Chi2 = {chi2_race}, P-value = {p_race}")

# 3. Gender and Specific Eye Conditions (Chi-squared Test or Logistic Regression)
model_gender = sm.formula.glm('data_value ~ C(gender)', family=sm.families.Binomial(), data=df).fit()
print("Logistic Regression for Gender and Eye Conditions:\n", model_gender.summary())

# 4. Impact of Risk Factors on Vision Health (Logistic Regression)
model_risk = sm.formula.glm('data_value ~ C(riskfactor)', family=sm.families.Binomial(), data=df).fit()
print("Logistic Regression for Impact of Risk Factors on Vision Health:\n", model_risk.summary())

# 5. Interaction Effects (Multivariate Regression or ANOVA)
model_interaction = ols('data_value ~ C(riskfactor) * C(age) * C(raceethnicity)', data=df).fit()
print("ANOVA for Interaction Effects:\n", model_interaction.summary())

# 6. Geographic Variation in Vision Health (Chi-squared Test or ANOVA)
if df['data_value'].dtype == 'object':  # Categorical data
    contingency_geo = pd.crosstab(df['locationdesc'], df['data_value'])
    chi2_geo, p_geo, _, _ = scipy.stats.chi2_contingency(contingency_geo)
    print(f"Chi-squared Test for Geographic Variation in Vision Health: Chi2 = {chi2_geo}, P-value = {p_geo}")
else:  # Continuous data
    model_geo = ols('data_value ~ C(locationdesc)', data=df).fit()
    print("ANOVA for Geographic Variation in Vision Health:\n", model_geo.summary())

# 7. Temporal Trends in Vision Health (Time Series Analysis or Linear Regression)
model_temporal = ols('data_value ~ yearstart', data=df).fit()
print("Linear Regression for Temporal Trends in Vision Health:\n", model_temporal.summary())


Index(['yearstart', 'yearend', 'locationabbr', 'locationdesc', 'datasource',
       'topic', 'category', 'question', 'response', 'age', 'gender',
       'raceethnicity', 'riskfactor', 'riskfactorresponse', 'data_value_unit',
       'data_value_type', 'data_value', 'data_value_footnote_symbol',
       'data_value_footnote', 'low_confidence_limit', 'high_confidence_limit',
       'numerator', 'sample_size', 'locationid', 'topicid', 'categoryid',
       'questionid', 'responseid', 'datavaluetypeid', 'ageid', 'genderid',
       'raceethnicityid', 'riskfactorid', 'riskfactorresponseid',
       'geolocation', 'geographic_level'],
      dtype='object')
ANOVA for Age and Vision Health:
                             OLS Regression Results                            
Dep. Variable:             data_value   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     45

  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +
  n * np.log(1 - mu + 1e-20)) * var_weights


Logistic Regression for Gender and Eye Conditions:
                  Generalized Linear Model Regression Results                  
Dep. Variable:             data_value   No. Observations:                33154
Model:                            GLM   Df Residuals:                    33151
Model Family:                Binomial   Df Model:                            2
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                   -inf
Date:                Mon, 13 May 2024   Deviance:                   5.0829e+07
Time:                        01:22:22   Pearson chi2:                 1.86e+23
No. Iterations:                     2   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------

  special.gammaln(n - y + 1) + y * np.log(mu / (1 - mu + 1e-20)) +
  n * np.log(1 - mu + 1e-20)) * var_weights


Logistic Regression for Impact of Risk Factors on Vision Health:
                  Generalized Linear Model Regression Results                  
Dep. Variable:             data_value   No. Observations:                33154
Model:                            GLM   Df Residuals:                    33150
Model Family:                Binomial   Df Model:                            3
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                   -inf
Date:                Mon, 13 May 2024   Deviance:                   5.0829e+07
Time:                        01:22:23   Pearson chi2:                 1.86e+23
No. Iterations:                     2   Pseudo R-squ. (CS):                nan
Covariance Type:            nonrobust                                         
                                    coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------

The dataset is regarding survey conducted at an autonomous college on college faculties performance and treatment towards students at the college. Questionnaire of 20 being rolled out to each department available at the college and randomly given to sampled list of students for the feedback.

1. **SN (Serial Number of the Survey Question)**:
   - **Description**: This is a unique identifier for each question in the survey, facilitating tracking and analysis of responses to specific questions.
2. **Total Feedback Given**:
   - **Description**: Represents the number of responses received for each course, indicating how many students from the sampled list actually provided feedback.
3. **Total Configured**:
   - **Description**: The total number of students enrolled in each course, providing a basis for calculating the response rate and assessing the representativeness of the feedback.
4. **Questions**:
   - **Description**: A list of the 20 survey questions asked. These questions are designed to assess various aspects of faculty performance and student treatment.
5. **Weightage 1 to Weightage 5**:
   - **Description**: These fields represent the response options for each survey question, with Weightage 1 being the lowest grade (least favorable) and Weightage 5 being the best rating (most favorable). Each weightage captures different levels of satisfaction or agreement with the survey questions.
6. **Average/Percentage**:
   - **Description**: This field shows the weighted average of the responses for each question, both in absolute terms and as a percentage. This helps in quickly understanding the overall trend of responses for each question or course.
7. **Course Name**:
   - **Description**: Indicates the year of the course within the graduation or post-graduation program (e.g., First Year, Second Year, etc.). This allows analysis of feedback across different stages of the academic program.
8. **Basic Course**:
   - **Description**: Specifies the stream of education (e.g., Science, Arts, Commerce). This helps in comparing faculty performance and student satisfaction across different academic disciplines.

### Analytical Uses:

- **SN** can be used to align responses to specific questions and analyze data question-wise.
- **Total Feedback Given** and **Total Configured** provide metrics for assessing participation rates and potential response biases.
- **Questions** are central to understanding the specific areas of faculty performance being evaluated.
- **Weightage 1 to Weightage 5** allow for detailed sentiment analysis and satisfaction levels.
- **Average/Percentage** offers a straightforward metric for gauging overall satisfaction or agreement with specific aspects of faculty performance.
- **Course Name** and **Basic Course** enable demographic and longitudinal studies to discern patterns or differences in feedback based on year of study or academic discipline.

This structured approach allows for comprehensive analysis and understanding of faculty performance from multiple angles, helping identify strengths and areas for improvement.

To conduct hypothesis testing on the student satisfaction survey dataset from your college, you can explore various aspects of faculty performance and treatment towards students. Here are some potential hypotheses that you could test using the data:

1. **Faculty Performance across Different Courses**
   - **Hypothesis**: There is no significant difference in faculty performance between different courses.
   - **Test Type**: ANOVA or Kruskal-Wallis test (if the data is not normally distributed).
2. **Student Satisfaction by Year of Study**
   - **Hypothesis**: Student satisfaction does not vary by year of study.
   - **Test Type**: Chi-squared test for categorical data or ANOVA for numeric ratings.
3. **Comparison between Graduation and Post-Graduation Streams**
   - **Hypothesis**: There is no difference in student satisfaction between graduation and post-graduation streams.
   - **Test Type**: Two-sample t-test or Mann-Whitney U test (if data is not normally distributed).
4. **Influence of Survey Response Rate on Satisfaction Scores**
   - **Hypothesis**: Courses with higher response rates do not have significantly different satisfaction scores compared to courses with lower response rates.
   - **Test Type**: Pearson or Spearman correlation analysis (to check correlation between response rate and satisfaction scores).
5. **Impact of Course Size on Feedback Scores**
   - **Hypothesis**: The total strength of the batch (course size) does not affect the feedback scores.
   - **Test Type**: Regression analysis or Pearson/Spearman correlation to check the relationship between course size and average feedback scores.
6. **Uniformity in Responses across Different Questions**
   - **Hypothesis**: Student responses are consistent across all 20 survey questions.
   - **Test Type**: Friedman test (if the responses are ranked) or repeated measures ANOVA.
7. **Comparison of Weighted Scores Across Academic Streams**
   - **Hypothesis**: The distribution of weighted scores (from Weightage 1 to Weightage 5) is the same across different academic streams.
   - **Test Type**: Chi-squared test or Multivariate ANOVA depending on the structure of your data.

These hypotheses can provide insights into various aspects of faculty performance and student satisfaction at your college. You can adjust the hypotheses and test types based on the specifics of your data and the questions you are most interested in exploring.