In [None]:
# Required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

data=pd.read_csv('')

# Check the first few rows
print(data.head())

# Preparing the data
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Adding a constant for the intercept
X = sm.add_constant(X)

# Create the GLM model
model = sm.GLM(y, X, family=sm.families.Binomial())

# Fit the model
result = model.fit()

# Print the summary
print(result.summary())


In [None]:
Analysis:

    Coefficients (Coef.):
Represent the change in the log odds of
the outcome for a one-unit change in the
predictor variable, holding other variables constant.
    P>|z|: P-values associated with the Wald test for each coefficient. Small values (typically <0.05) suggest that a predictor is significantly associated with the outcome variable.

    Dep. Variable: The dependent variable for the model is 'Outcome', which is the binary variable we are trying to predict.

    Model: The model used is a GLM.

    Model Family: The family is Binomial, indicating that the dependent variable is binary and the model is logistic regression.

    Link Function: The link function is Logit, which is commonly used for logistic regression.

    No. Observations: There are 768 observations used in fitting the model.

    Df Model: The number of predictors (degrees of freedom of the model) is 8, not including the constant term.

    Df Residuals: The degrees of freedom for the residuals is 759 (No. Observations - Df Model - 1 for the constant).

    Deviance and Pearson chi2: These are goodness-of-fit measures. Lower values generally indicate a better fit.

    Covariance Type: The type of covariance used is 'nonrobust', which means that it does not make adjustments for clustering or certain types of heteroskedasticity.

    Pseudo R-squ.: The Pseudo R-squared value is 0.2964, which gives an indication of the amount of variance explained by the model. In logistic regression, this value is interpreted differently from the R-squared in linear regression and is generally lower.

Coefficients Table:

    const: The constant (intercept) has a value of -8.4047. This represents the log-odds of the outcome when all predictors are held at zero.

    Pregnancies: The coefficient of 0.1232 suggests that with each additional pregnancy, the log-odds of having diabetes increases by 0.1232, holding other variables constant.

    Glucose: A one-unit increase in glucose concentration is associated with an increase in the log-odds of diabetes by 0.0352.

    BloodPressure: Blood pressure is slightly negatively associated with the log-odds of diabetes, with a coefficient of -0.0133.

    SkinThickness, Insulin, BMI: These variables have coefficients of 0.0006, -0.0012, and 0.0897 respectively, indicating their respective associations with the log-odds of diabetes per unit increase.

    DiabetesPedigreeFunction: This has a relatively large positive coefficient of 0.9452, suggesting a strong association with the outcome.

    Age: The coefficient of 0.0149 suggests that with each additional year of age, the log-odds of having diabetes increases slightly.

Statistical Significance:

    P>|z| column: Shows the p-values for the hypothesis tests for each coefficient. A common threshold for significance is 0.05.
        For instance, 'Pregnancies', 'Glucose', and 'DiabetesPedigreeFunction' have p-values less than 0.05, indicating that these are statistically significant predictors of diabetes at the 5% significance level.
        'BloodPressure' also shows significance with a p-value of 0.011.
        Other variables, like 'SkinThickness', 'Insulin', and 'BMI', have p-values greater than 0.05, suggesting they are not statistically significant at the 5% level.

    [0.025 0.975] columns: These are the 95% confidence intervals for the coefficients. If the interval for a coefficient does not include zero, it suggests that the predictor is significantly different from zero at the 5% level, reinforcing the findings from the p-values.

In summary, the model suggests that the number of pregnancies, glucose level, blood pressure, and diabetes pedigree function are significant predictors of the presence of diabetes, with glucose level and diabetes pedigree function showing the strongest associations. Age is marginally significant. Variables like skin thickness, insulin, and BMI are not statistically significant predictors in this model, according to the p-values provided.


In [None]:
# Required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns
data=pd.read_csv('diabetes.csv')

# Separate the features and the target variable
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Gaussian Naive Bayes model
gnb = GaussianNB()

# Fit the model on the training data
gnb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gnb.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print the classification report for further details
print(classification_report(y_test, y_pred))
