In this homework, you will be challenged to generate a multivariate logistic regression model using both Statsmodel and Scikit-learn. You should use the cancer.csv dataset. The field 'diagnosis' is a binary field with a 1 indicating a malignant cancer diagnosis and a 0 indicating a benign diagnosis. The columns contain characteristics of the tumor.

The data has already been cleaned, you should not need to remove duplicates, nulls, or outliers. Your target variable is your diagnosis. The ID column can be removed. All other columns can be used in your prediction.

Your notebook should show all the code you used to run both models, along with appropriate comments. We should be able to "Run All" cells and generate the output you show when turning in the code

You will also need to include at least one metric measuring the effectiveness of your models. Compare the metric generated between models. Does one model produce a higher/lower value?

Finally, it is often important when working with models to understand which feature is most important in the overall model or contributes to its overall success. Statsmodels generates a list of the p-values for each feature with the summary module. Please run this module on your model and interpret what this means. For sciki-learn, you'll need to work a little harder. There are a number of modules to determine feature importance. One common way is to caculate the coefficients. Please use at least one method to measure feature importance for scikit-learn

Your final cell in the workbook should include a write-up of the results of your analysis. Do you think one model is better? Why?



In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('cancer.csv')
df = df.drop('id', axis=1)
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [2]:
df.describe()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.096336,0.104341,0.088799,0.048919,0.181233,0.062792
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.01398,0.052813,0.07972,0.038803,0.027331,0.007013
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996
25%,0.0,11.7,16.17,75.17,420.3,0.08641,0.06492,0.02956,0.02031,0.162,0.0578
50%,0.0,13.37,18.84,86.24,551.1,0.09592,0.09263,0.06154,0.0335,0.1794,0.06166
75%,1.0,15.78,21.8,104.1,782.7,0.1051,0.1304,0.1307,0.074,0.1956,0.06608
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

## Statsmodel Logistic regression

In [4]:
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=200)

log_reg = sm.Logit(y_train, X_train).fit() 

print(log_reg.summary())

Optimization terminated successfully.
         Current function value: 0.113948
         Iterations 11
                           Logit Regression Results                           
Dep. Variable:              diagnosis   No. Observations:                  455
Model:                          Logit   Df Residuals:                      445
Method:                           MLE   Df Model:                            9
Date:                Tue, 07 Nov 2023   Pseudo R-squ.:                  0.8287
Time:                        15:45:07   Log-Likelihood:                -51.847
converged:                       True   LL-Null:                       -302.68
Covariance Type:            nonrobust   LLR p-value:                2.522e-102
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
radius_mean               -2.3255      3.904     -0.596      0.551      -9.976     

In [5]:
y_hat = log_reg.predict(X_test) 
prediction = list(map(round, y_hat)) 
  


from sklearn.metrics import (confusion_matrix, accuracy_score) 
  
# confusion matrix 
cm = confusion_matrix(y_test, prediction)  
print ("Confusion Matrix : \n", cm)  
  
# accuracy score of the model 
print('Test accuracy = ', accuracy_score(y_test, prediction))

Confusion Matrix : 
 [[72  4]
 [ 5 33]]
Test accuracy =  0.9210526315789473


## Sikit-Learn Logistic Regression

In [6]:
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=200)

model_sklearn = LogisticRegression(max_iter=1000)

model_sklearn.fit(X_train, y_train)

y_pred = model_sklearn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_report_sklearn = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{classification_report_sklearn}')



Accuracy: 0.8859649122807017
Confusion Matrix:
[[69  7]
 [ 6 32]]
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.91      0.91        76
           1       0.82      0.84      0.83        38

    accuracy                           0.89       114
   macro avg       0.87      0.88      0.87       114
weighted avg       0.89      0.89      0.89       114



In [7]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)

print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)


MAE: 0.11403508771929824
MSE: 0.11403508771929824
RMSE: 0.33769081675298523


In [8]:
coefficients = model_sklearn.coef_

# Display the coefficients for each feature
feature_names = X_train.columns
coefficients_dict = dict(zip(feature_names, coefficients[0]))

# Display the coefficients
for feature, coef in coefficients_dict.items():
    print(f'{feature}: {coef}')

radius_mean: -2.2294245812617333
texture_mean: 0.2302112885938188
perimeter_mean: 0.6073959798312047
area_mean: -0.007896815043979055
smoothness_mean: 0.4775944329473782
compactness_mean: 0.7061101986174045
concavity_mean: 1.0226064314833725
concave points_mean: 0.6629728800815549
symmetry_mean: 0.49938818961350684
fractal_dimension_mean: 0.13095275953468202


First, by comparing metrics generated by statsmodel and sikit learn, we were able to observe that the statsmodel logistic regression model's metrics had higer value compare to the sikit learn model.

In Statsmodel Logistic Regression, we could observe that the features with high p-values (e.g., 'radius_mean', 'perimeter_mean', 'compactness_mean', 'symmetry_mean', 'fractal_dimension_mean') may not be statistically significant in predicting the target variable and could be considered for removal.  In the case of sikit-learn logistic regression model, a decrease in 'radius_mean' is associated with a decrease in the log-odds of the target being malignant, and the increase of concavity_mean is associated the most with an increase in the log-odds of the target being malignant.

As a result, by comparing the accuracy of the test sample, we could conclude that the statsmodel logistic regression showed better performance with higher accuracy compare the the sikit-learn logistic regression model. 