I decided to treat this as a classification problem by creating a new binary 
variable affair (did the woman have at least one affair?) and trying to 
predict the classification for each woman.


Dataset
The dataset I chose is the affairs dataset that comes with Statsmodels. It 
was derived from a survey of women in 1974 by Redbook magazine, in 
which married women were asked about their participation in extramarital 
affairs. More information about the study is available in a 1978 paper from 
the Journal of Political Economy.


Description of Variables


The dataset contains 6366 observations of 9 variables:
rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good)


age: woman's age


yrs_married: number of years married


children: number of children


religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious)


educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree)


occupation: woman's occupation (1 = student, 2 = farming/semiskilled/unskilled, 3 = "white collar", 4 = 

teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = 
professional with advanced degree)


occupation_husb: husband's occupation (same coding as above)
affairs: time spent in extra-marital affairs


Code to loading data and modules:


In [9]:
#lets import necessory libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score

In [22]:
# Load the dataset
data = sm.datasets.fair.load_pandas().data

In [23]:
# Add "affair" column: 1 represents having affair, 0 represents not
data['affair'] = (dta.affairs > 0).astype(int)

In [12]:
# Create dataframes with an intercept column and dummy variables for occupation and husband's occupation
y, X = dmatrices('affair ~ rate_marriage + age + yrs_married + children + religious + educ + C(occupation) + C(occupation_husb)', dta, return_type="dataframe")
X = X.rename(columns = {'C(occupation)[T.2.0]':'occ_2', 'C(occupation)[T.3.0]':'occ_3', 'C(occupation)[T.4.0]':'occ_4', 'C(occupation)[T.5.0]':'occ_5', 'C(occupation)[T.6.0]':'occ_6', 'C(occupation_husb)[T.2.0]':'occ_husb_2', 'C(occupation_husb)[T.3.0]':'occ_husb_3', 'C(occupation_husb)[T.4.0]':'occ_husb_4', 'C(occupation_husb)[T.5.0]':'occ_husb_5', 'C(occupation_husb)[T.6.0]':'occ_husb_6'})
y = np.ravel(y)

In [13]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


In [35]:
# Create a logistic regression model and fit the training data
lr = LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression()

In [36]:
# Predict the class labels for the testing set
y_pred = lr.predict(X_test)

In [37]:
# Compute the accuracy score of the model
accuracy = metrics.accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.731413612565445


In [38]:
# Compute the cross-validation score of the model
cv_scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print("Cross-Validation Accuracy:", cv_scores.mean())

Cross-Validation Accuracy: 0.7246374021306636


# Hyperparameter Tuning

In [39]:
from sklearn.model_selection import GridSearchCV

In [40]:
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]

In [41]:
clf = GridSearchCV(lr, param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)

In [45]:
best_clf = clf.fit(X,y)

Fitting 3 folds for each of 1600 candidates, totalling 4800 fits


In [None]:
best_clf.best_estimator_

In [None]:
css=LogisticRegression(C=0.0018329807108324356, solver='liblinear')

In [None]:
css.fit(x_train,y_train)

In [None]:
css.score(x_test,y_test)

In [None]:
from sklearn.metrics import roc_curve, auc

In [None]:
#predicting the data
y_pred_cnb = css.predict(x_test)
y_prob_pred_cnb = css.predict_proba(x_test)

In [None]:
# roc curve for classes
fpr = {}
tpr = {}
thresh ={}

n_class = 2

for i in range(n_class):    
    fpr[i], tpr[i], thresh[i] = roc_curve(y_test, y_prob_pred_cnb[:,i], pos_label=i)
    
# plotting    
plt.plot(fpr[0], tpr[0], linestyle='--', label='Class 1 vs Rest')
plt.plot(fpr[1], tpr[1], linestyle='--', label='Class 2 vs Rest')


plt.title('Multiclass ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show()

In [None]:
#Intern at Pranathi 
#Student of DataTrained - Saurav
#Date - 6- April-2023
#Time - 12:47