Problem Statement :  
I decided to treat this as a classification problem by creating a new binary variable affair (did the woman have at least one affair?) and trying to predict the classification for each woman. 
Dataset 
The dataset I chose is the affairs dataset that comes with Statsmodels. It was derived from a survey of women in 1974 by Redbook magazine, in which married women were asked about their participation in extramarital affairs. More information about the study is available in a 1978 paper from the Journal of Political Economy. 
Description of Variables 
The dataset contains 6366 observations of 9 variables: 
rate_marriage: woman's rating of her marriage (1 = very poor, 5 = very good) 
age: woman's age 
yrs_married: number of years married 
children: number of children 
religious: woman's rating of how religious she is (1 = not religious, 4 = strongly religious) 
educ: level of education (9 = grade school, 12 = high school, 14 = some college, 16 = college graduate, 17 = some graduate school, 20 = advanced degree) 
occupation: woman's occupation (1 = student, 2 = farming/semi- skilled/unskilled, 3 = "white collar", 4 = teacher/nurse/writer/technician/skilled, 5 = managerial/business, 6 = professional with advanced degree) 
occupation_husb: husband's occupation (same coding as above) 
affairs: time spent in extra-marital affairs


In [1]:
import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import statsmodels.formula.api as smf
from scipy.special import factorial
import matplotlib.pyplot as plt 
from patsy import dmatrices 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
dta = sm.datasets.fair.load_pandas().data


In [2]:
dta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6366 entries, 0 to 6365
Data columns (total 9 columns):
rate_marriage      6366 non-null float64
age                6366 non-null float64
yrs_married        6366 non-null float64
children           6366 non-null float64
religious          6366 non-null float64
educ               6366 non-null float64
occupation         6366 non-null float64
occupation_husb    6366 non-null float64
affairs            6366 non-null float64
dtypes: float64(9)
memory usage: 447.7 KB


In [4]:
dta.describe()

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb,affairs
count,6366.0,6366.0,6366.0,6366.0,6366.0,6366.0,6366.0,6366.0,6366.0
mean,4.109645,29.082862,9.009425,1.396874,2.42617,14.209865,3.424128,3.850141,0.705374
std,0.96143,6.847882,7.28012,1.433471,0.878369,2.178003,0.942399,1.346435,2.203374
min,1.0,17.5,0.5,0.0,1.0,9.0,1.0,1.0,0.0
25%,4.0,22.0,2.5,0.0,2.0,12.0,3.0,3.0,0.0
50%,4.0,27.0,6.0,1.0,2.0,14.0,3.0,4.0,0.0
75%,5.0,32.0,16.5,2.0,3.0,16.0,4.0,5.0,0.484848
max,5.0,42.0,23.0,5.5,4.0,20.0,6.0,6.0,57.599991


In [5]:
dta.affairs=np.where(dta.affairs >0,1,0)


In [6]:
dta.tail(100)

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb,affairs
6266,4.0,37.0,16.5,2.0,1.0,12.0,3.0,4.0,0
6267,5.0,22.0,2.5,0.0,3.0,12.0,3.0,3.0,0
6268,4.0,37.0,16.5,4.0,2.0,12.0,2.0,2.0,0
6269,3.0,27.0,9.0,1.0,3.0,17.0,4.0,1.0,0
6270,5.0,22.0,2.5,0.0,1.0,17.0,2.0,2.0,0
6271,5.0,22.0,2.5,0.0,1.0,14.0,3.0,4.0,0
6272,5.0,22.0,2.5,0.0,2.0,16.0,4.0,5.0,0
6273,5.0,27.0,6.0,0.0,2.0,16.0,3.0,5.0,0
6274,4.0,22.0,2.5,0.0,3.0,14.0,5.0,4.0,0
6275,5.0,27.0,2.5,0.0,1.0,17.0,4.0,6.0,0


In [7]:
X=dta.iloc[:,0:8]

In [9]:
X.head(10)

Unnamed: 0,rate_marriage,age,yrs_married,children,religious,educ,occupation,occupation_husb
0,3.0,32.0,9.0,3.0,3.0,17.0,2.0,5.0
1,3.0,27.0,13.0,3.0,1.0,14.0,3.0,4.0
2,4.0,22.0,2.5,0.0,1.0,16.0,3.0,5.0
3,4.0,37.0,16.5,4.0,3.0,16.0,5.0,5.0
4,5.0,27.0,9.0,1.0,1.0,14.0,3.0,4.0
5,4.0,27.0,9.0,0.0,2.0,14.0,3.0,4.0
6,5.0,37.0,23.0,5.5,2.0,12.0,5.0,4.0
7,5.0,37.0,23.0,5.5,2.0,12.0,2.0,3.0
8,3.0,22.0,2.5,0.0,2.0,12.0,3.0,3.0
9,3.0,27.0,6.0,0.0,1.0,16.0,3.0,5.0


In [10]:
y=dta.iloc[:,8:]

In [12]:
y.head(10)

Unnamed: 0,affairs
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


In [19]:
Log_Reg=LogisticRegression()

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=11)

In [21]:
model=Log_Reg.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [22]:
y_pred=model.predict(X_test)

In [23]:
confusion_matrix(y_test,y_pred)

array([[1273,  124],
       [ 454,  250]], dtype=int64)

In [24]:
cross_val_score(Log_Reg, X, y, cv=10)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


array([0.71630094, 0.69749216, 0.74137931, 0.71226415, 0.70125786,
       0.73113208, 0.71855346, 0.70125786, 0.74842767, 0.75314465])

In [25]:
model.predict(X_test)[111]

0

In [26]:
y_test.iloc[111][0]

0