This dataset contains data about 150 couples with their corresponding Divorce Predictors Scale variables (DPS) on the basis of Gottman couples therapy.
The couples are from various regions of Turkey wherein the records were acquired from face-to-face interviews from couples who were already divorced or happily married.
All responses were collected on a 5 point scale

(0=Never, 1=Seldom, 2=Averagely, 3=Frequently, 4=Always).

Attribute Information:

1. If one of us apologizes when our discussion deteriorates, the discussion ends.
2. I know we can ignore our differences, even if things get hard sometimes.
3. When we need it, we can take our discussions with my spouse from the beginning and correct it.
4. When I discuss with my spouse, to contact him will eventually work.
5. The time I spent with my wife is special for us.
6. We don't have time at home as partners.
7. We are like two strangers who share the same environment at home rather than family.
8. I enjoy our holidays with my wife.
9. I enjoy traveling with my wife.
10. Most of our goals are common to my spouse.
11. I think that one day in the future, when I look back, I see that my spouse and I have been in harmony with each other.
12. My spouse and I have similar values in terms of personal freedom.
13. My spouse and I have similar sense of entertainment.
14. Most of our goals for people (children, friends, etc.) are the same.
15. Our dreams with my spouse are similar and harmonious.
16. We're compatible with my spouse about what love should be.
17. We share the same views about being happy in our life with my spouse
18. My spouse and I have similar ideas about how marriage should be
19. My spouse and I have similar ideas about how roles should be in marriage
20. My spouse and I have similar values in trust.
21. I know exactly what my wife likes.
22. I know how my spouse wants to be taken care of when she/he sick.
23. I know my spouse's favorite food.
24. I can tell you what kind of stress my spouse is facing in her/his life.
25. I have knowledge of my spouse's inner world.
26. I know my spouse's basic anxieties.
27. I know what my spouse's current sources of stress are.
28. I know my spouse's hopes and wishes.
29. I know my spouse very well.
30. I know my spouse's friends and their social relationships.
31. I feel aggressive when I argue with my spouse.
32. When discussing with my spouse, I usually use expressions such as ‘you always’ or ‘you never’ .
33. I can use negative statements about my spouse's personality during our discussions.
34. I can use offensive expressions during our discussions.
35. I can insult my spouse during our discussions.
36. I can be humiliating when we discussions.
37. My discussion with my spouse is not calm.
38. I hate my spouse's way of open a subject.
39. Our discussions often occur suddenly.
40. We're just starting a discussion before I know what's going on.
41. When I talk to my spouse about something, my calm suddenly breaks.
42. When I argue with my spouse, ı only go out and I don't say a word.
43. I mostly stay silent to calm the environment a little bit.
44. Sometimes I think it's good for me to leave home for a while.
45. I'd rather stay silent than discuss with my spouse.
46. Even if I'm right in the discussion, I stay silent to hurt my spouse.
47. When I discuss with my spouse, I stay silent because I am afraid of not being able to control my anger.
48. I feel right in our discussions.
49. I have nothing to do with what I've been accused of.
50. I'm not actually the one who's guilty about what I'm accused of.
51. I'm not the one who's wrong about problems at home.
52. I wouldn't hesitate to tell my spouse about her/his inadequacy.
53. When I discuss, I remind my spouse of her/his inadequacy.
54. I'm not afraid to tell my spouse about her/his incompetence

Of the participants, 84 (49%) [Class=0] were divorced and 86 (51%) [Class=1] were married couples.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv('../input/divorce-prediction/divorce_data.csv', delimiter=';')
reference = pd.read_csv('../input/divorce-prediction/reference.tsv', delimiter='|')

In [None]:
data

In [None]:
data_features = data.drop('Divorce', axis=1)
data_features.rename(columns=lambda x: x.replace('Q',""), inplace=True)
data_features.columns = [int(i) for i in data_features.columns]

In [None]:
positive = []
negative = []
for i in list(data.index):
    if (data['Divorce'][i] == 0):
        positive.append(i)
    else:
        negative.append(i)

In [None]:
data_features_positive = data_features.drop(positive, axis=0)
data_features_negative = data_features.drop(negative, axis=0)

# Graphics

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data_features_all = pd.DataFrame(columns=[], index=[])

j = 1
for i in range(1,55,1):
    data_features_all[j] = data_features[i]
    j+=1
    data_features_all[j] = data_features_positive[i]
    j+=1
    data_features_all[j] = data_features_negative[i]
    j+=1

In [None]:
fig, axes = plt.subplots(nrows = round(len(data_features_all.columns) / 3), ncols = 3, figsize=(12,160))
d = 0
c = 0
for i, ax in enumerate(fig.axes):
    if i < len(data_features_all.columns):
        
        if i == 1: 
            ax.set_title("%s"%(d+1) + ". " + reference['description'][d])
            c = 0
            d+=1
        elif (c % 3 == 0) and (c!=0) and (d <= 53):
            ax.set_title("%s"%(d+1) + ". " + reference['description'][d])
            d+=1
        c+=1
        
        sns.countplot(x=data_features_all.columns[i], alpha=0.5, data=data_features_all, ax=ax)
        ax.set_ylabel('')    
        ax.set_xlabel('')

fig.tight_layout()

# Learning

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_training, X_test, y_training, y_test = train_test_split(data_features, data['Divorce'], test_size = 0.3, random_state = 1)

In [None]:
print('TRAIN: ', y_training.value_counts())
print('\nTEST: ', y_test.value_counts())

In [None]:
print('train: ', X_training.shape[0])
print('test: ', X_test.shape[0])

## Simple learning

In [None]:
!pip install xgboost
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import CategoricalNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

In [None]:
for classifier, pl in zip((LogisticRegression,
                           RandomForestClassifier,
                           SGDClassifier,
                           DecisionTreeClassifier,
                           CategoricalNB,
                           KNeighborsClassifier,
                           XGBClassifier,
                           AdaBoostClassifier,
                           SVC),
                          ('Logistic regression',
                           'Random forest', 
                           'Stochastic Gradient Descent',
                           'Decision Tree',
                           'Categorical Naive Bayes',
                           'K-nearest neighbor Classifier',
                           'XGBClassifier',
                           'AdaBoostClassifier',
                           'Support Vector Machines')):

    pipe = Pipeline([('clf', classifier())])    
    pipe.fit(X_training, y_training)
    print(pl)
    print("Training sample: ",pipe.score(X_training, y_training))
    print("Testing sample:  ",pipe.score(X_test, y_test),'\n')

# Important features

In [None]:
graph_importance = RandomForestClassifier()
graph_importance.fit(X_training,y_training)
important_values = pd.DataFrame(graph_importance.feature_importances_, index=X_training.columns, columns=['importance'])
important_values.sort_values('importance').plot(kind='barh', figsize=(15, 15))

## Configure Hyperparameteres

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
SEED = 1

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_training, y_training, test_size = 0.3, random_state = 40)
print('TRAIN: ', y_train.value_counts())
print('\nValid: ', y_valid.value_counts())

In [None]:
print('train: ', X_train.shape[0])
print('valid: ', X_valid.shape[0])

### Logistic Regression

In [None]:
best_score = 0
counter = np.logspace(-2,2,5)

for C in counter:
    logreg = LogisticRegression(C=C, penalty='l2', solver='liblinear')
    logreg.fit(X_train, y_train)
    score = logreg.score(X_valid, y_valid)
    if (score > best_score) and (score != 1.0):
        best_score = score
        best_parameters = {'random_state':SEED,'C':C, 'penalty':'l2','solver':'liblinear'}

best_params_logreg = best_parameters
logreg = LogisticRegression(**best_parameters)
logreg.fit(X_training, y_training)
test_score = logreg.score(X_test, y_test)

print("best parameters\n{}".format(best_parameters))
print("\nbest proper score\non training:  {:.4f}".format(best_score))
print("\ntest's score: {:.4f}".format(test_score))

### Random Forest

In [None]:
forest = RandomForestClassifier(random_state=SEED)

forest_params = {
    'criterion': ['gini','entropy'],
    'n_estimators': [20,50,100,130],
    'max_features': range(1,30,1),
    'max_depth': range(1,20,1),
}

forest_search = RandomizedSearchCV(forest, forest_params, cv = 5)
aaa = forest_search.fit(X_training, y_training)

In [None]:
print("Best parameteres:\n{:}\n".format(forest_search.best_params_)) 
print("Best estimator:\n{:}\n".format(forest_search.best_estimator_))

In [None]:
best_forest = forest_search.best_estimator_
print("Best score on training data: {:.4f}".format(forest_search.best_score_))
print("Best score on testing data:  {:.4f}".format(best_forest.score(X_test, y_test)))

### Decision Tree

In [None]:
dtree = DecisionTreeClassifier(random_state=SEED)

dtree_params = {
    'criterion': ['gini','entropy'],
    'max_features': range(1,30,1),
    'max_depth': range(1,20,1),
}

dtree_search = GridSearchCV(dtree, dtree_params, cv = 5)
dtree_search.fit(X_training, y_training)

In [None]:
print("Best parameteres:\n{:}\n".format(dtree_search.best_params_)) 

In [None]:
print("Best value on cross-validation: {:}".format(dtree_search.best_score_)) 
print("testing sample: {:.4f}".format(dtree_search.score(X_test, y_test)))

### K-nearest neighbor Classifier

In [None]:
best_score = 0
counter_knn = range(1,10,1)

for neighbors in counter_knn:
    knn = KNeighborsClassifier(n_neighbors=neighbors)
    knn.fit(X_train, y_train)
    score = knn.score(X_valid, y_valid)
    
    if (score > best_score) and (score != 1.0):
        print(best_score)
        best_score = score
        best_parameters = {'n_neighbors':neighbors}
        
knn = KNeighborsClassifier(**best_parameters)
knn.fit(X_training, y_training)
test_score = knn.score(X_test, y_test)

print("best parameters\n{}".format(best_parameters))
print("\nbest proper score\non training:  {:.4f}".format(best_score))
print("\ntest's score: {:.4f}".format(test_score))

### Support Vector Machines

In [None]:
best_score = 0

for C in np.logspace(-3,3,30):
    for gamma in np.logspace(-2,2,6):
        for kernel in ['rbf','linear']:
            svc = SVC(gamma=gamma, kernel=kernel, C=C)
            svc.fit(X_train, y_train)
            score = svc.score(X_valid, y_valid)
            if (score > best_score):
                best_score = score
                best_parameters = {
                    'C':C,
                    'gamma':gamma,
                    'kernel':kernel
                }

                
svc = SVC(**best_parameters)
svc.fit(X_training, y_training)
test_score = svc.score(X_test, y_test)
test_score

In [None]:
print("best parameters\n{}".format(best_parameters))
print("\nbest proper score\non training:  {:.4f}".format(best_score))
print("\ntest's score: {:.4f}".format(test_score))

# Feature selection

In [None]:
from sklearn.metrics import roc_auc_score

logr = LogisticRegression(**best_params_logreg)
rfor = RandomForestClassifier(**forest_search.best_params_)

def get_score(model, X, y, Xt, yt):
  model.fit(X, y)
  y_pred = model.predict_proba(Xt)[:,1]
  score = roc_auc_score(yt, y_pred)
  return score

## Removing features with low variance

In [None]:
from sklearn.feature_selection import VarianceThreshold

In [None]:
low_div = VarianceThreshold(threshold=0.9)
X_trlow = low_div.fit_transform(X_training)
X_trlow = pd.DataFrame(X_trlow, columns = list(X_training.columns[low_div.get_support()]))

In [None]:
X_trlow.shape

In [None]:
X_training.shape

In [None]:
X_telow = low_div.transform(X_test)
X_telow = pd.DataFrame(X_telow, columns = list(X_test.columns[low_div.get_support()]))

In [None]:
get_score(logr, X_trlow, y_training, X_telow, y_test)

In [None]:
get_score(rfor,  X_trlow, y_training, X_telow, y_test)

## Univariate feature selection

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

In [None]:
univar_div = SelectKBest(mutual_info_classif, 30)
X_trunivar = univar_div.fit_transform(X_training, y_training)
X_trunivar = pd.DataFrame(X_trunivar, columns = list(X_training.columns[univar_div.get_support()]))

In [None]:
X_trunivar.shape

In [None]:
sns.set(font_scale = 1.5)
f, ax = plt.subplots(figsize=(15, 15))
sns.barplot(y = X_training.columns, x = univar_div.scores_, palette = 'pastel', orient = 'h');

In [None]:
X_trunivar.head()

In [None]:
X_teunivar = univar_div.transform(X_test)
X_teunivar = pd.DataFrame(X_teunivar, columns = list(X_test.columns[univar_div.get_support()]))

In [None]:
univar_logr = get_score(logr, X_trunivar, y_training, X_teunivar, y_test)
print('Logistic Regression score:', univar_logr)

In [None]:
univar_rfor = get_score(rfor, X_trunivar, y_training, X_teunivar, y_test)
print('Random Forest score:', univar_rfor)

# Dimensionality reduction

## Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler() 
scaler.fit(X_training)

In [None]:
X_tr_standart = scaler.transform(X_training)
X_tr_standart = pd.DataFrame(X_tr_standart, columns = X_training.columns)
X_tr_standart.head()

In [None]:
X_te_standart = scaler.transform(X_test)
X_te_standart = pd.DataFrame(X_te_standart, columns = X_test.columns)
X_te_standart.head()

## Principal component analysis (PCA)

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=54)
pca.fit(X_tr_standart)

In [None]:
variance = pca.explained_variance_ratio_
var = np.cumsum (np.round (variance, 3) * 100) 

sns.set(font_scale = 1.5)
f, ax = plt.subplots(figsize=(15, 5))
plt.ylabel('% Variance Explained') 
plt.xlabel('# of Features') 
plt.title('analysis of PCA') 
sns.lineplot(x = range(1, 55), y = var);

In [None]:
sns.set(font_scale = 1.5)
f, ax = plt.subplots(figsize=(15, 15))
plt.ylabel('# of Features') 
plt.xlabel('% Variance Explained') 
plt.title('analysis of PCA') 
sns.barplot(y = list(range(1, 55)), x = pca.explained_variance_ratio_, palette = 'pastel', orient = 'h');

In [None]:
pca = PCA(n_components = 0.99, svd_solver = 'full') 
pca.fit(X_tr_standart)

In [None]:
pca.n_components_

In [None]:
sns.set(font_scale = 1.5)
f, ax = plt.subplots(figsize=(15, 15))
plt.ylabel('# of Features') 
plt.xlabel('% Variance Explained') 
plt.title('analysis of PCA') 
sns.barplot(y = list(range(1, 34)), x = pca.explained_variance_ratio_, palette = 'pastel', orient = 'h');

In [None]:
X_train_pca = pca.transform(X_tr_standart)
X_test_pca  = pca.transform(X_te_standart)

In [None]:
X_train_pca = pd.DataFrame(X_train_pca, columns = [str(i) + ' component' for i in range(1, pca.n_components_ + 1)])
X_test_pca  = pd.DataFrame(X_test_pca,  columns = [str(i) + ' component' for i in range(1, pca.n_components_ + 1)])

In [None]:
X_train_pca.head()

In [None]:
pca_logreg_score = get_score(logr, X_train_pca, y_training, X_test_pca, y_test)
print('Logistic Regression score:', pca_logreg_score)

In [None]:
pca_rforest_score = get_score(rfor, X_train_pca, y_training, X_test_pca, y_test)
print('Random Forest score:', pca_rforest_score)