# How to live Happy Married Life?
Author: Md Rana Mahmud

<meta name="Md Rana Mahmud" content="John Doe">

# Table of Contents:
* [Business Understanding](#business)
* [Data Understanding](#data)
* [Data Preparation](#data-preperation)
* [Modeling](#modeling)
* [Evaluation](#evaluation)
* [Deployment](#deployment)
* [Analysis, Modeling, Visualization](#analysis)
    1. [Is there a significant difference between Happily Married and Divorced couples?](#q1)
    2. [What are the most agreed and disagreed things between the couples?](#q2)
    3. [What couples need to focus most to prevent divorce?](#q3)
    4. [How accurately we can predict a future divorce?](#q4)

In [1]:
# load the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re 

In [2]:
from itertools import islice
from textwrap import TextWrapper
# library to to independent sample t test 

In [3]:
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import precision_score,accuracy_score

In [4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionCV,LogisticRegression
from sklearn.model_selection import cross_val_score

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [6]:
from tpot import TPOTClassifier

In [7]:
from sklearn.feature_selection import RFE

In [None]:
from xgboost import XGBClassifier

In [None]:
from sklearn.svm import LinearSVC

In [None]:
%matplotlib inline

# Business Understanding <a class="anchor" id="business"></a>

Problem Overview

Yöntem and İlhan (2017, 2018)  developed the Divorce Predictors Scale based on  Gottman
couples therapy (Gottman, 2014; Gottman and Gottman, 2012). Using this scale they've done divorce prediction in their research paper.
Gottman couples therapy explains the causes of divorce based on empirical research. In this analysis, we'll try to gain some insight between happily married and divorced couples based on the following questions. Yöntem and İlhan (2017, 2018) collected the data from 7 regions of Turkey. In this data out of 170 couples 84 (49.41%) were divorced and 86 (50.59%) were
married couples.

In this project I was interested in using Divorce Predictors data set to find out the answer to the following questions:
    1. Is there a significant difference between Happily Married and Divorced couples?
    2. What are the most agreed and disagreed things between the couples?
    3. What couples need to focus most to prevent divorce?
    4. How accurately we can predict a future divorce?

# Data

## Data Understanding <a class="anchor" id="data"></a>

In [None]:
# load the data
df = pd.read_excel("divorce.xlsx")
# print first few rows of the data
df.head()

In [None]:
# check data types
df.dtypes

In [None]:
# check percentages of missing values
df.isnull().mean()

There are no missing values in the data and all the data are numeric and in integer format.

### check shape of data

In [None]:
# print shape of the data
print(df.shape)

In [None]:
# check value counts of the attributes
df.describe()

In [None]:
df.columns

This data 55 variable and 170 observations. Variable Atr1 to Atr54 are questions asked to the couples and Class indicates whether they're divorced or not.

## Data Preperation <a class="anchor" id="data-preperation"></a>

In [None]:
# first we'll create a dictionary of the original questions
questions = "1. If one of us apologizes when our discussion deteriorates,\
the discussion ends. 2. I know we can ignore our differences, even if things \
get hard sometimes. 3. When we need it, we can take our discussions with my \
spouse from the beginning and correct it. 4. When I discuss with my spouse,\
to contact him will eventually work.\
5. The time I spent with my wife is special for us.\
6. We don\'t have time at home as partners.\
7. We are like two strangers who share the same environment at home rather\
than family. 8. I enjoy our holidays with my wife.\
9. I enjoy traveling with my wife.\
10. Most of our goals are common to my spouse.\
11. I think that one day in the future, when I look back,\
I see that my spouse and I have been in harmony with each other.\
12. My spouse and I have similar values in terms of personal freedom.\
13. My spouse and I have similar sense of entertainment.\
14. Most of our goals for people (children, friends, etc.) are the same.\
15. Our dreams with my spouse are similar and harmonious.\
16. We\'re compatible with my spouse about what love should be.\
17. We share the same views about being happy in our life with my spouse\
18. My spouse and I have similar ideas about how marriage should be\
19. My spouse and I have similar ideas about how roles should be in marriage\
20. My spouse and I have similar values in trust.\
21. I know exactly what my wife likes.\
22. I know how my spouse wants to be taken care of when she/he sick.\
23. I know my spouse\'s favorite food.\
24. I can tell you what kind of stress my spouse is facing in her/his life.\
25. I have knowledge of my spouse\'s inner world.\
26. I know my spouse\'s basic anxieties.\
27. I know what my spouse\'s current sources of stress are.\
28. I know my spouse\'s hopes and wishes.\
29. I know my spouse very well.\
30. I know my spouse\'s friends and their social relationships.\
31. I feel aggressive when I argue with my spouse.\
32. When discussing with my spouse, I usually use expressions such as \
‘you always’ or ‘you never’ .\
33. I can use negative statements about my spouse\'s personality \
during our discussions.\
34. I can use offensive expressions during our discussions.\
35. I can insult my spouse during our discussions.\
36. I can be humiliating when we discussions.\
37. My discussion with my spouse is not calm.\
38. I hate my spouse\'s way of open a subject.\
39. Our discussions often occur suddenly.\
40. We\'re just starting a discussion before I know what\'s\
going on.\
41. When I talk to my spouse about something, my calm suddenly breaks.\
42. When I argue with my spouse, ı only go out and I don\'t say a word.\
43. I mostly stay silent to calm the environment a little bit.\
44. Sometimes I think it\'s good for me to leave home for a while.\
45. I\'d rather stay silent than discuss with my spouse.\
46. Even if I\'m right in the discussion, I stay silent to hurt my spouse.\
47. When I discuss with my spouse, I stay silent because I am afraid of \
not being able to control my anger.\
48. I feel right in our discussions.\
49. I have nothing to do with what I\'ve been accused of.\
50. I\'m not actually the one who\'s guilty about what I'm accused of.\
51. I\'m not the one who\'s wrong about problems at home.\
52. I wouldn't hesitate to tell my spouse about her/his inadequacy.\
53. When I discuss, I remind my spouse of her/his inadequacy.\
54. I\'m not afraid to tell my spouse about her/his incompetence."

In [None]:
# split the string by number
questions = re.split(r"[0-9]{1,2}.", questions)
questions

In [None]:
# Trim whitespace and drop empty string
questions = [question.strip() for question in questions if question != ""]
print(questions[1:5])
# check the nubmer of questions
print(len(questions))

In [None]:
# make titles text wrap
tw = TextWrapper()
# set width to break lines
tw.width = 40
questions = ["\n".join(tw.wrap(text)) for text in questions ]

In [None]:
# make column names lowercase
df.columns = map(str.lower, df.columns)

In [None]:
# make a dictionary of questions and variable names
quetions_list = dict(zip(df.columns, questions))

In [None]:
# print first few questions
list(islice(quetions_list.items(),3))

In [None]:
# print married and divorced couple frequency
df["class"].value_counts()

In [None]:
# assign label to class variable
df['class'] = df["class"].map({0:"Married",  1:"Divorced"})

Now we'll recode the values of the variables 0-4 scale to -2 to 2 to make the scale bipolar.

In [None]:
# recode all the feature values
df_new = df.drop(['class'],axis = 1).apply(lambda x: x.map({0:-2,1:-1,2:0,3:1,4:2}))

In [None]:
# add class
df_new['class'] = df['class']

In [None]:
percent = df["class"].value_counts()/df["class"].count()*100
percent = np.round(percent,2)
print(percent)
# make barplot of married vs divorced
percent.plot(kind='bar')
plt.title("Couples Marrital Status")
plt.xticks(rotation = 0)
plt.show()

From the above we can see that the data set is balanced and there are 50.58% Married and 49.41% divorced couples in the data.

# Analysis, Modeling, Visualization <a class="anchor" id="analysis"></a>



## Question 1 <a class="anchor" id="q1"></a>
1. Are there significant difference between Happily Married and Divorced couples?

In [None]:
# calculate average value of the questions for both category of couplse
df_mean = df_new.groupby(['class']).mean()


In [None]:
# print the average score
df_mean

In [None]:
# print the average values to see the differences in questions
fig, axs = plt.subplots(6, 3, figsize=[16, 30])

# make plots of all the variables
for num in range(1,19):
    plt.subplot(6,3,num)
    plt.subplots_adjust(hspace = 0.5)
    axs = df_mean[df.columns[num-1]].plot(kind='bar')
    
    plt.title(quetions_list[df.columns[num-1]])
    plt.xticks(rotation = 0)
    plt.xlabel("")
#     plt.ylabel("Disagree\t\tNeutral\t\t\t  Agree".expandtabs())
    plt.ylim(-2,2)
    plt.axhline(y=0, color='gray', linestyle='-')
plt.show()

In [None]:
fig, axs = plt.subplots(6, 3,figsize=[16, 30])

for num in range(18,36):
    plot_location = num%18+1
    plt.subplot(6,3,plot_location)
    plt.subplots_adjust(hspace = 0.5)

    axs = df_mean[df.columns[num]].plot(kind='bar')
    plt.title(quetions_list[df.columns[num]])
    plt.xticks(rotation = 0)
    plt.xlabel("")
#     plt.ylabel("Disagree\t\tNeutral\t\t\t  Agree".expandtabs())
    plt.ylim(-2,2)
    plt.axhline(y=0, color='gray', linestyle='-')
plt.show()

In [None]:
fig, axs = plt.subplots(6, 3,figsize=[16, 30])

for num in range(36,54):
    plot_location = num%18+1
    plt.subplot(6,3,plot_location)
    plt.subplots_adjust(hspace = 0.5)

    axs = df_mean[df.columns[num]].plot(kind='bar')
    plt.xticks(rotation = 0)
    plt.title(quetions_list[df.columns[num]])
    plt.xlabel("")
#     plt.ylabel("Disagree\t\tNeutral\t\t\t  Agree".expandtabs())
    plt.ylim(-2,2)
    plt.axhline(y=0, color='gray', linestyle='-')
plt.show()

Among the 54 questions married and divorced both couples on average gave a similar opinion about having time at home for partners, feeling right in their discussions.
In all other 52 questions, their average responses were in opposite directions.

In [None]:
plt.rcParams['figure.figsize'] = 12,4
df_mean.T.plot()
plt.title("Difference in Opinion between Divorced and Happiliy Married Couples")
plt.xlabel("Attribute");

###  Tests between two groups

As the features and class variables are both categorical for all questions asked here we'll now do chi-square tests of independence to see the statistical difference.

For all the tests the hypothesis will be

$Null Hypothesis_{i}$: There is no difference between the divorced and happily married couple for ith question
    
$Alternative Hypothesis_{i}$: There is a significant difference between the divorced and happily married couple for ith question

In [None]:
chi2_results =[]
for column in df_new.drop('class', axis = 1).columns:
    df_cross_tab = pd.crosstab(df_new[column],df_new['class'])
    chi2, p, dof, ex = chi2_contingency(df_cross_tab)
    chi2_results.append(p<0.05)
#     print(p<0.05)
chi2_results

At a 5% level of significance since all the p values are less than 0.05 we reject all the hypotheses and conclude that between married and divorced couples there exists a significant difference for all the questions. 

**Conclusion:** From the above analysis we can say that there exists a significant difference between happily married and divorced couples.

## Question 2 <a class="anchor" id="q2"></a>
2. What are the most agreed and disagreed things between the couplse?

In [None]:
# calculate average value between two groups
df_mean_orig = df_new.groupby(['class']).mean()

In [None]:
# assign column names
df_mean_orig.columns = quetions_list.values()

In [None]:
# transpose the data
df_mean_orig = df_mean_orig.T

In [None]:
# calculate difference score in opinion
df_mean_orig['difference'] =  df_mean_orig['Divorced'] - df_mean_orig['Married'] 

In [None]:
df_mean_orig.sort_values(['difference'],ascending = False)

In [None]:
plt.rcParams['figure.figsize'] = 6,4
agreed_df = df_mean_orig.sort_values(['difference'],ascending=False)[0:5]

agreed_df[['Married','Divorced']].plot(kind='bar')
agreed_questions = [value for key,value in quetions_list.items() \
                    if key in agreed_df.index]
plt.axhline(y=0,color='red')
plt.title("Top 5 Different Opinion Topics");

In [None]:
plt.rcParams['figure.figsize'] = 6,4
agreed_df = df_mean_orig.sort_values(['difference'],ascending=False)[-5:]

agreed_df[['Married','Divorced']].plot(kind='bar')
agreed_questions = [value for key,value in quetions_list.items()\
                    if key in agreed_df.index]
plt.axhline(y=0,color='red')
plt.title("Top 5 Similar Opinion Topics");

**Conclusion:**
On average most different opinion topics between happily married and divorced couples were 
    1. Starting a discussion before what's going on. 
    2. Humiliating during the discussion.
    3. Insulting partners during discussions.
    4. Discussion occurring suddenly.
    5. Breaking calm during the discussion with their wife.
In the top, five different opinion questions on average divorced couples mostly agreed and happily married couples disagreed.

On average most similar opinion topics between happily married and divorced couples were 
    1. They both don't have time with partners. 
    2. Acting like strangers who share the same family and environment.
    3. Staying silent during the discussion being right to hurt their spouse.
    4. Feeling right in their discussion.
    5. Most of the time staying silent to calm the environment.
Out of top similar opinion answered on average divorced couples most stayed silent during conversation, felt more right in their discussion, stayed more silent to hurt their spouse. On the other hand, happily married couples most disagreed with the opinion of sharing home and environment like strangers and not having time for partners.

## Question 3 <a class="anchor" id="q3"></a>
3. What couples needs to focus most in order to prevent divorce?

Here we'll use recursive feature elimination to select the features that can best predict a divorce. Selected features will be the most important thing couples need to focus in order to prevent divorce.

In [None]:
# seperate the data in X and Y
X =  df_new.drop("class",axis = 1)
y = df["class"].map({"Married":0, "Divorced": 1})


In [None]:
# instantiate model
estimator = LinearSVC(random_state=1111)
# instaniate RFE
selector = RFE(estimator, 10, step=1)
# fit RFE
selector = selector.fit(X, y)
# print the results
print(selector.support_)
print(selector.ranking_)

In [None]:
# select the data for top features
df_top = df_new.loc[:,selector.support_]
# top questions
top_questions =  [value for key,value in quetions_list.items() if key in df_top.columns]

# assign questions to column names
df_top.columns = top_questions
# add the class variable
df_top.loc[:,'class'] = df['class']

In [None]:
df_top.head()

In [None]:
df_plot

In [None]:
top_questions

In [None]:
# calculate mean 
top_mean = df_top.groupby(['class']).mean()
# transpose data for plotting
top_mean = top_mean.T

In [None]:
top_mean.plot(kind='bar')
plt.axhline(y=0,color="gray")
plt.axhline(y=1,color="gray")
plt.axhline(y=-1,color="gray")
plt.title("Top Questions to Focus On")
plt.ylim(-2,2)
plt.legend(loc='upper left');


From the above plots and important questions, we can say that they agree with our earlier questions. 
Divorced couples and happy couples were completely in opposite direction in terms of having similar sense of entertainments, similar ideas about how roles should be in marriage, having similar values in trust, knowing spouses hopes and wishes, knowing spouses friends and social relationships, humiliating during discussions, starting a discussion before what's going wrong. 
Divorced couples mostly stayed silent to calm the environment, felt right in their discussions while happily married couples stayed almost neutral.
In terms of time for partners at home, both agreed that they have time home as partners but divorced couples agree they had less time.

## Question 4 <a class="anchor" id="q4"></a>
4. How accuratley we can predict future divorce?

### Modeling <a class="anchor" id="modeling"></a>

In [None]:

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,shuffle=True, \
                                                    test_size=0.30,\
                                                    random_state = 1111 )


In [None]:
# function to calculatel accuracy
def calculate_accuracy(model, X_train, y_train, X_test, y_test):
    """ Calculates model training and test accuracy

    Keyword arguments:
    model -- fitted sklearn model
    X_train -- Training features
    y_train -- Training labels
    X_test -- Test features
    y_test -- Test labels
    Return:
    A touple of train_accuracy, test_accuracy
    """
    predicted = model.predict(X_train)
    train_accuracy = accuracy_score(y_train, predicted)
#     train_accuracy = round(train_accuracy, 2)
    predicted = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, predicted)
#     test_accuracy = round(test_accuracy,2)
    return (train_accuracy, test_accuracy)

In [None]:
# variable to store training,test accuracy
train_list = []
test_list = []

####  logistic regression model

In [None]:
# logistic regression model
clf = LogisticRegressionCV(cv=5, random_state=0).fit(X_train, y_train)
train_accuracy, test_accuracy = calculate_accuracy(clf, X_train, y_train, X_test, y_test)
print(train_accuracy)
print(test_accuracy)
# append the accuracy
train_list.append(train_accuracy)
test_list.append(test_accuracy)

#### Random Forest Classifier

In [None]:

# build randomforest model
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 2, stop = 50, num = 20)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 20, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

# do k fold cross validation

rf = RandomForestClassifier(random_state=42)
rf_random = RandomizedSearchCV(estimator = rf,return_train_score=True,\
                               refit=True, param_distributions = \
                               random_grid, n_iter = 100, cv =5, \
                               verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

train_accuracy, test_accuracy = calculate_accuracy(rf_random, X_train,\
                                                   y_train, X_test, y_test)
print(train_accuracy)
print(test_accuracy)
# append the accuracy
train_list.append(train_accuracy)
test_list.append(test_accuracy)

#### K Nearest Neighbors

In [None]:

#create new a knn model
knn = KNeighborsClassifier()
# create parameter grid
param_grid = {"n_neighbors": np.arange(1, 20)}
# do grid serach
knn_cv = GridSearchCV(knn, param_grid, cv=5)

# fit model
knn_cv.fit(X_train, y_train)

In [None]:
knn_cv.best_score_

In [None]:
knn_cv.best_params_

In [None]:
knn_cv.best_estimator_

In [None]:
train_accuracy, test_accuracy = calculate_accuracy(knn_cv, X_train, y_train, X_test, y_test)
print(train_accuracy)
print(test_accuracy)
# append the accuracy
train_list.append(train_accuracy)
test_list.append(test_accuracy)

#### Automated Classifier using TPOT

In [None]:
# Create the tpot classifier
tpot_clf = TPOTClassifier(generations=20, population_size=10,
                          offspring_size=10, scoring="accuracy",
                          verbosity=2, random_state=42, cv=5)

# Fit the classifier to the training data
tpot_clf.fit(X_train, y_train)

train_accuracy, test_accuracy = calculate_accuracy(tpot_clf,\
                    X_train, y_train, X_test, y_test)
print(train_accuracy)
print(test_accuracy)
# append the accuracy
train_list.append(train_accuracy)
test_list.append(test_accuracy)

# Evaluation <a class="anchor" id="evaluation"></a>

In [None]:
# build a daaframe of the model accuracies
results = pd.DataFrame({"Machine Learning Algorithm":\
    ["Logistic Regression","Random Forest Classifier",\
     "K Nearest Classifier", "Automated Classifier TPOT"],\
                        "Training":train_list,"Test":test_list})

In [None]:
print(results)
# set index
results.set_index("Machine Learning Algorithm",inplace=True)

In [None]:
results.plot(style='*-')
plt.xticks(rotation=20);

In [None]:
# plot the model results
results.plot(kind='bar')
plt.show();

Using 70% data as training and 30% data as testing we've fitted here four models to find the best model. 
In logistic regression, I've used 5 fold cross-validation.
Random forest was optimized using a grid search and 5 fold cross-validation to find optimum model parameters.
For k equals 1 to 20 K nearest neighbor model was optimized with 5 fold cross-validation to select the best value of k.
Among the four fitted models, Random forest Classifier and K nearest neighbor shows overfitting while Automated Classifier TPOT shows underfitting. The logistic regression model seems to perform well and doesn't show overfitting and underfitting. It's also simple and explainable. So I would choose the Logistic Regression model as the best classifier for prediction divorce.

# Deployment <a class="anchor" id="deployment"></a>

## Summary of Analysis

Our conclusions are:
    1. There exists a significant difference between married and divorced couples in all the topics        of the scale.
    2. Divorced and married couple's top 5 disagreed topics were discussion, humiliating during the        discussion, insulting during the discussion, sudden discussion, breaking calm during the     discussion. 

    On the other hand, the top 5 most similar opinionated topics were having time at home for partners, acting like strangers in terms of sharing environment at home. Married couples agreed on average they have more time at home and act less like strangers about family and environment than divorced couples.
    The other two topics where married couples were neutral but divorced couples agreed that they mostly stayed silent and felt right in their discussion.


    Based on this we can say that married couples are more careful about their discussion, acts less like strangers in terms of sharing environment at home and gives more time to their partners.  
    3. Divorced couples and married couples top 10 deciding questions we can see divorced couples and married couples gave opposite answers. 
    4. Using 70% data as training and 30% data logistic regression model performs best and it can 98.03% accurately predict divorce between couples. 

# Reference

Yöntem, M.K. and İlhan, T. (2018). Boşanma Göstergeleri Ölçeğinin Geliştirilmesi.
[Development of the Divorce Predictors Scale]. Sosyal Polika Çalışmaları Dergisi.
41, 339-358.

Gottman, J. M. and Gottman, J.S. (2012). Çiftler Arasında Köprüyü İnşa Etmek:
Gottman Çift Terapisi Eğitimi 1. Düzey Kitabı, [Level 1 Clinical Training. Gottman
Method Couples Therapy. Bringing to Couple Chasm.] İstanbul: Psikoloji İstanbul.

Gottman, J. ve Silver, N. (2014). Aşk Nasıl Sürdürülür. Aşk Laboratuarından Sırlar.
(trans. Gül, S.S.) [What Make Love Last. How to Build Trust and Avoid Betrayal.
2012]. İstanbul: Varlık Yayınları.