## Married at First Sight (Analysis)


The Lifetime reality television show and social experiment, Married at First Sight, features men and women who sign up to marry a complete stranger they've never met before. Experts pair couples based on tests and interviews. After marriage, couples have only a few short weeks together to decide if they want to stay married or get a divorce. There have been 10 full seasons so far which provides interesting data to look at what factors may or may not play a role in their decisions at the end of eight weeks as well as longer-term outcomes since the show aired.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Now that we have the libraries in place, let us import the data and see how it looks!

In [None]:
Data = pd.read_csv('../input/married-at-first-sight/mafs.csv')
Data.head()

This looks good. Now let us see the dimensions and our variables in the data set!


In [None]:
Data.shape #Dimensions

In [None]:
Data.info() #Information about Data

So, we do see that our data has 68 observations, and 17 dimensions. All the names of the variables are written in the info output. Seems good so far.


Let us make a copy of the data and work on it rather than using the original data.

In [None]:
my_data = Data.copy()

Now let us see some descriptive statistics about our dataset to get an idea of the values we're dealing with.


In [None]:
print(my_data.describe(include=['O']))
print(my_data.describe()) #Descriptive Statistics

We don't see any misssing values so that seems fine. A key point to note that 25 of the couples were divorced after getting married which is quite high! Other descriptive interpretation can be made from the output above.

We also find out that DrPepperSchwartz was always present no matter what. So, it is not quite significant for our analysis as it is constant. So let us drop that variable.

In [None]:
my_data.drop(['DrPepperSchwartz'], axis=1,inplace = True)
my_data.columns

Done! Let us find the important variables that might be useful for us!

In [None]:
my_data.head()

In order to predict if a couple will stay married or not, features like Name, Occupation don't matter much. So, for the sake of this problem, let us drop those columns!

In [None]:
my_data.drop(['Name','Occupation'], axis=1,inplace = True)
my_data.columns

Now let us convert the categorical variables into numerical ones!

In [None]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
my_data['Gender'] = enc.fit_transform(my_data['Gender'])
my_data['Status'] = enc.fit_transform(my_data['Status'])
my_data['Decision'] = enc.fit_transform(my_data['Decision'])

Great! After careful consideration, the location where the show takes place isn't something that really concerns how well the marriage is going to go. Let us drop that as well!

In [None]:
my_data.drop(['Location'], axis=1,inplace = True)
my_data.head()

Perfect. Let us see the correlations between variables to get a better idea!

In [None]:
plt.figure(figsize=(14,14))
corr_matrix = my_data.corr().round(2)
sns.heatmap(data=corr_matrix, annot=True)

We do see some strong correlations in this plot. Let us just see the data for the final time before making a model to run based on this. A big strong correlations can be seen for the variables Couple and Season, as well between them. So we will consider them in our analysis when we train the model on our data.

Let X be the predictors from the data and Y be the response variable aka Status.

In [None]:
X = my_data.iloc[:, [3,4,6,7,8,9,10,11,12]].values
y = my_data.iloc[:,5].values

Now let us split the data into training and testing data. Test data will be 30% of the total data which sounds quite reasonable.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


Perfect! Now let us try two different methods. A good method for this task would be a Random Forest Classifier. The other would be deploying a simple Artificial Neural Network. But first, Let us try a SVM to see if it good results to check the linearity of the data.

In [None]:
#Support Vector Classifier

from sklearn.svm import SVC 
classifier=SVC(kernel='linear',random_state=0)
#Fitting training data and making predictions on test data
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)

Great, now let us see how well our classifier does! 

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(classifier, X_test, y_test)

Not bad! But it can surely be better. We get all 0's right but not the 1's. I'm sure we could do a better model. Let's try a Random Forest Classifier.

In [None]:
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10,criterion='entropy',random_state=0)

#Fitting training data and making predictions on test data

classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)

Let's see how our classifier performs!

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(classifier, X_test, y_test)

Oh we did make a few false postives and but in overall sense, we do better than a SVM. Let us see the AUC Score!

In [None]:
probs = classifier.predict_proba(X_test)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, probs[:,1])

84%! Pretty Decent isn't it? Now let us try a deep learning model which can be of use as well. Here I try an Artifical Neural Network for this task.

In [None]:
#Importing the basic libraries and components

import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier

#Building the classifier and adding the layers

classifier = Sequential()
classifier.add(Dense(units=5, kernel_initializer='uniform', activation='relu', input_dim=9))
classifier.add(Dense(units=5, kernel_initializer='uniform', activation='relu'))
classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))

Done! Now let us compile the clasiifier, and set the optimizer,loss function and the metric for evaluation.

In [None]:
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Awesome! Now we're ready to make predictions on the data. Hope we do get a good result!

In [None]:
#Fitting the data to the training set and making predictions on the test set
classifier.fit(X_train,y_train,batch_size=1, epochs=100) 
y_pred=classifier.predict(X_test)

In [None]:
y_pred #Probabilities

Seems okay so far. Now that we have probabilities, let us convert them to 1's and 0's and see how well did we do.

In [None]:
y_pred1 = (y_pred>0.5)
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred1)
cm

Great! We do get a similar result as the SVM. Let us see the AUC Score!

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)

85% is awesome. So in conclusion, we built two predictive models, using different methodologies to predict the status of marriages after 8 weeks for these contestants. A deep learning approach resulted in an AUC score of 85% while RF Classifier gave us 84%.