# Introduction

This project analyzes data from on-line dating application OKCupid. In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we've never had before about how different people experience romance.

The goal of this project is to scope, prep, analyze, and create a machine learning model to solve a question.

## Import Python Modules


In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = [6,6]
%matplotlib inline

import warnings 
warnings.filterwarnings('ignore')

## Loading the Data

In [None]:
profiles = pd.read_csv("../input/profiles/profiles.csv",encoding = 'utf-8')

In [None]:
profiles.head()

In [None]:
profiles.info()

In [None]:
profiles.last_online.head()

In [None]:
list(profiles.columns)

## Explore the Data

In [None]:
print("nnumber of catogries : ",profiles.sign.nunique())

In [None]:
print("Catogries : ",profiles.sign.unique())

In [None]:
profiles.sign

In [None]:
profiles.sign.str.split()

In [None]:
profiles['signcleaned'] = profiles.sign.str.split().str.get(0)

In [None]:
profiles['signcleaned'].tail()

In [None]:
print("n_number of catogries : ",profiles.signcleaned.nunique())

In [None]:
print("catogries : ",  list(profiles.signcleaned.unique()))

In [None]:
profiles.signcleaned.value_counts()

## Continous Variables  
#### age

The next plot shows the distribution of age in the group. It seems that most users are in their late 20s to early 30s.

In [None]:
sns.set()
sns.distplot(profiles['age'],norm_hist = "True",hist_kws = {'color':'#8e00ce',
                       'linewidth':5, 'linestyle':'--', 'alpha':0.3},
            kde_kws = {'color':'#DC143C', 
                       'linewidth':2, 'linestyle':'--', 'alpha':1});

#### Height

The next plot shows the height variable, most people look like they are between 5 feet tall and 6.5 feet tall.

In [None]:
sns.distplot(profiles["height"],norm_hist="True",hist_kws = {'color':'#DC143C',
                       'linewidth':5, 'linestyle':'--', 'alpha':0.4},
            kde_kws = {'color':'#8e00ce', 
                       'linewidth':2, 'linestyle':'--', 'alpha':0.9});

# Discrete variable

#### Sex

Previously it was identified that there are more males in the data, and it seems that there are ~35,000 men to ~25,000 women.

In [None]:
sns.countplot(data=profiles, y="sex");

#### Body Type

The next chart shows the body type variable, and it seems that most users will describe themselves as average, fit, or athletic.

In [None]:
sns.countplot(data=profiles, y="body_type");

The next chart shows the break down of body type by gender and it seems that some of the body type descriptions are highly gendered. For example "curvy" and "full figured" are highly female descriptions, while males use "a little extra", and "overweight" more often.

In [None]:
sns.countplot(data=profiles, y="body_type", hue = "sex");

#### Diet

Here is a chart of the dietary information for users. Most user eat "mostly anything", followed by "anything", and "strictly anything", being open-minded seems to be a popular signal to potential partners. 

In [None]:
sns.countplot(data=profiles, y="diet");

#### Drinks

The next plot shows that the majority of the users drink "socially", then "rarely" and "often". 

In [None]:
sns.countplot(data=profiles, y="drinks");

#### Drugs

The vast majority of users "never" use drugs. It's good to to see.

In [None]:
sns.countplot(data=profiles, y="drugs");

#### Education

Below you can see the majority of users are graduate from college/university followed by masters programs and those working on college/university. Interestingly space camp related options are fairly a popular options.

In [None]:
plt.figure(figsize=(8,7))

sns.countplot(data=profiles, y="education");

#### Jobs

Most users don't fit into the categories provided, but there are a fair share of students, artists, tech, and business folks. 

In [None]:
sns.countplot(data=profiles, y="job");

#### Offspring

The data suggest that most users do not have kids. 

In [None]:
sns.countplot(data=profiles, y="offspring");

#### Orientation

The majority of users are straight.

In [None]:
sns.countplot(data=profiles, y="orientation");

interestingly the majority of bisexual users re female. 

In [None]:
sns.countplot(data=profiles, y="orientation", hue = "sex");

#### Pets

The chart shows that most users like or has dogs.

In [None]:
sns.countplot(data=profiles, y="pets");

#### Religion

Religion was similar to sign where there are a lot of qualifiers.

In [None]:
plt.figure(figsize=(10,11))
sns.countplot(data=profiles, y="religion");

religion was cleaned to take the first word and distilled down to 9 groups. The majority was not very religious identifying as agnostic, other, or atheists. 

In [None]:
profiles['religionCleaned'] = profiles.religion.str.split().str.get(0)
sns.countplot(data=profiles, y="religionCleaned");

#### Signs

Here are the astrological signs of the users. There are mainly evenly distributed with Capricorns being the rarest and Leos being the most common.

In [None]:
plt.figure(figsize=(9,8))
sns.countplot(data=profiles, y="signcleaned");

#### Smoking

Similarly for drugs the majority of users chose "no" for smoking.

In [None]:
sns.countplot(data=profiles, y="smokes");

#### Status

The relationship status for a dating website is fairly predictable. One would assume that most people are single and available which is reflected in the data.

In [None]:
sns.countplot(data=profiles, y="status");

### Data Preperation
#### Missing Data

Missing data is often not handled by machine learning algorithms well and have to be checked so they may need to be imputed or removed. It seems that many of the columns do have missing values. 

In [None]:
profiles.isnull().sum()

#### Preprocessing 

Preparing the data for modeling is important since it can speed up the process and produce better models. As the adage goes, "garbage in garbage out" so we want to make sure the data we are inputing into our modelling step is good enough to share with others.

In [None]:
cols = ['body_type', 'diet', 'orientation', 'pets', 'religionCleaned',
       'sex', 'job', 'signcleaned']
df = profiles[cols].dropna()

#### Dummy Variables

In this next step, dummy variables are created to deal with the categorical variables. Dummy variables will turn the categories per variable into its own binary identifier. The data now has 81 columns to predict signs. 

In [None]:
for col in cols[:-1]:
    df = pd.get_dummies(df, columns=[col], prefix = [col])

In [None]:
df.head()

#### Label Imbalance 

An imbalance in the prediction label needs to be checked. This is important since it's a multi-class problem where two or more outcomes can be had. An imbalance in a response variable is bad since it means that some labels only occur a few times. This is an issue for machine learning algorithms if there are not enough data to train with which will give bad predictions. 

In the given dataset, we observe that the counts of all the zodiac signs are more or less equal (i.e., without large deviations). Hence, we do not have to worry about imbalances and trying to address this problem.

In [None]:
df.signcleaned.value_counts()

#### Splitting Data

Next the data needs to be split into train and validation sets. In this split 25% of the data is reserved for the final validation, while 75% is kept for training the model. 

In [None]:
col_length = len(df.columns)

In [None]:
X = df.iloc[:, 1:col_length]
Y = df.iloc[:, 0:1]


In [None]:
from sklearn.model_selection import train_test_split 
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.25, random_state = 0)


In [None]:
Y_train = Y_train.to_numpy().ravel()
Y_val = Y_val.to_numpy().ravel()
Y_train

In [None]:
Y_val

### Prediction 

#### model building 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

#### K Nearest Neighbor

The next models is the `KNeighborsClassifier` which will take 20 of it's neighbors to predict the signs. The default value for `n_neighbors` is 5 which was kept. This number can be tuned later on if needed. This model had a 33% accuracy which is a good sign.


In [None]:
knn_model = KNeighborsClassifier(n_neighbors = 5).fit(X_train, Y_train)
knn_predictions = knn_model.predict(X_train)

In [None]:
print(classification_report(Y_train, knn_predictions))

#### Decision Trees

The last model is the decision tree, the default `max_depth` is `none` which means that it will "If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.". The results are very promising because it has a 78% accuracy with this model.

In [None]:
cart_model = DecisionTreeClassifier().fit(X_train, Y_train) 
cart_predictions = cart_model.predict(X_train) 

In [None]:
print(classification_report(Y_train, cart_predictions))

Below is a confusion matrix of the results with the true values on the y axis and predicted values along the x axis. Since the diagonals are lighter in color and have higher numbers, the accuracy is going to be high since those are the True Positives.

In [None]:
from sklearn.metrics import confusion_matrix 
cart_cm = confusion_matrix(Y_train, cart_predictions)
cart_labels = cart_model.classes_

In [None]:
cart_cm

In [None]:
cart_labels

In [None]:
plt.figure(figsize=(10,7))

ax= plt.subplot()
sns.heatmap(cart_cm, annot=True, ax = ax,fmt="d");


ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix');
ax.yaxis.set_tick_params(rotation=60)
ax.xaxis.set_tick_params(rotation=45)

ax.xaxis.set_ticklabels(cart_labels); 
ax.yaxis.set_ticklabels(cart_labels);

Going back to the model, a quick analysis will show that this tree model has a depth of 49 branches, which will probably not generalize to another dataset. In this case this model has been "overfit" for this data. 

In [None]:
cart_model.get_depth()

To make a point, a five fold cross validation is created with the same data. The results are worse than the KNN and about the Logistic Regression algorithms. the baseline was ~9%

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits=5, shuffle=True, random_state=0)
results = cross_val_score(cart_model, X_train, Y_train, cv=kfold, scoring='accuracy')

print(results)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

The decision tree model will be made it again, but with a `max_depth` of 20 to stop the algorithm from reaching the stopping point. The new accuracy rate of ~41% is worse than the first iteration, but slightly better than the KNN model. 

In [None]:
cart_model20 = DecisionTreeClassifier(max_depth = 20).fit(X_train, Y_train) 
cart_predictions20 = cart_model20.predict(X_train) 

In [None]:
cart_predictions20

In [None]:
print(classification_report(Y_train, cart_predictions20))

If we check again with cross validation, the new model is still averaging ~8% which is not very good. 

In [None]:
results20 = cross_val_score(cart_model20, X_train, Y_train, cv=kfold, scoring='accuracy')

print(results20)
print("Baseline: %.2f%% (%.2f%%)" % (results20.mean()*100, results.std()*100))

#### Final Model

So it seems that the `knn_model` might be the best model for OkCupid to use when users don't have their signs listed on their user profile. By using the hold out or validation set, we get ~8% accuracy which is not very good. 

In [None]:
knn_predictionsVal = knn_model.predict(X_val) 
print(classification_report(Y_val, knn_predictionsVal))

In the confusion matrix, it becomes clear that Cancer, Gemini, Leo, and Virgo was predicted most often, but was not super accurate since the vertical color band represents even distributed guesses mostly wrong and some correct. 

In [None]:
final_cm = confusion_matrix(Y_val, knn_predictionsVal)
knn_labels = knn_model.classes_


In [None]:
final_cm

In [None]:
knn_labels

In [None]:
plt.figure(figsize=(10,7))

ax= plt.subplot()
sns.heatmap(final_cm, annot=True, ax = ax, fmt="d");

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix');
ax.yaxis.set_tick_params(rotation=60)
ax.xaxis.set_tick_params(rotation=45)

ax.xaxis.set_ticklabels(knn_labels); 
ax.yaxis.set_ticklabels(knn_labels);

#### General Comments

In this project machine learning was used to predict the astrological signs of OkCupid users. This is an important feature since many people believe in astrology and matches between compatible star signs. If users don't input their signs, an algorithmic solution could have generated a sign to impute missing data when making matches.

Alas, the final selected algorithm did no better than basic guessing.
