In this capstone, you will create a presentation about your findings in this OkCupid dataset.

The purpose of this capstone is to practice formulating questions and implementing Machine Learning techniques to answer those questions. We will give you guidance about the kinds of questions we asked, and the kinds of methods we used to answer those questions. But the questions you ask and how you answer them are entirely up to you. We’re excited to see what kinds of different things you explore. Compared to the other projects you have completed this far, we are requiring few restrictions on how you structure your code. The project is far more open-ended, and you should use your creativity. In addition, much of the code you write for later parts of this project will depend on how you decided to implement earlier parts. Therefore, we strongly encourage you to read through the entire assignment before writing any code.

The dataset provided has the following columns of multiple-choice data:

body_type
diet
drinks
drugs
education
ethnicity
height
income
job
offspring
orientation
pets
religion
sex
sign
smokes
speaks
status
And a set of open short-answer responses to :

essay0 - My self summary
essay1 - What I’m doing with my life
essay2 - I’m really good at
essay3 - The first thing people usually notice about me
essay4 - Favorite books, movies, show, music, and food
essay5 - The six things I could never do without
essay6 - I spend a lot of time thinking about
essay7 - On a typical Friday night I am
essay8 - The most private thing I am willing to admit
essay9 - You should message me if…

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.
profiles = pd.read_csv('/kaggle/input/profiles.csv')

In [None]:
pd.set_option('display.max_columns', None)
display(profiles.head(5))

In [None]:
print('Total rows:',profiles.shape[0])
print('Total columns:',profiles.shape[1])
# print('breakdown of column types:')
# print(profiles.dtypes.value_counts())
print(len(profiles.select_dtypes('int64').columns),'numerical columns(integer):',list(profiles.select_dtypes('int64').columns))
print(len(profiles.select_dtypes('float64').columns),'numerical columns(float):',list(profiles.select_dtypes('float64').columns))
print(len(profiles.select_dtypes('object').columns),'categorical columns:',list(profiles.select_dtypes('object').columns))

In [None]:
#inspect essay questions
meaning = dict({'essay0' : 'My self summary',
'essay1' : 'What I’m doing with my life',
'essay2' : 'I’m really good at',
'essay3' : 'The first thing people usually notice about me',
'essay4' : 'Favorite books, movies, show, music, and food',
'essay5' : 'The six things I could never do without',
'essay6' : 'I spend a lot of time thinking about',
'essay7' : 'On a typical Friday night I am',
'essay8' : 'The most private thing I am willing to admit',
'essay9' : 'You should message me if…'})

objects = profiles.select_dtypes('object').nunique().sort_values(ascending=False).reset_index()
summary = objects[objects['index'].str.contains('essay')]
summary['meaning'] = summary['index'].map(meaning)
summary[0] = summary[0]/len(profiles)
summary[['index','meaning',0]]

* 90% of people answered essay 0, followed by essay 1 (usually just **things you do**)
* Obviously, 65% of people answered essay 8. (usually **private thoughts, self-critic**)

In [None]:
#inspect non-essay questions
objects = profiles.select_dtypes('object').nunique().sort_values(ascending=False).reset_index()
non_essay = [i for i in list(objects['index']) if not 'essay' in i]
profiles[non_essay].nunique()
# summary

In [None]:
#numeric data & clean it up
numeric = profiles._get_numeric_data()
numeric.describe() #found income has -1
numeric.income = numeric.income.replace({-1:np.nan})
profiles.income = profiles.income.replace({-1:np.nan})
numeric.height = numeric.height.replace({1:np.nan})
profiles.height = profiles.height.replace({1:np.nan})
numeric.describe() #found income has -1

In [None]:
#plot numeric data
plt.figure(figsize=[26,5])
k=1
for i in list(numeric.columns):
    plt.subplot(1,len(list(numeric.columns)),k)
    plt.hist(numeric[i], bins=40,label=i)
    plt.axvline(x=numeric[i].mean(),color='red',label='mean')
    plt.axvline(x=numeric[i].median(),color='green',label='median')
    plt.xlabel(i)
    plt.ylabel("Frequency")
    plt.legend()
    k+=1
plt.show()

* Dating app consists disproportionate young people
* Height mostly between 60-70inches
* Income mostly below 10k

# Task: Predict Zodiac signs, using some columns
As we started to look at this data, we started to get more and more curious about Zodiac signs. First, we looked at all of the possible values for Zodiac signs:

We started to wonder if there was a way to predict a user’s Zodiac sign from the information in their profile. Thinking about the columns we had already explored, we thought that maybe we could classify Zodiac signs using drinking, smoking, drugs, and essays as our features.

In [None]:
profiles['sign_new'] = profiles.sign.str.split(' ').str.get(0)

sign_dict = dict(
{'leo':0,
'gemini':1,
'libra':2,
'cancer':3,
'virgo':4,
'taurus':5,
'scorpio':6,
'aries':7,
'pisces':8,
'sagittarius':9,
'aquarius':10,
'capricorn':11,})
profiles['sign_num'] = profiles['sign_new'].map(sign_dict)
profiles['sign_num'].value_counts()


In [None]:
x_col = ['sign_num','drinks','smokes','drugs'] + list(objects[objects['index'].str.contains('essay')]['index'])
zodiac = profiles[x_col]
for i in ['drinks','smokes','drugs']:
    print(i,zodiac[i].unique())
    
zodiac.drugs = zodiac.drugs.map({'never':0,'sometimes':1,'often':2})
zodiac.smokes = zodiac.smokes.map({'no':0,'when drinking':1,'sometimes':2,'yes':3,'trying to quit':4})
zodiac.drinks = zodiac.drinks.map({'not at all':0,'rarely':1,'socially':2,'often':3,'very often':4,'desperately':5})

display(zodiac[['drinks','smokes','drugs']].describe())
#plot numeric data
plt.figure(figsize=[26,5])
k=1
for i in ['drinks','smokes','drugs']:
    plt.subplot(1,3,k)
    plt.hist(zodiac[i],label=i)
    plt.axvline(x=zodiac[i].mean(),color='red',label='mean')
    plt.axvline(x=zodiac[i].median(),color='green',label='median')
    plt.xlabel(i)
    plt.ylabel("Frequency")
    plt.legend()
    k+=1
plt.show()

* Average person **drinks socially, never smokes, never do drugs**

In [None]:
zodiac[list(objects[objects['index'].str.contains('essay')]['index'])]
for i in list(objects[objects['index'].str.contains('essay')]['index']):
    split = zodiac[i].str.split(' ')
    length = [len(i) if type(i)==list else 0 for i in split ]
    zodiac[i+'_length'] = length
length_col = [i for i in list(zodiac.columns) if 'length' in i]
zodiac[length_col].describe()

In [None]:
x = zodiac['sign_num']
y = zodiac['essay7_length']
plt.figure(figsize=[13,7])
sns.boxplot(x,y)

* essay 0 length does not seems to differ much from zodiacs

In [None]:
#use knn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

zodiac = zodiac.dropna()
y = zodiac.sign_num
x = zodiac.drop(columns=['sign_num']+list(objects[objects['index'].str.contains('essay')]['index']))

def missing_values_table(df):
    mis_val=df.isnull().sum()    
    mis_val_perc=100*df.isnull().sum()/len(df)
    mis_val_table=pd.concat([mis_val, mis_val_perc], axis=1) 
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values', ascending=False).round(1)
    print ("Your selected data frame has " + str(df.shape[1]) + " columns.\n"+"There are " + str(mis_val_table_ren_columns.shape[0]) +
 " columns that have missing values.")
    return mis_val_table_ren_columns

miss = missing_values_table(x)
miss.head(5)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state = 6)

#impute x_train
# imputer = SimpleImputer(missing_values=np.nan, strategy = 'median')
# imputer.fit(x_train)
# x_train = imputer.transform( x_train )
# x_test = imputer.transform (x_test )

#scale 
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

min_ = 1
max_ = 100
score = []
for i in range(min_,max_):
    classifier = KNeighborsClassifier(i)
    classifier.fit(x_train, y_train)
    score.append(classifier.score(x_test, y_test))
plt.plot(list(range(min_,max_)),score)
plt.show()

The accuracy is 9%. I am not impressed, mean guess is 1/12 = 8.3%.

# Things to explore: 
Use Classification Techniques
We have learned how to perform classification in a few different ways.

We learned about K-Nearest Neighbors by exploring IMDB ratings of popular movies
We learned about Support Vector Machines by exploring baseball statistics
We learned about Naive Bayes by exploring Amazon Reviews
Some questions we used classification to tackle were:

Can we predict sex with education level and income??
Can we predict education level with essay text word counts?
Use Regression Techniques
We have learned how to perform Multiple Linear Regression by playing with StreetEasy apartment data. Is there a way we can apply the techniques we learned to this dataset?

Some questions we used regression to tackle were:

Predict income with length of essays and average word length?
Predict age with the frequency of “I” or “me” in essays?
We also learned about K-Nearest Neighbors Regression. Which form of regression works better to answer your question?