## Probem Statement:
### Predict the match percentage
In an era where technology plays a significant role in people’s lives, one cannot deny that it changes the way people interact and communicate with others. Today, technology has caused some significant changes in the dating world as well. Online dating is a new trend that is influencing many people around the world.<br>
As a data scientist, you are required to predict the match percentage between the users in a matrix format based on the attributes provided by the user on a dating website.

#### Note:
Based on the user’s sexual orientation, you are required to perform the following:
- If a user is heterosexual (prefers the opposite sex), then the match percentage must be 0 for this user with respect to other users of the same gender if the other users have the same behavior.
- If a user is a homosexual (prefers the same sex), then the match percentage must be 0 for this user with respect to other users of the opposite gender if the other users have the same behavior.
- The match percentage of a user with her/himself must be zero.

## Approach
We will create two matrix and then multiply these two matrix element-wise.<br>
The two matrix are - 
1. `Matrix 1: user-user similarity matrix`
    - We have a final dataframe that has all the variables which are needed for the model building
    - Setting the user ids as an index of the dataframe
    - Then we use cosine similarity on this dataframe which will results in a matrix of size 20001x2001. 
    - Convert all the diagonals elements to 0 and then convert it into a dataframe which has columns and rows as user ids with value fill as similarity score between each user.
2. `Matrix 2: binary user-user matrix`
    - The elements of this matrix are 0 or 1 based upon the condition given in the **problem's note**.
    - At last we multiply these two matrices and get our final match percentage.

In [None]:
# importing the libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

import nltk
import re
from nltk.tokenize import word_tokenize
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances
import string
string.punctuation

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',100)

### I. Matrix 1

In [None]:
# reading the data
df = pd.read_csv('/kaggle/input/predict-the-match-percentage/data.csv')

# looking at the data
df.head()

In [None]:
df.info()

- There is no null values in any of the columns
- Most of the variables are object.

#### user id

In [None]:
# checking for duplicates
# and we see that the ids are all unique
df.user_id.nunique()

---
#### username

In [None]:
# doesn't help in model
df.drop('username', axis=1, inplace=True)

---
#### age

In [None]:
df.age.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Looking the distribution of column Age
plt.figure(figsize=(12,5))

skewness = round(df.age.skew(),2)
kurtosis = round(df.age.kurtosis(),2)
mean = round(np.mean(df.age),0)
median = np.median(df.age)

plt.subplot(1,2,1)
sns.boxplot(y=df.age)
plt.title('Boxplot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.subplot(1,2,2)
sns.distplot(df.age)
plt.title('Distribution Plot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.show()

In [None]:
# creating a new column and divides the age into the bins
df['age_bin'] = pd.cut(df.age, bins=[17,24,30,40,50,70],labels=['17-24','25-30','31-40','41-50','50+'])

In [None]:
# making the dummy variables for the age_bin column 
aged = pd.get_dummies(df.age_bin,prefix='age_')
df = pd.concat([df,aged], axis=1)
df.drop(['age','age_bin'],axis=1,inplace=True)

---
#### Height

In [None]:
df.height.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# Looking the distribution of column height
plt.figure(figsize=(12,5))

skewness = round(df.height.skew(),2)
kurtosis = round(df.height.kurtosis(),2)
mean = round(np.mean(df.height),0)
median = np.median(df.height)

plt.subplot(1,2,1)
sns.boxplot(y=df.height)
plt.title('Boxplot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.subplot(1,2,2)
sns.distplot(df.height)
plt.title('Distribution Plot\n Mean:{}\n Median:{}\n Skewness:{}\n Kurtosis:{}'.format(mean,median,skewness,kurtosis))

plt.show()

In [None]:
# creating a new column which stores height in feets
df['height_feet'] = round(df['height']*0.08333,1)

In [None]:
df.height_feet.describe(percentiles=[0.25,0.5,0.75,0.9,0.95,0.99])

In [None]:
# creating a column which has height in bins
df['height_bin'] = pd.cut(df.height_feet,bins=[4,5,6,7],labels=['4-5feets','5-6feets','6-7feets'],right=False)

In [None]:
# making a dummy variable for the height_bin column
heightd = pd.get_dummies(df.height_bin,prefix='height_')
df = pd.concat([df,heightd], axis=1)
df.drop(['height','height_feet','height_bin'],axis=1,inplace=True)

---
#### status

In [None]:
df.status.value_counts(normalize=True)

In [None]:
# 'single' and 'available' both have same context in dating site so just combine them
df['status'] = df['status'].replace('available','single')

In [None]:
# making a dummy variable for status
statusd = pd.get_dummies(df.status,prefix='status_')
df = pd.concat([df,statusd], axis=1)
df.drop('status',axis=1,inplace=True)

---
#### sex

In [None]:
df.sex.value_counts(normalize=True)

In [None]:
# converting to numeric
df['sex'] = df['sex'].replace(('m','f'),(1,0))

---
#### orientation

In [None]:
df.orientation.value_counts(normalize=True)

In [None]:
# creating a new column and apply the logic to fill the values
df['looking_for'] = np.NaN

df[(df.orientation=='straight') & (df.sex==1)]['looking_for']='female'
df[(df.orientation=='straight') & (df.sex==0)]['looking_for']='male'

df[(df.orientation=='gay') & (df.sex==1)]['looking_for']='male'
df[(df.orientation=='gay') & (df.sex==0)]['looking_for']='female'

df[(df.orientation=='bisexual') & (df.sex==1)]['looking_for']='both'
df[(df.orientation=='bisexual') & (df.sex==0)]['looking_for']='both'

In [None]:
# dropping the column
df.drop('orientation',axis=1,inplace=True)

In [None]:
# making the dummy variables for looking_for column
lfd = pd.get_dummies(df.looking_for,prefix='looking_')
df = pd.concat([df,lfd], axis=1)
df.drop('looking_for',axis=1,inplace=True)

---
#### drinks

In [None]:
df.drinks.value_counts(normalize=True)

In [None]:
# making the dummy variables for drinks column
drinkd = pd.get_dummies(df.drinks,prefix='drink_')
df = pd.concat([df,drinkd], axis=1)
df.drop('drinks',axis=1,inplace=True)

---
#### drugs

In [None]:
df.drugs.value_counts(normalize=True)

In [None]:
# making the dummy variables for drugs column
drugd = pd.get_dummies(df.drugs,prefix='drug_')
df = pd.concat([df,drugd], axis=1)
df.drop('drugs',axis=1,inplace=True)

---
#### job

In [None]:
df.job.value_counts(normalize=True)

In [None]:
# as the count of last four jobs has less than 1%, so we just combine them into the 'other'
df['job'] = df['job'].replace(('retired','rather not say','unemployed','military'),
                             ('other','other','other','other'))

In [None]:
# making the dummy variables for job column
jd = pd.get_dummies(df.job,prefix='job_')
df = pd.concat([df,jd], axis=1)
df.drop('job',axis=1,inplace=True)

---
##### location

In [None]:
locn = df[['location']]
locn[['city','state']] = locn.location.str.split(',',expand=True)

In [None]:
locd = pd.get_dummies(locn.city,prefix='lives_in_')
locn = pd.concat([locn,locd], axis=1)

locn.head()

In [None]:
locn.iloc[:,3:].sum().sort_values(ascending=False).index

In [None]:
locn.iloc[:,3:].sum().sort_values(ascending=False).values

Here we just looking for cities whose count is more than 20 and combine rest of the cities into the 'other' category

In [None]:
# rearranging the columns such that they appeared according to their count in descending order
locn = locn[['lives_in__san francisco', 'lives_in__oakland', 'lives_in__berkeley','lives_in__san mateo', 'lives_in__palo alto', 
             'lives_in__alameda','lives_in__san rafael', 'lives_in__san leandro','lives_in__redwood city', 
             'lives_in__emeryville', 'lives_in__daly city','lives_in__walnut creek', 'lives_in__hayward', 'lives_in__pacifica',
             'lives_in__el cerrito', 'lives_in__menlo park','lives_in__mountain view', 'lives_in__richmond', 
             'lives_in__martinez','lives_in__burlingame', 'lives_in__benicia', 'lives_in__vallejo','lives_in__mill valley', 
             'lives_in__south san francisco','lives_in__pleasant hill', 'lives_in__novato','lives_in__castro valley', 
             'lives_in__lafayette','lives_in__san carlos', 'lives_in__belmont', 'lives_in__san bruno','lives_in__el sobrante', 
             'lives_in__millbrae', 'lives_in__fremont','lives_in__half moon bay', 'lives_in__albany', 'lives_in__hercules',
             'lives_in__stanford', 'lives_in__san pablo', 'lives_in__san lorenzo','lives_in__fairfax', 'lives_in__atherton', 
             'lives_in__moraga','lives_in__sausalito', 'lives_in__san anselmo','lives_in__corte madera', 'lives_in__woodacre', 
             'lives_in__green brae','lives_in__belvedere tiburon', 'lives_in__rodeo', 'lives_in__orinda','lives_in__larkspur', 
             'lives_in__pinole', 'lives_in__canyon country','lives_in__stockton', 'lives_in__santa rosa', 'lives_in__brisbane',
             'lives_in__brooklyn', 'lives_in__point richmond', 'lives_in__lagunitas','lives_in__cincinnati', 'lives_in__phoenix',
             'lives_in__petaluma','lives_in__north hollywood', 'lives_in__nha trang','lives_in__foster city', 
             'lives_in__moss beach','lives_in__hacienda heights', 'lives_in__montara','lives_in__woodside']]

In [None]:
# creating a new column
locn['others'] = locn.iloc[:,13:].sum(axis=1).astype('int')

In [None]:
locn = locn[['lives_in__san francisco', 'lives_in__oakland', 'lives_in__berkeley','lives_in__san mateo', 'lives_in__palo alto', 
             'lives_in__alameda','lives_in__san rafael', 'lives_in__san leandro','lives_in__redwood city', 
             'lives_in__emeryville', 'lives_in__daly city','lives_in__walnut creek', 'lives_in__hayward','others', 
             'lives_in__pacifica','lives_in__el cerrito', 'lives_in__menlo park','lives_in__mountain view', 'lives_in__richmond', 
             'lives_in__martinez','lives_in__burlingame', 'lives_in__benicia', 'lives_in__vallejo','lives_in__mill valley', 
             'lives_in__south san francisco','lives_in__pleasant hill', 'lives_in__novato','lives_in__castro valley', 
             'lives_in__lafayette','lives_in__san carlos', 'lives_in__belmont', 'lives_in__san bruno','lives_in__el sobrante', 
             'lives_in__millbrae', 'lives_in__fremont','lives_in__half moon bay', 'lives_in__albany', 'lives_in__hercules',
             'lives_in__stanford', 'lives_in__san pablo', 'lives_in__san lorenzo','lives_in__fairfax', 'lives_in__atherton', 
             'lives_in__moraga','lives_in__sausalito', 'lives_in__san anselmo','lives_in__corte madera', 'lives_in__woodacre', 
             'lives_in__green brae','lives_in__belvedere tiburon', 'lives_in__rodeo', 'lives_in__orinda','lives_in__larkspur', 
             'lives_in__pinole', 'lives_in__canyon country','lives_in__stockton', 'lives_in__santa rosa', 'lives_in__brisbane',
             'lives_in__brooklyn', 'lives_in__point richmond', 'lives_in__lagunitas','lives_in__cincinnati', 'lives_in__phoenix',
             'lives_in__petaluma','lives_in__north hollywood', 'lives_in__nha trang','lives_in__foster city', 
             'lives_in__moss beach','lives_in__hacienda heights', 'lives_in__montara','lives_in__woodside']]

In [None]:
# storing only those columns whose count is greater than 20 and plus one extra column 'other'
locn = locn.iloc[:,:14]

In [None]:
# concatinate the main df with this location df
df = pd.concat([df,locn],axis=1)

# dropping the column
df.drop('location',axis=1,inplace=True)

---
#### pets

In [None]:
df.pets.value_counts(normalize=True)

In [None]:
# creating a fuction that return whether user likes cats or dogs or both or none
def pet_like(txt):
    if txt.find('likes dogs and likes cats')!= -1:
        return 'dog and cat'
    elif txt.find('likes dogs')!= -1:
        return 'dog'
    elif txt.find('likes cats')!= -1:
        return 'cat'
    else:
        return 'none'

# calling the above function
df['pet_like'] = df['pets'].apply(lambda x: pet_like(x))

In [None]:
# creating a function that returns whether a person owned cats or dogs or both or none
def pet_owned(txt):
    if txt.find('has dogs and has cats')!= -1:
        return 'dog and cat'
    elif txt.find('has dogs')!= -1:
        return 'dog'
    elif txt.find('has cats')!= -1:
        return 'cat'
    else:
        return 'none'

# calling the function
df['pet_owned'] = df['pets'].apply(lambda x: pet_owned(x))

In [None]:
# making the dummy variables for pet_like column
petld = pd.get_dummies(df.pet_like,prefix='petLike_')
df = pd.concat([df,petld], axis=1)
df.drop('pet_like',axis=1,inplace=True)

In [None]:
# making the dummy variables for pet_owned column
petod = pd.get_dummies(df.pet_owned,prefix='petOwn_')
df = pd.concat([df,petod], axis=1)
df.drop('pet_owned',axis=1,inplace=True)

In [None]:
# dropping the variable
df.drop('pets',axis=1,inplace=True)

---
#### smokes

In [None]:
df.smokes.value_counts(normalize=True)

In [None]:
# combining the categories as they have similar context
df['smokes'] = df['smokes'].replace('trying to quit','sometimes')
df['smokes'] = df['smokes'].replace('when drinking','sometimes')

In [None]:
# making the dummy variables for smokes column
smoked = pd.get_dummies(df.smokes,prefix='smoke_')
df = pd.concat([df,smoked], axis=1)
df.drop('smokes',axis=1,inplace=True)

---
#### new language

In [None]:
df.new_languages.value_counts(normalize=True)

In [None]:
# dropping the variable as it has no relevancy
nld = pd.get_dummies(df.new_languages,prefix='new_lang_')
df = pd.concat([df,nld], axis=1)
df.drop('new_languages',axis=1,inplace=True)

---
#### body profile

In [None]:
df.body_profile.value_counts(normalize=True)

In [None]:
# making the dummy variables for body_profile column
bd = pd.get_dummies(df.body_profile,prefix='body_')
df = pd.concat([df,bd], axis=1)
df.drop('body_profile',axis=1,inplace=True)

---
#### education level

In [None]:
df.education_level.value_counts(normalize=True)

In [None]:
# using scaler to convert them between 0-1
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['education_level'] = scaler.fit_transform(df[['education_level']])

---
#### dropped out

In [None]:
df.dropped_out.value_counts(normalize=True)

In [None]:
# converting to numeric
df['dropped_out'] = df['dropped_out'].replace(('no','yes'),(0,1))

---
#### interests and other_interests

In [None]:
df.interests.value_counts(normalize=True)

In [None]:
df.other_interests.value_counts(normalize=True)

In [None]:
# making the set of all the interest from both the columns and taking union of that so that there is no repetation
s1 = set(df.interests.value_counts(normalize=True).index)
s2 = set(df.other_interests.value_counts(normalize=True).index)
s3 = s1.union(s2)

In [None]:
# creating a new columns by combining the two interests columns
df['hobbies'] = df['interests']+','+df['other_interests']

In [None]:
# creating new columns of hobbies as a dummy variable
for col in list(s3):
    df[col] = df.hobbies.str.find(col).apply(lambda x: 0 if x==-1 else 1)

In [None]:
df.drop(['interests','other_interests','hobbies'],axis=1,inplace=True)

---
#### location preference

In [None]:
df.location_preference.value_counts(normalize=True)

In [None]:
# making the dummy variables for location_preference column
locd = pd.get_dummies(df.location_preference,prefix='location_pref_')
df = pd.concat([df,locd], axis=1)
df.drop('location_preference',axis=1,inplace=True)

---
#### language

In [None]:
# copying the language column in other object
lang = df[['language']]

# splitting the column
lang[['L1','L2','L3','L4','L5']] = lang.language.str.split(',',expand=True)

lang.head(10)

In [None]:
# looking for null entries
lang.isnull().sum()

In [None]:
# replacing the null values with 'none'
lang['L2'] = lang['L2'].replace(np.NaN, 'None')
lang['L3'] = lang['L3'].replace(np.NaN, 'None')
lang['L4'] = lang['L4'].replace(np.NaN, 'None')
lang['L5'] = lang['L5'].replace(np.NaN, 'None')

In [None]:
# creating a function that returns the language that user marked as fluent he/she is in
def fluent(txt):
    l1 = list(txt)
    l2 = list(txt.str.contains('(fluently)'))
    l3 = []
    for i,j in enumerate(l2):
        if j==True:
            l3.append(l1[i])
        else:
            l3.append('None')
    return l3

In [None]:
# calling the above function on all the five columns
lang['F1'] = fluent(lang['L1'])
lang['F2'] = fluent(lang['L2'])
lang['F3'] = fluent(lang['L3'])
lang['F4'] = fluent(lang['L4'])
lang['F5'] = fluent(lang['L5'])

In [None]:
lang.head()

In [None]:
lang.isnull().sum()

In [None]:
# removing any extra white spaces
lang['F1'] = lang['F1'].str.strip()
lang['F2'] = lang['F2'].str.strip()
lang['F3'] = lang['F3'].str.strip()
lang['F4'] = lang['F4'].str.strip()
lang['F5'] = lang['F5'].str.strip()

In [None]:
# creating the sets
f1 = set(lang.F1.unique())
f2 = set(lang.F2.unique())
f3 = set(lang.F3.unique())
f4 = set(lang.F4.unique())
f5 = set(lang.F5.unique())

In [None]:
# getting all the unique languages from all these columns
u1 = f1.union(f2)
u2 = u1.union(f3)
u3 = u2.union(f4)
u4 = u3.union(f5)

In [None]:
lang.head()

In [None]:
# creating the columns
for col in list(u4):
    lang[col] = lang.language.str.find(col).apply(lambda x: 0 if x==-1 else 1)

In [None]:
# creating a column which gives number of languages a user know
lang['lang_known'] = lang['language'].apply(lambda x:len(x.split(',')))

In [None]:
lang.sample(10)

In [None]:
# keeping only relevant columns
lang = lang.iloc[:,11:]

In [None]:
lang.sum(axis=0).sort_values(ascending=False)

In [None]:
# dropping the language
lang.drop('c++ (fluently)',axis=1,inplace=True)

In [None]:
# arranging the columns in the order of their counts
lang = lang[['lang_known','english (fluently)','spanish (fluently)','chinese (fluently)','french (fluently)',
             'german (fluently)','italian (fluently)','farsi (fluently)','hindi (fluently)','russian (fluently)',
             'hebrew (fluently)','tagalog (fluently)','japanese (fluently)', 'sign language (fluently)','portuguese (fluently)',
             'swedish (fluently)','korean (fluently)','sanskrit (fluently)', 'dutch (fluently)', 'arabic (fluently)', 
             'hungarian (fluently)', 'icelandic (fluently)','gujarati (fluently)','irish (fluently)', 'vietnamese (fluently)',
             'esperanto (fluently)', 'tamil (fluently)', 'bulgarian (fluently)', 'indonesian (fluently)','norwegian (fluently)',
             'thai (fluently)', 'urdu (fluently)', 'ukrainian (fluently)','cebuano (fluently)', 'polish (fluently)', 
             'bengali (fluently)','ancient greek (fluently)', 'slovak (fluently)', 'None','afrikaans (fluently)', 
             'maori (fluently)', 'czech (fluently)','danish (fluently)', 'latin (fluently)', 'other (fluently)',
             'ilongo (fluently)', 'greek (fluently)', 'lisp (fluently)','turkish (fluently)']]

In [None]:
# keeping those languages whose count is greater than 5 and combine rest other into one single column
lang['others'] = lang.iloc[:,17:].sum(axis=1)

In [None]:
lang.head()

In [None]:
# rearranging of columns
lang = lang[['lang_known','english (fluently)','spanish (fluently)','chinese (fluently)','french (fluently)',
             'german (fluently)','italian (fluently)','farsi (fluently)','hindi (fluently)','russian (fluently)',
             'hebrew (fluently)','tagalog (fluently)','japanese (fluently)', 'sign language (fluently)','portuguese (fluently)',
             'swedish (fluently)','korean (fluently)','others','sanskrit (fluently)', 'dutch (fluently)', 'arabic (fluently)', 
             'hungarian (fluently)', 'icelandic (fluently)','gujarati (fluently)','irish (fluently)', 'vietnamese (fluently)',
             'esperanto (fluently)', 'tamil (fluently)', 'bulgarian (fluently)', 'indonesian (fluently)','norwegian (fluently)',
             'thai (fluently)', 'urdu (fluently)', 'ukrainian (fluently)','cebuano (fluently)', 'polish (fluently)', 
             'bengali (fluently)','ancient greek (fluently)', 'slovak (fluently)', 'None','afrikaans (fluently)', 
             'maori (fluently)', 'czech (fluently)','danish (fluently)', 'latin (fluently)', 'other (fluently)',
             'ilongo (fluently)', 'greek (fluently)', 'lisp (fluently)','turkish (fluently)']]

In [None]:
# keeping only relevant languages
lang = lang.iloc[:,:18]
lang.head()

In [None]:
# concate the lang_fluent and lang_known to the main dataframe
df = pd.concat([df,lang],axis=1)

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['lang_known'] = scaler.fit_transform(df[['lang_known']])
df.drop('language',axis=1,inplace=True)

---
#### bio

For this column we have use **text mining** technique and finally use TF-IDF matrix

In [None]:
bio = df[['bio']]
bio.head()

In [None]:
# converting all words to lowercase

bio['bio'] = bio['bio'].str.lower()

In [None]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}


def contraction(txt):
    l1 = list(txt)
    l2 = []
    for i in l1:
        l2.append(' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in i.split(" ")]))
    return l2


bio['bio_clean']= contraction(bio['bio'])

In [None]:
# creating a function that removes "'s" part from the words
def removing_s(txt):
    l1 = list(txt)
    l2 = []
    for i in l1:
        l2.append(re.sub(r"'s\b","",i))
    return l2

bio['bio_clean']= removing_s(bio['bio_clean'])

In [None]:
# function that keep only letters and remove other things
def only_letters(txt):
    l1 = list(txt)
    l2 = []
    for i in l1:
        l2.append(re.sub("[^a-zA-Z]", " ", i))
    return l2

bio['bio_clean']= only_letters(bio['bio_clean'])

In [None]:
# function that removes puctuations

import string
string.punctuation

def remove_punctuation(txt):
    txt_nopunct = "".join([c for c in txt if c not in string.punctuation])
    return txt_nopunct

bio['bio_clean_rp'] = bio['bio_clean'].apply(lambda x: remove_punctuation(x))
bio.head()

In [None]:
# function for tokenization

import re
from nltk.tokenize import word_tokenize

def tokenize(txt):
    token = re.split('\W+', txt)
    return token

bio['bio_clean_tokenize'] = bio['bio_clean_rp'].apply(lambda x: tokenize(x))

# another way
# df['msg_clean_tokenize'] = df['msg_clean'].apply(word_tokenize)
# df['msg_clean_tokenize'] = df['msg_clean'].apply(lambda x: x.split())

bio.head()

In [None]:
# function for removing stopwords

stopwords = nltk.corpus.stopwords.words('english')

# list of stopwords
# print(stopwords[:30])

def remove_stopwords(txt):
    txt_clean = [word for word in txt if word not in stopwords]
    return txt_clean

bio['bio_no_sw'] = bio['bio_clean_tokenize'].apply(lambda x: remove_stopwords(x))
bio.head()

In [None]:
# function for lemmatization
wn = nltk.WordNetLemmatizer()

def lemmatizing(txt):
    text = [wn.lemmatize(word) for word in txt]
    return text

bio['bio_lemmatize'] = bio['bio_no_sw'].apply(lambda x: lemmatizing(x))
bio.head()

In [None]:
#bio['length_before'] = bio['bio'].apply(lambda x:len(x.split(' ')))
#bio['length_after'] = bio['bio_lemmatize'].apply(lambda x:len(x))

In [None]:
# joined all the clean words back to the sentences
bio['final'] = bio['bio_lemmatize'].apply(lambda x:" ".join(x))

In [None]:
# creating a function for scoring the sentiment of the sentence
# x<0 - Negative
# x=0 - Neutral
# x>0 & x<=1 - Positive

def sentiment(txt):
    return (TextBlob(txt).sentiment.polarity)

bio['sentiment_score'] = bio['bio'].apply(lambda x: sentiment(x))

# replacing all the negative values to 0
bio['sentiment_score'] = bio['sentiment_score'].apply(lambda x: 0 if x<0 else x)

In [None]:
bio.head()

In [None]:
# function that removes words whose length less than 3
def lessthan3(txt):
    l1 = list(txt.strip().split(' '))
    l2 = []
    for i in l1:
        if len(i)>3:
            l2.append(i)
    return(" ".join(l2))

bio['final1'] = bio['final'].apply(lambda x:lessthan3(x))

In [None]:
# graph showing top 50 words used by the users
all_words = []
for line in list(bio['final1']):
    words = line.split()
    for word in words:
        all_words.append(word)
        
plt.figure(figsize=(15,5))
plt.title('Top 50 most common words')
plt.xticks(fontsize=13)
fd = nltk.FreqDist(all_words)
fd.plot(50,cumulative=False)
plt.show()

In [None]:
# keeping only relevant columns
bio_final = bio[['final1','sentiment_score']]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# creating a TF-IDF matrix with 1000 words
tfidf = TfidfVectorizer(max_features=1000)

X = tfidf.fit_transform(bio_final['final1'])
print(X.shape)

In [None]:
# converting it into dataframe
bio_tfidf = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())
bio_tfidf.head()

In [None]:
# concat the TF-IDF and sentiment_score to the main dataframe
df = pd.concat([df,bio_tfidf,bio_final[['sentiment_score']]],axis=1)

df.drop('bio',axis=1,inplace=True)

In [None]:
df.head()

In [None]:
df1 = df.set_index('user_id')
df1.head()

In [None]:
from sklearn.metrics.pairwise import pairwise_distances

# User Similarity Matrix using 'cosine' measure
user_correlation = (1 - pairwise_distances(df1, metric='cosine'))*100

print(user_correlation)

In [None]:
user_correlation.shape

In [None]:
# putting the diagonal elements to 0
a = np.matrix(user_correlation)
#np.fill_diagonal(a,0.00)

# converting the matrix to dataframe
final_df = pd.concat([df[['user_id']],pd.DataFrame(a,columns=df.user_id)],axis=1)

In [None]:
final_df = final_df.round(2)
final_df.head()

### II. Matrix 2

In [None]:
# reading the data
sample = pd.read_csv('/kaggle/input/predict-the-match-percentage/data.csv')

# keeping only three columns
sample = sample[['user_id','sex','orientation']]

# joined sex and orientation column
sample["pref"] = sample["sex"] + " " + sample["orientation"]

# converting into category
sample["pref"] = sample["pref"].astype("category")

# getting the code for each category
sample["code"] = sample["pref"].cat.codes

# adding one to code so that it starts from 1
sample["code"] = sample["code"] + 1

sample.head()

In [None]:
# created a new dataframe with one column user_id
table = pd.DataFrame({'user_id':final_df.columns})
table = table.iloc[1:,:]

# create a new column 
table['uid'] = range(1,2002)

table.head()

In [None]:
# merging the two dataframe
sample = sample.merge(table,on='user_id')
sample.head()

In [None]:
# checking the code
sample.groupby('pref')['code'].mean()

In [None]:
# merging the dataframe with itself
sample2 = sample[["uid","code"]].merge(sample[["uid","code"]], on="code")

# getting the pivot table
sample2 = sample2.pivot_table(index='uid_x',columns='uid_y',values='code')

sample2

In [None]:
# applying the logic and replace the values
sample3 = sample2.replace([4,np.NaN,2,3,5,6],[1,1,0,0,0,0])
sample3

In [None]:
# converting it into matrix and filling diagonal elements to 0
b = np.matrix(sample3)
np.fill_diagonal(b,0.00)

# converting it into dataframe
sample4 = pd.DataFrame(b,columns=sample3.columns)

# setting the index
sample4.index = sample4.columns

sample4

### Multiplying the two matrix

In [None]:
sample5 = pd.DataFrame(np.multiply(np.matrix(final_df.iloc[:,1:]), np.matrix(sample4)),columns=final_df.columns[1:])
sample5.index = final_df.columns[1:]
sample5.reset_index(inplace=True)
sample5.rename(columns={'index':'user_id'},inplace=True)
sample5.head()

The above matrix is the final submission file and it shows the match percentage between each user with that other user and also following the instructions given in the **note** of the problem. 
**My score is 97.87761**.