___

<br>

![Imgur](http://i.imgur.com/UnPy2w0.png)


<br>

<br>

___


<br>





# StudyNest      -      *Personality Similarity Algorithm*


<br> 

- Applies the Rocchio and Cosine Similarity Algorithm



- Uses the algorithm and trains a KNeighborsRegressor model to predict the centroid values in comparison the entire dataset by taking the mean of cosine similarity (Rocchio Algorithm)


<br>


________________________________________________________________________________________________________________

## Importing Libraries

<br>

- Pandas + Numpy for Data


- Sklearn for algorithms and machine learning

<br>

In [1]:
# Import Pandas for Dataframe/Data Cleaning
import pandas as pd

# Import numpy for arrays
import numpy as np

# Import Cosine_similarity for similarity algorithm
from sklearn.metrics.pairwise import cosine_similarity

# Import K-Neighbors Regressor model for  prediction
from sklearn.neighbors import KNeighborsRegressor

# Import train_test_split for training and validation
from sklearn.model_selection import train_test_split


## Cleaning Data

<br>

- Removing NA and Missing Values


- Factorizing the Country Strings


- Removing 'Source' and 'Elapsed' from the Dataset


- Removing outlier values for Age and Accuracy


- Normalizing the Dataset

<br>

In [2]:
# Reading the personality data
personality = pd.read_csv('personality_data.csv')

# Dropping the entire row if any value is missing or NA
personality.dropna(axis = 0, how = 'any', inplace = True)

In [3]:
# Factorizing the unique countries to numbers (e.g US is 0, NZ is 1, IT is 2, ...)
labels, levels = pd.factorize(personality['country'].unique())

In [4]:
# Replacing the country strings
label = []
for i in personality['country']:
    # Find the index where the name of the country is equal to the value in the dataframe
    level = np.where(levels == i)
    # Append that number to a list
    label.append(level[0][0])
    
# Replace the column 'country' with the number labels of the countries
personality['country'] = label

In [5]:
# Dropping Source and Elapsed Data Value (Doesn't help out the prediction)
personality.drop(columns = ['source', 'elapsed'], inplace = True)

# Dropping values of accuracy over 100. The max value of accuracy is 100 and min is 0. 
personality = personality[personality['accuracy'] <= 100]

# Dropping values of age over 80. This is to maintain the validity of the age values.
personality = personality[personality['age'] <= 80]

In [6]:
# Normalizing the numerical data
for i in personality.columns:
    # finding the max value of each column
    max = personality[i].max()
    # Dividing every value in the column by the max
    personality[i] = personality[i]/max


## Applying Similarity Algorithm (Cosine Similarity)  + Rocchio Algorithm

<br>

- Applying Cosine Similarity


- Includes the calculation of the centroid values of closeness in comparison to all the other values in the dataset (mean of cosine similarity) 

<br>

In [7]:
# Using the cosine similarity to find the similarity between the first 10000 data values in the dataframe
# There are 50000 rows of datavalues; however, due to time and computation restraints only 10000 values are used.
cos_im = cosine_similarity(personality[:10000], personality[:10000]).tolist()

In [8]:
# Finding the mean of the cosine similarity
# The mean is the centroid value of closeness in comparison to all the other datavalues in the dataset
cos_im = np.array(cos_im).mean(axis = 0)

In [9]:
# Splitting the data into training and validation/test sets
X_train, X_test, y_train, y_test = train_test_split(personality.values[:10000], cos_im, test_size = 0.2, random_state = 42)

## Hyperparameter Searching

<br>

- For loop to find best n_neighbors for hyperparameter

<br>

In [10]:
# Defining n_search to store the scores for each model for n_neighbors from 1 to 20
n_search = []

# Try all n_neighbors from 1 to 20
for i in range(1, 20):
    # Define the Regression Model
    knr = KNeighborsRegressor(n_neighbors = i) # Add weights in the features to increase each accuracy
    # Training the model with the training data
    knr.fit(X_train, y_train)
    # Scoring the trained model
    score = knr.score(X_test, y_test)
    # Appending the scores to n_search
    n_search.append(score)
    
# Apply numpy function argmax (add 1 because started searching at n_neighbors = 1) to find best hyperparameter
best_n = np.array(n_search).argmax() + 1
    
print('Best n_neighbors hyperparameter = ', best_n)

Best n_neighbors hyperparameter =  2



## Training and Testing the Model

<br>

- Use KNeighborsRegressor Model and n_neighbors = 2 hyperparameter for best results


- Train the Model using training data


- Predicting/Testing the centroid values in comparison to the data we used to train the model. Using this to compare the prediction with another prediction to see if the two personality types are similar.


- Apply 10 percent similarity to see if X_test[0] is similar to X_test[1]

<br>

In [11]:
# Defining the best score model with the best n_neighbors from the hyperparameter search
knr = KNeighborsRegressor(n_neighbors = best_n) # Add weights in the features to increase each accuracy
# Training the model with the training data
knr.fit(X_train, y_train)
# Predict first input data values (Personality A)
preds_0 = knr.predict([X_test[0]])
# Predict second input data values (Personality B)
preds_1 = knr.predict([X_test[1]])

# approx 10 Percent Similarity since all the centroid closeness values are in between 0.85 and 0.95
# Similar if less than 0.01 because there aren't that many unusual/out of the ordinary personalities

# The Difference in Closeness, which is how similar the two personalities are
print('Similarity Percentage = ', abs(preds_0 - preds_1)[0] * 1000,'%\n') # Times 1000 because values are in between 0.85 and 0.95
 


# Determining if the two Personalities are Similar (If they are 10 Percent Similar)
if abs(preds_0 - preds_1) < 0.01:
    print('X_test[0] is similar to X_test[1]')
else:
    print('X_test[0] is not similar to X_test[1]')

Similarity Percentage =  3.092817085004196 %

X_test[0] is similar to X_test[1]


## Scoring the Model using the Test Set

<br>

- Calculating score using the best hyperparameter model 

<br>

In [12]:
# Scoring the test set
knr.score(X_test, y_test)

0.8529519570839639



____

<br>

Copyright &copy;. &nbsp; All Rights Reserved. &nbsp; _**StudyNest**_

<br>

____