# Objective

This OkCupid profiles dataset has text data as essays. Every user needs to fill the essay with respect to different dimensions. We found that User did not fill all the essays. But User has filled at least one of the essays. So we decided to combine the all essays with respect to every user and named it as eassy. 

Now we want to study the user's different segments or intrests. For this, We want to categories all the essays into various topics using different approaches. These topics will help us to understand the user's interest behavious and this will help us to provide recommendation to every user.

We are going to applied following steps to achieve our objective:
* Combine columns essay0 to eassy9 as essay
* Preprocessing the essay text
* Perform EDA
	1. Wods frequency distribution
	2. Part of speech tagging
* Identify user's intrest behavious using topic modeling

In [None]:
%matplotlib inline
from IPython.display import display
from bokeh.io import output_notebook
from bokeh.models import Label
from bokeh.plotting import figure, output_file, show
from collections import Counter
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import linear_kernel
from textblob import TextBlob
from tqdm import tqdm
import ast
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import re
import scipy.stats as stats
from scipy.sparse import csr_matrix, csc_matrix
from scipy import sparse
import seaborn as sb
import spacy
import string
output_notebook()
pd.options.mode.chained_assignment = None

## Load Dataset

In [None]:
full_df = pd.read_csv("../input/okcupid-profiles/okcupid_profiles.csv")
full_df['essay'] = full_df[full_df.columns[21:]].apply(lambda x: ' '.join(x.astype(str)), axis=1)
raw_data = full_df[["essay"]]
raw_data["essay"] = raw_data["essay"].astype(str)
raw_data.head()

## Computation Challenge
For entire text data, we are facing computation challenges from preprocessing to every topic modeling techique. So we have decided to perform our analysis using sampling. 

Our solution approach as Sampling:
*  Random 10% sampling from main dataset. [Iteration-1]
*  Selective sampling using every categorical dimentions [Iteration-2]

### Random Sampling [Iteration-1]

In [None]:
raw_data = raw_data.sample(frac=0.10)
raw_data["essay"] = raw_data["essay"].astype(str)
raw_data.info()

# Preprocessing

For preprocessing, we are cleaning text using below criteria.
1.	Apply lower casing
2.	Remove puctualtions
3.	Remove numbers
4.	Remove stop words
5.	Remove rare words
6.	Remove frequent words
7.	Apply lemmatization

In [None]:
class TextPreprocessor:
    APOSTROPHE = '\u2019'
    EMOTICONS_REGEX = r'[\U0001f600-\U0001f64f]+'
    DINGBATS_REGEX = r'[\U00002702-\U000027b0]+'
    TRANSPORT_AND_MAP_REGEX = r'[\U0001f680-\U0001f6c0]+'
    ENCLOSED_CHARS_REGEX = r'[\U000024c2-\U0001f251]+'
    MISC_REGEX = r'[\U000000a9-\U0001f999]'

    def make_lowercase(self, data_frame, column_name):
        data_frame[column_name] = data_frame[column_name].str.lower()
        #data_frame[column_name] = data_frame[column_name].apply(lambda texts: 
        print('make_lowercase applied')
        #print(data_frame[column_name])
        return data_frame

    def remove_punctuation(self, data_frame, column_name):
        PUNCT_TO_REMOVE = string.punctuation
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)))
        print('remove_punctuation applied')
        #print(data_frame[column_name])
        return data_frame

    def remove_stop_words(self, data_frame, column_name):
        STOPWORDS = set(stopwords.words('english'))
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: " ".join([word for word in str(text).split() if word not in STOPWORDS]))
        print('remove_stop_words applied')
        #print(data_frame[column_name])
        return data_frame

    def remove_frequent_words(self, data_frame, column_name):
        cnt = Counter()
        for text in data_frame[column_name].values:
            for word in text.split():
                cnt[word] += 1
        FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: " ".join([word for word in str(text).split() if word not in FREQWORDS]))
        print('remove_frequent_words applied')
        #print(data_frame[column_name])
        return data_frame

    def remove_rare_words(self, data_frame, column_name, max_rare_words_count=10):
        cnt = Counter()
        for text in data_frame[column_name].values:
            for word in text.split():
                cnt[word] += 1
        RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-max_rare_words_count - 1:-1]])
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: " ".join([word for word in str(text).split() if word not in RAREWORDS]))
        print('remove_rare_words applied')
        #data_frame.head(5)
        return data_frame

    def stem_words(self, data_frame, column_name):
        stemmer = PorterStemmer()
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: " ".join([stemmer.stem(word) for word in text.split()]))
        print('stem_words applied')
        #print(data_frame[column_name])
        return data_frame

    def lemmatize_words(self, data_frame, column_name):
        lemmatizer = WordNetLemmatizer()
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: " ".join([lemmatizer.lemmatize(word) for word in text.split()]))
        print('lemmatize_words applied')
        # print(data_frame[column_name])
        return data_frame
    
    def remove_numbers(self, data_frame, column_name):
        number_pattern = r'\d+'
        data_frame[column_name] = data_frame[column_name].apply(
            lambda text: re.sub(pattern=number_pattern, repl=" ", string=text))
        print('remove_numbers applied')
        return data_frame


    def lemmatize_words_v2(self, data_frame, column_name):
        lemmatizer = WordNetLemmatizer()
        wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}
        # pos_tagged_text = nltk.pos_tag(text.split())
        data_frame[column_name] = data_frame[column_name].apply(lambda text: " ".join(
            [lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in
             nltk.pos_tag(text.split())]))
        print('lemmatize_words_v2 applied')
        #print(data_frame[column_name])
        return data_frame

    def tokenize(self, data_frame, column_name):
        data_frame[column_name] = data_frame[column_name].apply(lambda text: nltk.tokenize.word_tokenize(text))
        print('tokenize applied')
        #print(data_frame[column_name])
        return data_frame


    def clean_text(self, data_frame, column_name):     
        data_frame_local = self.make_lowercase(data_frame, column_name)
        data_frame_local = self.remove_punctuation(data_frame_local, column_name)
        data_frame_local = self.remove_numbers(data_frame_local, column_name)
        data_frame_local = self.remove_stop_words(data_frame_local, column_name)
        data_frame_local = self.remove_rare_words(data_frame_local, column_name)
        data_frame_local = self.remove_frequent_words(data_frame_local, column_name)
        data_frame_local = self.lemmatize_words_v2(data_frame_local, column_name)
        #data_frame_local = self.tokenize(data_frame_local, column_name)
        return data_frame_local


    def remove_emojis(data):
        result = []
        for word in data:
            match = []
            match += re.findall(EMOTICONS_REGEX, word)
            match += re.findall(ENCLOSED_CHARS_REGEX, word)
            match += re.findall(DINGBATS_REGEX, word)
            match += re.findall(TRANSPORT_AND_MAP_REGEX, word)
            match += re.findall(MISC_REGEX, word)
            if not match == []:
                for item in match:
                    word = word.replace(item, '')
            result.append(word)
        return result


    def remove_empty_strings(data):
        return [word for word in data if word != '']

### TextPreprocessor
We have developed the TextPreprocessor module to apply all the preprocessing logic in one go.

In [None]:
text_processor = TextPreprocessor()
reindexed_data = text_processor.clean_text(raw_data, 'essay')
reindexed_data.head(5)

#### Re-indexing the data frame
Current data frame is randomly sampled. So indexing is not correct. We need to re-indexed the dataset.

In [None]:
reindexed_data["essay"] = reindexed_data["essay"].astype(str)
reindexed_data.reset_index(drop=True, inplace=True)
reindexed_data_values = reindexed_data["essay"]
reindexed_data.head(5)

# Exploratory Data Analysis
We are going to apply some basic eploratory analysis. We are performing below analysis.
*   Words frequency distribution
*   Part of speech tagging

In [None]:
# Define helper functions
def get_top_n_words(n_top_words, count_vectorizer, text_data):
    '''
    returns a tuple of the top n words in a sample and their 
    accompanying counts, given a CountVectorizer object and text sample
    '''
    vectorized_headlines = count_vectorizer.fit_transform(text_data.values)
    vectorized_total = np.sum(vectorized_headlines, axis=0)
    word_indices = np.flip(np.argsort(vectorized_total)[0,:], 1)
    word_values = np.flip(np.sort(vectorized_total)[0,:],1)
    
    word_vectors = np.zeros((n_top_words, vectorized_headlines.shape[1]))
    for i in range(n_top_words):
        word_vectors[i,word_indices[0,i]] = 1

    words = [word[0].encode('ascii').decode('utf-8') for 
             word in count_vectorizer.inverse_transform(word_vectors)]

    return (words, word_values[0,:n_top_words].tolist()[0])

### Words frequency distribution
Here we can evaluate the top 50 words and their frequency.

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')
words, word_values = get_top_n_words(n_top_words=50, count_vectorizer=count_vectorizer, text_data=reindexed_data["essay"])

fig, ax = plt.subplots(figsize=(20,8))
ax.bar(range(len(words)), word_values);
ax.set_xticks(range(len(words)));
ax.set_xticklabels(words, rotation='vertical');
ax.set_title('Top words in OkCupid dataset (excluding stop words)');
ax.set_xlabel('Word');
ax.set_ylabel('Number of occurences');
plt.show()

Here we can seet the top 50 word's distribution. Work, Movie, food and book etc are most fequent words user used in their profile.

### Words statistics

In [None]:
tagged_essays = [TextBlob(reindexed_data["essay"][i]).pos_tags for i in range(reindexed_data["essay"].shape[0])]

In [None]:
tagged_essays_df = pd.DataFrame({'tags':tagged_essays})

word_counts = [] 
pos_counts = {}

for eassy in tagged_essays_df[u'tags']:
    word_counts.append(len(eassy))
    for tag in eassy:
        if tag[1] in pos_counts:
            pos_counts[tag[1]] += 1
        else:
            pos_counts[tag[1]] = 1
            
print('Total number of words: ', np.sum(word_counts))
print('Mean number of words per eassy: ', np.mean(word_counts))

Based on this analysis, we have **1.1 million** words and on average **190 words** every eassy contains.

#### Words distribution

In [None]:
y = stats.norm.pdf(np.linspace(0,14,50), np.mean(word_counts), np.std(word_counts))

fig, ax = plt.subplots(figsize=(18,8))
ax.hist(word_counts, bins=range(1,14), density=True);
ax.plot(np.linspace(0,14,50), y, 'r--', linewidth=1);
ax.set_title('Eassy word lengths');
ax.set_xticks(range(1,14));
ax.set_xlabel('Number of words');
plt.show()

We did not get any distribution in this sample.

#### Part of Speech tagging for eassay

In [None]:
pos_sorted_types = sorted(pos_counts, key=pos_counts.__getitem__, reverse=True)
pos_sorted_counts = sorted(pos_counts.values(), reverse=True)

fig, ax = plt.subplots(figsize=(18,8))
ax.bar(range(len(pos_counts)), pos_sorted_counts);
ax.set_xticks(range(len(pos_counts)));
ax.set_xticklabels(pos_sorted_types);
ax.set_title('Part-of-Speech Tagging for Eassy Corpus');
ax.set_xlabel('Type of Word');

Using POS tagging, our corpus has top 3 section as noun (NN), Adjectives (JJ) and Adverbs (RB).

# Topic Modeling
For our OkCupid text corpus dataset, We are going to evaluate clustering algorithim for categorising the user interest behavious into topics. We are using here Random Sample 10% of actual dataset. Since we have limitation on computation. First iteration we will evaluate this random sample and second iteration we will use selective random sample. 

#### Preprocessing
We have sampled corpus data. we need to extract features. So we are using SKLearn's CountVectorizer object to get document-term-matrix. This matrix will be **n x K** dimention where **K** is the number of distinct words with respect to **n** essay.

In [None]:
small_count_vectorizer = CountVectorizer(stop_words='english', max_features=40000)
small_text_sample = reindexed_data["essay"]

print('Essay before vectorization: \n{}'.format(small_text_sample[10]))

small_document_term_matrix = small_count_vectorizer.fit_transform(small_text_sample)

print('Essay after vectorization: \n{}'.format(small_document_term_matrix[10]))

Based on above document-term-matrix, we can see that we have very high-rank and sparse data. Based on this we are selecting Latent Semantic Analysis or Latent Dirichilet Allocation. These two algorithm are using our document-term matrix and will get output as **n x N** topic matrix. Here N is the number of topic categories and n is the number of essay in our sample. here we are providing 10 initial value for N.

In [None]:
# Number of topic categories declare
N = 5

#### Latent Semantic Analysis
This is the very effective way to truncated singular value decomposition of a high-rank and sparse document-term matrix. This will preserved largest singular values.

In [None]:
lsa_model = TruncatedSVD(n_components=N)
lsa_topic_matrix = lsa_model.fit_transform(small_document_term_matrix)

In [None]:
# Define helper functions
def get_keys(topic_matrix):
    '''
    returns an integer list of predicted topic 
    categories for a given topic matrix
    '''
    keys = topic_matrix.argmax(axis=1).tolist()
    return keys

def keys_to_counts(keys):
    '''
    returns a tuple of topic categories and their 
    accompanying magnitudes for a given list of keys
    '''
    count_pairs = Counter(keys).items()
    categories = [pair[0] for pair in count_pairs]
    counts = [pair[1] for pair in count_pairs]
    return (categories, counts)

In [None]:
lsa_keys = get_keys(lsa_topic_matrix)
lsa_categories, lsa_counts = keys_to_counts(lsa_keys)

In [None]:
# Define helper functions
def get_top_n_words(n, keys, document_term_matrix, count_vectorizer):
    '''
    returns a list of n_topic strings, where each string contains the n most common 
    words in a predicted category, in order
    '''
    top_word_indices = []
    for topic in range(N):
        temp_vector_sum = 0
        for i in range(len(keys)):
            if keys[i] == topic:
                temp_vector_sum += document_term_matrix[i]
        temp_vector_sum = sparse.csr_matrix(temp_vector_sum)
        temp_vector_sum = temp_vector_sum.toarray()
        top_n_word_indices = np.flip(np.argsort(temp_vector_sum)[0][-n:],0)
        top_word_indices.append(top_n_word_indices)   
    top_words = []
    for topic in top_word_indices:
        topic_words = []
        for index in topic:
            temp_word_vector = np.zeros((1,document_term_matrix.shape[1]))
            temp_word_vector[:,index] = 1
            the_word = count_vectorizer.inverse_transform(temp_word_vector)[0][0]
            topic_words.append(the_word.encode('ascii').decode('utf-8'))
        top_words.append(" ".join(topic_words))         
    return top_words

In [None]:
top_n_words_lsa = get_top_n_words(20, lsa_keys, small_document_term_matrix, small_count_vectorizer)

for i in range(len(top_n_words_lsa)):
    print("Topic {}: ".format(i+1), top_n_words_lsa[i])

Here we have predicted the topic category for sample essays. Each topic category are sharing top 50 words for topic intuation.

#### Topic eassy fequency distribution
Here we are visulizing the topic category frequency distribution for sampled essays.

In [None]:
top_3_words = get_top_n_words(3, lsa_keys, small_document_term_matrix, small_count_vectorizer)
labels = ['Topic {}: \n'.format(i) + top_3_words[i] for i in lsa_categories]

fig, ax = plt.subplots(figsize=(16,8))
ax.bar(lsa_categories, lsa_counts);
ax.set_xticks(lsa_categories);
ax.set_xticklabels(labels);
ax.set_ylabel('Number of essay');
ax.set_title('LSA topic counts');
plt.show()

Based on LSA, Majority of our users belongs to topic_0. This leads us as a concern for recommendation. But this does not provide clusting view. Here we are using dimensionality-reduction technique called t-SNE for better explanation of topic category.

In [None]:
tsne_lsa_model = TSNE(n_components=2, perplexity=50, learning_rate=100, 
                        n_iter=2000, verbose=1, random_state=0, angle=0.75)
tsne_lsa_vectors = tsne_lsa_model.fit_transform(lsa_topic_matrix)

In [None]:
keys = lsa_keys
two_dim_vectors = tsne_lsa_vectors


mean_topic_vectors = []
for t in range(N):
    articles_in_that_topic = []
    for i in range(len(keys)):
        if keys[i] == t:
            articles_in_that_topic.append(two_dim_vectors[i])    
    #-------------------------------
#     temp_vector_sum = sparse.csr_matrix(temp_vector_sum)
#     temp_vector_sum = temp_vector_sum.toarray()
    #-------------------------------
    
    if len(articles_in_that_topic) == 0:
        print(f'length = {len(articles_in_that_topic)}')
        print('skip')
        continue
    print(f'length = {len(articles_in_that_topic)}')
    articles_in_that_topic = np.vstack(articles_in_that_topic)
    mean_article_in_that_topic = np.mean(articles_in_that_topic, axis=0)
    print(mean_article_in_that_topic)
    print(type(mean_article_in_that_topic))
    print(mean_article_in_that_topic.ndim)
    mean_topic_vectors.append(mean_article_in_that_topic)

In [None]:
# Define helper functions
def get_mean_topic_vectors(keys, two_dim_vectors):
    '''
    returns a list of centroid vectors from each predicted topic category
    '''
    mean_topic_vectors = []
    for t in range(N):
        articles_in_that_topic = []
        for i in range(len(keys)):
            if keys[i] == t:
                articles_in_that_topic.append(two_dim_vectors[i])    
        if len(articles_in_that_topic) == 0:
#             print(f'length = {len(articles_in_that_topic)}')
#             print('skip')
            mean_topic_vectors.append(np.array([0.0 , 0.0]))
            continue
        articles_in_that_topic = np.vstack(articles_in_that_topic)
        mean_article_in_that_topic = np.mean(articles_in_that_topic, axis=0)
        
        mean_topic_vectors.append(mean_article_in_that_topic)
    return mean_topic_vectors

In [None]:
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5" ])
colormap = colormap[:N]

In [None]:
top_3_words_lsa = get_top_n_words(3, lsa_keys, small_document_term_matrix, small_count_vectorizer)
lsa_mean_topic_vectors = get_mean_topic_vectors(lsa_keys, tsne_lsa_vectors)

plot = figure(title="t-SNE Clustering of {} LSA Topics".format(N), plot_width=700, plot_height=700)
plot.scatter(x=tsne_lsa_vectors[:,0], y=tsne_lsa_vectors[:,1], color=colormap[lsa_keys])

for t in range(N):
    label = Label(x=lsa_mean_topic_vectors[t][0], y=lsa_mean_topic_vectors[t][1], 
                  text=top_3_words_lsa[t], text_color=colormap[t])
    plot.add_layout(label)
    
show(plot)

Here There is no clear sepration of topics. Here we can see the similarity between LSA topic frequency distribution and t-SNE cluster. But we are failed to explained cluster clearly.

### Latent Dirichilet Allocation
Now we will repeat this process for LDA. This is a generative probalilistic process.

In [None]:
lda_model = LatentDirichletAllocation(n_components=N, learning_method='online', 
                                          random_state=0, verbose=0)
lda_topic_matrix = lda_model.fit_transform(small_document_term_matrix)

In [None]:
lda_keys = get_keys(lda_topic_matrix)
lda_categories, lda_counts = keys_to_counts(lda_keys)

In [None]:
top_n_words_lda = get_top_n_words(20, lda_keys, small_document_term_matrix, small_count_vectorizer)

for i in range(len(top_n_words_lda)):
    print("Topic {}: ".format(i+1), top_n_words_lda[i])

Here we have predicted the topic category for sample essays. Each topic category are sharing top 10 words for topic intuation.


#### Topic eassy fequency distribution
Here we are visulizing the topic category frequency distribution for sampled essays.

In [None]:
top_3_words = get_top_n_words(3, lda_keys, small_document_term_matrix, small_count_vectorizer)
labels = ['Topic {}: \n'.format(i) + top_3_words[i] for i in lda_categories]

fig, ax = plt.subplots(figsize=(16,8))
ax.bar(lda_categories, lda_counts);
ax.set_xticks(lda_categories);
ax.set_xticklabels(labels);
ax.set_title('LDA topic counts');
ax.set_ylabel('Number of essays');

Based on LDA, Majority of our users belongs to topic_0 and secondly belogs to topic 2. This leads us as a concern for recommendation. But this does not provide clusting view. Here we are using dimensionality-reduction technique called t-SNE for better explanation of topic category. 

In [None]:
tsne_lda_model = TSNE(n_components=2, perplexity=50, learning_rate=100, 
                        n_iter=2000, verbose=1, random_state=0, angle=0.75)
tsne_lda_vectors = tsne_lda_model.fit_transform(lda_topic_matrix)

In [None]:
top_3_words_lda = get_top_n_words(3, lda_keys, small_document_term_matrix, small_count_vectorizer)
lda_mean_topic_vectors = get_mean_topic_vectors(lda_keys, tsne_lda_vectors)

plot = figure(title="t-SNE Clustering of {} LDA Topics".format(N), plot_width=700, plot_height=700)
plot.scatter(x=tsne_lda_vectors[:,0], y=tsne_lda_vectors[:,1], color=colormap[lda_keys])

for t in range(N):
    label = Label(x=lda_mean_topic_vectors[t][0], y=lda_mean_topic_vectors[t][1], 
                  text=top_3_words_lda[t], text_color=colormap[t])
    plot.add_layout(label)

show(plot)

Based on above clustering graph, We are unable to explane the topic separation clearly. Now we can conclude that random sampling is not providing the clear prospective. 