
# Tinder Google Play Reviews

Custom Dataset Provided By: 


### Domain Knowledge

Since from a personal standpoint, no domain knowledge is present for any background information, we will leave the model up for interpretation for a quick analysis. 

How can we gather a story from the dataset?
What can we determine?

The data is primarily string data with two integer types, there are several directions we could take with this dataset:
- Predict thumbsUpCount
- Is there a relationship between username and content
- Relationship between content and at
- Can we classify reviews into topics
- What is the most common level of sentiment from reviews

In addition, the techniques that we could use:

- Principal Component Analysis (PCA)
- Dimensionality Reduction
- Linear Regression
- Random Forest
- Latent Dirichlet Allocation

and so forth. I think for this project we can answer three of these questions by the end of our analysis.


In [43]:
import pandas as pd
import re
import numpy as np
import pandas as pd
from pprint import pprint

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import spacy

import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [44]:
# Load and inspect our dataset
tinder = pd.read_csv('/Users/jasonrobinson/Documents/Projects/tinder_google_play_reviews.csv')

print(tinder.shape)
tinder.head(2)

(530253, 10)


Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt
0,gp:AOqpTOH2EGdi4f1cyGMD3yuybAw8AAEz61LalnOoPtr...,Christina B.,https://play-lh.googleusercontent.com/a-/AOh14...,Won't let me link my spotify,1,0,13.3.0,2022-03-18 22:00:58,,
1,gp:AOqpTOEAwIce8kQ2UdDFb0_RzaZGhjwyHTIk3mI1IaZ...,Franscois Matthee,https://play-lh.googleusercontent.com/a/AATXAJ...,This is not a dating app its a shity version o...,1,0,,2022-03-18 21:57:41,,


In [45]:
import spacy.cli
#spacy.cli.download("en_core_web_md")

In [46]:
# See non-nulls and data types
print(tinder['content'].shape)
tinder.info()

(530253,)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 530253 entries, 0 to 530252
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   reviewId              530253 non-null  object
 1   userName              530248 non-null  object
 2   userImage             530253 non-null  object
 3   content               528908 non-null  object
 4   score                 530253 non-null  int64 
 5   thumbsUpCount         530253 non-null  int64 
 6   reviewCreatedVersion  423368 non-null  object
 7   at                    530253 non-null  object
 8   replyContent          47714 non-null   object
 9   repliedAt             47714 non-null   object
dtypes: int64(2), object(8)
memory usage: 40.5+ MB


In [47]:
# We have some missing values, take a closer look at content
tinder['content'].isnull().sum()

1345

In [48]:
# We can just drop the 1345 rows since the dataset is so large
tinder = tinder.dropna(axis=0)

In [50]:
# Trimmed our datset down to 41000
tinder['content']

613                                               Facebook?
650       Took away the daily free super like and now yo...
830       Only works like 30% of time. I almost always g...
959       app is broken. never found a single match insp...
2104      I've been using Tinder since 2019. I've met a ...
                                ...                        
530245                            Y Facebook sign up only ?
530247    Buggy, after login the hour glass just keep sp...
530249    Tinder is extremely buggy on the galaxy S4 act...
530250                                      Keeps crashing.
530251    Crashes. Doesn't load. Total failure. Take it ...
Name: content, Length: 41091, dtype: object

In [6]:
# Get our summary statistics
#tinder.describe(include='all')

In answering the status of a relationship between **content** and **at** we will filter to just those two columns. Let's see if there is a correlation between date/time and sentiment.

In [11]:
# Look at 10 random rows
tinder[['content', 'at']].sample(10)

Unnamed: 0,content,at
335682,,2017-06-29 08:42:33
289654,,2018-03-11 22:03:28
116987,,2020-05-21 06:27:31
2036,,2022-03-07 03:51:00
297777,,2018-01-18 22:59:11
252529,,2018-10-06 03:25:30
140325,,2020-01-21 14:13:14
138553,,2020-01-30 07:47:12
87865,,2020-10-19 12:06:19
484063,,2015-01-08 01:31:32


In [None]:
# Let's confirm that our dates are really objects(strings), 
# which is what we need to preprocess our text
tinder['content'].dtype

In [53]:
# Convert string to datetime
tinder['at'] = pd.to_datetime(tinder['at'])

In [54]:
# Ensure change (8.8-years of data)
tinder['at'].sort_values(ascending=False)

613      2022-03-15 10:13:13
650      2022-03-15 03:53:09
830      2022-03-14 05:19:01
959      2022-03-13 12:15:34
2104     2022-03-06 16:56:47
                 ...        
530245   2013-07-15 23:44:37
530247   2013-07-15 23:29:17
530249   2013-07-15 22:43:41
530250   2013-07-15 22:27:15
530251   2013-07-15 22:20:31
Name: at, Length: 41091, dtype: datetime64[ns]

***


### Tokenization with Spacy


In [51]:
# We have the option of using spacy but for demonstration purposes
# it is also good to know how to create your own functions

tinder['content'] = tinder['content'].apply(lambda x: re.sub('\s+', ' ', x))

# Remove Emails
tinder['content'] = tinder['content'].apply(lambda x: re.sub('From: \S+@\S+', '', x))

# Remove non-alphanumeric characters
tinder['content'] = tinder['content'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

# Remove extra whitespace and lowercase text
tinder['content'] = tinder['content'].apply(lambda x: ' '.join(x.lower().split()))

In [56]:
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

tinder['lemmas'] = tinder['content'].parallel_apply(get_lemmas)

INFO: Pandarallel will run on 12 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


NameError: name 'get_lemmas' is not defined

In [None]:
from tqdm import tqdm
tqdm.pandas()

In [57]:
nlp = spacy.load("en_core_web_md", disable=['parser', 'tagger', 'ner'])

In [None]:
tinder['content'] = tinder['clean_text'].progress_apply(lambda x: [token.lemma_ for token in nlp(x) if (token.is_stop != True) and (token.is_punct != True) and (len(token) > 2)])

In [None]:
# We will work with spacy and gensim just for demonstration purposes
nlp.Defaults.stop_words
nlp = spacy.load('en_core_web_lg')
doc = nlp(str(tinder['content']))

In [None]:
doc

In [None]:
#tokens = []
#
#""" Update those tokens w/o stopwords"""
#for doc in tokenizer.pipe(tinder['content'], batch_size=500):
#    
#    doc_tokens = []
#    
#    for token in doc:
#        if (token.is_stop == False) & (token.is_punct == False):
#            doc_tokens.append(token.text.lower())
#
#    tokens.append(doc_tokens)
#
#tinder['tokens'] = tokens

In [None]:
df.tokens.head()

In [None]:
wc = count(df['tokens'])

wc_top20 = wc[wc['rank'] <= 20]

squarify.plot(sizes=wc_top20['pct_total'], label=wc_top20['word'], alpha=.8 )
plt.axis('off')
plt.show()

In [None]:
print(type(nlp.Defaults.stop_words))


### Tokenization with Gensim


In [None]:
# We could also decide to use a custom function of our own

#def tokenize(str):
#    idx = [x for x, v in enumerate(str) if v == '\"']
#    if len(idx) % 2 != 0:
#        idx = idx[:-1]
#    memory = {}
#    for i in range(0, len(idx), 2):
#        val = str[idx[i]:idx[i+1]+1]
#        key = "_"*(len(val)-1)+"{0}".format(i)
#        memory[key] = val
#        str = str.replace(memory[key], key, 1)        
#    return [memory.get(token, token) for token in str.split(",")] 

In [None]:
#Let's create a fuction which takes a corpus of document and 
# returns and dataframe of word counts for us to analyze.
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [None]:
# Apply the function
wc = count(df['tokens'])
print(wc.shape)
wc.head()

In [None]:
# Cumulative Distribution Plot
sns.lineplot(x='rank', y='cul_pct_total', data=wc);

In [None]:
wc[wc['rank'] <= 100]['cul_pct_total'].max()