### TF-IDF Attempts

##### Resources:

- https://medium.com/free-code-camp/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3
- https://www.geeksforgeeks.org/text-analysis-in-python-3/
- https://stackabuse.com/python-for-nlp-creating-tf-idf-model-from-scratch/
- https://realpython.com/natural-language-processing-spacy-python/
- https://spacy.io/usage/spacy-101

In [1]:
import pandas as pd
import json
import glob
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import re
import spacy
from spacy import displacy
from collections import Counter
import csv

In [2]:
slack_data = pd.read_csv('slack_data_clean.csv', encoding='utf-8')

In [3]:
slack_data

Unnamed: 0,type,user,text,files,timestamp,date,time,year,code_block,links
0,message,UD9L1D44T,hi ! i’m pretty sure it’s possible to define u...,,2019-11-21 06:38:50.160000,2019-11-21,06:38:50.160000,2019,,
1,message,U042FH0RB,i don’t actually think there is a public api t...,,2019-11-21 08:02:58.161000,2019-11-21,08:02:58.161000,2019,,
2,message,U046K2QNK,in `powderday` i'm trying to write code that w...,,2019-11-21 18:48:39.162300,2019-11-21,18:48:39.162300,2019,,
3,message,U042FH0RB,that’ll break when yt-4.0 comes out,,2019-11-21 23:02:55.162800,2019-11-21,23:02:55.162800,2019,,
4,message,U042FH0RB,maybe just check the first digit of the versio...,,2019-11-21 23:03:07.163300,2019-11-21,23:03:07.163300,2019,,
...,...,...,...,...,...,...,...,...,...,...
6435,message,UD9L1D44T,anytime :smile:,,2020-05-23 16:11:53.285800,2020-05-23,16:11:53.285800,2020,,
6436,message,U5ET0RNKE,it looks like `ds.covering_grid` and `ds.smoot...,,2020-05-23 16:56:10.289300,2020-05-23,16:56:10.289300,2020,,
6437,message,U5ET0RNKE,is it possible to view the web documentation f...,,2020-05-23 17:01:31.290700,2020-05-23,17:01:31.290700,2020,,
6438,message,U31LWTKNW,the docs from the main release versions are on...,,2020-05-23 17:03:17.290800,2020-05-23,17:03:17.290800,2020,,https://yt-project.org/docs/3.1/


In [94]:
user_text_list = list(slack_data["text"])
#print(user_text_list)

row_list = list()

for index, rows in slack_data.iterrows():
    text_list = [rows.text]
    row_list.append(text_list)
    
print(row_list)
    



### Tokenization

Tokenization is taking a string and creating a list of the individual words in the string. This is how text is converted into a Doc object (in spacy)

ex. `"I love coffee" = ["I", "love", "coffee"]`

In [95]:
# testing out tokenization

nlp = spacy.load('en_core_web_sm')

doc = nlp(row_list[0][0])

# for token in doc:
#     print(token.text)

for token in doc:
    print(token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

hi INTJ UH ROOT xx True False
! PUNCT . punct ! False False
-PRON- PRON PRP nsubj x True True
be VERB VBP ROOT ’x False True
pretty ADV RB advmod xxxx True False
sure ADJ JJ acomp xxxx True False
-PRON- PRON PRP nsubj xx True True
’ VERB VBZ ROOT ’x False True
possible ADJ JJ acomp xxxx True False
to PART TO aux xx True True
define VERB VB xcomp xxxx True False
unit NOUN NNS compound xxxx True False
equivalence NOUN NNS dobj xxxx True False
from ADP IN prep xxxx True True
the DET DT det xxx True True
user NOUN NN compound xxxx True False
- PUNCT HYPH punct - False False
end NOUN NN pobj xxx True False
, PUNCT , punct , False False
but CCONJ CC cc xxx True True
i PRON PRP nsubj x True True
can VERB MD aux xx True True
not PART RB neg x’x False True
find VERB VB ROOT xxxx True False
back ADV RB advmod xxxx True True
how ADV WRB advmod xxx True True
-PRON- PRON PRP nsubj xx True True
’ VERB VBZ aux ’x False True
do VERB VBN ccomp xxxx True True
in ADP IN prep xx True True
the DET DT det x

### Sentences

Sentence detection detects the start and end of sentences using a period as the sentence delimiter 

We can also add delimiters we using functions - **currently this doesn't seem to work as Spacy appears to consider all puncutation as a hard stop.**

In [6]:
about_doc = nlp(row_list[17][0])
sentences = list(about_doc.sents)

for sentence in sentences:
    print(len(sentence))
    print(sentence)
    print()

3
hey folks!

25
i'm trying to create a particle field but for a derived particle field, but it seems that i can't do that.



In [7]:
def add_delimiter(doc):
    for token in doc[:-1]:
        if token.text == ",":
            doc[token.i+1].is_sent_start = True
    return doc

In [8]:
custom_nlp = spacy.load("en_core_web_sm")
custom_nlp.add_pipe(add_delimiter, before='parser')
custom_ellipsis_doc = custom_nlp(row_list[17][0])
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

print(len(custom_ellipsis_sentences))
for sentence in custom_ellipsis_sentences:
    print(sentence)

3
hey folks!
i'm trying to create a particle field but for a derived particle field,
but it seems that i can't do that.


In [9]:
ellipsis_doc = nlp(row_list[17][0])
ellipsis_sentences = list(ellipsis_doc.sents)
print(len(ellipsis_sentences))
for sentence in ellipsis_sentences:
    print(sentence)

2
hey folks!
i'm trying to create a particle field but for a derived particle field, but it seems that i can't do that.


### Back to Tokenization

Tokens are the basic (meaningful) units. In Spacy these are stored in the Doc objects. 

We can customize the token process to detect custom characters (can we do this with markdown and code blocks?) by updating the `tokenizer` property in the nlp object

In [10]:
for token in about_doc:
    # idx = starting idex
    print(token, token.idx)
    
# we can also use: token.text_with_ws,
# token.is_alpha, token.is_punct, token.is_space,
# token.shape_, token.is_stop

hey 0
folks 4
! 9
i 11
'm 12
trying 15
to 22
create 25
a 32
particle 34
field 43
but 49
for 53
a 57
derived 59
particle 67
field 76
, 81
but 83
it 87
seems 90
that 96
i 101
ca 103
n't 105
do 109
that 112
. 116


In [11]:
from spacy.tokenizer import Tokenizer
custom_nlp = spacy.load("en_core_web_sm")
prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)
infix_re = re.compile(r'''[`~]''')

def customize_tokenizer(nlp):
    # pull out back ticks from markdown
    return Tokenizer (nlp.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search, 
                     infix_finditer=infix_re.finditer,
                     token_match=None)

# token match can be used to match strings that should never be split


In [13]:
custom_nlp_tokenizer = customize_tokenizer(custom_nlp)
custom_tokenizer_about_doc = custom_nlp(row_list[17][0])
print([token.text for token in custom_tokenizer_about_doc])


['hey', 'folks', '!', 'i', "'m", 'trying', 'to', 'create', 'a', 'particle', 'field', 'but', 'for', 'a', 'derived', 'particle', 'field', ',', 'but', 'it', 'seems', 'that', 'i', 'ca', "n't", 'do', 'that', '.']


### Stop Words

Spacy has a list of stop words already in there. We see what is NOT a stop word in our doc:

In [14]:
about_no_stopword_doc = [token for token in about_doc if not token.is_stop]
print(about_no_stopword_doc)

[hey, folks, !, trying, create, particle, field, derived, particle, field, ,, .]


### Lemmatization

Is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language and the reduced form is called a lemma. 

Inflected of a word allows for different grammatical categories (like tense and numbers). Lemmatization allows words to be analyzed in a single form and can help normalize the text. 

Spacy has th attribute `lemma_` on the Token class.

This helps avoid having duplicate words that have similar meanings like 'organized' and 'organize'

In [15]:
lemma_test = nlp(row_list[2][0])

for token in lemma_test:
    print(token, token.lemma_)

in in
` `
powderday powderday
` `
i -PRON-
'm be
trying try
to to
write write
code code
that that
will will
automagically automagically
check check
what what
` `
yt yt
` `
version version
we -PRON-
're be
on on
, ,
and and
then then
deal deal
with with
the the
octrees octree
accordingly accordingly
. .
     
is be
this this
reasonable reasonable
as as
a a
check check
, ,
or or
is be
there there
a a
better well
way way
to to
discern discern
( (
in in
code code
) )
if if
we -PRON-
're be
on on
3.x 3.x
or or
4.x 4.x
? ?



 



` `
` `
` `
if if
   
yt.__version yt.__version
_ _
_ _
= =
= =
' '
4.0.dev0 4.0.dev0
' '
: :

    
   
blah blah
` `
` `
` `


### Word Frequency

Let's find the unique words in a text

Note: markdown, symbols, and user ids made it though, have to find a way to remove or ignore those

In [16]:
complete_text = str(row_list[17][0])

# put everything into a string as Spacy expects
text_string = ""
for text in row_list[:100]:
    for t in text:
        t_str = str(t)
        text_string += t_str
    
# print(text_string)
    
complete_doc = nlp(text_string)
# remove stop words and puncutation - we can markdown and math symbols made it past anyways
words = [token.text for token in complete_doc if not token.is_stop and not token.is_punct]
word_freq = Counter(words)
# print the five most common words and their frequencies
common_words = word_freq.most_common(5)

print("Common Words: ", common_words)
print()

unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
print("Unique Words: ", unique_words)

Common Words:  [('`', 77), ('yt', 39), ('field', 28), ('\n', 22), ('>', 20)]

Unique Words:  ['define', 'equivalences', 'end', 'cookbook', 'actually', 'public', 'api', 'outin', 'powderday', 'write', 'automagically', 'deal', 'octrees', 'accordingly', '  ', 'reasonable', 'discern', '3.x', '4.x', '\n\n\n', 'yt.__version', '4.0.dev0', '\n   ', 'blah```that’ll', 'break', 'comes', 'outmaybe', 'digit', 'string?you', 'https://github.com', 'setup.py#l311', 'greater', 'comparisonsi’d', 'looseversion', 'dev0', 'numberah', '\xa0', 'misremember', 'saw', 'exposed', 'developing', 'finder', 'according', 'guide', 'doable', 'advance', 'support!hi', 'benedikt', 'uses', 'older', 'contained', 'binary', 'outputs', 'currently', 'capability', 'know', 'separately', 'galaxies', 'latest', 'aware', 'default', 'standalone', 'though.<@u010scnqyj1', 'channel<@u010scnqyj1', 'welcome', 'glad', 'here!depending', 'nature', 'useful', 'tool', 'written', '@u050ek2tc', 'analyze', 'rs', 'https://bitbucket.org/rthompson/pygad

### Part of Speech Tagging

is a grammartical role that explains how a particular word is used in a sentence. 

It involves assigning a POS tage to each token depending on it's usage in the sentence. Such as "noun" or "verb". 

In Spacy, POS tags are stored as attributes in the Token object.

In [23]:
for token in about_doc:
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))
    
    # tag_ lists the fine-grained part of the speech
    # pos_ lists the coarse-grained part of the speech
    # spacy_explain gives descriptive details about the tag
    

hey UH INTJ interjection
folks NNS NOUN noun, plural
! . PUNCT punctuation mark, sentence closer
i PRP PRON pronoun, personal
'm VBP AUX verb, non-3rd person singular present
trying VBG VERB verb, gerund or present participle
to TO PART infinitival "to"
create VB VERB verb, base form
a DT DET determiner
particle NN NOUN noun, singular or mass
field NN NOUN noun, singular or mass
but CC CCONJ conjunction, coordinating
for IN ADP conjunction, subordinating or preposition
a DT DET determiner
derived VBN VERB verb, past participle
particle NN NOUN noun, singular or mass
field NN NOUN noun, singular or mass
, , PUNCT punctuation mark, comma
but CC CCONJ conjunction, coordinating
it PRP PRON pronoun, personal
seems VBZ VERB verb, 3rd person singular present
that IN SCONJ conjunction, subordinating or preposition
i PRP PRON pronoun, personal
ca MD VERB verb, modal auxiliary
n't RB PART adverb
do VB AUX verb, base form
that DT DET determiner
. . PUNCT punctuation mark, sentence closer


In [24]:
# grab certain tags

nouns = []
adjectives = []

for token in about_doc:
    if token.pos_ == "NOUN":
        nouns.append(token)
    if token.pos_ == "ADJ":
        adjectives.append(token)
        
print(nouns)
print(adjectives)

[folks, particle, field, particle, field]
[]


In [25]:
displacy.render(about_doc, style="dep")

### Preprocessing Functions

Converts text to an analyzable format such as change text to lowercase, lemmatize each token, remove puncutation, remove stopwords. 


In [26]:
def is_token_allowed(token):
    if (not token or not token.string.strip() or
       token.is_stop or token.is_punct):
        return False
    return True

In [27]:
def preprocess_token(token):
    return token.lemma_.strip().lower()

In [28]:
complete_filtered_tokens = [preprocess_token(token) for token
                           in complete_doc if is_token_allowed(token)]

complete_filtered_tokens

['hi',
 'pretty',
 'sure',
 'possible',
 'define',
 'unit',
 'equivalence',
 'user',
 'end',
 'find',
 'cookbook',
 'little',
 'help',
 'actually',
 'think',
 'public',
 'api',
 'need',
 'poke',
 'code',
 'figure',
 'outin',
 '`',
 'powderday',
 '`',
 'try',
 'write',
 'code',
 'automagically',
 'check',
 '`',
 'yt',
 '`',
 'version',
 'deal',
 'octree',
 'accordingly',
 'reasonable',
 'check',
 'well',
 'way',
 'discern',
 'code',
 '3.x',
 '4.x',
 '`',
 '`',
 '`',
 'yt.__version',
 '=',
 '=',
 '4.0.dev0',
 'blah```that’ll',
 'break',
 'yt-4.0',
 'come',
 'outmaybe',
 'check',
 'digit',
 'version',
 'string?you',
 'yt',
 'internally',
 'version',
 'check',
 '<',
 'https://github.com',
 'yt',
 'project',
 'yt',
 'blob',
 'master',
 'setup.py#l311',
 '>',
 'great',
 'comparisonsi’d',
 'check',
 'sure',
 'looseversion',
 'right',
 'thing',
 'dev0',
 'version',
 'numberah',
 'maybe',
 'misremember',
 'see',
 'expose',
 'develop',
 'frontend',
 'hi',
 'yt',
 'run',
 'rockstar',
 'halo',
 'f

### Rule-Based Matching Using Spacy

A way to extract information from unstructured text, by identifying an extracting tokens and phrases according to patterns and grammatical features. 

It's similar to rule-based matching using regex, but also considers lexical and grammatical attributes of the text. 

What are some rule based patterns I can extract from yt slack data?

### Dependency Parsing

is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of the sentence has no dependency and is called the root of the sentence. 

Dependencies can be mapped in a directed graph representation:

- words are nodes
- the grammatical relationships are the edges

This helps you know what role a words plays in the text and how different words relate to each other. 

In [29]:
dep_doc = nlp(row_list[17][0])

for token in dep_doc:
    print(token.text, token.tag_, token.head.text, token.dep_)

hey UH folks intj
folks NNS folks ROOT
! . folks punct
i PRP trying nsubj
'm VBP trying aux
trying VBG trying ROOT
to TO create aux
create VB trying xcomp
a DT field det
particle NN field compound
field NN create dobj
but CC trying cc
for IN trying conj
a DT field det
derived VBN field amod
particle NN field compound
field NN for pobj
, , trying punct
but CC trying cc
it PRP seems nsubj
seems VBZ trying conj
that IN do mark
i PRP do nsubj
ca MD do aux
n't RB do neg
do VB seems ccomp
that DT do dobj
. . seems punct


In [30]:
displacy.render(dep_doc, style="dep")

### Navigating the Tree and Subtree

The dependency parse tree has all the properties of a tree. 

In [31]:
print(row_list[17][0])

hey folks! i'm trying to create a particle field but for a derived particle field, but it seems that i can't do that.


In [50]:
# Extract the children of a word

user_text_doc = nlp(row_list[17][0])
print([token.text for token in user_text_doc[12].children])

['field']


In [51]:
# extract previous neighboring node of a word

print(user_text_doc[12].nbor(-1))

but


In [52]:
# Extract next neighboring node of a word

print(user_text_doc[12].nbor())

a


In [53]:
# extract all token on the either side of a word

print([token.text for token in user_text_doc[12].lefts])

print([token.text for token in user_text_doc[12].rights])

[]
['field']


In [54]:
# print subtree of a word

print(list(user_text_doc[12].subtree))

[for, a, derived, particle, field]


In [55]:
def flatten_tree(tree):
    return ''.join([token.text_with_ws for token in list(tree)]).strip()

print(flatten_tree(user_text_doc[12].subtree))

for a derived particle field


### Shallow Parsing

or chunking, is the process of extracting phrases from unstructured text. Chunking groups adjacent tokens into phrases on the basis of their POS tags. 

Noun phrase: has a noun as its head, and are useful for explaining the context of the sentence, they help infer what is being talked about in the sentence. 

In [57]:
user_questions = row_list[17][0]

In [58]:
user_questions_doc = nlp(user_questions)

# extract noun phrases

for chunk in user_questions_doc.noun_chunks:
    print(chunk)

hey folks
i
a particle field
a derived particle field
it
i


### Verb Phrase Detection

is a syntactic unit composed of at least one verb. This is useful for understanding the actions that nouns are involved in. 

In [59]:
import textacy

In [60]:
pattern = r'(<VERB>?<ADV>*<VERB>+)'

verb_doc = textacy.make_spacy_doc(user_questions, lang='en_core_web_sm')

verb_phrases = textacy.extract.pos_regex_matches(verb_doc, pattern)

for chunk in verb_phrases:
    print(chunk.text)

print()
    
for chunk in verb_doc.noun_chunks:
    print(chunk)

trying
create
derived
seems
ca

hey folks
i
a particle field
a derived particle field
it
i


  action="once",


### Named Entity Recognition

the process of locating named entities in unstructured text and then classifying them into pre-defined categories (names, organizations, locations etc). 

In [64]:
user_questions_ent = row_list[17][0]

user_questions_ent_doc = nlp(user_questions_ent)

for ent in user_questions_ent_doc.ents:
    print(ent.text, ent.start_char, ent.end_char,
         ent.label_, spacy.explain(ent.label_))

In [65]:
displacy.render(user_questions_ent_doc, style='ent')

In [133]:
from nltk import WordNetLemmatizer
from collections import defaultdict

In [134]:
slack_text = r'slack_data_text.csv'

In [141]:
lemmatizer = WordNetLemmatizer()
data_per_yr = defaultdict(list)
with open(slack_text, 'r') as fin:
    lines = fin.readlines()[1:]
    for index, each_line in enumerate(lines):
        each_line = each_line.replace('\n', '').split(',')
        yr = int(each_line[0])
        full_txt = (each_line[1]).split()
        full_txt_out = []
        for word in full_txt:
            if word.endswith('ing') or word.endswith('ed'):
                word = lemmatizer.lemmatize(word, pos='v')
            else:
                word = lemmatizer.lemmatize(word)
            if len(word) > 1:
                full_txt_out.append(word)
        data_per_yr[yr].append(full_txt_out)

#print(data_per_yr)

In [145]:
tf_idf_dict = defaultdict(list)

for yr, docs in data_per_yr.items():
    unique_words_docs_sum = []
    for doc in docs:
        unique_words_in_one = list(set(doc))
        unique_words_docs_sum += unique_words_in_one

    df_dict = Counter(unique_words_docs_sum)

    n_doc = len(docs)

    for doc in docs:
        term_freq = Counter(doc)
        for term, freq in term_freq.items():
            tf = freq/sum(term_freq.values())
            df = df_dict[term]
            tf_idf = tf * np.log(n_doc/(df+1))
            tf_idf_dict[yr].append([term, tf_idf])

print(tf_idf_dict)



In [151]:
header = ['year', 'term', 'tf-idf']
dfs = []

for each_year, tfidf_scores in tf_idf_dict.items():
    df_list = []
    for term_score in tfidf_scores:
        df_list.append([each_year, term_score[0], float(term_score[1])])
    yr_df = pd.DataFrame(df_list, columns=header)
    yr_df = yr_df.sort_values(by=['tf-idf'], ascending=False)
    if 10 < len(tfidf_scores):
        yr_df = yr_df.iloc[:10].reset_index(drop=True)
        dfs.append(yr_df)
    else:
        raise ValueError('input of n_words is more than the words in data!')

df_out = pd.concat(dfs)

print(df_out)

   year                                      term    tf-idf
0  2019                                       _ツ_  6.975881
1  2019                                   binbash  6.975881
2  2019                                     ooooh  6.975881
3  2019                                  saethlin  6.975881
4  2019                                    hurray  6.975881
5  2019                                      oooh  6.975881
6  2019                   sobsoverlackofemojihere  6.975881
7  2019           ytfuncsmatplotlib_style_context  6.975881
8  2019                            dsdomain_width  6.975881
9  2019                                 nibappfrb  6.975881
0  2020                                       ala  6.622071
1  2020                                     10e10  6.622071
2  2020                                       lol  6.622071
3  2020                                     whoop  6.622071
4  2020                                      wink  6.622071
5  2020                    whereascodebl

In [152]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [158]:
TfidfVectorizer.fit_transform?

[0;31mSignature:[0m [0mTfidfVectorizer[0m[0;34m.[0m[0mfit_transform[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mraw_documents[0m[0;34m,[0m [0my[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Learn vocabulary and idf, return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently
implemented.

Parameters
----------
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.
y : None
    This parameter is ignored.

Returns
-------
X : sparse matrix, [n_samples, n_features]
    Tf-idf-weighted document-term matrix.
[0;31mFile:[0m      ~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py
[0;31mType:[0m      function


In [153]:
vectorize= TfidfVectorizer()

In [164]:
with open(slack_text, 'r') as fin:
    lines = fin.readlines()[1:]
    response= vectorize.fit_transform(lines)

In [166]:
print(response)

  (0, 4700)	0.18396648725047907
  (0, 5536)	0.23002182460164644
  (0, 2754)	0.26631509127859204
  (0, 4955)	0.1034811441859153
  (0, 3375)	0.2159032820251818
  (0, 4813)	0.15339012512237912
  (0, 1829)	0.2095585540279277
  (0, 4041)	0.2046449522984244
  (0, 2142)	0.2095585540279277
  (0, 2062)	0.1181993738600138
  (0, 9259)	0.3389833702501805
  (0, 8789)	0.13690830952081479
  (0, 4218)	0.14752252735556082
  (0, 3733)	0.3389833702501805
  (0, 9178)	0.19560244801947466
  (0, 3071)	0.2404301280341415
  (0, 8928)	0.08109417334374219
  (0, 6994)	0.1980929779185149
  (0, 5242)	0.2768433662552854
  (0, 8633)	0.1662678855219079
  (0, 7058)	0.2166013230611261
  (0, 4914)	0.13655832524303785
  (0, 4733)	0.17199400773820964
  (0, 624)	0.0784056076397113
  (1, 6579)	0.2008061346565861
  :	:
  (6438, 7542)	0.25939459893916755
  (6438, 3352)	0.2352186399042941
  (6438, 9706)	0.31693276245679375
  (6438, 8344)	0.2448433658891562
  (6438, 1650)	0.15183150396716963
  (6438, 9755)	0.17566697831089018
  