# Pocket text classification

This project aims to, somehow, automatically create tags for Pocket articles. 
It considers your currently tagged Pocket articles to determine new tags.

## Getting content from Pocket

I've already done it, so I won't cover it right now... however, I'll add it here when possible.

## Preparing our data

We have two (real) sets of data: tagged (manually) and untagged articles. These articles are of many kinds, which I usually read (or queue for eternity) on the commute to work. 

I used to tag them, but it became really hard to catch up, so I decided to do it automatically somehow.

We want to find out the correlation between the article's text (which we acquired in [Getting content from Pocket](#Getting-content-from-Pocket) and the tags I added to them. We want to find features, or stuff that seems to be the relation between article and tag (words that define the content of the article, in a way).

### Loading and playing with our data set

Opening the json file (```'item_id', 'resolved_url', 'tags'``` are from the Pocket API, `'text'` comes from running BeautifulSoup on all articles, mentioned on [Getting content from Pocket](#Getting-content-from-Pocket))

In [22]:
import json
fd = open("tagged_text.json")
tagged_text_list = json.loads(fd.read())
print(tagged_text_list[0].keys())

dict_keys(['item_id', 'resolved_url', 'tags', 'text'])


#### Tokenizing our text

In [51]:
from nltk import download, word_tokenize
download('punkt')

test_text = tagged_text_list[0].get('text')
tokens = word_tokenize(test_text)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vinicyusmacedo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Getting the most common words from our text

In [52]:
from nltk.probability import FreqDist
fdist = FreqDist(word.lower() for word in tokens)
fdist.most_common(100)

[('the', 221),
 (',', 195),
 ('of', 139),
 ('.', 131),
 ('to', 121),
 ('and', 112),
 ('a', 111),
 ('’', 95),
 ('that', 82),
 ('“', 73),
 ('”', 69),
 ('in', 60),
 ('is', 54),
 ('for', 49),
 ('you', 45),
 ('s', 44),
 ('incel', 41),
 ('with', 40),
 ('as', 39),
 ('community', 36),
 ('men', 35),
 ('it', 33),
 ('are', 29),
 ('be', 28),
 ('i', 25),
 ('this', 25),
 ('—', 25),
 ('about', 24),
 ('on', 24),
 ('incels', 24),
 ('they', 24),
 ('health', 23),
 ('but', 23),
 ('mental', 22),
 ('more', 22),
 ('or', 21),
 ('who', 21),
 ('have', 21),
 ('by', 20),
 ('he', 20),
 ('them', 20),
 ('can', 19),
 ('their', 18),
 ('one', 18),
 ('an', 17),
 (':', 17),
 ('members', 17),
 ('me', 17),
 ('women', 15),
 ('at', 15),
 ('idea', 15),
 ('t', 15),
 ('support', 14),
 ('vox', 14),
 ('culture', 14),
 ('re', 14),
 ('told', 14),
 ('all', 13),
 ('people', 13),
 ('therapy', 13),
 ('from', 13),
 ('not', 13),
 ('said', 13),
 ('what', 12),
 ('us', 12),
 ('up', 12),
 ('violent', 12),
 ('if', 12),
 ('was', 12),
 ('way', 

#### Getting the least common words from our text

In [5]:
fdist.most_common(100)[-20:]

[('love', 9),
 ('its', 9),
 ('treatment', 9),
 ('these', 9),
 ('black', 9),
 ('like', 9),
 ('could', 9),
 (';', 9),
 ('your', 9),
 ('own', 8),
 ('communities', 8),
 ('help', 8),
 ('may', 8),
 ('most', 8),
 ('spoke', 8),
 ('while', 8),
 ('other', 8),
 ('even', 8),
 ('do', 8),
 ('any', 8)]

#### Plotting some stuff

In [None]:
# This is used so jupyter can plot
%matplotlib inline
fdist.plot()

### First possibility: get outstanding words in relation to all articles

One possibility is to try to separate common words from outstanding words.

We can tokenize all the articles and put them in the same pool of words. Then, we count how many times a word is repeated in all pool of tokens.

Common words, such as connectors, nouns, verbs, will probably be very common. In other hand, specific words will repeat less in the pool of tokens (they may repeat a lot in a single article).

In other words, we want to find words in an article that have small probability to be found in other articles.

First of all, I'll grab all the tokens from a text. Next, I'll grab only the tokens and add them to an array of token occurences (i.e. how many articles does this token appear). It will hold content such as:

```json
# token, occurences
{"word": 2}
```

In [50]:
from nltk.probability import FreqDist
token_occurences = {}
token_probabilities = {}
# We just want to text the 3 first articles right now
# tagged_text_list = [text for text in tagged_text_list[0:3]]
# Getting all tokens occurences across texts
for article in tagged_text_list:
    content = article.get('text')
    tokens = word_tokenize(content)
    fdist = FreqDist(word.lower() for word in tokens)
    for word in fdist.most_common(100):
        if word[0] in token_occurences.keys():
            token_occurences[word[0]] = token_occurences[word[0]] + 1
        else:
            token_occurences[word[0]] = 1

# Calculating token probabilities

for token, count in token_occurences.items():
    token_probabilities[token] = count / len(tagged_text_list) * 100

# Getting words with less than 50% of probability

filtered_dict = {}
threshold = 50
for key, value in token_probabilities.items():
    if value <= threshold:
        filtered_dict[key] = value

print(filtered_dict)

{'incel': 33.33333333333333, 'community': 33.33333333333333, 'men': 33.33333333333333, '—': 33.33333333333333, 'incels': 33.33333333333333, 'their': 33.33333333333333, 'members': 33.33333333333333, 'women': 33.33333333333333, 'idea': 33.33333333333333, 'support': 33.33333333333333, 'vox': 33.33333333333333, 'culture': 33.33333333333333, 'told': 33.33333333333333, 'therapy': 33.33333333333333, 'said': 33.33333333333333, 'us': 33.33333333333333, 'violent': 33.33333333333333, 'way': 33.33333333333333, 'pill': 33.33333333333333, 'part': 33.33333333333333, 'misogyny': 33.33333333333333, 'sex': 33.33333333333333, 'its': 33.33333333333333, 'treatment': 33.33333333333333, 'these': 33.33333333333333, 'black': 33.33333333333333, 'could': 33.33333333333333, ';': 33.33333333333333, 'own': 33.33333333333333, 'communities': 33.33333333333333, 'help': 33.33333333333333, 'may': 33.33333333333333, 'most': 33.33333333333333, 'spoke': 33.33333333333333, 'while': 33.33333333333333, 'any': 33.3333333333333

Now, we want to see which most common words in an article show as less common words in our word pool.

In [48]:
article_features = []
article = tagged_text_list[0]
content = article.get('text')
tokens = word_tokenize(content)
fdist = FreqDist(word.lower() for word in tokens)
for word in fdist.most_common(100):
    if word[0] in filtered_dict.keys():
        if word[0] not in article_features:
            article_features.append(word[0])
print(article_features)

['incel', 'community', 'men', '—', 'incels', 'their', 'members', 'women', 'idea', 'support', 'vox', 'culture', 'told', 'therapy', 'said', 'us', 'violent', 'way', 'pill', 'part', 'misogyny', 'sex', 'its', 'treatment', 'these', 'black', 'could', ';', 'own', 'communities', 'help', 'may', 'most', 'spoke', 'while', 'any']
