# tlm & utokens
**Unique Token Dictionaries**

## Terms and Definitions


### Dictionary Size $s$

The dictionary size $s$ is the total number of unique tokens that make up the dictionary. 
Referring to the English language dictionary the size of that dictionary corresponds to the total number of words in that dictionary.

### Token Length $ \lambda $

The length of a token is the total number of characters that make up the token.

### Token Frequency $ \gamma $

The token frequency is the number of times a token appears in a given context.
For single token dictionaries the frequency is always equal to one. In this case, i.e. $ \gamma=1 $

### Dictionary Weight $ \omega $

The weight of a dictionary is determined by the sum of the product of the token frequency and the token length. 

$ \omega = \sum \limits_{i=1} ^{n} \gamma _i \lambda _i $

Where:

* $ \gamma $ is the token frecuency.
* $ \lambda $ is the length of the *i*-token.

For single element dictionaries the calculation is simplified to the following expression: $ \omega = \sum \limits_{i=1} ^{n} \lambda _i $

### Relative Token Weight $ \omega r _i $

The relative weight of an "i" token with respect to its context is determined by the relationship of its weight to the total but of the dictionary or context: 

$ \omega r _i = \frac {\gamma _i \lambda _i} {\omega} $

### Frequency of Token Length

The token length frequency is the number of times a token length appears within a given context.



## Imports

In [1]:
import pandas as pd
import numpy as np
import warnings

from urllib import request

## Load and explore utokens

In [2]:
#dtypes = {'token': 'string', 'bytes': 'uint8'}
en_tokens_df = pd.read_csv("datasets/en_tokens.txt", sep='\t', dtype={'token': 'string', 'bytes': 'uint8'})

#en_tokens_df.to_pickle('Datasets/en_tokens_p.txt')
#en_tokens_df.info()

## Tokens Mantainer

In [None]:
try:
    from pyspark import SparkConf, SparkContext
    cn = SparkConf().setMaster('local[2]').setAppName("Mi programa")
    sc = SparkContext(conf = cn)
except ValueError:
    warnings.warn("SparkContext already exists in this scope")

#cols = ['id', 'source', 'name', 'author', 'url', 'lang', 'format', 'isdone']
df_ebooks = pd.read_csv("datasets/ebooks.txt", sep='\t', dtype={'id': 'int64', 'source': 'string', 'name': 'string', 'author': 'string', 'url': 'string', 'format': 'int64', 'isdone':'bool'})
x=df_ebooks.to_numpy()

#books_df_01 = books_df[['id', 'url', 'format', 'isdone']]
def extractor(url):
    res = request.urlopen(url)
    raw = res.read().decode('utf8')
    return raw
ans = []

txt = extractor('http://www.gutenberg.org/cache/epub/63295/pg63295.txt')
len(txt)
txt[1:1000]

raw_rdd = sc.parallelize(txt)
print(raw_rdd.count())
print(raw_rdd.collect())

print (raw_rdd.take(12))

spc_filter = raw_rdd.filter(lambda x: ' ' in x)
print(spc_filter.count())

col_filter = raw_rdd.filter(lambda x: ',' in x)
print(col_filter.count())

dot_filter = raw_rdd.filter(lambda x: '.' in x)
print(dot_filter.count())

sco_filter = raw_rdd.filter(lambda x: ';' in x)
print(sco_filter.count())

lenTokenRdd = raw_rdd.map(lambda line: int(line.split(" ")))
#lenTokenRdd.count()

#lenTokenRdd = dicRdd.map(lambda line: int(line.split(" ")[1]))

#df_ebooks.info()

#url = "http://www.gutenberg.org/cache/epub/63295/pg63295.txt"
#response = request.urlopen(url)
#raw = response.read().decode('utf8')
#len(raw)


  
32281
['\ufeff', 'T', 'h', 'e', ' ', 'P', 'r', 'o', 'j', 'e', 'c', 't', ' ', 'G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g', ' ', 'E', 'B', 'o', 'o', 'k', ' ', 'o', 'f', ' ', 'T', 'h', 'e', ' ', 'W', 'i', 's', 'h', 'i', 'n', 'g', ' ', 'C', 'a', 'p', ',', ' ', 'b', 'y', ' ', 'M', 'a', 'r', 'y', ' ', 'M', 'a', 'r', 't', 'h', 'a', ' ', 'S', 'h', 'e', 'r', 'w', 'o', 'o', 'd', '\r', '\n', '\r', '\n', 'T', 'h', 'i', 's', ' ', 'e', 'B', 'o', 'o', 'k', ' ', 'i', 's', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'u', 's', 'e', ' ', 'o', 'f', ' ', 'a', 'n', 'y', 'o', 'n', 'e', ' ', 'a', 'n', 'y', 'w', 'h', 'e', 'r', 'e', ' ', 'i', 'n', ' ', 't', 'h', 'e', ' ', 'U', 'n', 'i', 't', 'e', 'd', ' ', 'S', 't', 'a', 't', 'e', 's', ' ', 'a', 'n', 'd', ' ', 'm', 'o', 's', 't', '\r', '\n', 'o', 't', 'h', 'e', 'r', ' ', 'p', 'a', 'r', 't', 's', ' ', 'o', 'f', ' ', 't', 'h', 'e', ' ', 'w', 'o', 'r', 'l', 'd', ' ', 'a', 't', ' ', 'n', 'o', ' ', 'c', 'o', 's', 't', ' ', 'a', 'n', 'd', ' ', 'w', 'i', 't', 'h',