**Corpus Description**

Link reference [here](https://www.kaggle.com/rtatman/ubuntu-dialogue-corpus).

**Content:**

The new Ubuntu Dialogue Corpus consists of almost one million two-person conversations extracted from the Ubuntu chat logs, used to receive technical support for various Ubuntu-related problems. The conversations have an average of 8 turns each, with a minimum of 3 turns. All conversations are carried out in text form (not audio).

The full dataset contains 930,000 dialogues and over 100,000,000 words and is available here. This dataset contains a sample of this dataset spread across .csv files. This dataset contains more than 269 million words of text, spread out over 26 million turns.

* folder: The folder that a dialogue comes from. Each file contains dialogues from one folder .
* dialogueID: An ID number for a specific dialogue. Dialogue ID’s are reused across folders.
* date: A timestamp of the time this line of dialogue was sent.
* from: The user who sent that line of dialogue.
* to: The user to whom they were replying. On the first turn of a
dialogue, this field is blank.
* text: The text of that turn of dialogue, separated by double quotes (“). Line breaks (\n) have been removed.

**Files information**

| lines  |  words  | characteres |      filename      | size  |
|--------|---------|-------------|--------------------|-------|
|9212878 |91660344 |  996253904  |dialogueText_196.csv|996,3MB|
|16587831|166392849| 1799936480  |dialogueText_301.csv|1,8GB  |
|1038325 |11035331 |  116070597  |dialogueText.csv    |116,1MB|

**Context:**

- Selected file: dialogueText.csv
- Data divide by users with `scripts/process_file.py` and select first five heaviest files:

```
1. -rw-rw-r-- 1 stiven stiven 535351 abr 18 19:37 ActionParsnip.csv
2. -rw-rw-r-- 1 stiven stiven 440231 abr 18 19:37 jrib.csv
3. -rw-rw-r-- 1 stiven stiven 434792 abr 18 19:37 Dr_Willis.csv
4. -rw-rw-r-- 1 stiven stiven 431646 abr 18 19:37 bazhang.csv
5. -rw-rw-r-- 1 stiven stiven 429946 abr 18 19:37 Pici.csv
```

In [None]:
# Importations

import nltk
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
nltk.download("book")
nltk.download('wordnet')

1. **Normalization**: I just to trasform upper case to lower case

In [62]:
text = open("./ActionParsnip.txt", "r", encoding = "UTF-8")
raw = text.read()
type(raw)
len(raw)
raw = raw.lower()

2. **Segmentation**: Sentence tokenization via `sent_tokenize` of the nltk python module

In [63]:
sentences = sent_tokenize(raw)
print(len(sentences))
f = open(f'./frases.txt', 'w', encoding='utf-8')
f.writelines(sentences)
f.close()

335


3. **Tokenization**: 18.979% of the tokens were deleted 

In [64]:
tokens = word_tokenize(raw)
len(tokens) # -> 40613

tokens = [ word for word in tokens if word.isalpha()]
len(tokens) # -> 32905

word_count = Counter(tokens)
word_count.most_common(20)

[('the', 1682),
 ('you', 1472),
 ('to', 880),
 ('it', 731),
 ('a', 720),
 ('is', 650),
 ('in', 621),
 ('and', 618),
 ('can', 520),
 ('use', 411),
 ('will', 405),
 ('for', 320),
 ('sudo', 288),
 ('i', 266),
 ('run', 261),
 ('of', 256),
 ('then', 254),
 ('install', 251),
 ('as', 249),
 ('if', 245)]

4. **Stop words elimination:** 

- 55.04% of the words were reduced with respect to tokens without alpha characters.
- 44.5% of the words were reduced with respect to original tokens.


In [65]:
stopwords.words('english')[1:10]

tokens = [ word for word in tokens if not word in stopwords.words('english')]
len(tokens) # -> 18112

18112

5. **Stemming and lemmatization:** 

In [66]:
# STEMMING

from nltk.stem import PorterStemmer, SnowballStemmer 
porter_stem = PorterStemmer()
snowball_stem = SnowballStemmer("english")

porter_tokens = [ porter_stem.stem(word) for word in tokens ]
snowball_tokens = [ snowball_stem.stem(word) for word in tokens ]

original_word_count = Counter(tokens)
porter_word_count = Counter(porter_tokens)
snowball_word_count = Counter(snowball_tokens)

# Porter
print('---------------------------(Porter)----------------------------------')
percentage = (1 - len(porter_word_count) / len(original_word_count)) * 100
print(f'{percentage}% of vocabulary size reduction')

# Snowball
print('--------------------------(Snowball)----------------------------------')
percentage = (1 - len(snowball_word_count) / len(original_word_count)) * 100
print(f'{percentage}% of vocabulary size reduction')

# LEMMATIZATION

from nltk import WordNetLemmatizer
lemma = WordNetLemmatizer()

lemma_tokens = [ lemma.lemmatize(word, pos='v') for word in tokens ]
lemma_tokens = [ lemma.lemmatize(word, pos='n') for word in tokens ]

lemma_word_count = Counter(lemma_tokens)

# Lemmatization
print('--------------------------(Snowball)----------------------------------')
percentage = (1 - len(lemma_word_count) / len(original_word_count)) * 100
print(f'{percentage}% of vocabulary size reduction')

---------------------------(Porter)----------------------------------
10.873408769448378% of vocabulary size reduction
--------------------------(Snowball)----------------------------------
10.590523338048087% of vocabulary size reduction
--------------------------(Snowball)----------------------------------
3.6951909476661937% of vocabulary size reduction


6. **Modifying sentences, collocations and text generator**

In [59]:
import pandas as pd

sentences = [ f'<s> {sentence} </s>' for sentence in sentences]
f = open(f'./frases.txt', 'w', encoding='utf-8')
f.writelines(sentences)
f.close()
 
bigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(tokens)

# bigrams
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)
# trigrams
trigram_freq = trigramFinder.ngram_fd.items()
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)

bigramFreqTable
# trigramFreqTable

Unnamed: 0,bigram,freq
118,"(run, sudo)",68
848,"(sudo, lshw)",40
172,"(sudo, install)",34
332,"(test, iso)",33
565,"(may, need)",27
...,...,...
5685,"(fine, times)",1
5686,"(times, use)",1
5687,"(use, itits)",1
5688,"(itits, running)",1


In [None]:
split_sentences = [ sentence.split() for sentence in sentences ]
bigrams_sentences = nltk.FreqDist([bigram for sentence in split_sentences for bigram in list(nltk.bigrams(sentence))])
display(bigrams_sentences)

Using an Stochastic process given in the class number five:

In [None]:
import numpy as np

def stochastic(freqdist, keys=None):
    pivot = np.random.rand()
    acc = 0.
    if keys is None: 
        keys = freqdist.keys()
    else:
        pivot = pivot * np.sum([freqdist.freq(k) for k in keys])
        
    for key in keys:
        acc = acc + freqdist.freq(key) 
        if pivot < acc: return key

def keys(freqdist, start: tuple):
    assert(type(start) is tuple)
    return [key for key in freqdist.keys() if start == key[:len(start)]]

def bigram_generator(prev= None, max_length=100):
    sentence = prev.upper().split() if prev else list()
    prev = (sentence[-1],) if prev else ('<s>', )
    for i in range(max_length):
        keys_list = keys(bigrams_sentences, prev)
        curr = stochastic(bigrams_sentences, keys_list)[-1]
        if curr == '</s>': break
        sentence.append(curr)
        prev = (curr, )
    return ' '.join(sentence)

bigram_generator()

6. Similarity

In [79]:
# This program computes the "distance" between two text files
# as the angle between their word frequency vectors (in radians).
#
# For each input file, a word-frequency vector is computed as follows:
#    (1) the specified file is read in it is converted into a list of alphanumeric "words"
#    (2) for each word, its frequency of occurrence is determined
#    (3) the word/frequency lists are sorted into order alphabetically
#
# The "distance" between two vectors is the angle between them.
# If x = (x1, x2, ..., xn) is the first vector (xi = freq of word i)
# and y = (y1, y2, ..., yn) is the second vector,
# then the angle between them is defined as:
#    d(x,y) = arccos(inner_product(x,y) / (norm(x)*norm(y)))
# where:
#    inner_product(x,y) = x1*y1 + x2*y2 + ... xn*yn
#    norm(x) = sqrt(inner_product(x,x))

import math
    # math.acos(x) is the arccosine of x.
    # math.sqrt(x) is the square root of x.

import string

import sys

#####################################################################
# Operation 1: read a text file and apply natural language process ##
#####################################################################
def read_file(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        text = open(filename, "r", encoding = "UTF-8")
        raw = text.read()
        raw = raw.lower()
        tokens = word_tokenize(raw)
        tokens = [ word for word in tokens if word.isalpha()]
        tokens = [ word for word in tokens if not word in stopwords.words('english')]
        
        from nltk.stem import PorterStemmer, SnowballStemmer 
        porter_stem = PorterStemmer()
        snowball_stem = SnowballStemmer("english")
        porter_tokens = [ porter_stem.stem(word) for word in tokens ]
        snowball_tokens = [ snowball_stem.stem(word) for word in tokens ]
        
        from nltk import WordNetLemmatizer
        lemma = WordNetLemmatizer()
        lemma_tokens = [ lemma.lemmatize(word, pos='v') for word in tokens ]
        lemma_tokens = [ lemma.lemmatize(word, pos='n') for word in tokens ]
        
        return tokens, porter_tokens, snowball_tokens, lemma_tokens
    except IOError:
        print("Error opening or reading input file: ",filename)
        sys.exit()

##############################################
# Operation 2: count frequency of each word ##
##############################################
def count_frequency(word_list):
    """
    Return a dictionary mapping words to frequency.
    """
    D = {}
    for new_word in word_list:
        if new_word in D:
            D[new_word] = D[new_word]+1
        else:
            D[new_word] = 1
    return D

#############################################
## compute word frequencies for input file ##
#############################################
def word_frequencies_for_file(filename):
    """
    Return dictionary of (word,frequency) pairs for the given file.
    """

    word_list, stem_porter_word , stem_snowball_word, lem_word = read_file(filename)
    
    # we can change tokens by stemm words or lem words
    freq_mapping = count_frequency(word_list)

    print("File",filename,":")
    print(len(word_list),"words,")
    print(len(freq_mapping),"distinct words")

    return freq_mapping

def inner_product(D1,D2):
    """
    Inner product between two vectors, where vectors
    are represented as dictionaries of (word,freq) pairs.

    Example: inner_product({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0 
    """
    sum = 0.0
    for key in D1:
        if key in D2:
            sum += D1[key] * D2[key]
    return sum

def vector_angle(D1,D2):
    """
    The input is a list of (word,freq) pairs, sorted alphabetically.

    Return the angle between these two vectors.
    """
    numerator = inner_product(D1,D2)
    denominator = math.sqrt(inner_product(D1,D1)*inner_product(D2,D2))
    return math.acos(numerator/denominator)

def main():
    filename_1 = 'ActionParsnip.txt'
    filename_2 = 'Pici.txt'
    sorted_word_list_1 = word_frequencies_for_file(filename_1)
    sorted_word_list_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(sorted_word_list_1,sorted_word_list_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

main()

File ActionParsnip.txt :
18112 words,
5656 distinct words
File Pici.txt :
13528 words,
3238 distinct words
The distance between the documents is: 0.964608 (radians)
