Skip to content

umberto-sonnino/cooccurence

Repository files navigation

Homework II: n-gram and co-occurrence estimation

Given a text corpus,there are two objectives:

  1. Calculate a list of n-grams and their frequencies in the corpus: n-gram list
  2. Create, for a given n-gram, a sorted list of similar words: co- occurrence list

You can work with any of the following corpora: 1.ukWaC (this has been used, but not uploaded) 2.Wikipedia corpus

Lemmatize the text corpus, if needed . Go over the processed text and create a list of all n-grams (unigrams and bigrams) • then calculate the frequency of each of these n-grams in the corpus

Co-occurrence estimation has been done with the Jaccard similarity coefficient.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages