-
Notifications
You must be signed in to change notification settings - Fork 7
Pygrunn 14 article #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: pelican
Are you sure you want to change the base?
Changes from all commits
cb0253e
2da68e1
6393cd1
fd9af3d
b0a5bc1
c8c26cf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,249 @@ | ||
| Pygrunn 2014 | ||
| ============ | ||
|
|
||
| :date: 2014-05-10 15:56 | ||
| :tags: conference, talk, nlp, pygrunn | ||
| :category: life | ||
| :author: Dmitrijs Milajevs | ||
| :template: article_cover | ||
| :cover: 016-pygrunn14.jpg | ||
|
|
||
| `Pygrunn <http://www.pygrunn.org/>`_ is an awesome conference for Python | ||
| developers and friends, which takes place in | ||
| `Groningen <http://en.wikipedia.org/wiki/Groningen>`_. | ||
|
|
||
| As usually, the conference was perfectly organized. This is one of the most | ||
| stylish conferences I've ever attended. It constantly grows, and next year the | ||
| conference moves to a bigger venue, so keep the beginning of May 2015 free and | ||
| attend the event. | ||
|
|
||
| Another positive trend is the growing proportion of science related talks. One | ||
| of the topics of the conference became (scientific) code quality and | ||
| collaboration between professional developers and scientists. | ||
|
|
||
| Check awesome summaries of talks by | ||
| `Reinout van Rees <http://reinout.vanrees.org/weblog/tags/pygrunn.html>`_ | ||
| and | ||
| `Maurits van Rees <http://maurits.vanrees.org/weblog/topics/pygrunn>`_. Get the | ||
| `#pygrunn <https://twitter.com/search?q=%23PyGrunn>`_ tweets and follow | ||
| `@pygrunn <https://twitter.com/PyGrunn>`_. | ||
|
|
||
|
|
||
| Computational linguistics 101 | ||
| ----------------------------- | ||
|
|
||
| `My presentation`__ started as a demonstration of the modern pythonic scientific | ||
| tools (my subjective classification): | ||
|
|
||
| __ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb | ||
|
|
||
| 1. Data structures: NumPy_, SciPy_, Pandas_ | ||
| 2. Algorithms: scikit-learn_, NLTK_, TextBlob_, gensim_ | ||
| 3. Reporting: IPython_, matplotlib_ seaborn_ | ||
|
|
||
| .. _NumPy: http://www.numpy.org/ | ||
| .. _SciPy: http://www.scipy.org/scipylib/index.html | ||
| .. _Pandas: http://pandas.pydata.org/ | ||
| .. _scikit-learn: http://scikit-learn.org/ | ||
| .. _NLTK: http://www.nltk.org/ | ||
| .. _TextBlob: http://textblob.readthedocs.org | ||
| .. _gensim: http://radimrehurek.com/gensim/ | ||
| .. _IPython: ttp://ipython.org/ | ||
| .. _matplotlib: http://matplotlib.org/ | ||
| .. _seaborn: http://www.stanford.edu/~mwaskom/software/seaborn/ | ||
|
|
||
|
|
||
| However, I find the technical talks with a lot of code rather boring, so I | ||
| decided to show how these libraries are used to solve simple CL tasks. | ||
|
|
||
| A universal pattern behind natural languages | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| First, I `covered`__ `Zipf's law <http://en.wikipedia.org/wiki/Zipf%27s_law>`_, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just checking if the __ and the ` are ok. |
||
| which states that the frequency of any word in a corpus of texts is inversely | ||
| proportional to its rank in the frequency table. To show that the law holds for | ||
| an English text, I loaded `the BNC frequency list`__ provided by `Adam | ||
| Kilgarriff`__ into `Pandas <http://pandas.pydata.org/>`_ `DataFrame`__ and | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In a RST file you can define the links for example at the end of the doc, and you will always link to the page. Example: More info here: http://docutils.sourceforge.net/docs/user/rst/quickref.html#external-hyperlink-targets |
||
| plotted the sorted frequencies on the log-log scale. | ||
|
|
||
| __ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#english-word-frequencies | ||
| __ http://www.kilgarriff.co.uk/BNClists/lemma.num | ||
| __ http://www.kilgarriff.co.uk/bnc-readme.html | ||
| __ http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html | ||
|
|
||
| .. image:: {filename}/static/images/016-bnc_freq.png | ||
| :align: center | ||
| :alt: English word frequency counts extracted from the British National Corpus on the log-log scale. | ||
| :target: {filename}/static/images/016-bnc_freq.png | ||
|
|
||
| As a homework, I asked whether the same behavior is observed in | ||
| other languages and what the differences are. | ||
|
|
||
| Distributional semantics | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| I could not resist and `presented`__ my `research area`__ :) by extracting word | ||
| co-occurrence counts and projecting the word vectors to 2 dimensions using | ||
| `scikit-learn`__ implementation of `manifold learning`__. | ||
|
|
||
| __ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#distributional-semantics | ||
| __ http://www.eecs.qmul.ac.uk/~dm303/ | ||
| __ http://scikit-learn.org/stable/ | ||
| __ http://scikit-learn.org/stable/modules/manifold.html | ||
|
|
||
| In distributional semantics, words are represented as rows in a matrix. The | ||
| columns correspond to other words the word co-occurs with. The values of the | ||
| matrix are the frequencies the words co-occurred together. For example, here are | ||
| the vectors for the words ``idea``, ``notion``, ``boy`` and ``girl``. | ||
|
|
||
| ======= ========== ==== ====== | ||
| \ philosophy book school | ||
| ======= ========== ==== ====== | ||
| idea 10 47 39 | ||
| notion 7 3 15 | ||
| boy 0 12 146 | ||
| girl 0 19 93 | ||
| ======= ========== ==== ====== | ||
|
|
||
| So, ``idea`` was seen with ``philosophy`` 10 times in the corpus I used. A | ||
| co-occurrence in this case means that ``philosophy`` was not more than 5 words | ||
| further from ``idea``. | ||
|
|
||
| The number patterns for ``boy`` and ``girl`` are much more similar than for | ||
| ``boy`` and ``notion``, suggesting that ``boy`` is much more similar to ``girl`` | ||
| than to ``notion``. Clearly, we can select much more words to label rows, making | ||
| the similarity reasoning more precise. | ||
|
|
||
| We can reason on word semantic similarity from a geometrical point of view using | ||
| a distance measure (for example, Euclidean distance). The closer are two vectors | ||
| to each other in the vector space, the closer are the words semantically. | ||
|
|
||
| Unfortunately, it's difficult for humans to reason in more than 3 dimensions. | ||
| While the multidimensional space is useful to perform computations, it's useless | ||
| to present the patterns words share. If we could imagine the space, we would | ||
| discover areas (or directions) that correspond to the girlish/boylish words and | ||
| to the more abstract idea/notion. | ||
|
|
||
| To overcome the issue, we can reduce the dimensionality of the space in such a | ||
| way that the distance between the elements is respected. Clearly, we can't | ||
| completely preserve the distances, but it's possible to respect the distances to | ||
| some degree. | ||
|
|
||
| Manifold learning is one of many techniques to perform dimensionality reduction. | ||
| If we apply it to the extracted co-occurrence counts for some of the words and | ||
| reduce to two dimensions (so we can plot it), we will notice that related words | ||
| cluster around each other. | ||
|
|
||
| .. image:: {filename}/static/images/016-ds.png | ||
| :align: center | ||
| :alt: Word semantic relatedness. | ||
| :target: {filename}/static/images/016-ds.png | ||
|
|
||
|
|
||
| Sprint | ||
| ------ | ||
|
|
||
| `Spyros Ioakeimidis <https://twitter.com/_spyreto_>`_ and | ||
| `Sjoerd de Haan <https://www.linkedin.com/profile/view?id=22830170>`_ liked the | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @spyreto and Sjoerd de Haan liked the idea of counting word frequencies among various languages and see how they compare in relation to Zipf's law. |
||
| idea of counting word frequencies among various languages and see how they | ||
| compare in relation to Zipf's law. | ||
|
|
||
| Initially, we wanted to take EU directives and compare the official EU languages, | ||
| however, the website was down, and we were kindly redirected to | ||
| `this page <http://sorry.ec.europa.eu/>`_ every time we wanted to get a legal | ||
| document. | ||
|
|
||
| Luckily, we found an already prepared `word frequencies for many languages | ||
| <http://invokeit.wordpress.com/frequency-word-lists/>`_ and reused them. We | ||
| wrote a simple function to plot the frequency of the words against the rank of | ||
| the words in the frequency table. Here is the top 10 most frequently used words | ||
| in English, Dutch and Latvian: | ||
|
|
||
| ==== ======== ========= ======== ========= ======== ========= | ||
| \ English Dutch Latvian | ||
| ---- ------------------ ------------------ ------------------ | ||
| Rank Word Frequency Word Frequency Word Frequency | ||
| ==== ======== ========= ======== ========= ======== ========= | ||
| 1 you 6281002 ik 2091479 ir 20182 | ||
| 2 i 5685306 je 1995150 es 19042 | ||
| 3 the 4768490 het 1428477 un 12737 | ||
| 4 to 3453407 de 1399236 tu 12141 | ||
| 5 a 3048287 is 1202489 tas 8601 | ||
| 6 it 2879962 dat 1188131 ka 7964 | ||
| 7 and 2127187 een 1011496 man 7725 | ||
| 8 that 2030642 niet 997681 to 7535 | ||
| 9 of 1847884 en 774098 vai 7527 | ||
| 10 in 1554103 wat 618627 ko 6906 | ||
| ==== ======== ========= ======== ========= ======== ========= | ||
|
|
||
| If you plot the word rank on the x axis and the word frequency on the y axis on | ||
| a log-log scale you should see a straight line. A straight line on a log-log | ||
| plot implies that the quantities on the two axis are related trough a power law. | ||
| Thus, if our data would fit straight line perfectly, that would mean that the | ||
| frequency of a word occurring is exactly proportional to a power of the rank of | ||
| that word in the frequency table. This is the content of Zipf's law, but | ||
| of course, such laws are never exact. | ||
|
|
||
| .. image:: {filename}/static/images/016-en_zipf.png | ||
| :align: center | ||
| :alt: English word frequency counts on the log-log scale. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The image is not found
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. github doesn't know how to find it, but our blog engine does :) |
||
| :target: {filename}/static/images/016-en_zipf.png | ||
|
|
||
| The blue line is the provided frequencies, the green is a regression line. | ||
|
|
||
| One thing we can compare amongst languages is how well this plot follows a | ||
| straight line. Also the slope of the line contains interesting information. It | ||
| tells what kind of power law we are dealing with exactly. | ||
|
|
||
| The slope is related to the morphology of a language. For example, in Latvian, | ||
| which has quite rich morphology, the word `"city"` is `"pilsēta"`, but the | ||
| English phrase `"in a city"` is `"pilsētā"`. All the occurrences of "`pilsēta`" | ||
| in a Latvian text will be distributed over several morphological forms, lowering | ||
| the counts. As a result, the slope for a Latvian text will be less steep | ||
| comparing to English. | ||
|
|
||
| We `tried`__ English, Ukrainian, Dutch, Russian, Latvian, Spanish and Italian. All | ||
| languages obey Zipf's law, at least visually. | ||
|
|
||
| __ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Word%20frequencies.ipynb | ||
|
|
||
| ========= ========= =========== | ||
| Language Slope Intercept | ||
| ========= ========= =========== | ||
| en -1.717729 21.934904 | ||
| uk -1.044263 11.212273 | ||
| nl -1.566664 19.635268 | ||
| ru -1.395736 17.781756 | ||
| lv -1.055992 11.541761 | ||
| es -1.707326 22.161790 | ||
| it -1.601567 20.000540 | ||
| ========= ========= =========== | ||
|
|
||
| Theory [Li1992]_ says that the slope coefficient should be close to -1. As the | ||
| table below shows, the values deviate from -1 quite drastically (-1.57 for | ||
| Dutch, for example). Also, the `slope estimate`__ for English from the `British | ||
| National Corpus`__ is -1.18 in contrary to -1.72. Here is the Zipf's law | ||
| visualization for English extracted from the BNC. | ||
|
|
||
| __ http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/pygrunn14.ipynb#estimating-the-slope | ||
| __ http://www.natcorp.ox.ac.uk/ | ||
|
|
||
| .. image:: {filename}/static/images/016-en_bnc_zipf.png | ||
| :align: center | ||
| :alt: Actual and estimated English word frequencies from the BNC. | ||
| :target: {filename}/static/images/016-en_bnc_zipf.png | ||
|
|
||
| Conclusion | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you could add a few closing words on Pygrunn, maybe something relating to linguistics as well. Or if you don't want to add anything you could just change Conclusion to References & keep the link below. |
||
| ---------- | ||
|
|
||
| Pygrunn is a great conference that start attracting not only (professional web) | ||
| developers, but also scientists. I was really surprised that my talk got a bit | ||
| of attention and people were willing to hack around a linguistic phenomena. I | ||
| hope that next year this trend continues. And the two communities become closer | ||
| to each other. | ||
|
|
||
| .. [Li1992] Li, Wentian. | ||
| `Random texts exhibit Zipf's-law-like word frequency distribution.`__ | ||
| Information Theory, IEEE Transactions on 38.6 (1992): 1842-1845. | ||
|
|
||
| __ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.8422&rep=rep1&type=pdf | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the notebook is not loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it works now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The article is improving!
Still the notebook is not loading. It is not found on the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange, it works for me, maybe there are some problems on the server. I'll give a link to the original file and to the rendered version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use word frequencies available here http://wacky.sslmit.unibo.it/doku.php?id=frequency_lists