## Tasks in Natural Language Processing

* Tokenization - Breaking down text into words and sentences
* Stopword Removal - Filtering common words
* N-Grams - Identifying commonly ocurring groups of words
* Word Sense Disambiguation - Identifying the context in which the word occurs
* Parts of Speech - Identifying parts of speech
* Stemming - Removing ends of the words


Natural language processing (NLP)
Natural Language Processing, or NLP, refers to a collection of different ways for a computer to make sense out of its interactions with a human being through a natural language. NLP is a comprehensive discipline in computer science and involves topics such as artificial intelligence, computer linguistics, and human computer interaction, or HCI. There are NLP subfields that are particularly relevant to a data scientist.
Tokenization, parsing, sentence segmentation, and named entity recognition are some of them. Tokenization and parsing isolate each text symbol from a text and conduct a grammatical analysis. Sentence segmentation separates one sentence from the other in a text. Named entity recognition identifies which text symbol maps to what types of proper names. A significant portion of data you're dealing with as a data scientist is unstructured.
That is, they are text extracted not from a database, but from sources such as social media sites, text documents, pictures, and so on. Therefore, one of the biggest challenges of a data scientist is to sort through this unstructured data and pre-process it so that data mining and analytics tools can take over to extract the ultimate knowledge they are seeking. Luckily for the data scientists, there are already well-developed NLP tools patched into program languages such as Python.
Some of these tools are also built into an operating system such as Unix or Linux.

Communication cuts to the very heart of who we are as human beings. It's the language that we share that helps us understand larger concepts such as community, law, and justice. As human beings we're always trying to do a better job communicating. So it's not much of a surprise that we want our machines to do the same thing. In many ways, machines do a much better job communicating with each other than us as human beings. It's pretty easy to have two machines communicate. There might be an occasional packet loss here and there, but when you send an email it usually arrives in its original form.
Human beings, on the other hand, are always struggling to reach greater understanding. If you can deliver 5 or 10% of what you're intending, then you're a great communicator. The main challenge is that we can't communicate with machines in the same way that they communicate with each other. We're not like Neo in The Matrix. We don't have an uplink port that will allow us to connect directly to the network. That means that the machines have to do a better job existing in our world.
To do this, AI programs try to do something called natural language processing. This is when you can interact with the machine using your own natural language. We're all familiar with how to communicate with a search engine like Google. There's a little box, and then you type in different questions or phrases. You can type something like, "recipe for Belgian waffles." Then the search engine will match your phrase to popular results.
It will look through common recipe sites for the term "Belgian waffles and recipe." Natural language processing makes this interaction much more human. Imagine if you could say something like, "I'm cooking breakfast. "Can you give me a good recipe for those "big fluffy waffles?" even with this simple request, you have a lot of natural language processing. The machine has to understand that good is relative, so in this case the person is probably looking for the top recipes. The machine also has to figure out what's a big fluffy waffle.
It's pretty common for human beings to describe thing by their attributes. It would be almost impossible to come up with an AI program using symbolic reasoning to do this level of natural language processing. How could you come up with an expert that could record all the relationships between different words and phrases? You wouldn't want an expert in a room hand coding different ways to describe Belgian waffles. Again, a lot of the work in this area has been using machine learning and artificial neural networks.
Any time you send a text or an email, it potentially goes through servers that can process parts of your conversations. They don't usually do this because they're interested in what you're saying. Instead they do this because they're interested in how you're saying it. It makes sense that organizations that are interested in artificial intelligence also offer many free communication services. Google has access to anonymized versions of your email and voicemail to pick up how you have conversations.
Apple offers iMessage and Microsoft has Skype. These services give their AI programs a treasure trove of different types of human communications. They can use machine learning to see patterns in how humans use their natural language. But natural language processing isn't just about understanding the words. It's also about understanding the context and meaning. A few years ago one of the top Google searches was "What is love?" At the time, when you put that search into Google, you would get all long list of results.
Most of them were about biological pairing rituals and the importance of feeling connected. This was the kind of response you'd except from a network that's just matching keywords. Natural language processing gives machines the ability to better understand the larger world. If you're typing in "What is love?" into a search engine, then you're probably much more interested in romantic notions of love, perhaps even some poetry or insights into what it's like to be in love. You might just want to hear a hit song by Haddaway.
Human beings have written on love from the beginning of language, so there's sure to be plenty to see on the topic.

In [1]:
from nltk.corpus import treebank

In [4]:
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()