<a href="https://colab.research.google.com/github/wilsonjefferson/DSSC_NLP/blob/main/Intro_and_Linguistic_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## IPython Notebook

The **IPython Notebook**, or Jupyter Notebook, is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

### Install

```
pip install jupyter
```

### Running Jupyter

You can start the dashboard on any system via the command prompt by entering the command 
```
jupyter notebook
```
The URL for the dashboard will be something like http://localhost:8888/tree. 


### Cells

Cells form the body of a notebook. There are two main cell types:

- A **code cell** contains code to be executed in the kernel and displays its output below.
- A **Markdown cell** contains text formatted using Markdown and displays its output in-place when it is run.


 # Useful links
 [Welcome to Colab](https://colab.research.google.com/notebooks/intro.ipynb#scrollTo=-Rh3-Vt9Nev9)

  [Overview of Colab](https://colab.research.google.com/notebooks/basic_features_overview.ipynb#scrollTo=JVXnTqyE9RET)
  



# Linguistic Analysis

Let's look at the different levels of linguistic analysis. First, we need some text. Let's take *Moby Dick* from Project Gutenberg, as a list of strings:

In [None]:
! pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=553caebdd9c60d8525aa6f840363cbc660f85f30bd48b3fea023aafd682431a7
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')

'moby_dick (1).txt'

In [None]:
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]
print(documents[1])

Call me Ishmael .


We will use the `spacy` library for a lot of the analyses. Here, we load it:

In [None]:
import spacy

nlp = spacy.load('en')

`spacy` is one of the main library for NLP.  and It has a very good documentation.

### Usage:

We can now call `nlp()` as a function on any text. By default, it will perform a number of analyses:
- tokenization
- sentence splitting
- lemmatization
- part of speech tagging
- dependency parsing
- named entity recognition

To speed up analysis, we can disable some of these analyses if we do not need it:
```
nlp = spacy.load('en', disable=['tokenizer', 'tagger', 'parser', 'ner'])
```


The result is an iterator over the sentences (if called on a text), or tokens (if called on a sentence). Each token has a range of properties see [here](https://spacy.io/api/token#attributes). We will use a few of them in the following:

- `text`: the actual word
- `lemma_`: the dictionary entry of a word
- `pos_`: the part of speech
- `dep`: dependency relation
- `is_punct`: check whether word is punctuation
- `is_stop`: check whether word is a stop word

## Tokenization
Before we do anything, we need to insert spaces into the data.

In [None]:
tokens = [[token.text for token in nlp(sentence)] for sentence in documents[:100]]
tokens


### Exercise

What's the longest and shortest sentence?

In [None]:
list_leng = []
for tok in tokens:
  list_leng.append([len(tok), tok])

print(list_leng)

[[2, ['Loomings', '.']], [4, ['Call', 'me', 'Ishmael', '.']], [45, ['Some', 'years', 'ago', '--', 'never', 'mind', 'how', 'long', 'precisely', '--', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.']], [16, ['It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', '.']], [99, ['Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', ';', 'whenever', 'it', 'is', 'a', 'damp', ',', 'drizzly', 'November', 'in', 'my', 'soul', ';', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', ',', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet', ';', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such', 'an', 'upper', 

### Exercise

Collect counts over the tokens. What is the most frequent token?

In [None]:
from collections import Counter

flat_list = [item for sublist in tokens for item in sublist]

wordfreq = [flat_list.count(p) for p in flat_list]
wordfreqdict = dict(list(zip(flat_list,wordfreq)))

values = sorted(wordfreqdict.values(), reverse=True)[1:20]

for key, value in wordfreqdict.items():
  if value in values:
    print(key)

.
me
--
in
and
to
I
a
the
of
is
;
it
that
as
all
you
-
?


## Lemmatization
We want to get the dictionary form of each word, to reduce variation.

In [None]:
print(documents[7])

There is nothing surprising in this .


How do you expect its lemmatize version to be?

In [None]:
# word.lemma_ to lemmatize a word


AttributeError: ignored

Now we run it for all the sentences

In [None]:
lemmas = [token.lemma_ for sentence in documents[:100] for token in nlp(sentence)]
lemmas

['looming',
 '.',
 'call',
 '-PRON-',
 'Ishmael',
 '.',
 'some',
 'year',
 'ago',
 '--',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 '--',
 'have',
 'little',
 'or',
 'no',
 'money',
 'in',
 '-PRON-',
 'purse',
 ',',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 '-PRON-',
 'on',
 'shore',
 ',',
 '-PRON-',
 'think',
 '-PRON-',
 'would',
 'sail',
 'about',
 'a',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 '.',
 '-PRON-',
 'be',
 'a',
 'way',
 '-PRON-',
 'have',
 'of',
 'drive',
 'off',
 'the',
 'spleen',
 'and',
 'regulate',
 'the',
 'circulation',
 '.',
 'whenever',
 '-PRON-',
 'find',
 '-PRON-',
 'grow',
 'grim',
 'about',
 'the',
 'mouth',
 ';',
 'whenever',
 '-PRON-',
 'be',
 'a',
 'damp',
 ',',
 'drizzly',
 'November',
 'in',
 '-PRON-',
 'soul',
 ';',
 'whenever',
 '-PRON-',
 'find',
 '-PRON-',
 'involuntarily',
 'pause',
 'before',
 'coffin',
 'warehouse',
 ',',
 'and',
 'bring',
 'up',
 'the',
 'rear',
 'of',
 'every',
 'funera

### Exercise
Right now, the lemmas of all pronouns are collapsed into `-PRON-`. Change the code to preserve the original word (as lower case).

In [None]:
# your code here
lemma_with_pron = [token.lemma_ if token.lemma_ != '-PRON-' else token.lower_ for sentence in documents[:100] for token in nlp(sentence)]
lemma_with_pron[0:5]

['looming', '.', 'call', 'me', 'Ishmael']

## Stemming
A more aggressive way of removing variation is *stemming*. Let's have a look again to our example.

In [None]:
print(documents[7])

There is nothing surprising in this .


How do you expect the lemmatized version to be?

In [None]:
from nltk import SnowballStemmer # most famous for stemming

stemmer = SnowballStemmer('english')
# stemmer.stem(word) to stem a word


Now let's stem all our sentences

In [None]:
stems = [[stemmer.stem(token) for token in sentence] for sentence in tokens]
stems

[['loom', '.'],
 ['call', 'me', 'ishmael', '.'],
 ['some',
  'year',
  'ago',
  '--',
  'never',
  'mind',
  'how',
  'long',
  'precis',
  '--',
  'have',
  'littl',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purs',
  ',',
  'and',
  'noth',
  'particular',
  'to',
  'interest',
  'me',
  'on',
  'shore',
  ',',
  'i',
  'thought',
  'i',
  'would',
  'sail',
  'about',
  'a',
  'littl',
  'and',
  'see',
  'the',
  'wateri',
  'part',
  'of',
  'the',
  'world',
  '.'],
 ['it',
  'is',
  'a',
  'way',
  'i',
  'have',
  'of',
  'drive',
  'off',
  'the',
  'spleen',
  'and',
  'regul',
  'the',
  'circul',
  '.'],
 ['whenev',
  'i',
  'find',
  'myself',
  'grow',
  'grim',
  'about',
  'the',
  'mouth',
  ';',
  'whenev',
  'it',
  'is',
  'a',
  'damp',
  ',',
  'drizzli',
  'novemb',
  'in',
  'my',
  'soul',
  ';',
  'whenev',
  'i',
  'find',
  'myself',
  'involuntarili',
  'paus',
  'befor',
  'coffin',
  'warehous',
  ',',
  'and',
  'bring',
  'up',
  'the',
  'rear',
  '

### Exercise

Keep track of the most frequent word for each stem in `tokens`. Hint: use a nested `defaultdict`.

- How many word forms does the stem `hand` have?
- What is the most common word form for the stems `respect` and `whale`? What happened there?

In [None]:
from collections import defaultdict

# your code here
flat_list = [item for stems in tokens for item in stems] # flat stems list

wordfreq = [flat_list.count(p) for p in flat_list] # count the frequency for ech stem_token
wordfreqdict = dict(list(zip(flat_list,wordfreq))) # make a dictionary stem_token : freq

print('hand: ', wordfreqdict.get('hand'))
print('respect: ', wordfreqdict.get('respect'))
print('whale: ', wordfreqdict.get('whale'))

SyntaxError: ignored

## Parts of speech
We can extract the part of speech for every word with the `pos_` atttribute.

List of POS tags:
https://universaldependencies.org/u/pos/


In [None]:
print(documents[7])

There is nothing surprising in this .


Which are the POS tags of these words?

In [None]:
# your code here
# use token.pos_

nlp(documents[7])

There is nothing surprising in this .

Let's apply this to all our documents

In [None]:
pos = [[token.pos_ for token in nlp(sentence)] for sentence in documents[:100]]
pos

### Exercise
Print out the words in the first 10 sentences, but remove all words that are not nouns, verbs, adjectives, adverbs, or proper names.

In [None]:
# your code here


## Named Entities
For each noun phrase, we can infer the semantic type of it.

In [None]:
print(documents[7])

Which are the NEs in this sentence?

In [None]:
# your code here
# nlp(document).ents

In [None]:
from spacy import displacy
displacy.render(nlp(documents[1]), style="ent",jupyter =True)

Let's apply this to all our documents

In [None]:
entities = [[(entity.text, entity.label_) 
             for entity in nlp(sentence).ents]
            for sentence in documents[:50]]
entities
# nlp('John gave a book to Mary and Celia in Cardiff').ents

### Exercise
Who are the 5 most frequently named people in the first 500 sentence?

In [None]:
# your code here


### Exercise

Use the text below to extract all entities. 
- Create tuples of `(lemma, NER type)`
- Collect counts over the tuples
- Look at the 10 most frequent tuples: how many of them are wrong? Why? Discuss with a neighbor.


In [None]:
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
By April, temperatures are on the rise in Trieste and in the late 20s by May. It is far less touristy than the likes of Florence or Rome, and spring sees the city’s lovely restaurants and bars populated almost exclusively by locals.
A city with a proud coffee-drinking culture – Illy has its headquarters here – Trieste has many venerable cafes, including the dazzling mirror-walled Caffè degli Specchi on the Piazza Unità d’Italia – said to be Europe’s biggest seaside piazza – and the elegant Caffè San Marco, which has a good bookshop. James Joyce was a regular when he lived here between 1904-1915. You can learn all about him at the excellent museum, which also has a free, downloadable themed walk on its website (museojoycetrieste.it).
Above Trieste is a vast limestone plateau known as the carso (or karst). Travel up to Villa Opicina on the edge of the region by bus.
There are several trattorie, but for a real treat catch a cab to one of the 30 or so osmize – farm restaurants that sell their wines, cured meats, cheese, honey, fruit and veg; traditionally, they were open eight, 16 or 24 days per year (“osmi” means “eighth” in Slovene) but this now varies – check the app at osmize.com for details.
Book it: Stay at the palatial, seafront Savoia Excelsior Palace, Jan Morris’s pad in her book Trieste and the Meaning of Nowhere. Doubles from £127 room only, starhotelscollezione.com
Belgrade.
As Belgrade shrugs off the snow, cafe tables start colonising the pavements again. Not that Serbia’s capital hibernates during the winter, but spring brings a freshness worth savouring before summer’s 40C heat kicks in.
You feel it especially in Kalemegdan, the huge park and fortress hugging the confluence of the Danube and Sava rivers. Down below are wide riverside paths that offer superb cycling all the way to the attractive suburb of Zemun. Follow the Sava southwards to reach the river island of Ada Ciganlija – open all year round for walks and bike rides and usually warm enough in May for a swim.
Although barely a month goes by without a festival, the pace picks up during spring. The Belgrade Dance Festival attracts dance companies from around the world. Classical guitar gets its own spotlight during the Guitar Art Festivalfrom 12-17 March, and from 26-28 April you can join the Orthodox Easter festivities. The landmark Mikser House in Savamala may have closed, but its Mikser festival still celebrates the best in Balkan design (24-26 May).
With the long-awaited reopening of the National Museum of Serbia and the Museum of Contemporary Art, Belgrade has some cultural heavyweights to add to its dizzyingly varied restaurant scene. Head to the Dorćol district for cheap cocktails in Blaznavac’s psychedelic garden before a Balkan-Mediterranean dinner in cosy Tezga (Strahinjića Bana 82, on Facebook).
Book it: Set in a handsome 1929 villa in Dorćol, Smokvica has six stylish rooms as well as a restaurant with a courtyard garden. Doubles from €70 room only, smokvica.rs
Montpellier.
Montpellier combines easy elegance and a vibrant cultural scene with a youthful buzz – its university, founded in the 13th century, counts radical satirist Rabelais among its alumni and some 60,000 students live here.
The medieval centre is a maze made for wandering, with 16 leafy squares – in spring, all green and abuzz with alfresco cafe life. The vast, pedestrianised Place de la Comédie connects the old town with the striking new Antigone district, replete with modern, neoclassical-style buildings.
For a fine-art foray, head to Musée Fabre – one of the biggest in France – or nearby photography museum Pavillon Populaire. Montpellier boasts the oldest botanical garden in France, too, dating from 1593 and particularly beautiful in April and May. Independent boutiques, opera houses, markets laden with Languedoc produce (look for oysters and asparagus) and great dining options add to the appeal. Le Petit Jardin is a bistro and lovely garden restaurant near the old cathedral.
Beyond the city, discover vineyards like Mas de Daumas Gassac or hike up Pic St-Loup in the Cévennes foothills for views over the coast. The beach and charming seaside town of Palavas-les-Flots is just 10km away – a tram ride or easy cycle (rent bicycles from Ville et Vélo).
Book it: Hotel Le Guilhem is a 16th-century building in the historic centre with cathedral views from its terrace (where breakfast is served, weather permitting). Doubles from £83 room only, leguilhem.com
Berlin.
After a long, cold winter, Berliners waste no time in celebrating the return of the sun, with beer gardens and flea markets reopening all over the city. Try picturesque Cafe Am Neuen Seefor lakeside pizza and beer in the Tiergarten, and hang out with the hipsters along the canal at Berlin’s coolest flohmarkt, surrounded by cherry blossom trees, a charming place to browse and have a couple of beers.
For the best blossom, though, head to one of the trails along the line of the wall, the Mauerweg, near S-bahn Bornholmer Strasse, or Lichterfelde Süd, where the trees were gifted by the Japanese to celebrate German reunification in 1989.
Berlin is set for two big anniversaries this year: it’s 100 years since the Bauhaus was founded and 30 years since the wall came down, so check out events including exhibitions and dances themed around the former (see visitberlin.de for more information) or get a feel for the creative chaos of post-wall Berlin at the multimedia Nineties Berlin exhibition.
Boat tours along the river and canals are superb in spring but, generally, cycling is the best way to get around (many hotels and hostels have their own). And if urban life gets a bit much, do as the Berliners do and head out to one of the many city lakes, such as Schlachtensee or Wannsee, directly accessible by U and S-bahn.
Book it: Boutique Hotel Oderberger in a trendy, central location has doubles from £113, hotel-oderberger.berlin, or try indoor caravanning at Hüttenpalast, doubles from £61 room only, huettenpalast.de
"""

# your code here

## Parsing
For each word, we can extract the word it is grammatically related to, plus the type of the relation.

In [None]:
print(documents[7])

Which are the NEs in this sentence?

In [None]:
# your code here


Let's apply this to all our documents

In [None]:
[[(c.text, c.head.text, c.dep_) for c in nlp(sentence)] 
 for sentence in documents[:100]]

Instead of doing this at a word-by-word basis, we can do it by larger chunks, the noun phrases.

In [None]:
[[(c.text, c.root.head.text, c.root.dep_) for c in nlp(sentence).noun_chunks] 
 for sentence in documents[:100]]

## Exercise
How does Melville describe nouns? Extract all the pairs related by `amod`.

In [None]:
# your code here


How does he describe men?

In [None]:
# your code here

# Try on the language that you prefer!

https://spacy.io/models
