# *Jupyter Mia!* - my Jupyter Project

**Web Scraping and Linguistic Analysis**

In my *Jupyter Mia!* project, I used Python to perform linguistic analysis of the [script](https://www.scripts.com/script/mamma_mia!_13236) of Mamma Mia! movie from 2008. Along the processes, I applied the use of **requests** and **BeautifulSoup** libraries to scrape content from the particular Uniform Resource Locator, and implemented the use of the **Natural Language Toolkit** library to perform the language tasks:
*   Tokenization and POS Tagging
*   POS tag frequency analysis
*   Word frequency analysis
*   Word length analysis
*   Sentence length analysis






# βήμα πρώτο - step one

Let me import the modules

As my *το πρώτο βήμα*  of the analysis, I have to import the required modules:

In [None]:
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
import matplotlib.pyplot as plt

I used **Requests** and **BeautifulSoup** to enable web scraping.

**Counter**, **nltk** to enable data analysis and **matplotlib** to create a chart.

# βήμα δυο - step two

Let's do some web scraping

In [None]:
page = requests.get("https://www.scripts.com/script/mamma_mia!_13236")

soup = BeautifulSoup(page.content, 'html.parser')

paragraphs = soup.find_all('p')

only_text = []

for el in paragraphs:
    if len(el.getText().strip()) > 0:
        only_text.append(el.getText())

Here the actual text is taken from the website with the specific address I provided. **Requests** are used to retrieve text and **BeautifulSoup** is used to parse HTML content. Then all elements of the paragraph will be found using **soup.find_all("p")** and the text will be added to the list after removing unnecessary spaces.

# βήμα τρίτο - step three

Now some tokenization and part-of-speech tagging

In [None]:
pos_tagged_tokens = []

for sentence in only_text:
    tokens = nltk.word_tokenize(sentence)
    pos_tagged = nltk.pos_tag(tokens)
    for item in pos_tagged:
        pos_tagged_tokens.append(item)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


As a following step, the text undergoes parts-of-speech (POS) tagging. The text is tokenized into words applying **nltk.word_tokenize(**), then, the words are tagged using **nltk.pos_tag()**.

**The POS tags are then stored in a list.**

# βήμα τέταρτο - step four

How often particular POS Tag occur?

In [None]:
pos_tags_list = [tag for word, tag in pos_tagged_tokens]
pos_tag_counts = Counter(pos_tags_list)
print(pos_tag_counts)

Counter()


# βήμα πέντε - step five

And now, how often particular words occur?

# βήμα έκτο - step six

Measuring the length of the words

# βήμα έβδομο - step seven

And now, let me measure the length of the sentences

# The story ends here, but the show is extended by two further songs, [*Dancing Queen* and *Waterloo*](https://www.youtube.com/watch?v=yAkHMAbKYRw).