# Lab 1: Jupyter and Python

In this lab, you will load and perform a basic analysis of a text using python in a Jupyter interactive notebook.

Follow the instruction and run the code cells. Make sure you understand what happens in every stage.

## Getting the data
1. Download the story "A Christmas Carol" by Charles Dickens from Project Gutenberg website:
    * Open this website: http://www.gutenberg.org/files/24022/24022-0.txt
    * Save the file (File -> Save) locally
    * Upload file to the same directory as notebook
-----------
## Loading the data
* First we open the file using *open()*
* **Make sure the file path is correct**

In [3]:
storyFile = open(r"carol.txt", "r")
storyText = storyFile.read()

----------
## Review the data
Lets print the first 500 character of the text

In [4]:
print(storyText[0:500])

﻿The Project Gutenberg EBook of A Christmas Carol, by Charles Dickens

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: A Christmas Carol

Author: Charles Dickens

Illustrator: Arthur Rackham

Release Date: December 24, 2007 [EBook #24022]

Language: English

Character set encoding: UTF-8


----------
Now, lets print the PREFACE. It start at character 1205:

In [5]:
print(storyText[1205:1472])

  PREFACE

  I have endeavoured in this Ghostly little book to raise the Ghost of an
  Idea which shall not put my readers out of humour with themselves, with
  each other, with the season, or with me. May it haunt their house
  pleasantly, and no one wish to lay it.


---------
## Counting the number of lines in the story
We start by splitting the text to a list of lines using *splitlines()*

In [6]:
storyLines = storyText.splitlines()

---------
Now we print the first 10 lines:

In [7]:
storyLines[0:10]

['\xef\xbb\xbfThe Project Gutenberg EBook of A Christmas Carol, by Charles Dickens',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever.  You may copy it, give it away or',
 're-use it under the terms of the Project Gutenberg License included',
 'with this eBook or online at www.gutenberg.org',
 '',
 '',
 'Title: A Christmas Carol',
 '']

----------
Last, we count the lines using *len()* method:

In [8]:
print("The number of lines in the text is:", len(storyLines))

('The number of lines in the text is:', 3976)


---------
## Counting the number of words in the story
We start by splitting the text to a list of words using *split()*

In [9]:
storyWords = storyText.split()

---------
Now we print the first 10 words:

In [10]:
storyWords[:10]

['\xef\xbb\xbfThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'A',
 'Christmas',
 'Carol,',
 'by',
 'Charles']

----------
Last, we count the words using *len()* on the list of words:

In [11]:
print("The number of words in the text is:", len(storyWords))

('The number of words in the text is:', 32430)


## Counting the number of lines containing a word

In [12]:
linesCounter = 0
for line in storyLines:
    if "christmas" in line.lower():
        linesCounter += 1

In [13]:
print("The number of lines with the word 'Christmas' is:", linesCounter)

("The number of lines with the word 'Christmas' is:", 95)


## Counting the number of different words

In [14]:
# remove non-alphanumeric characters and convert to lower case
editedWords = [word.strip(""" ,.*()[]!@#$%^&*{}?'`"-""").lower() 
               for word in storyWords]

In [15]:
print("number of different words: ", len(list(set(editedWords))))

('number of different words: ', 5273)


## Using *Counter* collection
We can use *Counter* to count the frequency of every item in a list.

In [16]:
from collections import Counter
wordsCounter = Counter(editedWords)

---------
What is the number of occurrences of the word 'Christmas'?

In [17]:
wordsCounter["christmas"]

93

---------
What are the 10 most common words?

In [18]:
wordsCounter.most_common(10)

[('the', 1776),
 ('and', 1141),
 ('of', 814),
 ('a', 781),
 ('to', 759),
 ('in', 589),
 ('it', 525),
 ('he', 490),
 ('was', 428),
 ('his', 425)]

-------
## Finding a phrase in the text

Let find the first occurrence of the phrase "Ding, dong!" in the text.

In [19]:
storyText.find("Ding, dong!")

45967

-------
Lets examine the text around this position

In [20]:
print(storyText[45800:46050])

rter was so long, that he was more than once convinced he must
have sunk into a doze unconsciously, and missed the clock. At length it
broke upon his listening ear.

'Ding, dong!'

'A quarter past,' said Scrooge, counting.

'Ding, dong!'

'Half past,


# Finding numbers in the text

In [21]:
numbersFound = []
for number in range(1000):
    numberLocation = storyText.find(str(number))
    if numberLocation >= 0:
        numbersFound.append(number)

In [22]:
len(numbersFound)

134

In [23]:
max(numbersFound)

997